In this article, I share the outcomes of our experiment aimed at identifying class names, that are references to classes that exist in Pharo environment, within comments in all Pharo classes and methods. The experiment was conducted on a Pharo 11 image, and the identification process relied on regular expressions to detect UpperCamelCases tokens. These tokens were expected to conform to Pharo’s naming conventions for class names. Once they were detected, the next step involved identifying the corresponding class definition for each one of them.
Analysis over class comments
I initiated the analysis on the comments within classes. The environment comprises a total of 9816 classes. Out of these, 8928 classes include comments that do not start with ‘Please comment me’, which is the beginning of default Pharo comments. This indicates that developers have modified 90 % of these comments to add their custom ones.
Using the following regular expression '\b[A-Z][a-zA-Z\d]+\b' asRegex
in Pharo, I applied a search over all classes that have custom comments to detect all UpperCamelCase tokens like OrderedCollection. In total I found 3371 classes that have in their comments tokens matching this pattern, moreover, I found in total 7505 references to classes in comments, with duplicates removed in each comment. In other words, if same token is found multiple times in the same comment, it will be counted once to enhance the searching performance over classes in the system.
Having those results, I decided to choose 20 classes with the largest number of references in their comments, and I checked the recall and precision ratios. Below table summarizes 5 classes, to avoid overcrowding the blog post with all the 20 classes:
Class Name | Recall ratio | Precision ratio |
EncoderForSistaV1 | 1 | 0.72 |
IRBytecodeGenerator | 1 | 0.86 |
ClyNavigationEnvironment | 1 | 0.86 |
RSLocation | 1 | 0.25 |
SettingBrowser | 1 | 0.73 |
On one hand, Recall ratio, also known as the true positive rate, measures the ability to accurately identify all true positive instances. It is high when the number of missing results (False Negatives) is high. On the other hand, Precision ratio focuses on the ability to make accurate positive predictions and avoid including irrelevant matches (false positives). It is high when the number of incorrect matches detected is high.
In the table provided above, the recall ratio for all classes is 1, indicating that the regular expression (regEx) successfully matched all instances of UpperCamelCases in the comments. The only potential scenario where results might be missed is if class names are misspelled or if they are mentioned without adhering to the specified naming convention. As for the precision ratio, results were not as good as the recall ratio because certain keywords, intended for other purposes, were incorrectly identified as classes, leading to more false positives in the classification. Some of them are used in comments to start a sentence (ex: Set the variable …) or a comment block (ex: “` Smalltalk “`).
Analysis over methods comments
Following the execution of the search algorithm within the environment 4,062 classes were found to contain 30,340 commented methods, with a total of 40,422 comments. Upon closer examination, it was determined that 7.5% of these commented methods belong to 368 classes and contain 3,270 references to existing classes in environment.
Having those results, I decided to choose 20 classes with the largest number of references in their methods comments, and I checked the recall and precision ratios. Below table summarizes 5 classes, to avoid overcrowding the blog post with all the 20 classes:
Class Name | Number of commented methods with references | Recall ratio | Precision ratio |
RSLineBuilder | 12 | 1 | 0.94 |
RSNormalizer | 7 | 1 | 0.82 |
RSLabel | 9 | 1 | 0.86 |
Color | 26 | 1 | 0.95 |
LargeInteger | 17 | 1 | 1 |
Boosting performance
Based on the previous research, I discovered that the accuracy of the results could be significantly improved by excluding certain tokens, such as Smalltalk, Set, True, False, Key, etc., which are known to produce false positive results. Additionally, enhancing the research algorithm by excluding code blocks from comments (i.e., everything between “` Smalltalk “` ), would not only optimize the search speed but also ensure that only the intended class names in comments are returned, leading to an overall improvement in precision ratio.
Our research has primarily concentrated on class names rather than method names. However, searching for method references in comments could prove to be more challenging due to two main reasons. Firstly, Pharo’s method naming convention follows camelCase, which increases the possibility of encountering false positive results when attempting to identify them within comments. This occurs because many tokens used in sentences to describe specific functionalities may coincide with method names, leading to potential confusion. Secondly, accurately linking tokens to their corresponding method definitions could also present difficulties, given that the same method might be defined in multiple classes. This situation adds complexity to the process of association and retrieval.
To address the problem, my initial idea involved utilizing AI to identify relevant tokens linked to classes and methods. I discussed this idea with AI experts, but they discouraged its usage for this particular project. They explained that a considerable amount of AI training data, tens of thousands of records, would be necessary to enable AI to accurately detect tokens in less than 10,000 methods and classes. Despite this challenge, the precision and recall ratios were already considered high, and implementing the previously suggested enhancements could further improve them.
Why we need this ?
Following this experiment, I believe that there are reasonable points of why we might need this:
- Enhancing Code Documentation: Programs are commented to help other developers understand the purpose and the usage of every code. When class names are clearly mentioned in comments, they provide additional context about the classes and their relationships, making it easier for developers to understand the codebase.
- Navigation: In Pharo, every class name that is enclosed between back-ticks like `OrderedCollection` within a comment, becomes clickable in view mode, thus helping the developer navigating to the class definition. However, during our search, we noticed that many classes are not marked in this manner. So why not benefiting from this search and enclosing every token we found in comments that has a valid link to an existing class ?
- Maintenance: Also during this search I found some tokens that conform to the naming convention of classes in Pharo, but actually the classes they refer to, do not exist anymore or are deprecated. I guess, this could be solved by enhancing the functionality of renaming a class in Pharo, to apply changes not only over references to these classes in source code, but also in comments, only for those enclosed between back-ticks. However, to be sure this could be done, a previous refactoring should be applied over comments to enclose class names correctly.