-
Notifications
You must be signed in to change notification settings - Fork 1
Add 'recognized_source_url' data and 'has_recognized_source' metric #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #30 +/- ##
==========================================
+ Coverage 18.54% 21.39% +2.84%
==========================================
Files 31 32 +1
Lines 1332 1351 +19
==========================================
+ Hits 247 289 +42
+ Misses 1085 1062 -23 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
dgkf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great. I left a lot a few suggestions, but don't take that as a sign that this isn't phenomenal work.
The most significant suggested change is regarding the decision to return NA when no url is found. Instead, I think it would be more appropriate to return a 0-length vector. This will just help to avoid the need for any special NA-handling for anyone that wants to operate on this data.
Really appreciate the thorough descriptors, test cases and clean code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename this to R/data_recognized_source.R to match the new name?
| "The URL of the package's source control repository on a recognized", | ||
| "hosting platform, determined by matching against an allow-list of", | ||
| "known domains. Extracted from the URL and BugReports fields in the", | ||
| "DESCRIPTION file. The allow-list can be customized; see ?options for", | ||
| "details." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description field allows for Rd-style infotex syntax so that we can get rich formatting rendered to html/text. Most importantly, we can wrap things in \\code{...} when applicable:
| "The URL of the package's source control repository on a recognized", | |
| "hosting platform, determined by matching against an allow-list of", | |
| "known domains. Extracted from the URL and BugReports fields in the", | |
| "DESCRIPTION file. The allow-list can be customized; see ?options for", | |
| "details." | |
| "The \\acronym{URL} of the package's source control repository on a", | |
| "recognized hosting platform, determined by matching against an", | |
| "allow-list of known domains. Extracted from the \\acronym{URL} and ", | |
| "\\code{BugReports} fields in the \\code{DESCRIPTION} file. The ", | |
| "allow-list can be customized; see \\code{?options} for details." |
| # Try to find a URL matching known source control domains | ||
| # We use case-insensitive matching for domains | ||
| for (url in all_urls) { | ||
| for (domain in source_control_domains) { | ||
| if (grepl(domain, url, ignore.case = TRUE)) { | ||
| return(url) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| # No known source control URL found | ||
| NA_character_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely a unique case, but it could be possible to have multiple source code urls (for example, if a package migrated from GitHub to codeberg, and is mirroring changes across both like I do for the options package).
In those situations, I think we might want to return all recognized domains.
Also, since our domains contain regex special characters (most importantly, the .), we need to either escape these characters or match on the fixed string (treating it as a raw string instead of a regex pattern).
| # Try to find a URL matching known source control domains | |
| # We use case-insensitive matching for domains | |
| for (url in all_urls) { | |
| for (domain in source_control_domains) { | |
| if (grepl(domain, url, ignore.case = TRUE)) { | |
| return(url) | |
| } | |
| } | |
| } | |
| # No known source control URL found | |
| NA_character_ | |
| is_src_url <- vapply( | |
| tolower(source_control_domains), | |
| grepl, | |
| logical(length(all_urls)), | |
| x = tolower(all_urls), | |
| fixed = TRUE | |
| ) | |
| source_control_domains[colSums(is_src_url) > 0] |
This uses a vectorized approach, but it's really not performance-critical. There are a couple quirks in here.
- Generally we'd use
grepl(fixed = TRUE), but this is incompatible withignore.case. Instead, we just convert both strings to lowercase during comparison. - I also opted to return a vector of length 0 instead of
NA_character_when no recognized url is found. This helps keep the function more type-stable.
| "Indicates whether the package has a source code repository on a", | ||
| "recognized hosting platform from an allow-list of known domains.", | ||
| "Inferred from the URL and BugReports fields in the DESCRIPTION file.", | ||
| "See ?options for customizing the allow-list." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, here we can use Rd syntax
| "Indicates whether the package has a source code repository on a", | |
| "recognized hosting platform from an allow-list of known domains.", | |
| "Inferred from the URL and BugReports fields in the DESCRIPTION file.", | |
| "See ?options for customizing the allow-list." | |
| "Indicates whether the package has a source code repository on a", | |
| "recognized hosting platform from an allow-list of known domains.", | |
| "Inferred from the \\acronym{URL} and \\code{BugReports} fields in the", | |
| "\\code{DESCRIPTION} file. See \\code{?options} for customizing the", | |
| "allow-list." |
| "See ?options for customizing the allow-list." | ||
| ), | ||
| function(pkg, resource, field, ...) { | ||
| !is.na(pkg$recognized_source_url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you agree with the above changes, instead returning a 0-length vector when no source url is discovered, then this would change to
| !is.na(pkg$recognized_source_url) | |
| length(pkg$recognized_source_url) > 0L |
This implements metric described in #13 . As discussed, it uses allow-list approach. The list of repositories is, of course, up for the debate.
I decided to name it "has recognized source" to underline the fact that it is based on the known list, not universal.
I tried to follow other standards from the package, including using
{options}for getting and setting the allow list.