Skip to content

Conversation

@dominik-appsilon
Copy link

This implements metric described in #13 . As discussed, it uses allow-list approach. The list of repositories is, of course, up for the debate.

I decided to name it "has recognized source" to underline the fact that it is based on the known list, not universal.

I tried to follow other standards from the package, including using {options} for getting and setting the allow list.

@codecov-commenter
Copy link

codecov-commenter commented Dec 18, 2025

Codecov Report

❌ Patch coverage is 5.26316% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 21.39%. Comparing base (5f49059) to head (667bdee).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
R/data_source_control.R 5.26% 18 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #30      +/-   ##
==========================================
+ Coverage   18.54%   21.39%   +2.84%     
==========================================
  Files          31       32       +1     
  Lines        1332     1351      +19     
==========================================
+ Hits          247      289      +42     
+ Misses       1085     1062      -23     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dominik-appsilon dominik-appsilon marked this pull request as draft December 18, 2025 14:01
@dominik-appsilon dominik-appsilon marked this pull request as ready for review December 18, 2025 15:22
Copy link
Collaborator

@dgkf dgkf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great. I left a lot a few suggestions, but don't take that as a sign that this isn't phenomenal work.

The most significant suggested change is regarding the decision to return NA when no url is found. Instead, I think it would be more appropriate to return a 0-length vector. This will just help to avoid the need for any special NA-handling for anyone that wants to operate on this data.

Really appreciate the thorough descriptors, test cases and clean code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this to R/data_recognized_source.R to match the new name?

Comment on lines +9 to +13
"The URL of the package's source control repository on a recognized",
"hosting platform, determined by matching against an allow-list of",
"known domains. Extracted from the URL and BugReports fields in the",
"DESCRIPTION file. The allow-list can be customized; see ?options for",
"details."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description field allows for Rd-style infotex syntax so that we can get rich formatting rendered to html/text. Most importantly, we can wrap things in \\code{...} when applicable:

Suggested change
"The URL of the package's source control repository on a recognized",
"hosting platform, determined by matching against an allow-list of",
"known domains. Extracted from the URL and BugReports fields in the",
"DESCRIPTION file. The allow-list can be customized; see ?options for",
"details."
"The \\acronym{URL} of the package's source control repository on a",
"recognized hosting platform, determined by matching against an",
"allow-list of known domains. Extracted from the \\acronym{URL} and ",
"\\code{BugReports} fields in the \\code{DESCRIPTION} file. The ",
"allow-list can be customized; see \\code{?options} for details."

Comment on lines +39 to +50
# Try to find a URL matching known source control domains
# We use case-insensitive matching for domains
for (url in all_urls) {
for (domain in source_control_domains) {
if (grepl(domain, url, ignore.case = TRUE)) {
return(url)
}
}
}

# No known source control URL found
NA_character_
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely a unique case, but it could be possible to have multiple source code urls (for example, if a package migrated from GitHub to codeberg, and is mirroring changes across both like I do for the options package).

In those situations, I think we might want to return all recognized domains.

Also, since our domains contain regex special characters (most importantly, the .), we need to either escape these characters or match on the fixed string (treating it as a raw string instead of a regex pattern).

Suggested change
# Try to find a URL matching known source control domains
# We use case-insensitive matching for domains
for (url in all_urls) {
for (domain in source_control_domains) {
if (grepl(domain, url, ignore.case = TRUE)) {
return(url)
}
}
}
# No known source control URL found
NA_character_
is_src_url <- vapply(
tolower(source_control_domains),
grepl,
logical(length(all_urls)),
x = tolower(all_urls),
fixed = TRUE
)
source_control_domains[colSums(is_src_url) > 0]

This uses a vectorized approach, but it's really not performance-critical. There are a couple quirks in here.

  • Generally we'd use grepl(fixed = TRUE), but this is incompatible with ignore.case. Instead, we just convert both strings to lowercase during comparison.
  • I also opted to return a vector of length 0 instead of NA_character_ when no recognized url is found. This helps keep the function more type-stable.

Comment on lines +62 to +65
"Indicates whether the package has a source code repository on a",
"recognized hosting platform from an allow-list of known domains.",
"Inferred from the URL and BugReports fields in the DESCRIPTION file.",
"See ?options for customizing the allow-list."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, here we can use Rd syntax

Suggested change
"Indicates whether the package has a source code repository on a",
"recognized hosting platform from an allow-list of known domains.",
"Inferred from the URL and BugReports fields in the DESCRIPTION file.",
"See ?options for customizing the allow-list."
"Indicates whether the package has a source code repository on a",
"recognized hosting platform from an allow-list of known domains.",
"Inferred from the \\acronym{URL} and \\code{BugReports} fields in the",
"\\code{DESCRIPTION} file. See \\code{?options} for customizing the",
"allow-list."

"See ?options for customizing the allow-list."
),
function(pkg, resource, field, ...) {
!is.na(pkg$recognized_source_url)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you agree with the above changes, instead returning a 0-length vector when no source url is discovered, then this would change to

Suggested change
!is.na(pkg$recognized_source_url)
length(pkg$recognized_source_url) > 0L

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants