Wikimedia cross-wiki coordination and L10n/i18n. Mainly active on Wikiquote, Wiktionary, Wikisource, Commons, Wikidata, Wikibooks. And of course Meta-Wiki, translatewiki.net.
Contact me by MediaWiki.org email or user talk.
Wikimedia cross-wiki coordination and L10n/i18n. Mainly active on Wikiquote, Wiktionary, Wikisource, Commons, Wikidata, Wikibooks. And of course Meta-Wiki, translatewiki.net.
Contact me by MediaWiki.org email or user talk.
Also, I've tried the link from a recent post and it doesn't even work: it produces an empty post after one and two redirects. It seems nobody is using those links, as nobody noticed.
Another reason to do this is that Facebook doesn't even allow sharing links to some Wikimedia projects.
Thanks for the update on XML data dumps list. I see there's progress on the other side: https://phabricator.wikimedia.org/T382947#10476420 . Hopefully this will allow to re-enable the dumps soon.
IIRC these (and the OAI feeds) were added back in the day when the WMF got some corporate contribution to provide specialised data feeds. I imagine any contractual obligations have long expired (if they even existed), but I don't know who could verify that.
The query itself will remain, so getting fresh results should be nothing more than a submit query away.
By running more tests and using Mann Whitney we know if a performance regression is of statistical significance. That way we can make sure that we only alert on real regressions. That decreases the number of false alerts and time spent investigating regressions.
We certainly don't want to be in the way. Feel free to delete the VMs. I was hoping to double check there's nothing to salvage in the local mounts but usually there shouldn't be anyway.
As an update, I created the account and luckily we were still in time for this round of submissions (CLDR 46). It's always a good time to ask a CLDR account from me! Six months tend to fly by.
Maybe it could be retrieved from a very early dump or some other means
@Hydriz Can I upgrade the VMs to Debian 11 one of these weekends? The only reason not to that I can think of is some scripts may require Python2, but that's still available in Debian 11.
@HShaikh Please don't propagate myths. https://aeon.co/essays/the-tragedy-of-the-commons-is-a-false-and-dangerous-myth
I'm closing this task as unclear and not pertaining to MediaWiki core, mostly because it mixes different user groups and permissions some of which are Wikimedia-specific.
This reminds me a bit of the https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool , which I believe focused on identifying easy concepts like numbers. I've not used it in years.
https://www.mediawiki.org/wiki/Special:RecentChanges?useskin=vector&uselang=ksh after disabling JavaScript recentchanges:
@Mazevedo Here's an example old ticket which may or may not be relevant any more. :)
Do you want to focus on the exonyms in languages which are supported by MediaWiki core (or at least translatewiki.net) but not in CLDR?
That was with all namespaces.
Current status
After the latest run
Mostly fixed upstream.
Not clear to me why this doi:10.1038/s41586-023-06291-2 got an arxiv but not pmc ID https://en.wikipedia.org/w/index.php?title=PubMed&diff=prev&oldid=1195324840
The new round seems to go fine so far https://en.wikipedia.org/w/index.php?title=Special:Contributions/OAbot&target=OAbot&dir=prev&offset=20240107000000&limit=50
For the non-Unpaywall side, continues at T228702
We're still discarding excess merges from Dissemin similar to the 2019 logic https://github.com/dissemin/oabot/commit/e3c74bff735c1ef16ee333dde2ac4bdd20949635 . We're not currently using the Dissemin title matches but if we did it would not be enough to check for title, author, year match: https://en.wikipedia.org/w/index.php?title=User_talk%3AOAbot&diff=1194216712&oldid=1193993325 .
There are over 6500k PMC matches and only 650k matches by title and author, of which some 60k appear without a PMCID match, so perhaps we can just ignore those europepmc matches:
$ lbzip2 -dc unpaywall_snapshot_2022-03-09_sorted.jsonl.bz2 | grep '"is_oa": true' | grep pmc | grep -c "oa repository (via pmcid lookup)" 6499014 $ lbzip2 -dc unpaywall_snapshot_2022-03-09_sorted.jsonl.bz2 | grep '"is_oa": true' | grep pmc | grep -c "oa repository (via OAI-PMH title and first author match)" 637491 $ lbzip2 -dc unpaywall_snapshot_2022-03-09_sorted.jsonl.bz2 | grep '"is_oa": true' | grep pmc | grep "oa repository (via OAI-PMH title and first author match)" | grep -vc "oa repository (via pmcid lookup)" 62310
Both papers on Unpaywall have evidence "oa repository (via OAI-PMH title and first author match)" although the PMC side exposes a link to the correct DOI. The CrossRef API has the page range like "113-128", "283-288", so it may be possible to check for the number of pages.
So we won't suggest edits like this either https://en.wikipedia.org/w/index.php?title=Saccharomyceta&curid=68064105&diff=1194087545&oldid=1182890284 as we don't get non-repository URLs from other sources.
A sample of what kind of URLs we're talking about
Only 35k or so of these are in the best_oa_location (sometimes even when a separate match for arxiv exists, like doi:10.1002/rsa.20071 / oai:CiteSeerX.psu:10.1.1.237.8456 / oai:arXiv.org:math/0209357 ).
Not sure how to narrow this down, we're talking about some 500k matches from CiteSeerX (out of 900k):
$ lbzip2 -dc unpaywall_snapshot_2022-03-09_sorted.jsonl.bz2 | grep citeseerx | grep "oa repository (via OAI-PMH doi match)" | jq -r 'select(.oa_locations | .[] | .endpoint_id == "CiteSeerX.psu" and .evidence == "oa repository (via OAI-PMH doi match)" )|.doi' | wc -l 505747 $ lbzip2 -dc unpaywall_snapshot_2022-03-09_sorted.jsonl.bz2 | grep -c citeseerx 887759
Another example where URL priorities changed: https://en.wikipedia.org/w/index.php?title=Balbinot_1&diff=prev&oldid=1193722831 (but there was no doi-access=free).
The recent change to sort all URLs https://github.com/dissemin/oabot/commit/ddab25a5ee71e2f23fe4b8dfb5a28c8da333a922 allowed the bot to perform https://en.wikipedia.org/w/index.php?title=Serafim_Kalliadasis&diff=prev&oldid=1193717235 , while previously it would probably only have suggested the first URL https://eprints.qut.edu.au/134215/1/134215p.pdf . http://hdl.handle.net/10044/1/55290 is the 3rd suggestion from Unpaywall and https://arxiv.org/abs/1609.05938 is the 8th.
That's fixed in https://github.com/dissemin/oabot/commit/1cd61525a8cc5d8378e60f63555cf291e1bb4660 hopefully
I've manually updated the leaderboard with https://github.com/nemobis/oabot/commit/4917289ac7b49ca5176129d9f19ae5355ac84b72
The last row created was
https://en.wikipedia.org/w/index.php?title=Lyman_E._Johnson&diff=prev&oldid=1191724248 was not supposed to happen as the existing URL returns a PDF.
Latest run
Still room for improvement
Some doi-access=free being re-added now:
$ find -maxdepth 1 -type f -print0 | xargs -0 -P16 -n1 jq '.proposed_edits|.[]| select(.proposed_change|contains("doi-access=free")) | .orig_string' | grep doi | grep -Eo 'doi *= *[^"|]+' | grep -Eo '10\.[0-9]+/[a-z]+(\.([a-z]{,8}|[0-9-]{9})\b)?' | sort | uniq -c | sort -nr | head -n 40 546 10.1146/annurev 409 10.1007/s 186 10.4202/app. 178 10.1016/j. 176 10.1016/j.cub 156 10.1126/science. 124 10.1038/s 96 10.1016/j.cretres 84 10.1111/pala. 78 10.1017/jpa. 72 10.1074/jbc. 66 10.1002/ar. 61 10.5252/geodiversitas 56 10.11646/zootaxa. 52 10.5852/ejt. 52 10.5852/cr 52 10.1016/j.palaeo 52 10.1002/spp 48 10.1016/j.jhevol 46 10.1093/zoolinnean 44 10.5962/bhl.part 44 10.1111/j. 42 10.1016/s 41 10.3140/bull.geosci 39 10.1016/j.cell 39 10.1002/ajb 38 10.4049/jimmunol. 38 10.1017/pab. 33 10.1038/nature 32 10.1111/j.1475-4983 31 10.37828/em. 31 10.1093/mnras 28 10.1111/j.1096-3642 27 10.5962/p. 27 10.2476/asjaa. 25 10.7203/sjp. 25 10.1016/j.revpalbo 23 10.1002/ajpa. 21 10.24425/agp. 21 10.1093/bioinformatics
Currently with some 160k pages found:
$ find -maxdepth 1 -type f -print0 | xargs -0 -P8 -n1 jq '.proposed_edits|.[]| select(.proposed_change|contains("subscription")) | .orig_string' | grep -Eo '\| *url *= *http[^|}]+' | cut -d/ -f3 | sort | uniq -c | sort -nr | head -n 30 15725 www.jstor.org 14451 dx.doi.org 12927 doi.org 9520 www.sciencedirect.com 6442 www.researchgate.net 5630 www.tandfonline.com 5491 onlinelibrary.wiley.com 4498 www.cambridge.org 3824 pubmed.ncbi.nlm.nih.gov 3477 link.springer.com 3182 muse.jhu.edu 3024 linkinghub.elsevier.com 2928 www.nature.com 2770 journals.sagepub.com 2065 www.academia.edu 1934 pubs.acs.org 1896 academic.oup.com 1736 www.persee.fr 1520 www.science.org 1473 semanticscholar.org 1247 www.journals.uchicago.edu 1210 archive.org 1128 books.google.com 956 ieeexplore.ieee.org 854 www.oxforddnb.com 789 brill.com 707 doi.wiley.com 646 www.semanticscholar.org 620 zenodo.org 571 www.degruyter.com
After a broader run
$ find -maxdepth 1 -type f -print0 | xargs -0 -P8 -n1 jq '.proposed_edits|.[]| select(.proposed_change|contains("subscription")) | .orig_string' | grep -Eo '\| *url *= *http[^|}]+' | cut -d/ -f3 | sort | uniq -c | sort -nr | head -n 30 3020 dx.doi.org 2666 www.jstor.org 2569 doi.org 2116 www.sciencedirect.com 1217 www.researchgate.net 1105 onlinelibrary.wiley.com 1011 www.tandfonline.com 822 www.cambridge.org 789 pubmed.ncbi.nlm.nih.gov 748 linkinghub.elsevier.com 685 link.springer.com 630 www.nature.com 522 journals.sagepub.com 453 muse.jhu.edu 435 pubs.acs.org 361 www.academia.edu 351 semanticscholar.org 341 academic.oup.com 338 www.science.org 301 archive.org 244 www.persee.fr 210 www.journals.uchicago.edu 187 books.google.com 180 ieeexplore.ieee.org 157 pubs.geoscienceworld.org 150 doi.wiley.com 149 www.semanticscholar.org 120 pubs.rsc.org 119 brill.com 108 link.aps.org
How to sample JSTOR DOIs which look closed:
$ find -maxdepth 1 -type f -print0 | xargs -0 -P8 -n1 jq '.proposed_edits|.[]| select(.proposed_change|contains("doi-access=|")) | .orig_string' | grep 2307 | grep -Eo "10.2307/[0-9]+" | sort | shuf -n 40
Currently the most represented domains would be:
$ find -maxdepth 1 -type f -mtime -1 -print0 | xargs -0 -n1 jq '.proposed_edits|.[]| select(.proposed_change|contains("subscription")) | .orig_string' | grep -Eo '\| *url *= *http[^|}]+' | cut -d/ -f3 | sort | uniq -c | sort -nr | head -n 30 916 dx.doi.org 723 www.sciencedirect.com 658 doi.org 519 www.jstor.org 312 onlinelibrary.wiley.com 292 linkinghub.elsevier.com 267 www.researchgate.net 221 www.tandfonline.com 218 www.cambridge.org 204 link.springer.com 182 pubmed.ncbi.nlm.nih.gov 179 www.nature.com 152 journals.sagepub.com 131 pubs.acs.org 102 www.science.org 94 academic.oup.com 93 semanticscholar.org 87 archive.org 79 www.academia.edu 74 pubs.geoscienceworld.org 55 doi.wiley.com 54 www.journals.uchicago.edu 52 pubs.rsc.org 50 muse.jhu.edu 49 www.semanticscholar.org 47 ieeexplore.ieee.org 43 iopscience.iop.org 42 link.aps.org 37 xlink.rsc.org 35 aip.scitation.org
Need to check how many url-access=limited we'd add to non-DOI citations like AdsAbs https://en.wikipedia.org/w/index.php?title=T_Scorpii&diff=prev&oldid=1188735108
We should not replace an existing url-access with another for the same URL as happened https://en.wikipedia.org/w/index.php?title=Soft_skills&diff=prev&oldid=1188731807 (even though I'd argue the archive.org inlibrary items are more "limited" than "registration").
I've manually deleted the older suggestions so now the numbers will be lower.
find ~/www/python/src/bot_cache -mtime +3 -delete
Some ISSNs
$ find ~/www/python/src/bot_cache -type f -exec jq '.proposed_edits | .[] | .orig_string' {} \; | grep issn | grep -Eo 'issn *= *[0-9-]{8,9}' | grep -Eo '[0-9-]{8,9}' | sort | uniq -c | sort -nr | head -n 40 87 0036-8075 46 0004-637 45 1476-4687 45 0004-6256 39 0191-2917 39 0098-7484 33 0028-0836 28 1044-0305 25 0067-0049 24 0080-4606 24 0021-8693 19 2156-2202 19 1396-0466 18 1538-4365 17 0148-0227 17 0031-4005 17 0022-0949 16 0950-9232 16 0304-3975 16 0278-2715 16 0140-6736 16 0035-8711 16 0028-646 16 0002-7294 15 1944-8007 15 1538-4357 15 0301-4223 15 0031-949 15 0006-3568 15 0003-9926 14 2330-4804 14 1475-4983 14 0271-5333 13 0272-4634 13 0097-3165 13 0080-4630 12 2515-5172 12 1631-0683 12 1364-5021 12 0094-8276
Or to catch some more ISSN:
$ find ~/www/python/src/bot_cache -type f -exec jq '.proposed_edits | .[] | .orig_string' {} \; | grep doi= | grep -Eo 'doi *=[^"|]+' | grep -Eo '10\.[0-9]+/[a-z]+\b(\.?([a-z]{,8}|[0-9-]{8,9})\b)?' | sort | uniq -c | sort -nr | head -n 4 390 10.1126/science. 260 10.1001/jama. 244 10.1074/jbc. 235 10.1038/sj.onc 155 10.1098/rsbm. 116 10.1098/rstb. 111 10.1525/aa. 110 10.1098/rspa. 104 10.1242/jeb. 104 10.1111/j. 100 10.5210/fm. 100 10.1377/hlthaff. 99 10.1016/j. 91 10.1098/rstl. 86 10.1093/mnras 74 10.1242/jcs. 68 10.1167/iovs. 68 10.1001/archinte. 62 10.1542/peds. 61 10.1111/j.1469-8137 60 10.1098/rsta. 57 10.1111/j.1558-5646 55 10.1001/archneur. 53 10.1111/j.1096-3642 52 10.1001/archpsyc. 48 10.3732/ajb. 46 10.1002/art. 43 10.1038/sj.mp 43 10.1016/j.febslet 42 10.1093/hmg 41 10.1111/j.1432-1033 41 10.1016/j.jacc 40 10.1093/acrefore 40 10.1001/archopht. 39 10.1098/rspb. 39 10.1093/molbev 38 10.1001/archpedi. 37 10.1242/dev. 37 10.1111/j.1475-4983 36 10.1016/j.jasms
Some of the most common DOI segments slated for doi-access=free removal in today's run:
$ find ~/www/python/src/bot_cache -type f -exec jq '.proposed_edits | .[] | .orig_string' {} \; | grep doi= | grep -Eo 'doi *=[^"|]+' | grep -Eo '10\.[0-9]+/[a-z]+(\.([a-z]{,8}|[0-9-]{9})\b)?' | sort | uniq -c | sort -nr | head -n 40 392 10.1126/science. 351 10.1074/jbc. 260 10.1001/jama. 236 10.1038/sj.onc 209 10.1007/s 176 10.1016/s 173 10.1038/s 155 10.1098/rsbm. 147 10.1146/knowable 139 10.1038/d 116 10.1098/rstb. 111 10.1525/aa. 110 10.1098/rspa. 104 10.1242/jeb. 104 10.1111/j. 100 10.5210/fm. 100 10.1377/hlthaff. 99 10.1016/j. 91 10.1098/rstl. 86 10.1093/mnras 76 10.1242/jcs. 75 10.1167/iovs. 68 10.1001/archinte. 62 10.1542/peds. 61 10.1111/j.1469-8137 60 10.1098/rsta. 57 10.1111/j.1558-5646 55 10.1001/archneur. 53 10.1111/j.1096-3642 52 10.1001/archpsyc. 48 10.3732/ajb. 46 10.1038/nature 46 10.1002/art. 45 10.1038/sj.mp 43 10.1016/j.febslet 42 10.1111/j.1432-1033 42 10.1093/hmg 41 10.1016/j.jacc 41 10.1007/bf 40 10.1093/acrefore
You can look at effect of captcha on known-human users (e.g. IPs from some insitutional range)
And currently
$ find ~/www/python/src/cache -type f -exec jq '.proposed_edits | .[] | .orig_string' {} \; | grep url= | grep -Eo 'url=[^"|]+' | cut -d/ -f3 | sort | uniq -c | sort -nr | head -n 40 1427 doi.org 1229 dx.doi.org 1180 www.sciencedirect.com 940 www.jstor.org 875 web.archive.org 736 onlinelibrary.wiley.com 606 www.researchgate.net 591 www.nature.com 586 www.tandfonline.com 408 www.cambridge.org 376 archive.org 337 link.springer.com 328 linkinghub.elsevier.com 310 www.escholarship.org 302 journals.sagepub.com 283 www.academia.edu 265 academic.oup.com 261 pubmed.ncbi.nlm.nih.gov 259 www.biodiversitylibrary.org 244 books.google.com 238 www.science.org 224 babel.hathitrust.org 220 zenodo.org 212 nrs.harvard.edu 184 ieeexplore.ieee.org 177 digitalcommons.law.yale.edu 176 www.journals.uchicago.edu 166 urn.kb.se 164 pubs.acs.org 123 www.bioone.org 118 nbn-resolving.de 117 philarchive.org 110 muse.jhu.edu 110 link.aps.org 105 www.research.manchester.ac.uk 100 bioone.org 87 www.aeaweb.org 86 www.osti.gov 79 pubs.rsc.org 77 dspace.lboro.ac.uk
I made reports upstream for Journal of Biological Chemistry (already fixed), Journal of Asian Studies/Duke University Press, Annual Review of Public Health, AAS journals, AME journals. I manually removed their doi-access=free removals in the queue (they were around 10 % of the total, I think, including all 10.1146/annurev DOIs some of which are not open yet).