Page MenuHomePhabricator

Commons database is growing way too fast
Open, Needs TriagePublic

Description

s4 is now the largest section by a wide margin:

s8 (noch eine Kopie).png (810×1 px, 157 KB)

It has basically doubled since 2019 (while many other sections got smaller with the optimizations in place). Many optimizations we have done improved the storage (e.g. the image metadata storage change in late 2021) but with this rate of growth no optimization can prevent major issues in a year or two.

It's just three tables that are growing really fast: categorylinks, templatelinks, externallinks (in total, they are responsible for 800GB). The rest don't seem to be too problematic:

table (Kopie).png (1×2 px, 452 KB)

By a quick look, I think we can do some easy fixes and it should drastically reduce the database growth:

For externallinks:

  • Use interwiki links/pagelinks instead of raw https links.

For templatelinks (Most used templates):

  • Merge some templates that are both heavily used and only used by the same users
  • Use redirect target in heavily used templates. e.g. Template:Location_dec is used 5 million times, using the redirect target removes 5m rows from templatelinks (I can give the list of heavily used redirect templates)
  • Migrate some functionalities to software or parser functions, etc. to avoid having a dedicated template used in basically every page (e.g. Template:Dir)

For categorylinks (Most used categories):

Templates and categories added by Module:SDC tracking might be contributing to the issue.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Thanks. I don't recommend doing 9m edits to remove 9m rows from templatelinks, it just adds the same number but larger rows to another table. Maybe if it could be batched with something else, sure but not on its own.

But generally, thank you so much for cleaning these up and improve the database stability!

I actually missed this:

That’s a good question. As far as I can tell from looking at a random file on wikidata.org and on an external wiki with InstantCommons, both the TemplateStyles and Commons’ MediaWiki:Filepage.css are loaded on other wikis (Filepage code here), whereas its common.css isn’t loaded. So if we want to keep the behavior consistent, I guess MediaWiki:Filepage.css makes more sense. WDYT @Ladsgroup?

Than yes, please move it to Filepage.css

I specced out this proposal in some more detail here: https://commons.wikimedia.org/wiki/Module_talk:Information#Proposal:_Move_template_styles_to_site-wide_CSS

I just made this change to Module:Information, so most of the 87 million templatelinks to Module:Information/styles.css‏‎ should start to disappear now.

Thanks. I don't recommend doing 9m edits to remove 9m rows from templatelinks, it just adds the same number but larger rows to another table.

Good point. I was trying to create SQL query to find template redirects transcluded in other templates but failed so far. There a single edit will fix great many pages. Can anybody with better SQL skills help?

I also wrote Module:Self which implements 7 templates called from Template:Self including already discussed Template:Self/is-pd-expired. Template:Self is used on 36M pages.

Now looking at Module:Autotranslate called from Template:Autotranslate, which in turn is only called from other templates. I think we could call Module:Autotranslate directly from the other templates, without Template:Autotranslate for high volume templates. This would have a nice side-effect of avoiding issues described at T196464: Incorrect template loop detection

Thanks. I don't recommend doing 9m edits to remove 9m rows from templatelinks, it just adds the same number but larger rows to another table.

Good point. I was trying to create SQL query to find template redirects transcluded in other templates but failed so far. There a single edit will fix great many pages. Can anybody with better SQL skills help?

I'll give it a try.

I also wrote Module:Self which implements 7 templates called from Template:Self including already discussed Template:Self/is-pd-expired. Template:Self is used on 36M pages.

Thank you so much! it looks super nice. Once the updates trickle in, I'll do an optimize table of templatelinks to reclaim some space.

Now looking at Module:Autotranslate called from Template:Autotranslate, which in turn is only called from other templates. I think we could call Module:Autotranslate directly from the other templates, without Template:Autotranslate for high volume templates. This would have a nice side-effect of avoiding issues described at T196464: Incorrect template loop detection

Let me know how it goes.

Change 997425 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] Schema: Drop iwl_title from one of indexes

https://gerrit.wikimedia.org/r/997425

Now looking at Module:Autotranslate called from Template:Autotranslate, which in turn is only called from other templates. I think we could call Module:Autotranslate directly from the other templates, without Template:Autotranslate for high volume templates. This would have a nice side-effect of avoiding issues described at T196464: Incorrect template loop detection

Let me know how it goes.

I think I replaced most calls from templates to Template:Autotranslate, (also Template:Autotranslate/clone 1 and few more clones needed due to T196464 issue) with calls directly to Module:Autotranslate. In a month or two when the DB catches up with the change we will see if I missed any.

one way you can reduce future usage of redirects is make TemplateWizard and visualeditor insert the target whenever a redirect is selected.

Change 997425 merged by jenkins-bot:

[mediawiki/core@master] Schema: Drop iwl_prefix_from_title from iwlinks

https://gerrit.wikimedia.org/r/997425

Note: once T61245: Review the PageNotice extension for deployment is resolved, we can include some common wikitext (though rendered differently in different pages) in all file pages, whose content will not be treated as template link, so it will not be updated when the common message is changed.

Potentially we can automatically include an license template, or even infobox, but such will be confusing when there are still manual infobox and license template in wikitext. One solution is hide the automatic template (display:none) by default, and show them via TemplateStyles once the manual ones are removed; I am not sure whether it works.

After some discussion we retired Template:License_template_tag which was used to track files without licenses and was redundant to Extension:CommonsMetadata which was introduced latter. That should eventually remove 100M template links.

Unfortunately I had to revert the removal of Template:License_template_tag. It appears that the UploadWizard extension tests for the presence of {{License template tag}} when a custom license is specified. See mw.UploadWizardLicenseInput.js and CommonSettings.php.

After some discussion we retired Template:License_template_tag which was used to track files without licenses and was redundant to Extension:CommonsMetadata which was introduced latter. That should eventually remove 100M template links.

That template is used for many different things that you probably just broke. I see uploadwizard and no license bot already. Please don't do this in such an uncontrolled way. Please check what is using it and switch it before trying again.

After some discussion we retired Template:License_template_tag which was used to track files without licenses and was redundant to Extension:CommonsMetadata which was introduced latter. That should eventually remove 100M template links.

That template is used for many different things that you probably just broke. I see uploadwizard and no license bot already. Please don't do this in such an uncontrolled way. Please check what is using it and switch it before trying again.

We did identify some tools which were using it in the past, but it seems like no all of them. It was unexpected that Commons:UploadWizard was relying on it, especially since there are other ways of tracking presence of license template.

Change 991921 merged by jenkins-bot:

[mediawiki/extensions/WikimediaMessages@master] Use interwiki to link to Creative Commons

https://gerrit.wikimedia.org/r/991921

Change 991921 merged by jenkins-bot:

[mediawiki/extensions/WikimediaMessages@master] Use interwiki to link to Creative Commons

https://gerrit.wikimedia.org/r/991921

Once this rolls out with the train next week, the number of external links to https://creativecommons.org (currently 146759278, or ~146.8 million) should start to go down gradually, as pages are re-parsed for various reasons and use the new version of the message with an interwiki link instead of an external link. (But we might have to actively mass-purge pages to get all of them to switch to interwiki links.)

I have a question for MediaWiki developers.

Currently Module:Autotranslate, with 72M uses, is used to transclude the best language subtemplate based on user language. It uses Lua's mw.language.getFallbacksFor function to get fallback language codes and mw.title.exists to check which language subpages exist and which do not. Lua function mw.title.exists is considered an expensive function.

User @Tacsipacsi proposed a rewrite which replaces calls to mw.title.exists with calls directly to frame:expandTemplate with pcall function wrapper used to catch errors, like non-existing page. The rewrite seems to work exactly like the current version, but is not using any "expensive" functions and no rows are created in the pagelinks table. So the question is, is the rewrite a good idea? We can write new Lua function mimicking mw.title.exists using frame:expandTemplate with pcall wrapper, which according to current metrics would be less expensive and have smaller BD footprint. However I find it hard to believe that expanding a template is less expensive than checking if it exists, and I am weary that maybe the issue is with the metrics we currently use for tracking expensive calls.

Once this rolls out with the train next week, the number of external links to https://creativecommons.org (currently 146759278, or ~146.8 million) should start to go down gradually, as pages are re-parsed for various reasons and use the new version of the message with an interwiki link instead of an external link. (But we might have to actively mass-purge pages to get all of them to switch to interwiki links.)

I’ve started cleaning up Module:Caller title usage (now that T224810 is ready), which causes a large number of reparses. To help cleaning up the externallinks table, I halt this cleanup on Commons until the train arrives, so that those reparses also remove the external links as a side effect.

However I find it hard to believe that expanding a template is less expensive than checking if it exists, and I am weary that maybe the issue is with the metrics we currently use for tracking expensive calls.

Expanding a template is not less expensive than checking if it exists; it’s less expensive than checking if it exists and expanding it. In this particular module, there are two cases:

  • The given language subpage exists. In this case, currently we manually check if it exists, then expand the template, which consists of a check whether the template exists (MediaWiki cannot know that we’ve already done that check) and actually expanding the template. With my proposed code, the manual check is skipped, which is obviously a net win.
  • The given language doesn’t exist. In this case, currently we manually check if it exists, and finding that it doesn’t, we don’t go further. With my proposed code, the manual check is skipped, so it’s the MediaWiki transclusion code that does the existence check and finds that it fails, reporting the error. It’s the same amount of existence checks, so it shouldn’t have worse performance.

Rewrite of Module:Autotranslate discussed went live last week.

Ebrahim subscribed.

About Template:Dir used on 112M pages, likely dozen times on each page. That template is imbedded in almost every non-lua-based template that shows some text. Lua based templates (like Template:Information and other infoboxes) use mw.language:getDir function. We do have 181k templates on Commons and good fraction of them show something, so it would be a big job to replace that template. The easiest solution would be to add it to "software or parser functions" which would use the current format, the way {{!}} template was replaced by a parser function. Whatever way it is implemented it would be good to keep it in synch with Lua function. Shall we create a ticket for that?

Following your fantastic idea, T365189 and https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1032542

This comment was removed by Bugreporter.

An update: We optimized templatelinks in Commons and now it's 184GB which is roughly 100GB smaller than what it used to be \o/ It's still too large but much better than ~290GB it used to be.

A massive thank you to anyone who has been cleaning this up.

https://commons.wikimedia.org/w/index.php?search=deepcategory:"Files_with_no_machine-readable_author"+dpla

i noticed that out of the 7 million in Category:Files with no machine-readable author, 3.8 are DPLA uploads.

but they had author info in their source repos.

if the uploader could have done their job right...

With implementation of T359761 / T366623 and replacement of {{dir}} and {{bcp47}} templates with brand new parser functions, we should loose some more templatelinks (124M +5M to be precise). I also replaced many calls to {{Autotranslate}} templates in template namespace with the Lua calls directly to c:Module:Autotranslate which might reduce it even further.

This is a good practice I believe, we should see other most used templates in Commons https://commons.wikimedia.org/wiki/Special:MostTranscludedPages and replace them with more efficient solutions. Maybe we can even have a dedicated extension for Commons, if Commons database is that huge it shows how important it is and how more we should make the software more efficient for the sake of it, benefiting all the wikis at the end.

With implementation of T359761 / T366623 and replacement of {{dir}} and {{bcp47}} templates with brand new parser functions, we should loose some more templatelinks (124M +5M to be precise). I also replaced many calls to {{Autotranslate}} templates in template namespace with the Lua calls directly to c:Module:Autotranslate which might reduce it even further.

We currently have multiple ways of translating templates. Maybe have a look if we can switch more to the translation extension to replace autotranslate and int:lang? Not sure if we already have a task for that?

Maybe have a look if we can switch more to the translation extension to replace autotranslate and int:lang? Not sure if we already have a task for that?

We do have Commons:Translators_noticeboard, but not much of a push to move translations from old format to translation extension. I tried to convert a single template: Template:Sisterprojects/i18n since someone had edit request for it. After a over a week of effort I managed to copy 107 out of ~150 languages, before I moved to other stuff (I might come back and finish it). With 10's of thousands of templates translating things it might not be sustainable to switch to translation extension without some better tools for copying.

We discovered another major pain in the neck related to translation extension: when replacing {{dir}} and {{bcp47}} templates with new parser functions we changed non-translated parts of few thousand templates managed by translation extension, like this. None of those changes propagated to the language subtemplates without manual multi-step process of approving each affected template for re-translation. Wider use of translation extension would make changes like this even harder. I wonder if there are bots to help with basic translation extension tasks like mark for translation some list of pages.

Also a funny fact: pages on Commons that use translation extension use {{Autotranslate}} template for displaying the correct subtemplate. See for example Template:Upload_campaign_thanks_ADM20. So translation extension is only used for managing language subtemplates, but not choosing the right one.

We currently have multiple ways of translating templates. Maybe have a look if we can switch more to the translation extension to replace autotranslate and int:lang? Not sure if we already have a task for that?

My understanding is that the Translate extension folks (and/or other mediawiki developers broadly) don't want to support the Commons usecase of automatically translating templates into the interface language (which is what autotranslate does, even for templates where the actual translation is handled by the Translate extension).

This functionality is heavily used in user warning templates, where it could in theory be replaced with a link to a translated page elsewhere, but this would decrease the effectiveness of the warnings for non-English speakers. It is also vital on file description pages to translate license information into the interface language. In theory these templates could be replaced with an interface page that uses structured data to highlight important source, license, authorship, and attribution information for re-users, but that will require significant efforts on improving the structured data data model, upload and editing tool support, and backfilling for existing files.

Change 991921 merged by jenkins-bot:

[mediawiki/extensions/WikimediaMessages@master] Use interwiki to link to Creative Commons

https://gerrit.wikimedia.org/r/991921

Once this rolls out with the train next week, the number of external links to https://creativecommons.org (currently 146759278, or ~146.8 million) should start to go down gradually, as pages are re-parsed for various reasons and use the new version of the message with an interwiki link instead of an external link. (But we might have to actively mass-purge pages to get all of them to switch to interwiki links.)

Regarding this, there will still be 53,506,154 links to https://creativecommons.org/share-your-work/licensing-considerations/compatible-licenses/ coming from MediaWiki:Wm-license-cc-conditions-share_alike-text & friends. These cannot be replaced by the current :creativecommons: interwiki, as its target url is //creativecommons.org/licenses/$1.

I assume that changing that target URL is not feasible for compatibility reasons. Would it be desirable to add, like, a :ccorg: interwiki that targets //creativecommons.org/$1?

Thanks for the replies related to translating. Dead end for this task.

My understanding is that the Translate extension folks (and/or other mediawiki developers broadly) don't want to support the Commons usecase of automatically translating templates into the interface language (which is what autotranslate does, even for templates where the actual translation is handled by the Translate extension).

I would like to support this feature. There are similar cases in translatewiki that prevents from fully using translatable pages. Comments are welcome in T313748: Allow translatable templates to be shown in the user interface language.

I assume that changing that target URL is not feasible for compatibility reasons. Would it be desirable to add, like, a :ccorg: interwiki that targets //creativecommons.org/$1?

I think so, it's probably better than trying hacks like [[creativecommons:../foo]]. Proposed at https://meta.wikimedia.org/wiki/Talk:Interwiki_map#ccorg

Change #1082873 had a related patch set uploaded (by Legoktm; author: Legoktm):

[mediawiki/extensions/WikimediaMessages@master] Use ccorg interwiki for CC license text links

https://gerrit.wikimedia.org/r/1082873

Change #1082874 had a related patch set uploaded (by Legoktm; author: Legoktm):

[mediawiki/extensions/WikimediaMessages@master] Add new wm-license-cc-pd-mark-wikilink message

https://gerrit.wikimedia.org/r/1082874

The new ccorg interwiki has now been added, so I've submitted two patches:

Change #1082873 merged by jenkins-bot:

[mediawiki/extensions/WikimediaMessages@master] Use ccorg interwiki for CC license text links

https://gerrit.wikimedia.org/r/1082873

Change #1082874 merged by jenkins-bot:

[mediawiki/extensions/WikimediaMessages@master] Add new wm-license-cc-pd-mark-wikilink message

https://gerrit.wikimedia.org/r/1082874

Change #1084925 had a related patch set uploaded (by Legoktm; author: Legoktm):

[mediawiki/extensions/WikimediaMessages@master] Remove now-unused wm-license-cc-pd-mark-link message

https://gerrit.wikimedia.org/r/1084925

Change #1084925 merged by jenkins-bot:

[mediawiki/extensions/WikimediaMessages@master] Remove now-unused wm-license-cc-pd-mark-link message

https://gerrit.wikimedia.org/r/1084925

Noting that I applied a change to Module:Core. Previously, links to edit Wikidata statements were external; I converted them to internal.

The impact is hard to calculate. There are are over 66 million external links to Wikidata with the path LIKE "/wiki/Q%". The Creator template has about 5 million transclusions. Conservatively assuming an average of 3 edit links per transclusion, that'd be 15 million rows. (This could be much higher, especially with multiple transclusions on a single page, as IIRC only one transclusion per page is included in the count?)

I also applied changes to Template:Cc-by-sa-layout, Template:Cc-by-layout, Template:Cc-by-sa-layout/styles.css, and MediaWiki:Filepage.css. These should remove almost all of the templatelinks to Template:Cc-by-sa-layout/styles.css: just under 70 million.