Page MenuHomePhabricator

Wikibase: Introduce separate database configuration for term store
Open, Needs TriagePublic8 Estimated Story Points

Description

In order to accommodate for the storage usage, Wikidata will provide terms-related tables from a separate server/cluster than the rest of Mediawiki/Wikibase tables.

In order to allow make use of those, Wikibase persistence logic needs to use the right database/cluster for the relevant queries.

Currently in Wikibase the database details and connections are modeled through DomainDb classes.
It is expected that Wikibases other than Wikidata wouldn't need to separate database tables, so two "pointers" could still lead to one database. The "term-database" connection setting should be optional, and unless specified differently, assume to use the "general" Mediawiki/Wikibase database.

Original description for history/context below:


Currently term store is reaching 340GB in wikidata and slowly reaching the wb_terms era. To allow splitting s8 into a core cluster and a dedicated cluster for term store (tentatively called x3), we need to make term store take a different domain db than the core db of wikidata (and should be configurable and initially pointing to wikidata db). This would free up space for wikidata and reduces write pressure and allows for more horizontal scaling.

As an example, see how RepoDomainDb is injected in TermStoreWriterFactory or DatabaseTermStoreWriterBase.

Event Timeline

Created T351820: Move Wikidata term store to separate database cluster as the more general task for the project suggested in the task description, since this task is limited to the code changes in Wikibase IIUC.

Does WPP in the title stand for Wikibase Product Platform Team WPP or something else? (Asking because the current subscribers are closer to Wikidata Dev Team / Wikidata Dev Team (Wikidata.org Slice), and I’m also not sure this task belongs with product platform.)

I’ll also be very interested in how moving the term store to a separate cluster, with (I assume) separate transactions, will affect the deadlocks we’re currently seeing in the term store (T283198).

Created T351820: Move Wikidata term store to separate database cluster as the more general task for the project suggested in the task description, since this task is limited to the code changes in Wikibase IIUC.

Thanks

Does WPP in the title stand for Wikibase Product Platform Team WPP or something else? (Asking because the current subscribers are closer to Wikidata Dev Team / Wikidata Dev Team (Wikidata.org Slice), and I’m also not sure this task belongs with product platform.)

I was asked by Itamar, maybe I misunderstood them :D

no misunderstanding at all :), I just meant that the following project tags should be added: wmde-wikidata-tech and Wikibase Product Platform Team WPP also I think I was thrown off a bit by the parent task, let's keep all discussion here, I'd say.

More information from the parent task T351820#9352829:

Clarification, we will not do anything until at least start of next US FY (as we need to budget a couple more dbs for extra headroom) so we have at least six months.

Part of my confusion was because to me this feels more like our responsibility, but I’m also happy for the product platform team to take it over ^^

Using virtual domains should make this quite easier (famous last words)

Hi, We are currently buying the hardware for this. Any updates?

So who’s responsible for this task at WMDE? The current project columns kind of sound like neither the Wikidata team nor the Wikibase Product Platform Team consider themselves responsible (“Radar” / “outside WPP”), which would be unfortunate.

image.png (198×280 px, 19 KB)

@Ladsgroup hallochen, when would you ideally like an answer by?

@Ladsgroup hallochen, when would you ideally like an answer by?

As soon as possible. We had time a year ago.

WMDE-leszek renamed this task from WPP: Make term store database configurable to Wikibase: Introduce separate database configuration for term store.Nov 25 2024, 1:54 PM
WMDE-leszek updated the task description. (Show Details)

Change #1102897 had a related patch set uploaded (by Jakob; author: Jakob):

[mediawiki/extensions/Wikibase@master] Introduce TermsDomainDb

https://gerrit.wikimedia.org/r/1102897

Change #1104692 had a related patch set uploaded (by Jakob; author: Jakob):

[mediawiki/extensions/Wikibase@master] TermsDomainDb: Avoid ConnectionManager & ReplicationWaiter

https://gerrit.wikimedia.org/r/1104692

Change #1104964 had a related patch set uploaded (by Jakob; author: Jakob):

[mediawiki/extensions/Wikibase@master] Make TermsDomainDb an interface

https://gerrit.wikimedia.org/r/1104964

Change #1102897 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Introduce TermsDomainDb

https://gerrit.wikimedia.org/r/1102897

Change #1104692 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] TermsDomainDb: Avoid ConnectionManager & ReplicationWaiter

https://gerrit.wikimedia.org/r/1104692

Change #1104964 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Make TermsDomainDb an interface

https://gerrit.wikimedia.org/r/1104964

Change #1105383 had a related patch set uploaded (by Jakob; author: Jakob):

[mediawiki/extensions/Wikibase@master] Add virtual-wikibase-terms virtual domain

https://gerrit.wikimedia.org/r/1105383

Change #1108777 had a related patch set uploaded (by Jakob; author: Jakob):

[mediawiki/extensions/Wikibase@master] Use flushSnapshot instead of write connection TRX flag

https://gerrit.wikimedia.org/r/1108777

Change #1108793 had a related patch set uploaded (by Jakob; author: Jakob):

[mediawiki/core@master] Add ILBFactory::getAutoCommitPrimaryConnection()

https://gerrit.wikimedia.org/r/1108793

Change #1108777 abandoned by Jakob:

[mediawiki/extensions/Wikibase@master] Use flushSnapshot instead of write connection TRX flag

Reason:

Doesn't work. Doing I4d3e872596d7fd7c994252dd892ff5691f59c0d0 instead.

https://gerrit.wikimedia.org/r/1108777

Change #1108793 merged by jenkins-bot:

[mediawiki/core@master] Add ILBFactory::getAutoCommitPrimaryConnection()

https://gerrit.wikimedia.org/r/1108793

Change #1109432 had a related patch set uploaded (by Jakob; author: Jakob):

[mediawiki/extensions/Wikibase@master] Add ADR 27: Drop the wbt_type table

https://gerrit.wikimedia.org/r/1109432

Change #1109443 had a related patch set uploaded (by Jakob; author: Jakob):

[mediawiki/extensions/Wikibase@master] Remove $flags param from TermsDomainDb::getReadConnection

https://gerrit.wikimedia.org/r/1109443

Change #1109444 had a related patch set uploaded (by Jakob; author: Jakob):

[mediawiki/extensions/Wikibase@master] Avoid directly using CONN_TRX_AUTOCOMMIT in TermsDomainDb

https://gerrit.wikimedia.org/r/1109444

Change #1109443 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Remove $flags param from TermsDomainDb::getReadConnection

https://gerrit.wikimedia.org/r/1109443

Change #1109444 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Avoid directly using CONN_TRX_AUTOCOMMIT in TermsDomainDb

https://gerrit.wikimedia.org/r/1109444

Change #1109690 had a related patch set uploaded (by Jakob; author: Jakob):

[mediawiki/extensions/Wikibase@master] Sanitize term type IDs when running update.php

https://gerrit.wikimedia.org/r/1109690

Change #1110720 had a related patch set uploaded (by Jakob; author: Jakob):

[mediawiki/extensions/Wikibase@master] Remove term type ID services

https://gerrit.wikimedia.org/r/1110720

Note example queries of term store (https://phabricator.wikimedia.org/diffusion/EWBA/browse/master/docs/storage/terms.md) includes wbt_type table, and dropping this table should be considered breaking change.

Note example queries of term store (https://phabricator.wikimedia.org/diffusion/EWBA/browse/master/docs/storage/terms.md) includes wbt_type table, and dropping this table should be considered breaking change.

databases are explicitly excluded from https://www.mediawiki.org/wiki/Stable_interface_policy

And https://www.wikidata.org/wiki/Wikidata:Stable_Interface_Policy:

We acknowledge that third party tools on Cloud VPS and Toolforge may rely on the Wikibase database schema. Because of this, changes to the available tables and fields are subject to the above notification policy. However, note that the database schema is not designed to be a public API, and less consideration is given to backwards compatibility.

I imagine a view could be added (or something) to the cloud replicas to reduce breakages?

Note example queries of term store (https://phabricator.wikimedia.org/diffusion/EWBA/browse/master/docs/storage/terms.md) includes wbt_type table, and dropping this table should be considered breaking change.

These are internal examples for the documentation of Wikibase and how the tables work, not generally meant for use elsewhere, as SQL is not generally a public interface

I imagine a view could be added (or something) to the cloud replicas to reduce breakages?

Yeah, that'd work but it should be temporary. Using views for keeping b/c permanently adds a lot extra complexity and conflicts with main usecase of views which is to hide private data (and has happened that because of b/c layer private data actually got leaked) so I really want to keep the views as simple as possible to de-risk data leak.

Checking in on the other ticket

They will be part of the new cluster, that will get replicated to wiki replicas.

It's likely that tools etc using these tables will break anyway if they are not also left in place on the wikidatawiki cloud replicas for some time.
So it probably doesn't make too much sense to just keep the type table etc there

Checking in on the other ticket

They will be part of the new cluster, that will get replicated to wiki replicas.

It's likely that tools etc using these tables will break anyway if they are not also left in place on the wikidatawiki cloud replicas for some time.
So it probably doesn't make too much sense to just keep the type table etc there

Two things are being confused here (that's why there are two tickets):

  • Get rid of term type and replace them with constant: This will happen quite soon, hopefully in the next week or so. It can have a b/c view temporarily to avoid breaking of tools but should be eventually removed.
  • Moving the tables to a dedicated cluster: This will happen in quarter or so. There is no way we can provide b/c because they will live in physically separate hosts but since it's rather far. We can start the communication to the tool owners that will happen and might break their tools (unless they change it). We can provide the dns record that would currently point to s8 (wikidata_term_store.analytics.db.svc.wikimedia.cloud would point to wikidatawiki.analytics.db.svc.wikimedia.cloud for now and then later change to the new cluster). But again, given that this is a bit far in the future, we haven't started the work on that yet.

Change #1110810 had a related patch set uploaded (by Jakob; author: Jakob):

[mediawiki/extensions/Wikibase@master] Drop wbt_type table

https://gerrit.wikimedia.org/r/1110810

Two things are being confused here (that's why there are two tickets):

  • Get rid of term type and replace them with constant: This will happen quite soon, hopefully in the next week or so. It can have a b/c view temporarily to avoid breaking of tools but should be eventually removed.
  • Moving the tables to a dedicated cluster: This will happen in quarter or so. There is no way we can provide b/c because they will live in physically separate hosts but since it's rather far. We can start the communication to the tool owners that will happen and might break their tools (unless they change it). We can provide the dns record that would currently point to s8 (wikidata_term_store.analytics.db.svc.wikimedia.cloud would point to wikidatawiki.analytics.db.svc.wikimedia.cloud for now and then later change to the new cluster). But again, given that this is a bit far in the future, we haven't started the work on that yet.

+1, sounds good to leave a temp view in place up until the big move then