Page MenuHomePhabricator

Normalize pagelinks table
Closed, ResolvedPublic

Description

Once T299417: Normalize templatelinks table is done.

Related Objects

StatusSubtypeAssignedTask
OpenNone
Resolvedsimon04
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedABran-WMF
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedJAllemandou
ResolvedLadsgroup
ResolvedSBisson
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
DeclinedMarostegui
ResolvedMarostegui
ResolvedUmherirrender
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 966213 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] wikireplicas: Allow pagelinks.pl_target_id to be replicated to the cloud

https://gerrit.wikimedia.org/r/966213

Change 966213 merged by Ladsgroup:

[operations/puppet@production] wikireplicas: Allow pagelinks.pl_target_id to be replicated to the cloud

https://gerrit.wikimedia.org/r/966213

Hey @Ladsgroup , we're using that table in the image suggestions production data pipeline, see this query.
While I'll try to monitor this ticket and mailing lists announcements, it would be great if you could ping me in advance before the breaking change rolls out, so that we can take action.
You could also leave a comment in T350007: [M] Adapt image suggestions to comply with breaking database schema changes if you prefer, thanks!

Sure. Will ping before dropping the columns in the ticket.

Before dropping columns, please notify WP:VPT at enwp and similar pages on other projects. Dropping these columns will break a lot of regular and ad hoc queries using Quarry, {{database report}}, etc. Identifying and modifying them will be a major exercise.

Before dropping columns, please notify WP:VPT at enwp and similar pages on other projects. Dropping these columns will break a lot of regular and ad hoc queries using Quarry, {{database report}}, etc. Identifying and modifying them will be a major exercise.

I will but noting that this has been already announced multiple times and in multiple venues: https://lists.wikimedia.org/hyperkitty/list/[email protected]/thread/KWWEWNFVXDXOHZBJO32NQLBKC5LUTEAB/

Change 977218 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] beta: Stop writing to the old columns of pagelinks in fawiki

https://gerrit.wikimedia.org/r/977218

Change 977218 merged by jenkins-bot:

[operations/mediawiki-config@master] beta: Stop writing to the old columns of pagelinks in fawiki

https://gerrit.wikimedia.org/r/977218

I changed PK of pagelinks in fawiki in beta cluster:

ALTER TABLE pagelinks DROP PRIMARY KEY, ADD PRIMARY KEY (`pl_from`, `pl_target_id`);

Then, stopped writing the old columns and then dropped them and indexes:

ALTER TABLE pagelinks DROP COLUMN pl_namespace, DROP COLUMN pl_title;
DROP INDEX pl_namespace ON pagelinks;
DROP INDEX pl_backlinks_namespace ON pagelinks;

Size of the table went from 9.1M to 448K but the catch here is that majority of size redaction came from running optimize when changing the PK, making it go to 996K, so the net gain here is cutting size of the table to around 45% which is pretty good regardless.

Actually my measurements were incorrect, the impact is bigger let me redo that.

Running it on simplewiki gave 38% of the original size. On top, you'd get a clean up due to running optimize table which can in itself cut the size of the table from half to one fifth depending on the wiki.

Fully understand the reasoning behind this normalization. But (unexpectedly) this will result in some queries to run substantially slower. For instance:

old query (3 seconds)
SELECT
  page_title,
  pl_title,
  CASE
    WHEN c2.cl_to IS NULL THEN NULL
    ELSE '{{yes}}'
  END AS under_construction
FROM page JOIN pagelinks
  ON page_id = pl_from
LEFT JOIN categorylinks c2
  ON page_id = c2.cl_from
  AND c2.cl_to = 'صفحه‌های_گسترده_در_دست_ساخت'
WHERE
  page_namespace = 0
  AND pl_namespace IN (2, 3)
  AND NOT EXISTS (
    SELECT c1.cl_to
    FROM categorylinks c1
    WHERE
      page_id = c1.cl_from
      AND c1.cl_to = 'مقاله‌های_نامزد_حذف_سریع'
  )

versus"

new query (3.5 minutes)
SELECT
  page_title,
  lt_title,
  CASE
    WHEN c2.cl_to IS NULL THEN NULL
    ELSE '{{yes}}'
  END AS under_construction
FROM page
JOIN pagelinks
  ON page_id = pl_from
JOIN linktarget 
  ON pl_target_id = lt_id
LEFT JOIN categorylinks c2
  ON page_id = c2.cl_from
  AND c2.cl_to = 'صفحه‌های_گسترده_در_دست_ساخت'
WHERE
  page_namespace = 0
  AND lt_namespace IN (2, 3)
  AND NOT EXISTS (
    SELECT c1.cl_to
    FROM categorylinks c1
    WHERE
      page_id = c1.cl_from
      AND c1.cl_to = 'مقاله‌های_نامزد_حذف_سریع'
  )

It would be nice if we had analytics on the queries run against pagelinks table (both in production and in replicas) and what the average or median query duration looks like over time.

In looking at the EXPLAIN results for those two queries, the first one runs more quickly not only because of denormalization, but also because the WHERE condition on pl_namespace uses the pl_namespace index on the pagelinks table. However, it seems like the secondary does *not* use the lt_namespace_title [[index]] on the linktargets table (where I expected it would be used as a partial index). Any ideas on how to improve my query or the indexes?

Maybe this is what you're looking for:

SELECT
  page_title,
  lt_title,
  CASE
    WHEN c2.cl_to IS NULL THEN NULL
    ELSE '{{yes}}'
  END AS under_construction
FROM page
JOIN pagelinks
  ON page_id = pl_from
JOIN linktarget 
  ON pl_target_id = lt_id
LEFT JOIN categorylinks c2
  ON page_id = c2.cl_from
WHERE
  c2.cl_to = 'صفحه‌های_گسترده_در_دست_ساخت'
  AND page_namespace = 0
  AND lt_namespace IN (2, 3)
  AND page_id NOT IN (
    SELECT c1.cl_from
    FROM categorylinks c1
    WHERE
      c1.cl_to = 'مقاله‌های_نامزد_حذف_سریع'
  )

(if I understood your query correctly, you want articles that are in daste sakht but not naamzade hazfe sari?)

No. The query finds all articles linking to a User or User_talk page, and distinguishes which has an "under construction" category. So the last part is optional, not mandatory. Your suggested query makes it mandatory.

But thanks for taking a look.

ah I see, unfortunately I don't think there is a way to make the query instantly fast as the knowledge that has the highest cardinality (pl from ns + pltarget ns) now is literally sitting in two different table and none of the fields have enough cardinality on their own to make query in the respective tables faster. I think it'll get a bit worse once the data is properly populated in linktarget (and more links to user and user talk ns start being added to linktarget) but OTOH, dropping the old columns makes all queries faster by increasing memory lookups and this took 48s for me:

SELECT
  page_title,
  lt_title,
  CASE
    WHEN page_id in (SELECT cl_from from categorylinks WHERE cl_to = 'صفحه‌های_گسترده_در_دست_ساخت') THEN '{{yes}}'
    ELSE NULL
  END AS under_construction
FROM page
JOIN pagelinks
  ON page_id = pl_from
JOIN linktarget 
  ON pl_target_id = lt_id
WHERE
  page_namespace = 0
  AND pl_from_namespace = 0
  AND lt_namespace IN (2, 3)
  AND page_id NOT IN (
    SELECT c1.cl_from
    FROM categorylinks c1
    WHERE
      c1.cl_to = 'مقاله‌های_نامزد_حذف_سریع'
  )

(adding pl_from_namespace = 0 helps a bit)

@Huji Hi, the data has been fully populated for fawiki. If you want check and switch your tools and queries, it should be fine to do so now.

Change #1024450 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] Run pagelinks migration for third parties too

https://gerrit.wikimedia.org/r/1024450

Change #1024450 merged by jenkins-bot:

[mediawiki/core@master] Run pagelinks migration for third parties too

https://gerrit.wikimedia.org/r/1024450

We are about to start dropping the old columns everywhere.

What is this change, and what should be announced in Tech News (if anything)?

We are about to start dropping the old columns everywhere.

Does "everywhere" include Manual:Pagelinks table, which has yet to be updated this year and doesn't indicate that any columns are deprecated? Nor document the new way?

@Huji Hi, the data has been fully populated for fawiki. If you want check and switch your tools and queries, it should be fine to do so now.

I migrated all my code. In most cases, queries are running just a tad slower. In a few cases, they seem to run a tad faster. In all cases, my tests showed the results were identical.

Change #1028778 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Stop writing to old columns of pagelinks in most wikis

https://gerrit.wikimedia.org/r/1028778

Change #1028778 merged by jenkins-bot:

[operations/mediawiki-config@master] Stop writing to old columns of pagelinks in most wikis

https://gerrit.wikimedia.org/r/1028778

Mentioned in SAL (#wikimedia-operations) [2024-05-07T15:32:32Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:1028778|Stop writing to old columns of pagelinks in most wikis (T352010 T299947)]]

Mentioned in SAL (#wikimedia-operations) [2024-05-07T15:38:11Z] <ladsgroup@deploy1002> ladsgroup: Backport for [[gerrit:1028778|Stop writing to old columns of pagelinks in most wikis (T352010 T299947)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-05-07T16:05:02Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:1028778|Stop writing to old columns of pagelinks in most wikis (T352010 T299947)]] (duration: 32m 29s)

What is this change, and what should be announced in Tech News (if anything)?

Something along the lines of:

Two columns of pagelinks table (pl_namespace, pl_title) are being dropped soon. Users must use pl_target_id instead (foreign key to linktarget table). See T222224 for reasoning.

@Ladsgroup Added to https://meta.wikimedia.org/wiki/Tech/News/2024/20 -- Thanks for the draft wording, always appreciated!
I think I've gotten the <code> highlighting and non-translation snippets correct, but please glance-check.
It will be frozen for translations in ~20 hours, if any tweaks (or link additions/changes) are needed before then.

It doesn't work

Can't use pagelinks without linktarget; can't use linktarget without pagelinks

Chicken and egg problem

It doesn't work

Can't use pagelinks without linktarget; can't use linktarget without pagelinks

Chicken and egg problem

The join columns were wrong: Here is the correct query: https://quarry.wmcloud.org/query/82698

Thanks. So Quiddity, the advice: "Users must use pl_target_id instead (a foreign key to the linktarget table)." wasn't sufficient to clearly explain to me what I needed to do. I think, to better explain for rusty SQL users like me:

Two columns of the pagelinks table (pl_namespace, pl_title) are being dropped soon. Users must use two columns of the new linktarget table instead (lt_namespace, lt_title).

In your existing SQL queries:

  1. Replace JOIN pagelinks with JOIN linktarget and pl_ with lt_ in the ON statement
  2. Below that add JOIN pagelinks ON lt_id = pl_target_id

Thanks for the detailed recommendation! Updated at https://meta.wikimedia.org/wiki/Tech/News/2024/20

Sidenote (Tech News insider stuff): I'm hesitant about including that much technical detail (as it's a burden for the translators, and confusing + bulky for those completely unfamiliar with the topic...), but we're trying to improve Tech News (cf. T363876) so we need to experiment with some aspects in order to get clear example-based feedback, which I think this will help with. :)

Change #1043756 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] schema: Drop old pagelinks columns

https://gerrit.wikimedia.org/r/1043756

I am running MW 1_43-wmf9 and started running update.php yesterday night. It seems that the updater is very slow (running migrateLinksTable on pagelinks). I now started to run this migration in a separate process. Do you have any idea how long the migration might take? My first estimation is 12 days for our 12M pages. This seems very slow, does that make sense?

I am running MW 1_43-wmf9 and started running update.php yesterday night. It seems that the updater is very slow (running migrateLinksTable on pagelinks). I now started to run this migration in a separate process. Do you have any idea how long the migration might take? My first estimation is 12 days for our 12M pages. This seems very slow, does that make sense?

It depends on how connected those pages are (=size of pagelinks table). 12 days for 12M sounds about right. You could potentially increase the batch size if they get connected to one set of popular pages mostly.

Thank you. This is very helpful. I think one of the problems is that we are also writing to the wiki simultaneously.

MariaDB [(none)]> SELECT table_name AS `Table`, round(((data_length + index_length) / 1024 / 1024 / 1024), 2) `Size in GB`  FROM information_schema.TABLES where table_name like '%links' order by (data_length + index_length) desc;
+---------------+------------+
| Table         | Size in GB |
+---------------+------------+
| pagelinks     |      22.43 |
| externallinks |       9.92 |
| templatelinks |       7.45 |
| categorylinks |       0.09 |
| iwlinks       |       0.01 |

And it is a bit faster than originally expected. So maybe only 6 days. (Edit: It was actually much faster real 1510m58.783s)

Change #1043756 merged by jenkins-bot:

[mediawiki/core@master] schema: Drop old pagelinks columns

https://gerrit.wikimedia.org/r/1043756

I didn’t want to spam here… It looks like @CodeReviewBot is smarter than I thought. Sorry!

Ladsgroup claimed this task.
Ladsgroup edited projects, added Data-Persistence; removed Patch-For-Review.
Ladsgroup moved this task from Inbox to Epic - Database on the Data-Persistence board.