Page MenuHomePhabricator

Incorrect backlog cutoff for redirects in the New Page Patrol queue
Closed, ResolvedPublic2 Estimated Story Points

Description

The backlog length for the New Pages Queue for articles is 90 days, but several weeks ago editors realized that redirects were dropping off of the queue after only 30 days. In practice, this means that many if not most redirects will not be reviewed. This problem would be solved if the backlogs were both 90 days long.

Note that currently there should be no redirects older than 20-something days in the queue, as once we noticed the problem a few editors made sure to keep the back of the queue patrolled.

Event Timeline

JJMC89 subscribed.

It looks like cron/updatePageTriageQueue.php is using a hardcoded 30 days instead of the configured PageTriageMaxAge to remove pages from the queue.

JTannerWMF moved this task from Inbox to External on the Growth-Team board.
JTannerWMF added subscribers: Niharika, JTannerWMF.

We are tagging the Community-Tech team with this being a top wish list item. CC: @Niharika

Niharika triaged this task as Medium priority.Jul 9 2019, 11:34 PM
Niharika set the point value for this task to 2.
Niharika moved this task from Needs Discussion to Up Next (June 3-21) on the Community-Tech board.

Update: When @MusikAnimal looked into the history of the code, this behavior (i.e. redirects dropped after 30 days in the queue) appeared to be an intentional choice/by design, though we do not know the reason why (perhaps database storage issues?). @Barkeep49, do you know why this decision may have been made?

The patch that changed redirects to purge after 30 days was https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageTriage/+/24939, which was merged by @kaldari. Perhaps he remembers the reasoning?

My suspicion is simply performance. At the time of writing, there are 31,741 redirects in the queue (reviewed + unreviewed, over 30 days). This is versus 22,646 mainspace pages + 4,086 drafts + 17,983 user pages = 44,715 non-redirects. So if we triple the redirects (90 days), we have 95,223 pages -- over twice as many as all other types of pages combined. It's surely the case that the database servers in 2012 were not as powerful as they are today, so maybe we can get away with storing that much more data. It's still an awful lot, though.

We should weigh out the costs/benefits. Patrolling redirects seems much less specialized than new page reviewing. I suspect your average [[offensive term]] redirect to [[famous person]] will get picked up by recent changes patrollers, no? Do we have any idea on how many redirects are deleted or corrected as a result of NPP, typically?

This discussion might be better held at https://en.wikipedia.org/wiki/Wikipedia_talk:New_pages_patrol/Reviewers but pinging @Rosguill and @DannyS712 as the two people who have dived deep into redirects most recently.

I'd say that on a typical day of patrolling the back end of the queue, I'll go through 150-300 articles, send 5-10 to RfD, tag around 5 with G5 or R3, and either retarget or convert-to-dab 5 more. Attack redirects are less frequent, I'll come across a handful of attack redirects per week.

These numbers can swing quite a bit though, because errors and/or vandalism on redirects are often repeated by the same editor multiple times in a row. There's been days where 30+ redirects will end up being bundled together for RfD, and on some days I've run into 10+ G5s in a row.

For me, part of the issue is how I review redirects: I have a chrome extension to let me mass open links in new tabs. I go to the Special:NewPages feed of unpatrolled redirects and open 100 at a time, then go through and close any that need a second look / aren't obviously acceptable, and then go through and mark the remaining tabs as patrolled. The ratelimit on patrolling means that I have to pause after each one. Something I've been thinking about when it comes to redirect patrolling is extending my bot's task of patrolling redirects to create a pseudo-group of "autopatrolled redirect creators" to ease the thousands of redirects that need to be patrolled, and I was waiting for T223828 to start investigating the task / looking for consensus. I've opened a preliminary discussion at https://en.wikipedia.org/wiki/Wikipedia_talk:New_pages_patrol/Reviewers#Initial_thoughts_-_autopatrolled_redirects. The bot task could save a few dozen per week, but I think personally the ratelimit on patrolling redirects is a big hurdle.

Okay so it sounds like there is a healthy amount of redirect patrolling using Page Curation. I suppose let's talk to the DBAs about this. If they think all those extra rows are fine, then there's no reason not to bump the expiry to 90 days, provided you're okay with the longer backlog.

I think personally the ratelimit on patrolling redirects is a big hurdle.

Are you sure it's a rate limit on patrolling, and not the MediaWiki API itself?

Okay so it sounds like there is a healthy amount of redirect patrolling using Page Curation. I suppose let's talk to the DBAs about this. If they think all those extra rows are fine, then there's no reason not to bump the expiry to 90 days, provided you're okay with the longer backlog.

I think personally the ratelimit on patrolling redirects is a big hurdle.

Are you sure it's a rate limit on patrolling, and not the MediaWiki API itself?

The popup window says An error occurred while marking the page as reviewed: You've exceeded your rate limit. Please wait some time and try again. This only appears to arise when using the page curation toolbar - using the "mark this page as patrolled" button at the bottom when the toolbar is disabled works fine (but it also moves around on the page if there is a redirect template, so the toolbar is more convenient). https://en.wikipedia.org/w/api.php?action=query&meta=userinfo&uiprop=ratelimits says that my rate limits are:

Ratelimits
{
    "batchcomplete": "",
    "query": {
        "userinfo": {
            "id": 34581532,
            "name": "DannyS712",
            "ratelimits": {
                "move": {
                    "user": {
                        "hits": 8,
                        "seconds": 60
                    },
                    "extendedmover": {
                        "hits": 16,
                        "seconds": 60
                    }
                },
                "edit": {
                    "user": {
                        "hits": 90,
                        "seconds": 60
                    }
                },
                "badcaptcha": {
                    "user": {
                        "hits": 30,
                        "seconds": 60
                    }
                },
                "emailuser": {
                    "user": {
                        "hits": 20,
                        "seconds": 86400
                    }
                },
                "changeemail": {
                    "user": {
                        "hits": 4,
                        "seconds": 86400
                    }
                },
                "rollback": {
                    "user": {
                        "hits": 10,
                        "seconds": 60
                    },
                    "rollbacker": {
                        "hits": 100,
                        "seconds": 60
                    }
                },
                "purge": {
                    "user": {
                        "hits": 30,
                        "seconds": 60
                    }
                },
                "linkpurge": {
                    "user": {
                        "hits": 30,
                        "seconds": 60
                    }
                },
                "renderfile": {
                    "user": {
                        "hits": 700,
                        "seconds": 30
                    }
                },
                "renderfile-nonstandard": {
                    "user": {
                        "hits": 70,
                        "seconds": 30
                    }
                },
                "cxsave": {
                    "user": {
                        "hits": 10,
                        "seconds": 30
                    }
                },
                "urlshortcode": {
                    "user": {
                        "hits": 50,
                        "seconds": 120
                    }
                },
                "pagetriage-mark-action": {
                    "user": {
                        "hits": 1,
                        "seconds": 3
                    }
                },
                "pagetriage-tagging-action": {
                    "user": {
                        "hits": 1,
                        "seconds": 10
                    }
                },
                "thanks-notification": {
                    "user": {
                        "hits": 10,
                        "seconds": 60
                    }
                },
                "badoath": {
                    "user": {
                        "hits": 10,
                        "seconds": 60
                    }
                }
            }
        }
    }
}

The popup window says An error occurred while marking the page as reviewed: You've exceeded your rate limit. Please wait some time and try again.
...

Got it, I see now where this is happening in the code https://github.com/wikimedia/mediawiki-extensions-PageTriage/blob/master/includes/Api/ApiPageTriageAction.php#L31-L33. I think we could easily exempt redirects from this, if there was consensus to do so.

The popup window says An error occurred while marking the page as reviewed: You've exceeded your rate limit. Please wait some time and try again.
...

Got it, I see now where this is happening in the code https://github.com/wikimedia/mediawiki-extensions-PageTriage/blob/master/includes/Api/ApiPageTriageAction.php#L31-L33. I think we could easily exempt redirects from this, if there was consensus to do so.

I'm not sure that is the best idea though - per BEANS I've emailed you my concern - feel free to post it here if you think it isn't an issue

I'm not sure that is the best idea though - per BEANS I've emailed you my concern - feel free to post it here if you think it isn't an issue

Not quite BEANS-worthy in my opinion, but thanks for the caution! I wanted to state that removing this throttling for redirects is merely technically possible. Is there anyone else hitting the rate limit? I ask because you have a bot account to get around it, no? Anyway we might be getting a little off-topic :)

I'll try to do the math to see just how much of an impact the extra redirects will have on database storage, considering there's associated metadata too (pagetriage_page_tags), and not just the rows in pagetriage_page. I also noticed that while we don't expose things like category/reference counts (and even AfC state!) for redirects, there are still rows for them in the database, so maybe fixing that would give us more wiggle room.

I removed the Community-Tech tag from this ticket, as the potential changes are not a part of the Community Wishlist Survey.

I dug around the code tonight. The code responsible for deleting redirects older than 30 days is the cron job SQL query located at https://github.com/wikimedia/mediawiki-extensions-PageTriage/blob/58ef5381d2c1d5455a92090477bd6488544f3bf0/cron/updatePageTriageQueue.php#L80-L102. Would be an easy patch. I may submit a patch soon.

Change 815960 had a related patch set uploaded (by Novem Linguae; author: Novem Linguae):

[mediawiki/extensions/PageTriage@master] Change redirect delete cron job cutoff to 6 months

https://gerrit.wikimedia.org/r/815960

image.png (383×773 px, 29 KB)

Here's an analysis of database size increase. This is based on the assumption that there are 40,000 redirects for every 30 day period. Here's a couple quarry queries. 1, 2

Currently @Rosguill reviews redirects every day, and we have several bot tasks that also review redirects. This keeps the redirect queue at exactly the 30 day mark (green data).

If these numbers seem too high, we could lower it to something in the middle such as 90 days or 60 days.

TheresNoTime subscribed.

Tagging DBA for comment on the potential performance impact of this change, per T227250#5363690 and the summary by @Novem_Linguae above — no immediate urgency afaics.

root@db2112:/srv/sqldata/enwiki# ls -Ssh | grep -i pagetriage
233M pagetriage_page_tags.ibd
 61M pagetriage_log.ibd
 27M pagetriage_page.ibd
 64K pagetriage_tags.ibd

This table is small enough not to be a big issue and it won't grow much further given that it still has a cutoff (just longer). I think it's fine unless @Marostegui has a different idea.

Change 815960 merged by jenkins-bot:

[mediawiki/extensions/PageTriage@master] Change redirect delete cron job cutoff to 6 months

https://gerrit.wikimedia.org/r/815960

Novem_Linguae claimed this task.
Novem_Linguae moved this task from Waiting for enwiki deploy to Done on the PageTriage board.