Setting “preferred rank” for latest population statements

I noticed that many German cities have lots of population statements with “point in time”, none of which has the preferred rank. Usually, it’s recommended to distinguish the “current” population (usually, the latest statement) by setting it to preferred rank, so that it can easily be queried for with wdt:. In general, determining the best statement requires human oversight (the latest statement could come from a sketchy source, so perhaps the second latest would be preferred), but in most of these cases, all population statements on an item have the same reference (some population table by some statistical state office). I decided that I could fix those cases with a bot.

(Note: all this text was written after-the-fact, so I’m sometimes talking about the code in the past tense.)

Preface: pywikibot update

Wikidata was recently updated for improved quantity precision handling (see here for details). pywikibot was already updated for these changes, but the version installed on PAWS (or at least, on my server) was outdated. To fix this, I updated pywikibot from the terminal:

cd /srv/paws/pwb
git pull
mv ../lib/python3.4/site-packages/pywikibot{,-bak}
cp -a pywikibot/ ../lib/python3.4/site-packages/

Putting together the bot

I assembled the bot in typical PAWS fashion: playing around until things started falling together :)

We start with some boilerplate, copied from older bots:

We get a test item and play with it. (Most commands in this phase are no longer visbile, but it’s essentially lots of tab completion and dir(…).)

This function gets the “point in time” qualifier of a claim and converts it to an ISO 8601 date/time string, which can be compared with the </> operators (the normal WbTime class isn’t comparable).

Search for the latest and second-latest population claim of the item.

A function that returns the “reference URL” source of a claim, if it exists.

I only want to set the latest statement to preferred rank if it has exactly the same source as the second-latest one. Can we compare sources by simply comparing their JSON content? Turns out, yes!

Set the rank to preferred. (This doesn’t yet do anything on Wikidata, it only sets a field of the local value.)

Actually edit the item on Wikidata.

Put all of this in a function (and also print what we’re doing). By this time, I had also created a wiki user page that I could link to in the edit summary.

Second test run. (There’s no print output – that’s what reminded me to add a print to the query above.)

Third test run, now with print.

A query that finds us all cities with more than one wdt: population statements.

A generator for the query…

…and now we run the query and process all the items.

Turns out I was a bit sloppy and didn’t handle some Nones correctly.

Update processItem to correctly handle missing sources (if source1 is not None…).

Another test edit (successful).

Run the query again. We immediately run into the next None issue.

Update refUrlSource to handle missing claims.

Run the query once more. This time, it works! It runs for quite a while before something crashes.

(I was a bit confused what caused this error. Actual reason is further down below, but at first I just continued to run the bot.)

Next morning, I came back to find the bot finished. But my main query still returned results, so I just ran it again…

…and again…

…and then I noticed that these edits weren’t showing up in my contributions or on the items’ histories, so I stopped the bot again. What’s the query again?

I realized the mistake, and fixed the query:

I got the same error about the redirect again (see above), and realized what the actual reason was:

Run the bot one last time…

…and we’re done. The remaining 12 results of the query were fixed manually: some had two preferred statements – someone had added a newer one and didn’t reset the older statement to normal rank – and some had a more complicated source situation.