Studying Wikipedia Edits by Tag

This guide will help you understand how to gather edit data from Wikipedia and work with it.

In this guide we will be going through:

In this tutorial using edit data, we will try to analyze to what degree editing has shifted to mobile -- i.e. editing on one's mobile device instead of a desktop computer.

We can identify Mobile edits via a specific edit tag that is recorded with an edit. Data on what tags are associated with which edits are available via Mediawiki dumps or the API.

Table of Contents

Every language on Wikipedia has its own edit tag tables and history dumps. In this tutorial, we will work with simplewiki which is a simplified English version of Wikipedia.

Now, let's take a look at the data we will be working with.

Change_tag_def Data

We will be working with change_tag_def data which stores information related to tags. So, this data will help us identify tags based on our requirement.

As you can see that the data is quite small(4.0 K) so it will be very quick to process this data.

About the data

The table in the simplewiki-latest-change_tag_def.sql database called change_tag_def looks like:

ctd_id ctd_name ctd_user_defined ctd_count
1 mw-replace 0 8781
2 visualeditor 0 278031
3 mw-undo 0 50455
4 mw-rollback 0 64141
5 mobile edit 0 202660

Change_tag Data

We will also be working with change tag data which tracks tags for revisions, logs and recent changes. Using this data we will be able to identify the revisions associated with a given tag.

As you can see that the data is not that large (11 M) so it will be quick to process this data.

About the data

The table in the simplewiki-latest-change_tag.sql database called change_tag looks like:

ct_id ct_rc_id ct_log_id ct_rev_id ct_params ct_tag_id
1 2515963 NULL 2489613 NULL 39
2 2518988 NULL 2492598 NULL 39
3 2542101 NULL 2515172 NULL 39
4 2594933 NULL 2515174 NULL 39
5 2694931 NULL 2667475 NULL 39

XML History Dump

It is the history dump in XML format.

The dataset is larger -- 470MB compressed -- but still can be processed in several minutes

About the data

The history dump file looks like:

<siteinfo> <sitename>Wikipedia</sitename> <dbname>simplewiki</dbname> <base>https://simple.wikipedia.org/wiki/Main_Page</base> <generator>MediaWiki 1.36.0-wmf.32</generator> <case>first-letter</case> <namespaces> <namespace key="-2" case="first-letter">Media</namespace> <namespace key="-1" case="first-letter">Special</namespace> . . <namespace key="2303" case="case-sensitive">Gadget definition talk</namespace> </namespaces> </siteinfo> <page> <title>April</title> <ns>0</ns> <id>1</id> <revision> <id>2130</id> <timestamp>2003-03-27T09:24:48Z</timestamp> <contributor> <username>Ams80</username> <id>667428</id> </contributor> <comment>Edited from English Wikipedia</comment> <model>wikitext</model> <format>text/x-wiki</format> <text bytes="724" id="2130" /> <sha1>g1vldyqwrp9b6ot9f4t27k5n2lmibva</sha1> </revision>

After some metadata, you can see a page object with some metadata about the page and then a list of revisions from oldest to newest where each revision is an edit and metadata about that edit.

We will use this data to find the total number of edits made using mobile and we will be able to do that by filtering using revision id from change_tag table that is associated with mobile edits.

Finding the tag id associated with mobile edits

Since, our goal is to find how mobile edits have evolved over the course of time we need to separate only those revisions that are associated with mobile edits. So, to do that we will be using the change_tag_def table from simplewiki database to find the unique tag id > ctd_id associated with mobile edits.

We will make a make_connection helpers to connect to the hosts and DB.

And a query helper to wrap some of the boilerplates of querying. This helper returns all the results of a query.

Observation:

We can see that the tag id associated with mobile edit is 5 which we can use to filter revision id from change_tag table.

Finding the revision ids associated with Mobile edit

Now that we know the tag_id associated with mobile edit we need to find revision ids associated with mobile edit. So, to do that we will work with change_tag table in simplewiki database and filter out those revision ids where tag_id (ct_tag_id) is 5(mobile edit).

We will make a make_connection helpers to connect to the hosts and DB.

And a query helper to wrap some of the boilerplate of querying. This helper returns all the results of a query.

Finding the total number of mobile edits and non-mobile edits

We have all the revision ids associated with mobile edit so now we need to calculate total mobile edits and non mobile edits using history dump. We can filter out mobile edits from the history dump using the revision ids.

To process the wikipedia XML history dumps we will be using mwxml library. You can learn more about the library here: https://github.com/mediawiki-utilities/python-mwxml

Finally, we have a total number of mobile edits and non-mobile edits made to Wikipedia by year.

An interesting thing we can observe here is till the year 2012, we cannot find a single mobile edit but after the year 2012 the number of mobile edits is increasing exponentially that in year 2020 and 2021 the number of non-mobile edits is almost same as mobile edits.

Here, we are observing the whole edit data but what if we only want to look at one particular Wikipedia article in some particular year.

We can filter the particular article using page.title

Calculating the number of mobile and non mobile edits in year 2020 made to the article "Europe" using dump

Accessing the Edit Tag APIs

Using dump data we were able to find the total number of mobile and non mobile edits for particular article in particular year. However, this is very costly as we have to loop through the whole xmp dump to obtain those data.

Hence, The Revisions API can be a much simpler way to access data about edit tags for a given article if you know what articles you are interested in and are interested in relatively few articles (e.g., hundreds or low thousands).

With that the APIs are up-to-date while the Mediawiki dumps are always at least several days behind -- i.e. for specific snapshots in time -- so the data you get from the Mediawiki dumps might be different from the APIs if edits have been made to a page in the intervening days.

We will be using mwapi library to work with the APIs. You can learn more about the mwapi API here: https://pypi.org/project/mwapi/

Calculating the number of mobile and non mobile edits in year 2020 made to the article "Europe" using API

Using API we were able to calculate the number of mobile and non mobile edits in year 2020 made to the article "Europe" and the result we obtain is similar to the one we did using dump data.

However, using API is way easier and faster than using the dump

Example Analyses of Edit Tag Data

Now let's look at examples to see what we can do with the data that we gathered about the edit tags for various Wikipedia articles.

How has mobile editing changed over the last ~20 years?

Plotting a pie chart of mobile edit data to visualize the percentage of mobile and non mobile edits over the 20 years time frame

If we look at the historical data we can see that only 3.88% of edits were made from mobile

Plotting stacked bar plot of total mobile and non-mobile edits by year

Here, we can observe that the mobile edits were not very common till year 2015 but it has grown from year 2016. In year 2018 and 2019 the total number of mobile edits are almost same to the total number of non mobile edits.

So we can use Edit data to find a lot of meaningful insights.

Forking a PAWS notebook

If you want to build on the work of an existing public notebook, you can create a copy of it for your personal use (AKA "fork"), and upload it to your PAWS control panel.

1) Get the URL of another public PAWS notebook. Example: https://public.paws.wmcloud.org/YOURUSERNAME/YOURNOTEBOOK.ipynb

2) Add ?format=raw to the end of the URL to download a raw .ipynb file. Example: https://public.paws.wmcloud.org/YOURUSERNAME/YOURNOTEBOOK.ipynb?format=raw

3) Log in to your PAWS account and use the Upload button to upload this copy into your own control panel.