How to Use Wikipedia Edits by Tag


Tags are brief messages that MediaWiki automatically places next to certain edits (sometimes refered as changes to a particular article or the contributions of a particular editor) or logged actions. Edit tags can be found in histories, recent changes and certain special pages.

Tags are useful for maintenance as well as administration. Edit tags can also be used to detect potentially harmful edits or software bugs. Tags also come in handy to identify how someone made an edit. They can be made via mobile or by VisualEditor, or using GettingStarted. Read more about tags here.

This tutorial is a simple guide on how to gather edit data from Wikipedia, specifically mobile edits which are edits made from mobile (web or app). We will discover how to extract data from history dumps as well as from an API and how to visualize the data we extract. So let's dive into it !


Extract Edits from dumps

Wikimedia data dumps are basically collections of data which includes Wikimedia content and meta data, search indexes and more. However they are not always uptodate nor consistent. But you can always use them to gain a general idea. You can read more about them here.

In this section, we will go through the followin content after a brief introduction

Every language on Wikipedia has its own edit tag tables and history dumps You can find all the dbnames here. The language simplewiki, which we will be using for this tutorial below is a simplified English version of Wikipedia found here.

You could replace the LANGUAGE parameter with a parameter of your choice depending on what you want to study.For example,'arwiki' can be used to study Arabic Wikipedia or 'enwiki' for English Wikipedia.

As an example, you can extract data as below.

Exploring a Change description dump

To take a closer look at dumps, we can use the zcat command. You can gain an idea of the sturucture of the files using this as well as figure out how to parse it.

As an example, let's take a closer look at the change description dump.

Inspecting the output a bit closer, As you can see from the CREATE TABLE statement, that each datapoint has 4 fields (ctd_id, ... , ctd_count) A desciption of each of the fields can be found here. It is important to view dumps in this way so that you can begin to consider possible parsing methods to use, based on the structure of the files.

The data that we want is on lines that start with INSERT INTO change_tag_def VALUES.

The third datapoint (3,'mw-undo',0,50455) can be interpreted as follows for clarity:

Exploring a Change tag dump

We can use the follwing command to observe a change tag dump as well.

Observe the CREATE TABLE statement. Each datapoint has 6 fields as (ct_id, ct_rc_id, ... , ct_tag_id)

A description of the fields in the data can be found here

And the data that we want is on lines that start with INSERT INTO change_tag VALUES...

The first datapoint (1,2515963,NULL,2489613,NULL,39) is explained for clarity:

Exploring a History dump

As the last task, let's observe the history dump. It can be observed in the same manner by using the following command

After some metadata, you can see you will have a page object with some metadata about the page and then a list of revisions from oldest to newest where each revision is an edit and metadata about that edit The start of the data for the April page below can also be viewed here and it's metadata here

Identifying the change tag associated with a mobile edit

Time to get our hands dirty. To get a better idea in how to use dumps let's loop through the TAG_DESC_DUMP_FN file and identifying the change tag id associated with mobile edit. For additional tags and extended descriptions on Simple English Wikipedia, see here.

In this example gzip library is used to decompress the file, read more about it here.

The tuples we need is after the VALUES in the INSERT statement. Hence the extracted data is processed and the tuples are stored in the form of strings. In the end you can loop through this array of strings and find the id number which is corresponding to a mobile edit.

Storing the revision IDs associated with mobile edits

To observe the revision IDs that are associated with a mobile edit we can loop through the TAG_DUMP_FN. This is very similar to the above method as both dumps have a similar structure.

Comparison of mobile vs non-mobile edits made in each year

To observe the history dump better, let us parse through it and record how many mobile vs. non-mobile edits were made in each year. For this we will be using mwxml which is a Python library that allows you to easily manage the XML dumps.

Accessing edit tag APIs

The Revisions API can be a much simpler way to access data about edit tags for a given article if you know what articles you are interested in and are interested in relatively few articles (e.g., hundreds or low thousands).

NOTE: the APIs are up-to-date while the Mediawiki dumps are always at least several days behind -- i.e. for specific snapshots in time -- so the data you get from the Mediawiki dumps might be different from the APIs if edits have been made to a page in the intervening days.

Visualization of tagged edit data

For visualization of data, we can use matplotlib, a plotting library for python. For this tutorial we will only be plotting the mobile edits along with the year. But you can always extend this and plot graphs and diagams of your choice to include the other edits as well as a comparison if you wish.