Studying Wikipedia Edits by Tag

This tutorial is intended for anyone who wants to understand how to gather edit data from Wikimedia. This data can be used for example to see to what degree editing has shifted to mobile -- i.e. editing on one's mobile device instead of a desktop computer.

Learning outcomes:

Table Of Contents

Introduction
Accessing Tagged Edits via Dumps
Accessing the Edit Tag APIs
Example Analyses of Edit Tag Data
Further reading
Footnotes

Introduction

Data on what tags are associated with which edits are available via MediaWiki dumps or the API. Each has its own set of steps. Throughout the tutorial, we'll focus on the mobile edit tag. In the end, for further reading purposes, the most frequently used tags are visualized.

Accessing Tagged Edits via Dumps

Wikimedia Dumps is a free and reusable resource for archiving, for bot editing of the wikis, and for providing the data in an easily queryable format, among other things. More information on dumps can be gathered here. For this tutorial, we will be using the dumps to access the edits made from the mobile i.e. mobile edit tag. For easier understanding we will break the process into the following steps:

  1. Understanding the overview and structure of dumps
  2. Gathering the data by parsing the dump file
  3. Analyzing and Visualising

Understanding the Overview and Structure of Dumps

There are a few important things to note before working with Dumps.

Description

The TAG_DUMP_DESC_FN Dumps file has the table change_tag_def which would help us retrieve the ctd_id(tag id) which are linked to revisions. Upon inspecting the file we can see that the table structure looks like:

Field Type Null Key Default Extra
ctd_id int(10) unsigned NO PRI NULL auto_increment
ctd_name varbinary(255) NO UNI NULL
ctd_user_defined tinyint(1) NO MUL NULL
ctd_count bigint(20) unsigned NO MUL 0

And the data that we want is on lines that start with INSERT INTO change_tag_def VALUES... The third datapoint (3,'mw-undo',0,50455) can be interpreted as:

3: unique ID for the tag -- can use this to join with the change_tag table information

'mw-undo': descriptive name -- in this case, the tag is applied when an editor uses the "mw-undo" button to revert an edit

0: not defined by a user

50455: has been used for 50,455 edits in Simple English Wikipedia

This can also be seen here.

Description

The TAG_DUMP_FN Dumps file has the table change_tag which is of our interest. Upon inspecting the file we find that table structure looks as:

Field Type Null Key Default Extra
ct_id int(10) unsigned NO PRI NULL auto_increment
ct_rc_id int(10) unsigned YES MUL NULL
ct_log_id int(10) unsigned YES MUL NULL
ct_rev_id int(10) unsigned YES MUL NULL
ct_params blob YES NULL
ct_tag_id int(10) unsigned NO MUL NULL

From this table we'll get the revision id(ct_rev_id) of the revision made using specific tag id(ct_tag_id).

And the data that we want is on lines that start with INSERT INTO change_tag VALUES... The first datapoint (1,2515963,NULL,2489613,NULL,39) can be interpreted as:

1: unique ID -- can be ignored

2515963: links to recent changes table but can be ignored for now

NULL: doesn't link to logging table

2489613: revision ID of an edit that was tagged -- this will allow us to tie tags to the XML history

NULL: additional parameters that can be ignored

39: the tag that was applied to edit 2489613 was #39 (which happens to be 'New user creating interrogative pages' if we check the change_tag_def table above)

This can also be seen here.

Description

After some metadata, you can see a page object with some metadata about the page and then a list of revisions from oldest to newest. Each revision is an edit and includes its metadata. The beginning of the data for the April page is also here and it's metadata here

Gathering the data by parsing the dump files

To gather the revisions associated with mobile edits-

We can deduce from the above question that there are several INSERT statements. So, for each INSERT statement, we'll loop through the query and use the find rev ids function to get the mobile edit revision ids, and we'll keep updating our set with the values returned by the function.

Analysing and Visualising

Now we are in a position to loop through the HISTORY_DUMP_FN and record how many mobile and non-mobile edits were made each year

Accessing the Edit Tag APIs

In this section, we'll target on following operations:

Choose a topic find the mobile and non mobile tag edits made in year 2020 using the Dumps

Here we choose the topic Deaths in 2020 and year 2020 in case you want to replace topic and/or year with some other topic and/or year change the variables TOPIC and/or YEAR respectively

Using API endpoint to find the mobile and non mobile tag edits made in the year 2020 for the topic chosen above

The Revisions API can be a much simpler way to access data about edit tags for a given article if you

  1. know what articles you are interested in
  2. are interested in relatively few articles (e.g., hundreds or low thousands)

NOTE: the APIs are up-to-date while the Mediawiki dumps are always at least several days behind -- i.e. for specific snapshots in time -- so the data you get from the Mediawiki dumps might be different from the APIs if edits have been made to a page in the intervening days.

We'll follow the below-mentioned steps to get the data from API:

  1. Create a session variable and set parameters for get request
  2. Loop through the data and store it
  3. Analyse count of mobile and non mobile edits

Create a session variable and set parameters for get request

It is best practice to include a contact email in user agents generally this is private information though so do not change it to yours if you are working in the PAWS environment or adding to a Github repo

user_agent helps identify the request if there's an issue and is best practice

We add all the parameters necessary to filter the data we are interested in.

NOTE: rvdir: Direction to list in. (enum)

older: List newest revisions first (default) --rvstart/rvstartid has to be higher than rvend/rvendid

newer: List oldest revisions first --rvstart/rvstartid has to be lower than rvend/rvendid

For further description of parameters follow this

Loop through the data and store it

We'll now loop through the data and count the number of mobile and non-mobile edits. This is also available here.

Comparision and conclusion

To compare the findings from the two approaches, we'll use pie charts.