My first Wikipedia's edit data extraction and analysis

Table of Contents

  1. Overview
  2. Getting started
    1. Prerequisites
      1. Skills
      2. Accounts
      3. Libraries
  3. Step-by-step guide
    1. Extracting and inspecting edit data from the history dumps
    2. Analyzing edit data extracted from the history dumps
    3. Extracting edit data from Wikimedia's API
    4. Analyzing edit data extracted from Wikimedia's API

Overview

Wikipedia and other Wikimedia's projects data is available to the public through different sources. To access Wikipedia's metadata on edits, two sources can be used: history dumps and Wikimedia's REST API.

History dumps are large queryable files and their data can be inconsistent, incomplete and at least several days behind (i.e. for specific snapshots in time), whereas the API is a constantly supported interface that provides up-to-date data.

Both sources are useful and can be used together for complex analysis. This guide uses Python) and both sources in order to study Wikipedia's edits through edit tags. These tags help automatically classify and describe the nature of edits and are often used to identify harmful behaviour.

This guide will teach you how to:

Getting started

Prerequisites

Skills

Accounts

Libraries

Step-by-step guide

Step 1. Extracting and inspecting edit data from the history dumps

Every language on Wikipedia has its own edit tag tables and history dumps. You can find all the dbnames as follows:

The language simplewiki in the list above corresponds to the simplified English version of Wikipedia. You can replace the LANGUAGE parameter below with, for example, 'arwiki' to study Arabic Wikipedia or 'enwiki' for English Wikipedia.

You will also set other parameters: DUMP_DIR for the directory on PAWS server that holds Wikimedia dumps, and 3 specific datasets in this directory that will be used for this guide.

Now you will extract these history dumps datasets.

Inspect the top of the change description dump (TAG_DESC_DUMP_FN) to see what it looks like. As you can see from the CREATE TABLE statement, each datapoint has 4 fields (ctd_id, ... , ctd_count).

And the data that we want is on lines that start with INSERT INTO change_tag_def VALUES... The third datapoint (3,'mw-undo',0,50455) can be interpreted as:

Now inspect the top of the change tag dump TAG_DUMP_FN to see what it looks like. As you can see from the CREATE TABLE statement, each datapoint has 6 fields (ct_id, ct_rc_id, ... , ct_tag_id).

And the data that we want is on lines that start with INSERT INTO change_tag VALUES... The first datapoint (1,2515963,NULL,2489613,NULL,39) can be interpreted as:

Inspect the top of the history dump {HISTORY_DUMP_FN} to see what it looks like. After some metadata, you can see you will have a page object with some metadata about the page and then a list of revisions from oldest to newest where each revision is an edit and metadata about that edit.

The start of the data for the April page below can also be viewed here and it's metadata here.

Step 2. Analyzing edit data extracted from the history dumps

Step 3. Extracting edit data from Wikimedia's API

To open a session with the API the following parameters are expected: mwapi.Session(host, user_agent=None, formatversion=None, api_path=None, timeout=None, session=None, **session_params)

First add SITENAME to the LANGUAGE variable defined in step 1. Next, for best practices, configure the user agent with a descriptive label (p.e., our tutorial_label) and your contact email.

Step 4. Analyzing edit data extracted from Wikimedia's API