Studying Wikipedia Page Protections

This notebook provides a tutorial for how to study page protections on Wikipedia either via the Mediawiki dumps or API. It has three stages:

Accessing the Page Protection Dumps

This is an example of how to parse through Mediawiki dumps and determine what sorts of edit protections are applied to a given Wikipedia article.

1. Extract data from the dump file.

Inspect the page_restrictions table

2. Save page_restrictions table into a Pandas DataFrame

Save data into a Pandas DataFrame because

Pandas library provide high-performance, easy-to-use data structures and data analysis tools.

3. Inspect the DataFrame

As the previous table shows, for each pr_page (page id) there can be more than one record,one for each type of protection that the page has.

Accessing the Page Protection APIs

The Page Protection API can be a much simpler way to access data about page protections for a given article if you know what articles you are interested in and are interested in relatively few articles (e.g., hundreds or low thousands).

NOTE: the APIs are up-to-date while the Mediawiki dumps are always at least several days behind -- i.e. for specific snapshots in time -- so the data you get from the Mediawiki dumps might be different from the APIs if permissions have changed to a page's protections in the intervening days.

1. Select 10 random page IDs from the data gathered from the Mediawiki dump to get data for from the API

2. Request to API the 10 random IDs

3. Examine API results and Compare to data from Mediawiki dump

After inspecting the data provided by the API on the types of protection of the selected pages, it is observed that the number of attributes differs with respect to those of the data in the dump file. In the API protection information is stored in a dict with *3 keys : type, level, expiry there is not information for cascade protections and future per-user edit restriction as in dump files. However the difference in per-user edit restriction field it is not a problem because is not in use.

Now we are going to process the data in a format that allows us to easily compare the two sources of information field by field.

Create a DataFrame with data gathered from the Mediawiki dump only with the 10 random IDs to easy comparison with API data.

Create a DataFrame with the ten API data to easy comparison with dump file data

Compare DataFrames

df_api_protection: data from Mediawiki API

df_ten_dump: data from Mediawiki dump

Pandas compare method allows compare a DataFrame with another one and show the differences if exist. The resulting DataFrame shows that there are no differences between df_api_protection and df_ten_dump since each cell compared shows NaN value.

It is safe to say that this result is not representative for all data as we are only comparing 10 ids out of 84,338 possible (about 0.012% of total data)

Example Analyses of Page Protection Data

Here we show some examples of things we can do with the data that we gathered about the protections for various Wikipedia articles. You'll want to come up with some questions to ask of the data as well. For this, you might need to gather additional data such as:

Process and Analyze the data

We are going to use the data about page protection gathered from the Mediawiki dump and saved in the DataFrame df before.

Descriptive statistics

There are 84.338 unique IDs in the DataFrame. Each page can have more than one type of protection such as, edit, move or upload and as we can observe the majority of the pages on these DataFrame have two type of protection, edit and move.

The top of protection levels for move and upload types is full-protection (sysop in data) and for edit protection type is semi-protection (autoconfirmed in data)

We are going to select a sample of articles to get additional information about them from the API

The most representative protections in this data are move and edit, from now on we are going to work only with those two types.

Make Request to API for anonymous contributors

To make some kind of inference we need additional information about the articles, so we are going to request information about the anonymous contributors for the 3000 IDs randomly selected from protection_page data gathered from the Mediawiki dump. The intuition is that the pages with the greatest number of anonymous contributors may be exposed to more vandalism and therefore need some kind of protection.

The main reason to request just this amount of IDs from the API is because is a time consuming process, as we can note from the reports above.

Merge the data gathered from the API and data from protection table

Since not all pages have all the protections for which new columns were created, some null values were generated. We are going to replace null values with 0.

Descriptive statistics

As we can see, most of the articles in the sample have very few or no anonymous contributors.

If we explore the data by type of protection we observe that when pages have edit protection the average of anonymous contributors is lower than when the page don't have this protection.

And if we explore the levels of edit protection, we notice that average number of anonymous contributors per level increases as the level of protection decreases.

On the other hand when it comes to move protection the average of contributors anonymous is greater than when the page don't have this protection.

The point biserial correlation is used to measure the relationship between a binary variable, and a continuous variable. For this data sample te correlation between protection type edit and the number of anonymous contributors is negative and weak.

but we are going to try to train a naive model with the data restrictions we have

Predictive Model

To protect pages from vandalism Wikipedia allows for some articles to become protected, where only certain users can make revisions to the page. However, with over six million articles in the English Wikipedia, it is very difficult for editors to monitor all pages to suggest articles in need of edit protection.

Therefore considering the problem of deciding whether an article should be protected for editing or not in Wikipedia we can formulate it, as a binary classification task and propose a very simplistic set of features to decide which pages to protect based on (1) number of anonymous contributors and (2) other type of protections like move.

We are going to use a Naive Bayes (NB) classifier that predicts whether a page has an edit protection or not.

The model is very simplistic and the set of features is very limited, but it can be a good start to implement an improved version in the future analysis.

Train model

The NB classifier is 85.75% accurate. This means that 85.75 percent of the time the classifier is able to make the correct prediction as to whether or not a page has edition protections or not. These results suggest that our feature set of 2 attributes are fine indicators of edition class but we need to add more features to obtain more accurate result.

Future Analyses