Studying Wikipedia Page Protections

This notebook provides a tutorial for how to study page protections on Wikipedia either via the Mediawiki dumps or API. It has three stages:

Accessing the Page Protection Dumps

This is an example of how to parse through Mediawiki dumps and determine what sorts of edit protections are applied to a given Wikipedia article.

1. Extract data from the dump file.

Inspect the page_restrictions table

2. Save page_restrictions table into a Pandas DataFrame

Save data into a Pandas DataFrame because

Pandas library provide high-performance, easy-to-use data structures and data analysis tools.

3. Inspect the DataFrame

As the previous table shows, for each pr_page (page id) there can be more than one record,one for each type of protection that the page has.

Accessing the Page Protection APIs

The Page Protection API can be a much simpler way to access data about page protections for a given article if you know what articles you are interested in and are interested in relatively few articles (e.g., hundreds or low thousands).

NOTE: the APIs are up-to-date while the Mediawiki dumps are always at least several days behind -- i.e. for specific snapshots in time -- so the data you get from the Mediawiki dumps might be different from the APIs if permissions have changed to a page's protections in the intervening days.

1. Select 10 random page IDs from data gathered from the Mediawiki dump

2. Request to API the 10 random IDs

3. Examine API results and Compare to data from Mediawiki dump

Create a DataFrame with data gathered from the Mediawiki dump only with the 10 random IDs.

Create a DataFrame with the API data

Compare DataFrames

df_api_protection: data from Mediawiki API

df_ten_dump: data from Mediawiki dump

Pandas compare method allows compare a DataFrame with another one and show the differences if exist. The resulting DataFrame shows that there are no differences between df_api_protection and df_ten_dump since each cell compared shows NaN value.

Example Analyses of Page Protection Data

Here we show some examples of things we can do with the data that we gathered about the protections for various Wikipedia articles. You'll want to come up with some questions to ask of the data as well. For this, you might need to gather additional data such as:

Descriptive statistics

TODO: give an overview of basic details about page protections and any conclusions you reach based on the analyses you do below

We are going to use the data about page protection gathered from the Mediawiki dump saved in the DataFrame df

There are 84.338 unique IDs in the DataFrame. Each page can have more than one type of protection such as, edit, move or upload and as we can observe the majority of the pages on these DataFrame have tow type of protection, edit and move.

The top of protection levels for move and upload types is full-protection (sysop in data) and for edit protection type is semi-protection (autoconfirmed in data)

We will select a sample of 3000 articles to get additional information about them from the API

The most representative protections in this data are move and edit, from here on we will work only with those two.

Make Request to API for anonymous contributors

Request information about the anonymous contributors for the 3000 IDs randomly selected from protection_page data gathered from the Mediawiki dump

The main reason to request just 3000 IDs from the API is because is a time consuming process, as we can note from the reports above.

Merge the data gathered from the API and data from protection table

Since not all pages have all the protections for which new columns were created, some null values were generated. We are going to replace null values with 0.

Predictive Model

TODO: Train and evaluate a predictive model on the data you gathered for the above descriptive statistics. Describe what you learned from the model or how it would be useful.

Future Analyses

TODO: Describe any additional analyses you can think of that would be interesting (and why) -- even if you are not sure how to do them.