Lab 1 - Revision Histories

Modified: Sep 28, Jennings Anderson

Professor Brian Keegan
Department of Information Science, CU Boulder
This notebook is copyright and made available under the Apache License v2.0 license.

This is the first of five lab notebooks that will explore how to do some introductory data extraction and analysis from Wikipedia data. This lab will use a combination of MySQL and Python to retrieve the revision histories for a selection of articles. You do not need to be fluent in either to complete the lab, but there are many options for extending the analyses we do here by using more advanced queries and scripting methods.

Acknowledgements
I'd like to thank the Wikimedia Foundation for the PAWS system and related Wikitech infrastructure that this workbook runs within. Yuvi Panda, Aaron Halfaker, Jonathan Morgan, and Dario Taraborelli have all provided crucial support and feedback.

Import modules and setup environment

Load up all the libraries we'll need to connect to the database, retrieve information for analysis, and visualizing results.

Setup a connection to the English Wikipedia database

Retrieve the host name, username, and (encrypted) password associated with your account so you can login to the database.

Connect to the system using the creditials from above, then select the database for the English Wikipedia.

What tables are available for use?

Important note

Article page titles need to be separated by underscores "_" rather than spaces. It's easy to replace spaces with underscores using the replace method called on the string.

Get revisions for a single article

Write a function get_page_revisions that takes the name of a Wikipedia page and its namespace and use the database connection to return a pandas DataFrame containing all the revisions for the article. Some of the fields inside needs to be cleaned up by converting to other data types to help with analysis in later steps.

Use the function to get the revisions for a single article and use the .head() method to inspect the first 5 rows in the DataFrame.

Count the total number of revisions in the article.

Count the total number of unique users contributing to the article.

When was the first revision made to the article?

When was the most recent revision made to the article?

How has the size of the article changed over time?

Who made the largest change to the article?

TJA Response

This appears to be an error or some sort of Vandalism where the content was duplicated over and over and over and over.

Group revisions by date

Muliple revisions can be made in a single day. We want to be able to count the number of revisions per day and how the article's length has changed over time.

What dates had the most revisions in a single day? A simple way of doing this is to count the number of times each "rev_date" occurs in the data with the value_counts method. Let's look at the top 10, 5 of which are in October 2012 in the weeks before the Election Day.

We can use the groupby functionality to create "baby" DataFrames having all the same particular value. For example, we can group by the date and each of the baby DataFrames would be all the revisions that occurred on that date. Store the groupby in the "rev_gb_date" object for subsequent steps.

Inspect a single baby DataFrame by passing the key (the date values we grouped on) to the get_group method and look at the first 5 entries.

Compute unique revisions per day

We can also use aggregate functionality to perform computations efficiently across all the baby DataFrames.

Reindex the axis to make it continuous over the date range.

What was the most active day of editing?

Plot the number of revisions by day.

This is pretty noisy. We can clean it up by applying a rolling mean over a 60-day window.

Plot the cumulative sum of the number of revisions over time using the cumsum method.

Compute article length by day

Using the same "rev_gb_date" object, aggregate each of the baby DataFrames and compute the median article size ("rev_len"). Reindex again to make a continuous time axis.

Plot the reindexed series, making the y-axis be on a log scale to pick up detail in the early days when the article was orders of magnitude smaller.

Compute diffs on time and article size

The "rev_len" only records the length of the revision at that point in time. By taking the difference with the previous observation, we can compute how much net content the user added or removed from the previous revision. Using the diff method on the "rev_len" series, we store the resulting differences as "rev_len_diff".

Compare the first five elements of the two series to verify it's worked as intended.

Do some more cleanup of the diffs. In particular, when a revision is deleted (which is rare) the "rev_len" value is null, which makes the diff null as well. Make the first revision's "rev_len_diff" equal to the size of the article and then ffill the other nan values.

Similarly, the "rev_timestamp" series only records the time when the revision occurred. By taking the difference with the previous revision, we can compute how much time has elapsed since the last edit. Using the diff method again on the "rev_timestamp" series, we store the resulting differences as "ref_timestamp_diff".

The resulting values are Timedeltas which we convert back to number of day floats by dividing by np.timedelta64.

Compare the first five elements of the two series to verify it's worked as intended.

Combine all the computations with one daily DataFrame

Because we added new series to the "rev_df" we need to re-run the groupby to update it. We can also combine all the different aggregate functions we did before into one step by defining an "agg_dict" that applies an aggregation function over the values in a specific column. This will return a DataFrame with only a subset of columns with the aggregated values we specified.

Plot the changes in the average time since the last revision by day.

That's pretty noisy. We can apply the rolling method over a 60-day window and get the mean to smooth out some of the noise.

Similarly, what was the median size of a change to the article over time?