Lab 4 - Pageviews

Professor Brian Keegan
Department of Information Science, CU Boulder
This notebook is copyright and made available under the Apache License v2.0 license.

This is the third of five lab notebooks that will explore how to analyze the structure of collaborations in Wikipedia data about users' revisions across multiple articles. This lab will extend the methods in the previous two labs about analyzing a single article's revision histories and analyzing the hyperlink networks around a single Wikipedia page. You do not need to be fluent in either to complete the lab, but there are many options for extending the analyses we do here by using more advanced queries and scripting methods.

Acknowledgements
I'd like to thank the Wikimedia Foundation for the PAWS system and related Wikitech infrastructure that this workbook runs within. Yuvi Panda, Aaron Halfaker, Jonathan Morgan, and Dario Taraborelli have all provided crucial support and feedback.

Confirm that basic Python commands work

Import modules and setup environment

Load up all the libraries we'll need to connect to the database, retreive information for analysis, and visualize results.

Define an article to examine pageview dynamics.

Get pageview data for a single article

Details about the Wikimedia REST API for pageviews is available here. Unfortunately, this data end point only provides information going back to July 1, 2015.

This is what the API returns as an example.

Write a function to get the pageviews from January 1, 2015 (in practice, the start date will be as late as August or as early as May) until yesterday.

Get the data for your page.

Interpret page view results

What does the pageview activity look like? Are there any bursts of attention? What might these bursts be linked to?

Use a logarithmic scaling for the y-axis to see more of the detail in the lower-traffic days.

What are the dates for the biggest pageview outliers? Here we define an "outlier" to be more than 3 standard deviations above the average number of pageviews over the time window.

How much of the total pageview activity occurred on these days compared to the rest of the pageviews?

How does pageview activity change over the course of a week?

Compare pageviews to another page

Lets write a function that takes a list of article names and returns a DataFrame indexed by date, columned by articles, and values being the number of pageviews.

Enter two related pages for which you want to compare their pageview behavior.

Get both of their data.

Plot the data.

What is the correlation coefficient between these two articles' behavior?

How did the ratio between the two articles' pageviews change over time?

Use the functions for resolving redirects and getting page outlinks from prior labs.

Get the outlinks.

Get the data.

This stage may take several minutes.

What are the most-viewed articles in the hyperlink network?

Most and least correlated articles

Which articles are most correlated with each other?

List out the 10 most correlated articles.

Inspect this correlation from the raw data.

Look at the 10 least-correlated articles.

Plot the correlation between the two most anti-correlated articles. These show some kinda wacky properties that are interesting to explore or think more about.