Studying the Wikipedia Clickstream

This notebook provides a tutorial for how to study reader behavior via the Wikipedia Clickstream via the monthly dumps. It has three stages:

Accessing Wikipedia Clickstream Dumps

This is an example of how to parse through the Wikipedia clickstream dumps for a wiki and gather the data for a given article.

Task 1

How many unique "destination" articles are in the dump? Output: 4310848 I used pandas to read the csv file and manipulate it. I gave a headline to the dataframe naming the attributes for it to be clear. While running the code, I found multivalued attributes separated with commas which I skipped. Seeing the dataset I set the delimiter as \t. The kernel was stopping as the dataset was huge. I set the dtype so that it runs a bit faster. At first I tried the problem without chunking and it gave me output till the data was 10000000. Then I created chunks while reading the file and chose only the destination column. I used nunique() method to get the number of the unique columns

What proportion of pageviews in the dataset have search engines (other-search) as the source? 11.85%

What was the most common source-destination pair? Highest: (other-empty, Main_Page) with 173895698 occurences

Task 2

Finding Your Article in Other Languages

The langlinks API is a simple (automatic) way to get all the other language versions of a particular article. For example, you can do this with List of dinosaurs of the Morrison Formation and see that it exists not just in English but also in French.

Task 3

Compare Reader Behavior across Languages

Here you want to explore what's similar and what's different between how readers interact with your article depending on the language they are reading it in. Provide hypotheses for any large differences you see. You don't have to do any formal statistical tests unless you want to -- it can be just observations you have about the data. Feel free to focus on the article you chose or expand to other articles.

Remember, you can use the langlinks API to see whether an article in one language is the same as one in another language. For instance, in the French clickstream dataset we see that someone went from the article Formation de Morrison to Liste des dinosaures de la formation de Morrison in French Wikipedia 15 times in January. From the English clickstream dataset, we could see that someone went from the article Morrison Formation to List of dinosaurs of the Morrison Formation in English Wikipedia 401 times in January. With the langlinks API, we can verify that this reading path is equivalent in French and English (same source and destination article, just different languages).

Future Analyses

TODO: Describe what additional patterns you might want to explore in the data (and why). You don't know have to know how to do the analyses.