Studying the Wikipedia Clickstream

This notebook provides a tutorial for how to study reader behavior via the Wikipedia Clickstream via the monthly dumps. It has three stages:

Accessing Wikipedia Clickstream Dumps

This is an example of how to parse through the Wikipedia clickstream dumps for a wiki and gather the data for a given article.

For the English clickstream dataset, the number of rows for the article "Cat_food" is 58, and if we group the data frame by source and destination, the number of rows after grouping stays the same, which means that there is no rows with the same source and destination.

I have tested on a few other articles that I am not showing here, the number of rows before and after grouping is always the same, which means that there is no duplicate data that we need to pay extra attention to for this exercise.

However, for further analysis, I think a data cleaning process is needed if we do not know whether the data is clean.