Studying the Wikipedia Clickstream

This notebook provides a tutorial for how to study reader behavior via the Wikipedia Clickstream via the monthly dumps. It has three stages:

Accessing Wikipedia Clickstream Dumps

This is an example of how to parse through the Wikipedia clickstream dumps for a wiki and gather the data for a given article.

Reading the Wikipedia Clickstream data

Number of Unique Destination?

Pageviews "other-search" proportion?

Source-destination pair?

Most common: (other-search, Richard_Ramirez) with 4258415 count

The Popular Article

Validate that the article is truly popular

Retrieve rows containing the popular article

Constructing the multipartite graph with NetworkX

Graph Explanation

We can see that the graph is a multipartite graph, to be exact tripartite, and we divide the articles to three groups:

We arrange these groups respectively from left to right, hence the flow of the arrows are generally from left to right as well.

However, there could be some articles being source and destination at the same time, shown as some arrows from middle to left.

For the graph below,

Finding Your Article in Other Languages

The langlinks API is a simple (automatic) way to get all the other language versions of a particular article. For example, you can do this with List of dinosaurs of the Morrison Formation and see that it exists not just in English but also in French.

Retrieve rows containing the popular article

Constructing the multipartite graph with NetworkX

Source and destination at the same time

For the graph below,

Compare Reader Behavior across Languages

Here you want to explore what's similar and what's different between how readers interact with your article depending on the language they are reading it in. Provide hypotheses for any large differences you see. You don't have to do any formal statistical tests unless you want to -- it can be just observations you have about the data. Feel free to focus on the article you chose or expand to other articles.

Remember, you can use the langlinks API to see whether an article in one language is the same as one in another language. For instance, in the French clickstream dataset we see that someone went from the article Formation de Morrison to Liste des dinosaures de la formation de Morrison in French Wikipedia 15 times in January. From the English clickstream dataset, we could see that someone went from the article Morrison Formation to List of dinosaurs of the Morrison Formation in English Wikipedia 401 times in January. With the langlinks API, we can verify that this reading path is equivalent in French and English (same source and destination article, just different languages).

Royal Danish Navi Ships

Patrol Boat


Above are the cases where the source articles are same, just in different languages (English vs Francais). The pageview counts has been shown. In the Francais graph, the popular article does not point to no other article (there is no destination article with the popular article being the source). Hence, the Francais graph is only bipartite instead of tripartite.

The graph for Francais is a lot simpler but it is similar in some sense to the first graph (which was in English). To be precise, the Francais graph is a subset of the English graph. The edge thickness represents the number of pageview counts relative to that graph. This is to normalize the values since, for example, the pageview counts in the Francais graph are a lot smaller than the pageview counts in the English graph. If it is applied with the same scale then the Francais graph edges would be very thin.

Since the edge thickness has to be normalized relative to each particular graphs, the meaning of the edge thickness in the English graph versus the Francais graph is different. That is, edges of the same thickness represents the same number of pageview counts within a graph, but represents different number of pageview counts between different graphs.

Future Analyses

TODO: Describe what additional patterns you might want to explore in the data (and why). You don't know have to know how to do the analyses.

Descriptive Analytics: PageRank Centrality

Inferential Analytics: Homophily

Predictive Analytics