Studying the Wikipedia Clickstream

This notebook provides a tutorial for how to study reader behavior via the Wikipedia Clickstream via the monthly dumps. It has three stages:

Accessing Wikipedia Clickstream Dumps

This is an example of how to parse through the Wikipedia clickstream dumps for a wiki and gather the data for a given article.

Task 1

Extracting the required data

1. How many unique "destination" articles are in the dump?

2. What proportion of pageviews in the dataset have search engines (other-search) as the source?

What was the most common source-destination pair?


All the three given tasks can be completed in a single iteration through the dataset and performing all the computations in the loop to make the code more efficient in terms of time taken for computation. However, it has been split up into three different loops in order to enable easy comprehension of the code for readers.

Task 2

The destination article chosen for this task is "Kotaro_Uchikoshi". The article appears in the languages English (en) and Japanese (ja) with at least 1000 page views in the month of January (Checked via the tool:

The steps involved in the process for choosing an appropriate article based on the given criteria are given below (see Section A of Appendix for detailed process):

Pull all the data in the clickstream dataset for the chosen article (both as a source and destination)

Visualizing the data to show common pathways to and from the article

From the above plot, we notice that the outflow exceeds the inflow, ie., the total number of clicks which lead to the chosen article is lesser than the total number of clicks leading to another article from the chosen article. However, this should not be interpreted as an inconsistency. One of reasons for this could be that people follow links in multiple tabs as they read an article. Hence, a single pageview can lead to multiple records with that page as the referer. We can infer from the above data that some percentage of people followed more than one link from the chosen page. We may also infer that most people needed more resources to understand the topic of the article due to which they followed the linked pages.

We notice that both Inflow and Outflow have a clear maximum frequency. The most popular source of the article is "other-search" with frequency around 2220 and the most popular destination is "Too Kyo Games" with frequency approx 2900.
The other relatively popular sources are "AI The Somnium Files", "other-empty" and "Zero Escape".
The other relatively popular destinations are "AI The Somnium Files", "The Girl in Twilight" and "Infinity (Video Game Series)".

On further inspecting the most frequently visited links from the Kotaro Uchikoshi Wikipedia Page we observe that all these links are present in the introductory paragraph of the article. We also observe that all these links appear at least 3 times on the article page. We can infer that readers tend to click on the links either in the first paragraph or they are intrigued by links which appear repeatedly and tend to click on them.

We can immediately see that even though there are few external sources leading to the chosen article, they have high frequencies as compared to sources of type link.
The above plot has not been created for outflow_df since all articles will have type link.

From the heatmap above, we can make the following inferences: