Studying the Wikipedia Clickstream

This notebook provides a tutorial for how to study reader behavior via the Wikipedia Clickstream via the monthly dumps. It has three stages:

Accessing Wikipedia Clickstream Dumps

This is an example of how to parse through the Wikipedia clickstream dumps for a wiki and gather the data for a given article.

Task 1

Extracting the required data

We begin by extracting the data for the first 1,000,000 entries into the file "data.csv". (A sample of the data has been used to perform the analysis.)

Now let's load the data from the TSV file into a Pandas data frame.

1. How many unique "destination" articles are in the dump?

2. What proportion of pageviews in the dataset have search engines (other-search) as the source?

To find the proportion of entries with source as 'other-search', we get the proportions of unique values for the "Source" column and extract the value corresponding to the required source.

What was the most common source-destination pair?

For the most common source-destination pair, we get the ID of the row with the highest "Frequency". We then extract the required information from there.

Task 2

Finding a popular destination article

Identify and store data corresponding to unique sources

Choosing an appropriate article for visualization

The chosen destination article for this task is "Kotaro_Uchikoshi". The article appears in the languages English (en) and Japanese (ja) with at least 1000 page views in the month of January (Checked via the tool: https://pageviews.toolforge.org/langviews/).

Visualizing the data to show common pathways to and from the article

From the above map, we can infer that the most common source through which people land up on the chosen article is "other-search". The most common destination from the chosen article is "Too_Kyo_Games".

From the above plot, we notice that the outflow exceeds the inflow, ie., the total number of clicks which lead to the chosen article is lesser than the total number of clicks leading to another article from the chosen article. However, this should not be interpreted as an inconsistency. One of reasons for this could be that people follow links in multiple tabs as they read an article. Hence, a single pageview can lead to multiple records with that page as the referer. We can infer from the above data that some percentage of people followed more than one link from the chosen page. We may also infer that most people needed more resources to understand the topic of the article due to which they followed the linked pages.

We notice that both Inflow and Outflow have a clear maximum frequency. The most popular source of the article is "other-search" with frequency around 2220 and the most popular destination is "Too Kyo Games" with frequency approx 2900.
The other relatively popular sources are "AI The Somnium Files", "other-empty" and "Zero Escape".
The other relatively popular destinations are "AI The Somnium Files", "The Girl in Twilight" and "Infinity (Video Game Series)".

We observe that the majority of inflow frequencies lie in the range of (1, 230). The majority of outflow frequencies lie in the range of (1, 250).