Studying the Wikipedia Clickstream

This notebook provides a tutorial for how to study reader behavior via the Wikipedia Clickstream via the monthly dumps. It has three stages:

Table of Contents

Accessing Wikipedia Clickstream Dumps

This is an example of how to parse through the Wikipedia clickstream dumps for a wiki and gather the data for a given article.

Accessing the Clickstream Data

Every language on Wikipedia has its own page restrictions table you can find all the dbnames (e.g., enwiki) here: for example, you could replace the LANGUAGE parameter of 'enwiki' with 'arwiki' to study Arabic Wikipedia.

List all languages available -- the largest is English at 384MB compressed. Some of these could be easily loaded into memory but English is large enough that you wouldn't want to do that. The language codes can be decoded as follows:
de = German
en = English
es = Spanish
fa = Persian
fr = French
it = Italian
ja = Japansese
pl = Polish
pt = Portuguese
ru = Russian
zh = Chinese

Inspect the data

Inspect the top of the English January 2021 clickstream dump to see what it looks like Format described here:

The first datapoint ("List_of_dinosaurs_of_the_Morrison_Formation Epanterias link 18") can be interpreted as: 18 times in the month of January 2021, someone was reading (this is the "source" or "prev") and then they clicked on a link in that article and opened up this article: (this is the "destination" or "curr")

As you can see, readers got to the Epanterias article in at least six other ways:

You might notice that this doesn't account for all the pageviews to that article in January see here. That's because, for privacy reasons, a particular path between articles must occur at least 10 times to be reported.

Working with the data

Now that we have our data we can focus on analyzing the english clickstream data for the things that we are interested in. To do this we need the python package pandas to create a dataframe of the clickstream data.

We also need to unzip the file using 'gzip' compression since it is in tsv.gz format as we previously saw. In addition, since the data file is so large we need to break the dataframe into 'chunks' to not exceed memory limits. We can then loop through these chunks and get the information we care about. The questions we want to answer are:


For the next step we will choose an article from our dataset that has at least 1000 pageviews and 20 unique sources, and shows up in at least one other language with at least 1000 pageviews in January 2021. For this tutorial we will choose the article 'Studio Ghibli'. We can then check the views for this page or others with this tool here.

Then we will pull all the data in the clickstream dataset for our choosen article (both as a source and destination) and visualize the data, using a Sankey diagram, to show what the common pathways to and from the article are. The library used to create the diagram is Plotly and it has documentation here.