Studying the Wikipedia Clickstream

This notebook provides a tutorial for how to study reader behavior via the Wikipedia Clickstream via the monthly dumps. It has three stages:

Table of Contents

Accessing Wikipedia Clickstream Dumps

This is an example of how to parse through the Wikipedia clickstream dumps for a wiki and gather the data for a given article.

Accessing the Clickstream Data

Every language on Wikipedia has its own page restrictions table you can find all the dbnames (e.g., enwiki) here: https://www.mediawiki.org/w/api.php?action=sitematrix for example, you could replace the LANGUAGE parameter of 'enwiki' with 'arwiki' to study Arabic Wikipedia.

List all languages available -- the largest is English at 384MB compressed. Some of these could be easily loaded into memory but English is large enough that you wouldn't want to do that. The language codes can be decoded as follows:
de = German
en = English
es = Spanish
fa = Persian
fr = French
it = Italian
ja = Japansese
pl = Polish
pt = Portuguese
ru = Russian
zh = Chinese

Inspect the data

Inspect the top of the English January 2021 clickstream dump to see what it looks like Format described here: https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream

The first datapoint ("List_of_dinosaurs_of_the_Morrison_Formation Epanterias link 18") can be interpreted as: 18 times in the month of January 2021, someone was reading https://en.wikipedia.org/wiki/List_of_dinosaurs_of_the_Morrison_Formation (this is the "source" or "prev") and then they clicked on a link in that article and opened up this article: https://en.wikipedia.org/wiki/Epanterias (this is the "destination" or "curr")

As you can see, readers got to the Epanterias article in at least six other ways:

You might notice that this doesn't account for all the pageviews to that article in January see here. That's because, for privacy reasons, a particular path between articles must occur at least 10 times to be reported.

Working with the data

Now that we have our data we can focus on analyzing the english clickstream data for the things that we are interested in. To do this we need the python package pandas to create a dataframe of the clickstream data.

We also need to unzip the file using 'gzip' compression since it is in tsv.gz format as we previously saw. In addition, since the data file is so large we need to break the dataframe into 'chunks' to not exceed memory limits. We can then loop through these chunks and get the information we care about. The questions we want to answer are:

Visualization

For the next step we will choose an article from our dataset that has at least 1000 pageviews and 20 unique sources, and shows up in at least one other language with at least 1000 pageviews in January 2021. For this tutorial we will choose the article 'Studio Ghibli'. We can then check the views for this page or others with this tool here.

Then we will pull all the data in the clickstream dataset for our choosen article (both as a source and destination) and visualize the data, using a Sankey diagram, to show what the common pathways to and from the article are. The library used to create the diagram is Plotly and it has documentation here.

Finding Your Article in Other Languages

The langlinks API is a simple (automatic) way to get all the other language versions of a particular article. For example, you can do this with List of dinosaurs of the Morrison Formation and see that it exists not just in English but also in French.

Using the Media Wiki API

For the article we visualized above, we will use the API to gather its page titles in all of the other clickstream languages (if they exist).

To do this we will need the media wiki api (mwapi) and will need to set a contact email to access the API. NOTE: it is best practice to include a contact email in user agents generally this is private information though so do not change it to yours if you are working in the PAWS environment or adding to a Github repo.

Also, we need to add additional parameters here to query the pages we're interested in. We can find the documentation for Langlinks here, which allows us to only query the article we want in every available language.

Visualization in another language

For at least one language the article exists in that has a corresponding clickstream dataset, we will choose Spanish, loop through that clickstream dataset and gather all the relevant data (as we did in English for the visualization above).

Compare Reader Behavior across Languages

Here you want to explore what's similar and what's different between how readers interact with your article depending on the language they are reading it in. Provide hypotheses for any large differences you see. You don't have to do any formal statistical tests unless you want to -- it can be just observations you have about the data. Feel free to focus on the article you chose or expand to other articles.

Remember, you can use the langlinks API to see whether an article in one language is the same as one in another language. For instance, in the French clickstream dataset we see that someone went from the article Formation de Morrison to Liste des dinosaures de la formation de Morrison in French Wikipedia 15 times in January. From the English clickstream dataset, we could see that someone went from the article Morrison Formation to List of dinosaurs of the Morrison Formation in English Wikipedia 401 times in January. With the langlinks API, we can verify that this reading path is equivalent in French and English (same source and destination article, just different languages).

Comparison

Comparing the English and Spanish 'Studio Ghibli' pages' clickstream data a few things stand out. Mainly that some major destinations from the English page are not present at all in the Spanish data. Looking the wiki pages for both (English and Spanish) reveals that they are different in both in structure and content, notably not containing all the same links.

We can get a list of all of the links from our English 'Studio Ghibli' page and then check to see how many of them have a Spanish equivalent by querying the langlinks API.

Looking at our English visualization data we can check some of the top destinations from the English article and see if they have an equivalent Spanish article.

The third most popular single destination from the English page: 'List of Studio Ghibli works' does not exist in Spanish. Comparing the total number of articles that exist in both languages we see that Spanish is missing ~ 27% of the articles that exist in the English clickstream. Further analysis would be needed to see how the lack of articles in some languages overall shapes the clickstream data across languages as a whole.

Future Analyses

Two topics which could potentially provide more insight with further analysis are:

The source to destination flow for both is primarily search engine/other or more well-known movies => less well-known movies and people associated with the studio. Further analysis may reveal whether the trend of general => specific continues to hold up further into the clickstream data. It would be interesting to be able to follow users’ paths in the clickstream and see if their 'rabbit hole' continues to be more and more specific, or if at some point it broadens again. Ex, if a user goes 'google search' => 'Studio Ghibli' => 'Spirited Away' => Specific cast member. Do they stop there or do they at some point reach a more general link and repeat the process? Is this pattern repeated in the greater clickstream data?

Does the different performance of the studio's different movies in regions that speak the articles language create statistically significant clickstream usage. Ex, if a certain band in a certain country is 'bigger than the Beatles' would a trend like that be reflected by the clickstream data? If certain movies haven't been translated to that language are they much less prevalent in the clickstream?