Studying the Wikipedia Clickstream

This notebook provides a tutorial for how to study reader behavior via the Wikipedia Clickstream via the monthly dumps. It has three stages:

Accessing Wikipedia Clickstream Dumps

This is an example of how to parse through the Wikipedia clickstream dumps for a wiki and gather the data for a given article.

Due to memory constraints, we are looping through the data and reading the data in chunks of 10,000 using pandas. We assign the column names as the source, destination, medium, and page view in each chunk. medium refers to how the views came to the destination and page view refers to the number of times a certain article page has been visited.

For each chunk, we find out the unique columns and we continue adding the destinations to a list in each iteration. Finally we make sure only unique values are stored in this list and we use the length of the list to get the final number of unique destinations in the entire dataset.

Similarly, we find out the sum of all the pageviews and the sum of pageviews of rows that contains 'other-search' as the source. After completing the iteration we use these two values to calculate the proportion of pageviews in the dataset that have search engines as the source.

We find out the maximum value of pageviews in a chunk and update it in the next iteration if a larger value is available to find the largest pageview value in the entire dataset. Using this value we find the source-destination pair that is the most popular.

Firstly we shortlist the data based on pageview. the list 'candidate_articles' only contains destinations that have been viewed 1000 times.

Now we are selecting 15 articles as final candidates from the previous candidate articles list. We loop through the candidate article list and get all the clickstream data where a candidate article is a destination. Then we check if the number of unique sources is equal to or more than 20. We remove rows containing duplicate sources and then getting the number of rows to get the value of unique sources. If the number of unique sources is equal to or more than 20, we add the article name in the final_candidate list. We continue this process until we get 15 final candidates.

The selected popular article for this task is 'Ethical consumerism'. Again we are looping through the dataset in chunks of 10,000. In each iteration, we are appending the data that has the popular article as the source to 'source'. Also, we are appending the data that has the popular article as the destination to 'destination'. These lists will be used to plot the graph. we turn both 'source' and 'destination' into pandas data frame to plot the graphs

'source' contains the data where the source is 'Ethical consumerism' for all of the rows. So these data can be considered popular paths from the selected article. Now we will plot the data with the destinations on the x-axis and the pageview on the y-axis.

We will plot a graph for the data where the destinations are all 'Ethical consumerism' in a similar manner. These data can be considered a popular path to the selected popular article.

Finding Your Article in Other Languages

The langlinks API is a simple (automatic) way to get all the other language versions of a particular article. For example, you can do this with List of dinosaurs of the Morrison Formation and see that it exists not just in English but also in French.

After checking in the Pageview tools, we found that our selected article is available in 17 languages. We added the title parameter to get the langlinks of our selected article and the lllimit parameter to retrieve up to 100 langlinks. By default, it shows 10 langlinks and that is why we increased the limit. Then we query the API using the parameters to get the titles of the selected popular article in all the other available languages.

Now we retrive the Italian clickstream dataset of the same article to plot graph of popular path from and to the article.

Now we will only be plotting graph for popular path to anr from the selected article for the Italian clickstream

Compare Reader Behavior across Languages

Here you want to explore what's similar and what's different between how readers interact with your article depending on the language they are reading it in. Provide hypotheses for any large differences you see. You don't have to do any formal statistical tests unless you want to -- it can be just observations you have about the data. Feel free to focus on the article you chose or expand to other articles.

Remember, you can use the langlinks API to see whether an article in one language is the same as one in another language. For instance, in the French clickstream dataset we see that someone went from the article Formation de Morrison to Liste des dinosaures de la formation de Morrison in French Wikipedia 15 times in January. From the English clickstream dataset, we could see that someone went from the article Morrison Formation to List of dinosaurs of the Morrison Formation in English Wikipedia 401 times in January. With the langlinks API, we can verify that this reading path is equivalent in French and English (same source and destination article, just different languages).

After viewing the 'source' and 'destination' dataframe, we can see that their is a path between Sustainability and Ethical consumerism. Also there is a path form Ethical consumerism to Amazon. Now we will evaluate if similar path exists in other languages as well

Now we use the langlinks API to get the titles of these articles in other languages. We select the language where both the articles are present. From the response above we select 'lang:'it' ( Italian) and 'lang:'ru' ( Russian) . Now we will check if a similar path exists in the Italian and Russian clickstream dataset.

Just like English clickstream, in Italian clickstream people clicked on 'Consumo critico'(Ethical consumerism) from 'Sostenibilità'(Sustainability) but there is no similar path between the popular article to Amazon.

Similar path is not available in Russian clickstream data and so it can be concluded that reader behavior is different in English wikipedia and Russian wikipedia. However the behavior is slightly similar between English wikipedia and Italian wikipedia

Future Analyses

TODO: Describe what additional patterns you might want to explore in the data (and why). You don't know have to know how to do the analyses.

We can compare the pageview of the same articles in different languages to see if an article has gained similar popularity in different languages. If an article has a very high pageview in English and French that would mean this article is popular among both English speakers and French speakers.

Also, the growth of page views of clickstream data in different months can be explored. If an article has a steady growth of pageview that would mean this is a very helpful article and many people often need to access this information. If an article has a very high pageview in the earlier month after publishing but then the growth stagnates, that would mean this article was very trendy in that particular time but then it lost its relevance to people's lives.