Studying the Wikipedia Clickstream

This notebook provides a tutorial for how to study reader behavior via the Wikipedia Clickstream via the monthly dumps. It has three stages:

Accessing Wikipedia Clickstream Dumps

This is an example of how to parse through the Wikipedia clickstream dumps for a wiki and gather the data for a given article.

Questions Tackled Note: Using a sample size of 20,000 entries

Unique destination articles

</br>

proportions of articles opened by search engines

</br>

Analysis

</br></br>

1. The number of Unique destination articles was 134232

2. The number of Independent searches (Other-search) in search engines that lead to destination 118884 with a proportion of 27.6%

3. The highest source-destination pair value was the other-search, Tanya_Roberts Pair with 1,583,356 searches

PlOTTING THE PROPORTION OF PAGEVIEWS IN THE DATASET THAT HAVE SEARCH ENGINES AS THE SOURCE

Conclusion

Marvel Cinematic Universe was the most popular destination artticle from the 1,000,000 rows analysed the next most popular language was German which I checked

Checking for Shared articles in the German data set after reformatting it

Finding Your Article in Other Languages

The langlinks API is a simple (automatic) way to get all the other language versions of a particular article. For example, you can do this with List of dinosaurs of the Morrison Formation and see that it exists not just in English but also in French.

Compare Reader Behavior across Languages

Here you want to explore what's similar and what's different between how readers interact with your article depending on the language they are reading it in. Provide hypotheses for any large differences you see. You don't have to do any formal statistical tests unless you want to -- it can be just observations you have about the data. Feel free to focus on the article you chose or expand to other articles.

Remember, you can use the langlinks API to see whether an article in one language is the same as one in another language. For instance, in the French clickstream dataset we see that someone went from the article Formation de Morrison to Liste des dinosaures de la formation de Morrison in French Wikipedia 15 times in January. From the English clickstream dataset, we could see that someone went from the article Morrison Formation to List of dinosaurs of the Morrison Formation in English Wikipedia 401 times in January. With the langlinks API, we can verify that this reading path is equivalent in French and English (same source and destination article, just different languages).

Future Analyses

TODO: Describe what additional patterns you might want to explore in the data (and why). You don't know have to know how to do the analyses.