Wikipedia Clickstream

About the data:

The data contains counts of (referer, resource) or (source, destination) pairs extracted from the request logs of Wikipedia.

Format of the Data

The data has 4 columns: Source, Destination, Type and TotalNum

Inspecting the Wikipedia Clickstream data

Observation:

The data has 4 columns and has altogether 31983523 data entries

Importing the data as Chunks

How many unique "destination" articles are in the dump?

Approach:

Since the dump data is large and all data cannot be stored as pandas dataframe at once, so to find the unique destination articles in the dump, the dump is read as pandas chunks and only Destination column is added in Set so as to only store unique destination.

Observation:

There are altogether 4283759 unique destination articles in the dump

What proportion of pageviews in the dataset have search engines (other-search) as the source?

Approach:

Since the dump data is large and all data cannot be stored as pandas dataframe at once, so to find the proportion of pageviews in dataset that have search engine as source:

Observation:

0.47 proportion of pageviews in the dataset have search engine (other-search) as the source which means that in general almost half of the time people viewed wiki-pages through search engines in the month of Jan

What was the most common source-destination pair?

Approach:

Since the dump data is large and all data cannot be stored as pandas dataframe at once, so to find the most commom source-destination pair:

Observation:

The most common source-destination pair in January 2021 dataset was (other-empty, Main_Page)

However, this is not a very interesting pair so taking a look at more interesting source-destination pair by only appending most common pair that has link(if the source and request are both articles and the source links to the request) as type.

Observation:

The most common interesting source-destination pair in January 2021 dataset was (Main_Page, Deathis_in_2021) which is mostly due to pandemic in 2020 that took a lot of lives, followed by the prominent political figures of US as source and their spouse as destination and the recent election in US could be the reason behind such high traffic.

Finding destination article from the dataset that is Relatively Popular

Approach:

Since the dump data is large and all data cannot be stored as pandas dataframe at once, so to find the destination article from the dataset that is Relatively Popular:

Observation:

Choosing Article Holland(index: 50030) as it has 59 unique sources, 43756 pageviews and shows up in 76 languages with more than 1000 pageviews

https://pageviews.toolforge.org/langviews/?project=en.wikipedia.org&platform=all-access&agent=user&start=2021-01-01&end=2021-01-31&sort=views&direction=1&view=list&page=Holland

Analyzing Holland Article Wiki Data

Getting all the data with either Holland as source or destination

Getting all the data with either Holland as source or destination from the entire dataset

Visualize the data to show what the common pathways to and from the article are

Common pathways to the Article "Holland"

Using bar plot to visualize the type of source from which maximum traffic is gained to Holland article

Observation: It seems that more than 90% of people visit the Article Holland from External Sources, only a few through direct links from some other Wikipedia Article and rarely through another medium such as search from some other article or the main Wiki page.

Using line chart to visualize the distribution of Source from which traffic to Holland article comes from

X-ticks are disabled because reading the tick label is difficult as there are many source

The bar chart is used to observe the distribution and to see if their is skewness towards large values.

Observation:

We can observe a right-skewed distribution so most data falls to the left i.e only a few (source, destination) pairs are popular

Almost all(97.5%) of pageview to Holland article comes from the top 10 Sources

Using Bar chart to visualize Common pathways to Holland article

Top 10 source to Holland Article

Finding common pathways to Holland article where source type is other(if the source and request are both articles but the source does not link to the request. This can happen when clients search or spoof their refer)

Using Bar chart to visualize Common pathways to Holland article where relation type is Other

Observation:

Most of the search for Holland seems to be done from the main page of Wiki and the rest of them are mostly from other European countries

Finding common pathways to Holland article where source type is link(if the source and request are both articles and the source links to the request)

X-ticks are disabled because reading the tick label is difficult as there are many source

The bar chart is used to observe the distribution and to see if their is skewness towards large values.

Observation:

We can observe a right-skewed distribution so most data falls to the left i.e only a few (source, destination) pairs are popular

Using Bar chart to visualize Common pathways to Holland article where relation is direct link

Since the distribution is right-skewed only taking top 10 pair into consideration

Observation:

Most of the traffic to Holland through a direct link from other wiki articles seems to come from Netherland which is a name Holland is also referred as and most of the other articles are closely related to Holland.

Common pathways from the Article "Holland"

X-ticks are disabled because reading the tick label is difficult as there are many source

The bar chart is used to observe the distribution and to see if their is skewness towards large values.

Observation:

We can observe a heavily right-skewed distribution so most data falls to the left i.e only a few (source, destination) pair are popular

Using bar chart to visualizethe most common pathways from article Holland

Since the distribution is right-skewed only taking top 10 pair into consideration

Summary:

Common pathways to the Article "Holland"

Observations

Common pathways from the Article "Holland"

Observations

Finding Article in Other Languages

Compare Reader Behavior across Languages (Using de(German) language)

Getting dataset with Holland from another language for analysis

Sankey Diagram to visualize (Source, Destination) relationship

Using Sankey diagrams because it emphasize the major transfers or flows within a system so it will help to visualize common pathways to and from holland article in both language which can be used to observe readers behavior in both the language for comparision.