Goals: Come up with the pipeline in betweeen NLP(using spaCy Library, ML and Feature Extraction) TRY TO COME UP WITH SEMANTIC ANALYSIS Dont forget to use pandas- profiling

Studying the Wikipedia Clickstream

This notebook provides a tutorial for how to study reader behavior via the Wikipedia Clickstream via the monthly dumps. It has three stages:

Accessing Wikipedia Clickstream Dumps

This is an example of how to parse through the Wikipedia clickstream dumps for a wiki and gather the data for a given article.

Installments required to run this notebook

Clickstream containing counts of (referer, resource) pairs extracted from the request logs of Wikipedia. A referer is an HTTP header field that identifies the address of the webpage that linked to the resource being requested. The data shows how people get to a Wikipedia article and what links they click on. In other words, it gives a weighted network of articles, where each edge weight corresponds to how often people navigate from one page to another. For more information and documentation, see the link in the references section below

Explore -Understand the data

The data has 4 columns: Source, Destination, Type and n

Make other Analysis also - type-for the Month January - Proportions - piecharts, bar graphs etc

Article Chosen, Firstly taken - Artificial Intelligence, chosen as a interest on education field next taken - self driving cars, chosen as it invlves both education and interest on marketing - what more companies got involved? etc

!! Fact check - at least 1000 pageviews and 20 unique sources in the dataset !!

Creating DataFrame using pandas

As a source,

Now that we can pulled all the data w.r.to our article and also made dataframes ready for the sake of analytics,

Let us first show the pictorial representations of

  1. Sources -> Article and Article -> Destinations i.e Inflows and Outflows

Note : These representations shall be varied by no of occurances