EDA – plays a crucial role in understanding the what, why, and how of the problem statement Exploratory data analysis is an approach to analyzing data sets by summarizing their main characteristics with visualizations to unravel many insights they often lead us in building robust models

Importing of different Modules and explore their usages

Pandas is an extensively data manipulation tool, built on the Numpy package and its key data structure called DataFrame allows to store and manipulate tabular data in rows of obseravations

NumPy is for scientific computation,a general-purpose array-processing package

json is javascript object notation,lightweight data-interchange format I have ran multiple examples to get acquinted with the json format, it's syntax and datatypes like string, number, object (JSON object), array, boolean, nul

re is Python's inbuilt library to work with regular expressions. I learnt different functions of re and sequences

Seaborn and Matplotlib are imported to perform some visualizations as Analyses results can be shown well as the infographics have an impulsed and quick impact than the text paragraphs describing the results Data visualisation is an essential part of analysis since it allows even non-programmers to be able to decipher trends and patterns.

I have imported mediawiki API and have referred through all the sections and actions that can be performed -Quering etc

Interaction with API

To know how the data looks like I have run following

Let’s have a look at data dimensionality, features names, and feature types.

To check if there exists any missing or null values

Therefore they are no null values

There are 11 Columns

Noted the Observations....

As discussed in the earlier Contribution #1 in the Jupyter Notebook named kickstart- exploration of the data and other modules that the stats is an python dictionary accessing the value for each corresponding key may be difficult So made each value of stats into a seperate Column in the Data STATS has the highest importance as it speaks everything in the analysis

Let's check for the newer publications-translations As we are more focused on recent trends though observing the history may lead to interesting conclusions! Okayyy Let's do Both! No missing No Confusion

As we have observed from the above results that the type of the 'publishedDate' column-observations is object We should convert it into int datatype so we can perform some comparsion operations in order to extract information about recent activities

using astype method I have changed the publishedDate datatype into int64

Let us look at the frequency of publications in accordance with the publishedDates..

Most of the publications-transalations are done in the year 2016 - 2017!

Why not after that period? Has everthing been finished during that period ?? To be studied!!

Let us try to get from 2018

Oops! there is not much data available in the limit(500) we have set So lets go other 6 months backwards

Now that we have seen the recent years and visualised the any - > Machine translation or Human Translation mt - > Machine translation human -> Human Translation mtSectionsCount -> Number of Sections translated Let us now combine both of them to observe which may lead us to new conclusions

The two above conclusions-plots are not-so-bad May be we have to compare among different parametres in order to get meaningful observations.. Let's Do it

Let us make use scatter plots and try to grab some insights from it