Understanding the impact of Content Translations tools at it's best

Contribution no 4

step into Machine Learning

Contribution no 5

See keyword extraction

Importing of different Modules and explore their usages

Pandas is an extensively data manipulation tool, built on the Numpy package and its key data structure called DataFrame allows to store and manipulate tabular data in rows of obseravations

NumPy is for scientific computation,a general-purpose array-processing package

json is javascript object notation,lightweight data-interchange format I have ran multiple examples to get acquinted with the json format, it's syntax and datatypes like string, number, object (JSON object), array, boolean, nul

re is Python's inbuilt library to work with regular expressions. I learnt different functions of re and sequences

Seaborn and Matplotlib are imported to perform some visualizations as Analyses results can be shown well as the infographics have an impulsed and quick impact than the text paragraphs describing the results Data visualisation is an essential part of analysis since it allows even non-programmers to be able to decipher trends and patterns.

I have imported mediawiki API and have referred through all the sections and actions that can be performed -Quering etc

Interaction with API

To know how the data looks like I have run following

EDA – plays a crucial role in understanding the what, why, and how of the problem statement Exploratory data analysis is an approach to analyzing data sets by summarizing their main characteristics with visualizations to unravel many insights they often lead us in building robust models

Let’s have a look at data dimensionality, features names, and feature types.

To check if there exists any missing or null values

Therefore they are no null values

There are 11 Columns

Noted the Observations....

As discussed in the earlier Contribution #1 in the Jupyter Notebook named kickstart- exploration of the data and other modules that the stats is an python dictionary accessing the value for each corresponding key may be difficult So made each value of stats into a seperate Column in the Data STATS has the highest importance as it speaks everything in the analysis

Let's check for the newer publications-translations As we are more focused on recent trends though observing the history may lead to interesting conclusions! Okayyy Let's do Both! No missing No Confusion

As we have observed from the above results that the type of the 'publishedDate' column-observations is object We should convert it into int datatype so we can perform some comparsion operations in order to extract information about recent activities

using astype method I have changed the publishedDate datatype into int64

Let us look at the frequency of publications in accordance with the publishedDates..

Most of the publications-transalations are done in the year 2016 - 2017!

Why not after that period? Has everthing been finished during that period ?? To be studied!!

Let us try to get from 2018

Oops! there is not much data available in the limit(500) we have set So lets go other 6 months backwards

Now that we have seen the recent years and visualised the any - > Machine translation or Human Translation mt - > Machine translation human -> Human Translation mtSectionsCount -> Number of Sections translated Let us now combine both of them to observe which may lead us to new conclusions

The two above conclusions-plots are not-so-bad May be we have to compare among different parametres in order to get meaningful observations.. Let's Do it

Let us make use scatter plots and try to grab some insights from it

Let us identify what type/how many of articles have after-translation-efforts like edits/translations by human or machine and how are they changed? When they are changed/translated? How much are they translated?

Firstly understanding the which articles have been translation either by Human or Machine

Secondly understanding the which articles have been translation either by exclusively Machine

sort by multiple columns:

Robust comparision often lead us to healthy conclusions!!

let's have a detective eye on them so that intelligent comparisions cannot escape from us

Better Comparison among the stats - the central part of Data Analysis


"On top of the content that was translated, which the above notebook demonstrates ways to access, more data can be accessed about the translations and what occurred after them. Try comparing statistics about edits, pageviews, etc. between the source and translated versions of articles. More advanced analyses in a project might eventually compare translated articles with similar articles that were not translated or classify edits based upon their 'type' for more fine-grained analyses of what happens to translated articles. "We would hope for a mixed-methods approach that uses both quantitative analyses (e.g., edit counts, topics that are more frequently translated, etc.) and qualitative analyses (e.g., content analysis of translated pages and subsequent edits, talk pages, etc.).

Now Heading towards the actual Quantitative Analysis

First comes the pageviews

Having seen such an significant amount of views for Hindi wikipedia too though the Literacy rates in India is low is relly surprising Isn't it?

Hence We have Compared the views in the Commons, English, Hindi Wikipedias

Hmm.. Interesting topic Ahead!!! Let us see which articles got the top views ! in Hindi Language or English ? Let us see Hindi First!

How about English

Some Insights!

We could see that in both of the languages, Main_Page has got the highest views which is so obvious!

Special:Search is there in both the languages at top 5!(in English - position 2 andin Hindi position 4 ) -- but not at the same position! Is that because Hindi speaking People Don't know about this or They are more interested in the topics topped at the positions 2 and 3?

Let us compare with more focus

Okay Views is Completed!

Meanwhile what would be going at the developer side??

Once the articles are published will they be same forever

No changes ? Impossible right?

Because the world is constantly changing,

The articles are often reiterated and updated

Now let us dig in to Edits Section!!

Taking the https://pawspublic.wmflabs.org/pawspublic/User:Isaac_(WMF)/Content%20Translation%20Example.ipynb as basis