Outreachy. Mediawiki. Impact of content translation tools

The task is for Outreachy Internship. Made by Ekaterina Kharitonova

The Content Translation tool has supported the creation of over 400,000 articles across a large variety of Wikipedia language projects. We have very little understanding, however, of what parts of the tool work well and what happens to the articles after they have been created. For instance, what types of article sections are translated as-is? what sections are changed substantially? do translated articles see subsequent editing and linking to other articles in the project?

This task links to a number of related projects that could happen around these larger questions about the adoption of translated content on Wikipedia. We would hope for a mixed-methods approach that uses both quantitative analyses (e.g., edit counts, topics that are more frequently translated, etc.) and qualitative analyses (e.g., content analysis of translated pages and subsequent edits, talk pages, etc.).

This task involves two mini-tasks: Quantitative Exploration of Content Translation Tools and Qualitative Exploration of Content Translation Tools

Qualitive Exploration - what types of content is translated and what happens to these articles once they are created. Go through the edit histories for a few articles and begin to identify whether any trends emerge about the types of edits that happen to translated articles. Compare the translated and source articles in their current state. What types of content were added after the translation? Are the articles diverging in content or staying similar? What sorts of discussions occur on the talk pages of translated articles?

Quantitative Exploration - Try comparing statistics about edits, pageviews, etc. between the source and translated versions of articles. More advanced analyses in a project might eventually compare translated articles with similar articles that were not translated or classify edits based upon their 'type' for more fine-grained analyses of what happens to translated articles.

Importing the data

I've chosen to look at English to Russian translation. On the 26th of May 2019, almost 16700 articles on Wikipedia were translated to Russian language. Russia takes 6th place in this category. Source: https://ru.wikipedia.org/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:ContentTranslationStats Also, Russian language takes 2nd place in category of unfinished translations (English to Russian translation). It raises the question: why is that happening?

The oldest translation in our sample happened on 2017-10-06. The most recent - on 2019-01-04. Let's clean our dataframe a bit

Cleaning and accessing the translation stats

During the cleaning process we need:

Gathering additional information

Here we'll gather information about:

Now I want to get number of views of original and translated pages from the date of translation.

We go through the dictionary and sum views of specific page. Some pages were deleted, these titles will be copied in errors list. We do it both for translated and original articles.

Data Analysis

Here is what I want to look at:

  1. Translated articles with the most views in our sample
  2. The relative difference in the number of views of the original and translated articles from the day of translation
  3. Articles which have more views of translated version, than of original
  4. Articles with the highest coefficient of human translations. We can analyze History and Talk pages of these articles, how some sections are translated.
  5. Articles with the highest number of revisions

We can see that the most popular translated articles from 2017-10-06 to 2019-01-04 are:

We can see that usually translated pages have a lot less views than an original. Th medium coefficient is 0.007

Here are titles of articles (original title) along with the country/subject they belong to:

It looks like most articles from our top are belong to regions near to Russia. Also, some of them have political orientation.

Here are top articles by human translation along with the subject they belong to:

Let's choose a couple article and look at them closely. For that, we'll use another API

I can notice that a translator added additional links to famous places or people. For example: School of Journalism at The University of Montana, Jack Anderson, names of different newpapers (such as The Washington Post, Washingtonian, etc.) Also, the order of words is changed. The translation tool doesn't always put words in right order or grammatical case. The text structure is also changed a little bit. Some compound sentences are divided into several smaller sentences. Also translation tool sometimes uses not so contextual words, they are ussually replaces with more approptiate words.

the History Page includes such comments: Links to various pages of the English site of James Grady for easy sources search; Russian language improvement.

Second place takes title thatwe've already seen before: The Cockpit (OVA). Let's explore it.

We see that translated article is much longer than original. The Russian translator added information about anime reviews and criticism, anime soundtrack and the plot decription became more detailed.

What's coming:

It is just the beginning of my analysis. During the Internship for Wikimedia I plan: