Some notes about the cxpublishedtranslations API:

Example and overview given here: In the parameters, you may notice there is an 'offset'. The API will only return 500 results at a time and given that there are often more than 500 articles that were translated, this parameter tells the API which results you want. * For the cxpublishedtranslations API, the articles are sorted by the date they were translated, so the deeper into the dataset you go (the higher the offset), the newer the translations are. We do not have full data for articles that were translated before 2016-01-22, so I chose an offset (20000) that would get me a list of articles translated after that date. You may have to try a few offsets to find one that works. If the results are empty, that probably means your offset is too high.

Get set of translated articles to dig more deeply into

Alternative view of the data via Pandas

Get corresponding parallel translation

Parallel translations can either be accessed through the dump files or API. Use the dump files if you are planning on analyzing the entire corpus (or a large proportion) of translated articles. The API is best for looking at a few examples.

Dump files

the dump files give you local access to the parallel translations and are the "friendly" way to access the data, especially if you are looking at a lot of examples. download most recent .text.json.gz file based on instructions here: upload dump file to PAWS (there is an upload button if you go to your dashboard (click on the PAWS logo in top right) there is a bug where the dump files have extra commas that break the JSON schema and leads to an error if you call json.load directly on the fin variable. Instead, you have to remove them as below to load in the dump file.


this is the quickest way to access the parallel translations, but it is best for looking at just a few examples see overview: