Accessing Wikidata Content

This notebook provides a tutorial for how to access content in Wikidata either via the JSON dumps) or API. It has three stages:

Accessing the Wikidata JSON Dumps

This is an example of how to parse through JSON dumps) and gather statements (property:value) for all items with at least one Wikipedia sitelink). This can obviously be adjusted for whatever filtering etc. is desired. Of note, the JSON dumps are over 50 GB compressed and thus processing them can easily take a full day. If this is done via PAWS, the service will time-out.

Show an example of the data

Accessing the Wikidata APIs

The Wikidata APIs can be much faster for accessing data about Wikidata items if you know what items you are interested in and are interested in relatively few items (e.g., hundreds or low thousands). To demonstrate, we'll show how to use the wbgetentities endpoint, which allows you to get all the statements and sitelinks associated with a Wikidata item. We choose a random sample of 10 items from the JSON dump to compare.

NOTE: the APIs are up-to-date while the JSON dumps are always at least several days behind -- i.e. for specific snapshots in time -- so the data you get from the JSON dumps might be different from the APIs if users have made edits to the Wikidata items in the intervening days.

Example Analyses of Wikidata data

Here we show some examples of things we can do with the data that we gathered about Wikidata items. We'll work with the larger dataset we gathered from the JSON dumps. There are many research questions you might come up with about Wikidata items, but we're going to examine the following:

There is a lot of content on wikis and much more that could be added but relatively few editors. As a result, there are always going to be topics that have excellent coverage and others that have just more basic information. Ideally, the topics that are covered well are also the "important" ones. We are going to do a basic analysis to see how well this holds up in Wikidata. Specifically, we choose to look at the relationship between the number of statements that a given Wikidata item has (a rough proxy of quality / coverage) and the number of languages in which there is a corresponding Wikipedia article (a rough proxy for interest / importance of the topic).

Descriptive statistics

Basic details on number of statements and number of sitelinks per item. Conclusions:

Relationship between the two variables:

Relationship between # of sitelinks and # of statements. Conclusions:

Outlier analysis

We now visually examine the outliers -- i.e. articles with high quality but low importance or low quality but high importance. Conclusions:

Predictive Model

We've established that there is a clear relationship between # of statements (quality) and # of sitelinks (importance) and that that relationship also depends on whether the item is about a person or not. Now we want to see with how much accuracy we can predict the number of sitelinks based on the number of statements about an item. This can tell us for which items we would expect articles in more languages to be written about them.

NOTE: the model presented below is very simplistic and so actually tells us very little that the correlational analyses didn't, but adding more variables and interactions between variables would allow for a more nuanced analysis of outliers.

Future Analyses

At this stage, we'd ideally add more data and variables. For example: