Checking The Met open data consistency against Wikidata

This set of scripts will check the consistency of the Met data set and Wikidata.

For objects, we should check the title, accession number, creator, creation date and instance of info.

Import Met CSV database into a dataframe

This is more than 400,000 rows, so it may take 5-10 seconds or more

Examine the structure of rows

Take a look at some of the rows. NaN means "not a number" or a blank from the CSV file.

Statistics on most used and unique terms

The most frequently used terms can be found in the row labeled "top." In summary:

  1. Drawings and prints departement has the most items
  2. Object number "62.635" is used four times, which we can investigate below.

Examine random artist names

Public domain works

Most frequent artists

Photographer Walker Evans leads the list here, followed by publishers of trade or baseball cards (Kinney Brothers Tobacco, Allen & Ginter, W. Duke, Sons, et al), ephemera and other materials.

Anonymous creators

Many entries in the Met database are "Unknown", "Anonymous" or "Anonymous..." with the addition of some qualifying details. Sometimes it is "Anonymous|..." with a pipe symbol, and other details are included. Here are the most frequent uses of "Anonymous." We should somehow capture this info in Wikidata, but it's possible that we might not want to and just infer the era instead from the artwork's Wikidata inception time?

Question - Should we make keep this distinction on "Unknown" and "Anonymous"?

Question - Should we capture the nationality and era of the "Anonymous" creator in Wikidata?

Let's take a look at some of the examples of using a pipe (|) in the Anonymous entries

Question - Is there a consistent interpretation of the pipe symbol, in terms of it being an "and" or an "or" ... or does it depend on the department and their use of it differently?

Send Wikidata Query to pick up Met objects

Met objects are currently (April 2019) modeled slightly differently, so one goal of the project is to normalize this and make it consistent. There are currently two different methods to pick up Met objects:

  1. Anything with Met ID (P3634)
  2. Anything with inventory number (P217) qualified with collection (P195) set to Met (Q160236)

For a SPARQL query, these two are combined with UNION, and optional fields returned.

Examine some random records to understand the JSON structure

Now that we've made the SPARQL query, which may take 5-10 seconds, let's look at some of the raw JSON that is returned to understand how to parse it.

Convert the WDQ JSON result to a dataframe

Make life easier by converting the JSON to a Pandas Dataframe, which is basically a 2D spreadsheet-like data structure. We're going to also do some integrity checks as we import. Most of the data are strings and numbers, but the "inception" is a formal date string in the format +1984-01-01T00:00:00Z and it's possible Wikidata has dates that validate but are illogical, like year 0. It will error out on these, and show up in pink below.

Problem - It is also possible inception is set to "Unknown value" in Wikidata which is tricky to handle in Python.

In SPARQL parlance, it would be tested like this:

?item wdt:P571 ?date .

FILTER isBLANK(?date) .

We're have to figure out how to best represent this while doing our data work, since a Python dateTime module is quite strict. Some research indicates that there is quite a need for this type of function of handling outliers, but there is no simple or pat solution.

(https://stackoverflow.com/questions/6697770/allowing-invalid-dates-in-python-datetime)

Examine some random records to check they are being imported correctly

Wikidata items for The Met with 'anonymous' as creator

Wikidata items believed to be Met objects but missing Met Object ID statement

These items don't have Met Object ID (P3634) set but are in the list because the inventorynumber and collection->The Met Museum was set. We test to see if Met ID = 0.

Take a peek at 10 random entries returned

Use Pandas equivalent of a database join

Do an "inner" join that makes a new dataframe based on the Met database (met_df) but adds a new columns from the Wikidata query (wd_missing_metid_df) that supplies qid and Object Number/inventory number

Test some rows for sanity:

Generate Quickstatements to fix the problem

Wikidata items believed to be Met objects but missing inventory number statement

These items don't have inventory number set but are in the list because Met Object ID (P3634) was set. We test to see if Wikidata results for Met items has inventorynumber set to None.

FIX for this would be to generate and ingest Quickstatements to fill in inventory number.

Something like:

Q61876946|P217|"2003.161"|P195|Q160236

Use Pandas equivalent of a database join

Do an "inner" join that makes a new dataframe based on the Met database (met_df) but adds a new columns from the Wikidata query (wd_missing_inventory_df) that supplies qid and matched metid

Generate Quickstatements to fix the problem