Checking The Met open data consistency against Wikidata

This set of scripts will check the consistency of the Met data set and Wikidata.

For objects, we should check the title, accession number, creator, creation date and instance of info.

Import Met CSV database into a dataframe

This is more than 400,000 rows, so it may take 10-20 seconds or more

Examine the structure of rows

Take a look at some of the rows. NaN means "not a number" or a blank from the CSV file.

Statistics on most used and unique terms

The most frequently used terms can be found in the row labeled "top." In summary:

  1. Drawings and prints departement has the most items
  2. Object number "62.635" is used four times, which we can investigate below.

Examine random artist names

Public domain works

Most frequent artists

Photographer Walker Evans leads the list here, followed by publishers of trade or baseball cards (Kinney Brothers Tobacco, Allen & Ginter, W. Duke, Sons, et al), ephemera and other materials.

Anonymous creators

Many entries in the Met database are "Unknown", "Anonymous" or "Anonymous..." with the addition of some qualifying details. Sometimes it is "Anonymous|..." with a pipe symbol, and other details are included. Here are the most frequent uses of "Anonymous." We should somehow capture this info in Wikidata, but it's possible that we might not want to and just infer the era instead from the artwork's Wikidata inception time?

Question - Should we keep this distinction on "Unknown" and "Anonymous"?

Question - Should we capture the nationality and era of the "Anonymous" creator in Wikidata?

We may want to sidestep this at this point.

Let's take a look at some of the examples of using a pipe (|) in the Anonymous entries

Question - Is there a consistent interpretation of the pipe symbol, in terms of it being an "and" or an "or" ... or does it depend on the department and their use of it differently?

Analyzing textile classifications and object names

Examine some Credit Lines to extract a date

We can extract a year from the "Credit Line" column to set collection (P195) qualifier start time (P580). The object page on The Met web site has a "Provenance" section, but this does not seem accessible from the API or dump. It also seems to be different from Credit Line. Example:

https://www.metmuseum.org/art/collection/search/681504

Question - Is this a reasonable inference?

Question - Why is the above example inconsistent - Met API/database has no date, but object page says 1000 BC - 1 AD, which seems to be the incorrect range of dates

A method for extracing the year:

Send Wikidata Query to pick up Met objects

Met objects are currently (April 2019) modeled slightly differently, so one goal of the project is to normalize this and make it consistent. There are currently two different methods to pick up Met objects:

  1. Anything with Met ID (P3634)
  2. Anything with inventory number (P217) qualified with collection (P195) set to Met (Q160236)

For a SPARQL query, these two are combined with UNION, and optional fields returned.

Examine some random records to understand the JSON structure

Now that we've made the SPARQL query, which may take 5-10 seconds, let's look at some of the raw JSON that is returned to understand how to parse it.

Convert the WDQ JSON result to a dataframe

Make life easier by converting the JSON to a Pandas Dataframe, which is basically a 2D spreadsheet-like data structure. We're going to also do some integrity checks as we import. Most of the data are strings and numbers, but the "inception" is a formal date string in the format +1984-01-01T00:00:00Z and it's possible Wikidata has dates that validate but are illogical, like year 0. It will error out on these, and show up in pink below.

Problem - It is also possible inception is set to "Unknown value" in Wikidata which is tricky to handle in Python.

In SPARQL parlance, it would be tested like this:

?item wdt:P571 ?date .

FILTER isBLANK(?date) .

We're have to figure out how to best represent this while doing our data work, since a Python dateTime module is quite strict. Some research indicates that there is quite a need for this type of function of handling outliers, but there is no simple or pat solution.

(https://stackoverflow.com/questions/6697770/allowing-invalid-dates-in-python-datetime)

Examine some random records to check they are being imported correctly

Wikidata items for The Met with 'anonymous' as creator

Wikidata items believed to be Met objects but missing Met Object ID statement

These items don't have Met Object ID (P3634) set but are in the list because the inventorynumber and collection->The Met Museum was set. We test to see if Met ID = 0.

Outlier problems

The following entries had some issues with matching what was in the Met database - either these are outdated or there are some issues.

Use Pandas equivalent of a database join

Do an "inner" join that makes a new dataframe based on the Met database (met_df) but adds a new columns from the Wikidata query (wd_missing_metid_df) that supplies qid and Object Number/inventory number

Test some rows for sanity:

Good - we fixed 82 items missing the Met ID on May 24, 2019, so now we are in sync.

Generate Quickstatements to fix the problem

In case you need to create Quickstatements, here they are:

Wikidata items believed to be Met objects but missing inventory number statement

These items don't have inventory number set but are in the list because Met Object ID (P3634) was set. We test to see if Wikidata results for Met items has inventorynumber set to None.

FIX for this would be to generate and ingest Quickstatements to fill in inventory number.

Something like:

Q61876946|P217|"2003.161"|P195|Q160236

Use Pandas equivalent of a database join

Do an "inner" join that makes a new dataframe based on the Met database (met_df) but adds a new columns from the Wikidata query (wd_missing_inventory_df) that supplies qid and matched metid

Generate Quickstatements to fix the problem