COMPARE The Met DB against Wikidata titles

This is an experiment to check the consistency of the Met data set and Wikidata.

One approach is to examine an object's title, accession number, creator, creation date and instance of info and to see how many of these match between the two databases.

  1. Met: Load the entire Met database of 400,000 items into a dataframe.
  2. Wikidata: Return all items with Met Object ID (P3634) set into a dataframe.
  3. Compare the basic indicators between the two using fuzzy comparisons.
  4. Flag any objects that have no similarities at all.

Import Met CSV database

This is more than 400,000 rows, so it may take 5-10 seconds or more

Sanity check

Take a look at some of the rows. NaN means "not a number" or a blank from the CSV file.

Columns of the Met database

Isolate Met DB columns of interest

Send Wikidata Query to pick up Met objects

Met objects are currently (April 2019) modeled slightly differently, so one goal of the project is to normalize this and make it consistent. There are currently two different methods to pick up Met objects:

  1. Anything with Met ID (P3634)
  2. Anything with inventory number (P217) qualified with collection (P195) set to Met (Q160236)

For a SPARQL query, these two are combined with UNION, and optional fields returned.

Convert the WDQ JSON result to a dataframe

Make life easier by converting the JSON to a Pandas Dataframe, which is basically a 2D spreadsheet-like data structure. We're going to also do some integrity checks as we import. Most of the data are strings and numbers, but the "inception" is a formal date string in the format +1984-01-01T00:00:00Z and it's possible Wikidata has dates that validate but are illogical, like year 0. It will error out on these, and show up in pink below.

Problem - It is also possible inception is set to "Unknown value" in Wikidata which is tricky to handle in Python.

In SPARQL parlance, it would be tested like this:

?item wdt:P571 ?date .

FILTER isBLANK(?date) .

We're have to figure out how to best represent this while doing our data work, since a Python dateTime module is quite strict. Some research indicates that there is quite a need for this type of function of handling outliers, but there is no simple or pat solution.

(https://stackoverflow.com/questions/6697770/allowing-invalid-dates-in-python-datetime)

Examine some random records to check they are being imported correctly

Use Pandas equivalent of a database join

Do an "inner" join that makes a new dataframe based on the Met database (met_df) but adds a new columns from the Wikidata query (wd_missing_metid_df) that supplies qid and Object Number/inventory number

Test some rows for sanity:

Need some checking here - If we had 96 missing Q items, but only 82 are coming back after the merge, there are some errors that need resolving.

Generate Quickstatements to fix the problem

Wikidata items believed to be Met objects but missing inventory number statement

These items don't have inventory number set but are in the list because Met Object ID (P3634) was set. We test to see if Wikidata results for Met items has inventorynumber set to None.

FIX for this would be to generate and ingest Quickstatements to fill in inventory number.

Something like:

Q61876946|P217|"2003.161"|P195|Q160236

Use Pandas equivalent of a database join

Do an "inner" join that makes a new dataframe based on the Met database (met_df) but adds a new columns from the Wikidata query (wd_missing_inventory_df) that supplies qid and matched metid

Generate Quickstatements to fix the problem