Compare Met paintings against Wikidata

Check to see if Wikidata have paintings miscategorized compared to the Met's database.

  1. Met: Load the entire Met database of 400,000 items into a dataframe.
  2. Wikidata: Return all items with Met Object ID (P3634) set into a dataframe.
  3. See if Met database "Object Name=Painting" match what Wikidata has
  4. Flag any objects that don't match

Import Met CSV database

This is more than 400,000 rows, so it may take 5-10 seconds or more

Sanity check

Take a look at some of the rows. NaN means "not a number" or a blank from the CSV file.

Columns of the Met database

Isolate Met DB columns of interest

Send Wikidata Query to pick up all Met objects

Met objects are currently (April 2019) modeled slightly differently, so one goal of the project is to normalize this and make it consistent. There are currently two different methods to pick up Met objects:

  1. Anything with Met ID (P3634)
  2. Anything with inventory number (P217) qualified with collection (P195) set to Met (Q160236)

For a SPARQL query, these two are combined with UNION, and optional fields returned.

Convert the WDQ JSON result to a dataframe

Make life easier by converting the JSON to a Pandas Dataframe, which is basically a 2D spreadsheet-like data structure. We're going to also do some integrity checks as we import. Most of the data are strings and numbers, but the "inception" is a formal date string in the format +1984-01-01T00:00:00Z and it's possible Wikidata has dates that validate but are illogical, like year 0. It will error out on these, and show up in pink below.

Problem - It is also possible inception is set to "Unknown value" in Wikidata which is tricky to handle in Python.

In SPARQL parlance, it would be tested like this:

?item wdt:P571 ?date .

FILTER isBLANK(?date) .

We're have to figure out how to best represent this while doing our data work, since a Python dateTime module is quite strict. Some research indicates that there is quite a need for this type of function of handling outliers, but there is no simple or pat solution.

(https://stackoverflow.com/questions/6697770/allowing-invalid-dates-in-python-datetime)

Examine some random records to check they are being imported correctly

Met paintings in Wikidata already

Do an "inner" join that makes a new dataframe based on the Met database (met_df) but adds a new columns from the Wikidata query (wd_metitems_df) that supplies qid and Object Number/inventory number

Merged dataframe with Met paintings DB and Wikidata QIDs

Met paintings NOT in Wikidata

Make a new dataframe that excludes all paintings that have a Met Object ID already in Wikidata (P3634), and show some sample rows

All Met paintings not in Wikidata yet, but have no title or artist name.