Investigating The Met collection through Data

The following Python script does some basic data analysis to better understand The Met's database dump of 400,000+ objects, to provide insights on how to integrate it with linked open data projects, especially Wikidata. Any question: send to andrew.lih@gmail.com

Basic stats on Met collection

Below are some stats showing how full or empty each of the columns are in the database.

Breakdown of columns

What departments have the largest number of objects?

Object details

Here's a typical object returned from the API or read from the spreadsheet.

Dimensions

One particular challenging field is 'dimensions' which may be delimited by newlines, and varies dramatically for 2D objects and 3D objects, such as vases, furniture, coins, etc.

Advanced parsing of the CSV file must be done because of this. Simple UNIX tools, and even MS Excel, are typically fooled by the CSV file that breaks up 'Dimensions' into multiple lines like this.

By country

Objects that have country "Iran|Iran"

Iranian and Persian content

The Country column has both "Iran" and "Iran|Iran" as separate entries. Here's an analysis of the different timespans on the objects in each classification.

Iran

Iran|Iran

Egyptian artifacts

Objects with country=Egypt make up the largest proportion of database entries. They span from hundreds of thousands of years ago to modern day. To focus on artworks, we filter out objects older than 6000 B.C.E.

Distribution of Egyptian objects

Below is a histogram showing the different time frames for Egyptian objects. There may be some false peaks there, as some objects may be rounded to the nearest guess (ie. 2000 BCE)

Classification analysis

A general breakdown shows prints, photographs and drawings being the most prevalent

Some general questions to be addressed:

All objects

Just public domain released objects

For all PD objects, a breakdown of the classifications

Only for objects that have an image

Met highlights

The object "hdf" contains an efficient copy of all the "Met highlights" items, which number around 2,000

How many of the Met highlight objects are declared as "public domain"?

This will form the maximum number that could be expected to be in Commons with an image

Checking Met Highlights against Wikidata

The variable "hstring" is a manual list of the 1,900+ Met highlight objects. How may of these are in Wikidata, by manual match?

How many of the 1,900+Met highlight objects are in Wikidata now?

Perform query to get all Wikidata QIDs of collection->Met, subject has role->collection highlight

Perform query to get all Wikidata QIDs of that match manual list of highlight items (hstring)

Differences between the two highlights lists

If we look at the differences between the manual list of highlights using "collection/subject has role" and by manual list of object IDs, we have some disrepancies. These seem to be Wikidata items that have "subject has role"->"collection highlight" but when checking the Met API, it comes back as false.

All highlights items that have "collection->Met" and "subject has role->collection highlight" but not in the manually matched rows

This means they are old or mislabeled entries in Wikidata since the Met does not consider them highlights anymore. In a call with Jennie Choi of The Met, these are indeed former highlights and should the claim removed.

All highlights items in the manually matched rows but has no statement "collection->Met" and "subject has role"

This implies that the statement with qualifier needs to be added to Wikidata.