Image Similarity Tool¶

This notebook provides a tutorial to explore image similarity. The goal is to design an API that, given an input image, returns a collection of similar images.

Install the required dependencies, download data and model.

In [1]:
# install ml dependencies
! pip install tensorflow 
! pip install tensorflow_hub
! pip install opencv-python

# download a python file with helper methods for image similarity
! curl -L https://analytics.wikimedia.org/published/datasets/one-off/image_similarity/image_similarity_tools.py -o image_similarity_tools.py

# download data    
! curl -L https://analytics.wikimedia.org/published/datasets/one-off/image_similarity/microtask_data.tar.gz -o microtask_data.tar.gz
! tar -xf microtask_data.tar.gz
! rm microtask_data.tar.gz
Collecting tensorflow
  Using cached tensorflow-2.6.0-cp38-cp38-manylinux2010_x86_64.whl (458.4 MB)
Collecting google-pasta~=0.2
  Using cached google_pasta-0.2.0-py3-none-any.whl (57 kB)
Requirement already satisfied: wheel~=0.35 in /srv/paws/lib/python3.8/site-packages (from tensorflow) (0.37.0)
Collecting h5py~=3.1.0
  Using cached h5py-3.1.0-cp38-cp38-manylinux1_x86_64.whl (4.4 MB)
Collecting clang~=5.0
  Using cached clang-5.0-py3-none-any.whl
Collecting typing-extensions~=3.7.4
  Using cached typing_extensions-3.7.4.3-py3-none-any.whl (22 kB)
Collecting six~=1.15.0
  Using cached six-1.15.0-py2.py3-none-any.whl (10 kB)
Collecting absl-py~=0.10
  Using cached absl_py-0.14.1-py3-none-any.whl (131 kB)
Collecting numpy~=1.19.2
  Using cached numpy-1.19.5-cp38-cp38-manylinux2010_x86_64.whl (14.9 MB)
Collecting tensorboard~=2.6
  Using cached tensorboard-2.6.0-py3-none-any.whl (5.6 MB)
Requirement already satisfied: protobuf>=3.9.2 in /srv/paws/lib/python3.8/site-packages (from tensorflow) (3.18.0)
Collecting tensorflow-estimator~=2.6
  Using cached tensorflow_estimator-2.6.0-py2.py3-none-any.whl (462 kB)
Collecting astunparse~=1.6.3
  Using cached astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Collecting grpcio<2.0,>=1.37.0
  Using cached grpcio-1.41.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.9 MB)
Collecting termcolor~=1.1.0
  Using cached termcolor-1.1.0-py3-none-any.whl
Collecting wrapt~=1.12.1
  Using cached wrapt-1.12.1-cp38-cp38-linux_x86_64.whl
Collecting keras-preprocessing~=1.1.2
  Using cached Keras_Preprocessing-1.1.2-py2.py3-none-any.whl (42 kB)
Collecting flatbuffers~=1.12.0
  Using cached flatbuffers-1.12-py2.py3-none-any.whl (15 kB)
Collecting opt-einsum~=3.3.0
  Using cached opt_einsum-3.3.0-py3-none-any.whl (65 kB)
Collecting gast==0.4.0
  Using cached gast-0.4.0-py3-none-any.whl (9.8 kB)
Collecting keras~=2.6
  Using cached keras-2.6.0-py2.py3-none-any.whl (1.3 MB)
Requirement already satisfied: markdown>=2.6.8 in /srv/paws/lib/python3.8/site-packages (from tensorboard~=2.6->tensorflow) (3.3.4)
Collecting tensorboard-plugin-wit>=1.6.0
  Using cached tensorboard_plugin_wit-1.8.0-py3-none-any.whl (781 kB)
Collecting google-auth-oauthlib<0.5,>=0.4.1
  Using cached google_auth_oauthlib-0.4.6-py2.py3-none-any.whl (18 kB)
Requirement already satisfied: setuptools>=41.0.0 in /srv/paws/lib/python3.8/site-packages (from tensorboard~=2.6->tensorflow) (57.5.0)
Requirement already satisfied: requests<3,>=2.21.0 in /srv/paws/lib/python3.8/site-packages (from tensorboard~=2.6->tensorflow) (2.26.0)
Collecting tensorboard-data-server<0.7.0,>=0.6.0
  Using cached tensorboard_data_server-0.6.1-py3-none-manylinux2010_x86_64.whl (4.9 MB)
Collecting google-auth<2,>=1.6.3
  Using cached google_auth-1.35.0-py2.py3-none-any.whl (152 kB)
Requirement already satisfied: werkzeug>=0.11.15 in /srv/paws/lib/python3.8/site-packages (from tensorboard~=2.6->tensorflow) (2.0.1)
Requirement already satisfied: cachetools<5.0,>=2.0.0 in /srv/paws/lib/python3.8/site-packages (from google-auth<2,>=1.6.3->tensorboard~=2.6->tensorflow) (4.2.2)
Collecting rsa<5,>=3.1.4
  Using cached rsa-4.7.2-py3-none-any.whl (34 kB)
Collecting pyasn1-modules>=0.2.1
  Using cached pyasn1_modules-0.2.8-py2.py3-none-any.whl (155 kB)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /srv/paws/lib/python3.8/site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard~=2.6->tensorflow) (1.3.0)
Collecting pyasn1<0.5.0,>=0.4.6
  Using cached pyasn1-0.4.8-py2.py3-none-any.whl (77 kB)
Requirement already satisfied: charset-normalizer~=2.0.0 in /srv/paws/lib/python3.8/site-packages (from requests<3,>=2.21.0->tensorboard~=2.6->tensorflow) (2.0.6)
Requirement already satisfied: certifi>=2017.4.17 in /srv/paws/lib/python3.8/site-packages (from requests<3,>=2.21.0->tensorboard~=2.6->tensorflow) (2021.5.30)
Requirement already satisfied: idna<4,>=2.5 in /srv/paws/lib/python3.8/site-packages (from requests<3,>=2.21.0->tensorboard~=2.6->tensorflow) (3.2)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /srv/paws/lib/python3.8/site-packages (from requests<3,>=2.21.0->tensorboard~=2.6->tensorflow) (1.26.7)
Requirement already satisfied: oauthlib>=3.0.0 in /srv/paws/lib/python3.8/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard~=2.6->tensorflow) (3.1.1)
Installing collected packages: pyasn1, six, rsa, pyasn1-modules, google-auth, tensorboard-plugin-wit, tensorboard-data-server, numpy, grpcio, google-auth-oauthlib, absl-py, wrapt, typing-extensions, termcolor, tensorflow-estimator, tensorboard, opt-einsum, keras-preprocessing, keras, h5py, google-pasta, gast, flatbuffers, clang, astunparse, tensorflow
  Attempting uninstall: six
    Found existing installation: six 1.16.0
    Uninstalling six-1.16.0:
      Successfully uninstalled six-1.16.0
  Attempting uninstall: numpy
    Found existing installation: numpy 1.21.2
    Uninstalling numpy-1.21.2:
      Successfully uninstalled numpy-1.21.2
  Attempting uninstall: typing-extensions
    Found existing installation: typing-extensions 3.10.0.2
    Uninstalling typing-extensions-3.10.0.2:
      Successfully uninstalled typing-extensions-3.10.0.2
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bokeh 2.4.0 requires typing-extensions>=3.10.0, but you have typing-extensions 3.7.4.3 which is incompatible.
Successfully installed absl-py-0.14.1 astunparse-1.6.3 clang-5.0 flatbuffers-1.12 gast-0.4.0 google-auth-1.35.0 google-auth-oauthlib-0.4.6 google-pasta-0.2.0 grpcio-1.41.0 h5py-3.1.0 keras-2.6.0 keras-preprocessing-1.1.2 numpy-1.19.5 opt-einsum-3.3.0 pyasn1-0.4.8 pyasn1-modules-0.2.8 rsa-4.7.2 six-1.15.0 tensorboard-2.6.0 tensorboard-data-server-0.6.1 tensorboard-plugin-wit-1.8.0 tensorflow-2.6.0 tensorflow-estimator-2.6.0 termcolor-1.1.0 typing-extensions-3.7.4.3 wrapt-1.12.1
Collecting tensorflow_hub
  Using cached tensorflow_hub-0.12.0-py2.py3-none-any.whl (108 kB)
Requirement already satisfied: numpy>=1.12.0 in /srv/paws/lib/python3.8/site-packages (from tensorflow_hub) (1.19.5)
Requirement already satisfied: protobuf>=3.8.0 in /srv/paws/lib/python3.8/site-packages (from tensorflow_hub) (3.18.0)
Installing collected packages: tensorflow-hub
Successfully installed tensorflow-hub-0.12.0
Collecting opencv-python
  Using cached opencv_python-4.5.3.56-cp38-cp38-manylinux2014_x86_64.whl (49.9 MB)
Requirement already satisfied: numpy>=1.17.3 in /srv/paws/lib/python3.8/site-packages (from opencv-python) (1.19.5)
Installing collected packages: opencv-python
Successfully installed opencv-python-4.5.3.56
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2535  100  2535    0     0  46090      0 --:--:-- --:--:-- --:--:-- 46090
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 62.3M  100 62.3M    0     0  28.0M      0  0:00:02  0:00:02 --:--:-- 28.0M

Importing the libraries. If this fails after installing the dependencies, restart the kernel.

In [2]:
%load_ext autoreload
%autoreload 2

import cv2
import image_similarity_tools
import os
import random
2021-10-11 13:10:39.628038: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-10-11 13:10:39.628084: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-10-11 13:10:42.330725: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-10-11 13:10:42.330771: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2021-10-11 13:10:42.330829: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (jupyter--41marachi-5f-41): /proc/driver/nvidia/version does not exist
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/ImageNetLabels.txt
16384/10484 [==============================================] - 0s 0us/step
24576/10484 [======================================================================] - 0s 0us/step

Data¶

There are pictures of three categories: dog, fox and wolf. The image file name is the name of the file on wikiedia commons. We can iterate over the image files in their respective category folders in the data directory.

In [3]:
for root, dirs, files in os.walk("data"):
    if len(files)>0:
        print(f'category {os.path.basename(root)} contains {len(files)} images')
        print(f'\texample images: {files[:2]}')
category Wolf contains 290 images
	example images: ['Loups_siberie.jpg', 'Canis_lupus_signatus_(Kerkrade_Zoo)_26.jpg']
category Fox contains 142 images
	example images: ['Kew_Gardens_-_London_-_September_2008_(2958753889).jpg', 'Vulpes_vulpes_qtl1.jpg']
category Dog contains 283 images
	example images: ['Henry_Tenré.jpg', '(2)_Isha_female_rajapalayam.jpg']
In [4]:
image_similarity_tools.run_me()

Let's start with an initial analsys of the images in the dataset. This can be done by iterating over the files and using dict to accumulate results, but feel free to install to other libraries.

  • number of images; total, per category
  • resolution of images; avergage, are there outliers that might cause problems?
  • file size; total, average, per category
In [ ]:
# TODO compute statistics

We can load images as numpy arrays using the open cv library, and we can plot images using the utilities provided in image_similarity_tools.

In [ ]:
fox = 'data/Fox/Sierra_Nevada_red_fox_1_(cropped).jpg'
dog = 'data/Dog/Yacare_De_El_Siledin.jpg'
wolf = 'data/Wolf/WPZ_Gray_Wolf_02.jpg'

images = [cv2.imread(f) for f in [fox,dog,wolf]]
print(f'the type of fox is {type(images[0])} with shape {images[0].shape}')
image_similarity_tools.plot_images(1,3,images)

We can run inference on an image by passing the file path to image_similarity_tools.classify_image, which returns a list of predictions for the imagenet labels.

In [ ]:
# print top three likely categories for some images
for image in [fox,dog,wolf]:
    predictions = image_similarity_tools.classify_image(image)
    print(f'Top predictions for file {image}')
    for label, prob in sorted(predictions, key=lambda kv: kv[1],reverse=True)[:3]:
        print(f'\t{label} : {prob}')

Image similarity API¶

In this section we will implement an "API", which is just a python function in fact, that takes an input image, and returns a set of similar images from the dataset.

The algorithm argument specifies which approach should be used to generate the image recommendations:

  • "metadata-based": hacky example algorithm that uses the fact that the file path contains the category
  • "random": returns random images
  • "deeplearning": algorithm that uses the label predictions from the run_inference_on_image method to generate similar images
  • "color_distribution": algorithm that uses the distribution of colors in images to recommend similar images

One direct approach is to first extract features from an image, and then use a distance metric (e.g. euclidian) to compute the nearest neighbors.

This microtask is based on a Wikimania workshop on image similarity, the notebooks contain useful code to extract features from an image.

In [ ]:
def get_similar_images(input_image, algorithm, n_recommendation=5):
    """
    Given an input image, this method returns a list of similar images.    
    
    :param input_image: file name of an image
    :param algorithm: algorithm to use for recommendations
    :return: list of image recommendations
    """
    if algorithm=="metadata-based":
        category_dir, _ = input_image.rsplit("/",1)
        recommendations = random.sample(os.listdir(category_dir),n_recommendation)
        return [f'{category_dir}/{f}' for f in recommendations]
    elif algorithm == "random": 
        # TODO complete  
        pass 
    elif algorithm == "deeplearning": 
        # TODO complete  
        pass 
    elif algorithm == "color_distribution": 
        # TODO complete  
        pass #  
    else:
        raise ValueError(f'algorithm {algorithm} not implemented')

Example queries¶

In [ ]:
get_similar_images(
    input_image='data/Wolf/Timberwolf_Juli_2009_Zoo_Hannover.jpg',
    algorithm="metadata-based")  
In [ ]:
get_similar_images(
    input_image='data/Wolf/Timberwolf_Juli_2009_Zoo_Hannover.jpg',
    algorithm="color_distribution")
In [ ]:
get_similar_images(
    input_image='data/Wolf/Timberwolf_Juli_2009_Zoo_Hannover.jpg',    
    algorithm="deeplearning")

Analysis¶

Let's generate a summary of how the different models perform.

  • For each algorithm, calculate how often the recommended images are in the same overall category (Dog/Fox/Wolf) as the input image? Can we compute a precision/recall?
  • How do the precision/recall results compare to the "random" baseline algorithm?
In [ ]:
# TODO: build out analyses per instructions above

Future work¶

TODO:

  • What are the most expensive steps computationally based on the work done in this notebook so far?
  • Can you identify potential challenges to scaling this api to cover more images? For example, if you had 10 million images in the dataset instead of 715, what changes might you make to get_similar_images to keep the API quick?
In [ ]: