Image similarity tool¶

This notebook provides a tutorial to explore image similarity. The goal is to design an API that, given an input image, returns a collection of similar images.

Install the required dependencies, download data and model.

In [1]:
# install ml dependencies
! pip install tensorflow 
! pip install tensorflow_hub
! pip install opencv-python

# download a python file with helper methods for image similarity
! curl -L https://analytics.wikimedia.org/published/datasets/one-off/image_similarity/image_similarity_tools.py -o image_similarity_tools.py

# download data    
! curl -L https://analytics.wikimedia.org/published/datasets/one-off/image_similarity/microtask_data.tar.gz -o microtask_data.tar.gz
! tar -xf microtask_data.tar.gz
! rm microtask_data.tar.gz
Collecting tensorflow
  Downloading tensorflow-2.6.0-cp38-cp38-manylinux2010_x86_64.whl (458.4 MB)
     |████████████████████████████████| 458.4 MB 11 kB/s /s eta 0:00:01   |█████                           | 70.7 MB 90.1 MB/s eta 0:00:05�█████████▍                 | 206.2 MB 74.1 MB/s eta 0:00:04     |███████████████████▎            | 276.1 MB 85.8 MB/s eta 0:00:03     |████████████████████▎           | 290.5 MB 77.9 MB/s eta 0:00:039 MB 82.8 MB/s eta 0:00:02██████████████████████▏       | 345.9 MB 82.8 MB/s eta 0:00:02██████████████████████▉       | 355.0 MB 82.8 MB/s eta 0:00:02     |█████████████████████████▍      | 364.0 MB 82.8 MB/s eta 0:00:02██████████████████    | 399.6 MB 78.9 MB/s eta 0:00:01
Collecting keras~=2.6
  Downloading keras-2.6.0-py2.py3-none-any.whl (1.3 MB)
     |████████████████████████████████| 1.3 MB 79.5 MB/s eta 0:00:01
Requirement already satisfied: protobuf>=3.9.2 in /srv/paws/lib/python3.8/site-packages (from tensorflow) (3.18.1)
Requirement already satisfied: wheel~=0.35 in /srv/paws/lib/python3.8/site-packages (from tensorflow) (0.37.0)
Collecting wrapt~=1.12.1
  Downloading wrapt-1.12.1.tar.gz (27 kB)
Collecting six~=1.15.0
  Downloading six-1.15.0-py2.py3-none-any.whl (10 kB)
Collecting h5py~=3.1.0
  Downloading h5py-3.1.0-cp38-cp38-manylinux1_x86_64.whl (4.4 MB)
     |████████████████████████████████| 4.4 MB 22.8 MB/s eta 0:00:01     |██████████████▎                 | 1.9 MB 22.8 MB/s eta 0:00:01
Collecting gast==0.4.0
  Downloading gast-0.4.0-py3-none-any.whl (9.8 kB)
Collecting astunparse~=1.6.3
  Downloading astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Collecting flatbuffers~=1.12.0
  Downloading flatbuffers-1.12-py2.py3-none-any.whl (15 kB)
Collecting opt-einsum~=3.3.0
  Downloading opt_einsum-3.3.0-py3-none-any.whl (65 kB)
     |████████████████████████████████| 65 kB 2.6 MB/s  eta 0:00:01
Collecting typing-extensions~=3.7.4
  Downloading typing_extensions-3.7.4.3-py3-none-any.whl (22 kB)
Collecting google-pasta~=0.2
  Downloading google_pasta-0.2.0-py3-none-any.whl (57 kB)
     |████████████████████████████████| 57 kB 2.6 MB/s  eta 0:00:01
Collecting grpcio<2.0,>=1.37.0
  Downloading grpcio-1.41.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.9 MB)
     |████████████████████████████████| 3.9 MB 30.5 MB/s eta 0:00:01     |█████████████████████▉          | 2.7 MB 30.5 MB/s eta 0:00:01
Collecting absl-py~=0.10
  Downloading absl_py-0.15.0-py3-none-any.whl (132 kB)
     |████████████████████████████████| 132 kB 15.6 MB/s eta 0:00:01
Collecting termcolor~=1.1.0
  Downloading termcolor-1.1.0.tar.gz (3.9 kB)
Collecting numpy~=1.19.2
  Downloading numpy-1.19.5-cp38-cp38-manylinux2010_x86_64.whl (14.9 MB)
     |████████████████████████████████| 14.9 MB 24.6 MB/s eta 0:00:01    |████████████████████▌           | 9.6 MB 24.6 MB/s eta 0:00:01
Collecting clang~=5.0
  Downloading clang-5.0.tar.gz (30 kB)
Collecting tensorboard~=2.6
  Downloading tensorboard-2.7.0-py3-none-any.whl (5.8 MB)
     |████████████████████████████████| 5.8 MB 71.5 MB/s eta 0:00:01
Collecting keras-preprocessing~=1.1.2
  Downloading Keras_Preprocessing-1.1.2-py2.py3-none-any.whl (42 kB)
     |████████████████████████████████| 42 kB 1.0 MB/s  eta 0:00:01
Collecting tensorflow-estimator~=2.6
  Downloading tensorflow_estimator-2.7.0-py2.py3-none-any.whl (463 kB)
     |████████████████████████████████| 463 kB 44.6 MB/s eta 0:00:01
Collecting google-auth<3,>=1.6.3
  Downloading google_auth-2.3.2-py2.py3-none-any.whl (155 kB)
     |████████████████████████████████| 155 kB 73.3 MB/s eta 0:00:01
Requirement already satisfied: requests<3,>=2.21.0 in /srv/paws/lib/python3.8/site-packages (from tensorboard~=2.6->tensorflow) (2.26.0)
Requirement already satisfied: markdown>=2.6.8 in /srv/paws/lib/python3.8/site-packages (from tensorboard~=2.6->tensorflow) (3.3.4)
Collecting tensorboard-plugin-wit>=1.6.0
  Downloading tensorboard_plugin_wit-1.8.0-py3-none-any.whl (781 kB)
     |████████████████████████████████| 781 kB 70.8 MB/s eta 0:00:01
Collecting tensorboard-data-server<0.7.0,>=0.6.0
  Downloading tensorboard_data_server-0.6.1-py3-none-manylinux2010_x86_64.whl (4.9 MB)
     |████████████████████████████████| 4.9 MB 18.3 MB/s eta 0:00:01     |██▊                             | 419 kB 18.3 MB/s eta 0:00:01
Collecting google-auth-oauthlib<0.5,>=0.4.1
  Downloading google_auth_oauthlib-0.4.6-py2.py3-none-any.whl (18 kB)
Requirement already satisfied: setuptools>=41.0.0 in /srv/paws/lib/python3.8/site-packages (from tensorboard~=2.6->tensorflow) (58.2.0)
Requirement already satisfied: werkzeug>=0.11.15 in /srv/paws/lib/python3.8/site-packages (from tensorboard~=2.6->tensorflow) (2.0.2)
Collecting pyasn1-modules>=0.2.1
  Downloading pyasn1_modules-0.2.8-py2.py3-none-any.whl (155 kB)
     |████████████████████████████████| 155 kB 83.6 MB/s eta 0:00:01
Requirement already satisfied: cachetools<5.0,>=2.0.0 in /srv/paws/lib/python3.8/site-packages (from google-auth<3,>=1.6.3->tensorboard~=2.6->tensorflow) (4.2.4)
Collecting rsa<5,>=3.1.4
  Downloading rsa-4.7.2-py3-none-any.whl (34 kB)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /srv/paws/lib/python3.8/site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard~=2.6->tensorflow) (1.3.0)
Collecting pyasn1<0.5.0,>=0.4.6
  Downloading pyasn1-0.4.8-py2.py3-none-any.whl (77 kB)
     |████████████████████████████████| 77 kB 5.0 MB/s  eta 0:00:01
Requirement already satisfied: idna<4,>=2.5 in /srv/paws/lib/python3.8/site-packages (from requests<3,>=2.21.0->tensorboard~=2.6->tensorflow) (3.2)
Requirement already satisfied: charset-normalizer~=2.0.0 in /srv/paws/lib/python3.8/site-packages (from requests<3,>=2.21.0->tensorboard~=2.6->tensorflow) (2.0.6)
Requirement already satisfied: certifi>=2017.4.17 in /srv/paws/lib/python3.8/site-packages (from requests<3,>=2.21.0->tensorboard~=2.6->tensorflow) (2021.5.30)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /srv/paws/lib/python3.8/site-packages (from requests<3,>=2.21.0->tensorboard~=2.6->tensorflow) (1.26.7)
Requirement already satisfied: oauthlib>=3.0.0 in /srv/paws/lib/python3.8/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard~=2.6->tensorflow) (3.1.1)
Building wheels for collected packages: clang, termcolor, wrapt
  Building wheel for clang (setup.py) ... done
  Created wheel for clang: filename=clang-5.0-py3-none-any.whl size=30692 sha256=469055f501e91cef1fc42e69248003c86ac99fa781b613cad8fb8c1e2bb847d9
  Stored in directory: /home/paws/.cache/pip/wheels/f1/60/77/22b9b5887bd47801796a856f47650d9789c74dc3161a26d608
  Building wheel for termcolor (setup.py) ... done
  Created wheel for termcolor: filename=termcolor-1.1.0-py3-none-any.whl size=4847 sha256=022209d01739e6265d6a921df4544fc9f9e02d47f05f67c61871735da3ae1662
  Stored in directory: /home/paws/.cache/pip/wheels/a0/16/9c/5473df82468f958445479c59e784896fa24f4a5fc024b0f501
  Building wheel for wrapt (setup.py) ... done
  Created wheel for wrapt: filename=wrapt-1.12.1-cp38-cp38-linux_x86_64.whl size=78578 sha256=a96babaa40d461d2b5fae41d28770ee1f5b739f4b05474e4c50c38c010d89d4b
  Stored in directory: /home/paws/.cache/pip/wheels/5f/fd/9e/b6cf5890494cb8ef0b5eaff72e5d55a70fb56316007d6dfe73
Successfully built clang termcolor wrapt
Installing collected packages: pyasn1, six, rsa, pyasn1-modules, google-auth, tensorboard-plugin-wit, tensorboard-data-server, numpy, grpcio, google-auth-oauthlib, absl-py, wrapt, typing-extensions, termcolor, tensorflow-estimator, tensorboard, opt-einsum, keras-preprocessing, keras, h5py, google-pasta, gast, flatbuffers, clang, astunparse, tensorflow
  Attempting uninstall: six
    Found existing installation: six 1.16.0
    Uninstalling six-1.16.0:
      Successfully uninstalled six-1.16.0
  Attempting uninstall: numpy
    Found existing installation: numpy 1.21.2
    Uninstalling numpy-1.21.2:
      Successfully uninstalled numpy-1.21.2
  Attempting uninstall: typing-extensions
    Found existing installation: typing-extensions 3.10.0.2
    Uninstalling typing-extensions-3.10.0.2:
      Successfully uninstalled typing-extensions-3.10.0.2
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bokeh 2.4.0 requires typing-extensions>=3.10.0, but you have typing-extensions 3.7.4.3 which is incompatible.
Successfully installed absl-py-0.15.0 astunparse-1.6.3 clang-5.0 flatbuffers-1.12 gast-0.4.0 google-auth-2.3.2 google-auth-oauthlib-0.4.6 google-pasta-0.2.0 grpcio-1.41.1 h5py-3.1.0 keras-2.6.0 keras-preprocessing-1.1.2 numpy-1.19.5 opt-einsum-3.3.0 pyasn1-0.4.8 pyasn1-modules-0.2.8 rsa-4.7.2 six-1.15.0 tensorboard-2.7.0 tensorboard-data-server-0.6.1 tensorboard-plugin-wit-1.8.0 tensorflow-2.6.0 tensorflow-estimator-2.7.0 termcolor-1.1.0 typing-extensions-3.7.4.3 wrapt-1.12.1
WARNING: You are using pip version 21.2.4; however, version 21.3.1 is available.
You should consider upgrading via the '/srv/paws/bin/python3 -m pip install --upgrade pip' command.
Collecting tensorflow_hub
  Downloading tensorflow_hub-0.12.0-py2.py3-none-any.whl (108 kB)
     |████████████████████████████████| 108 kB 14.0 MB/s eta 0:00:01
Requirement already satisfied: protobuf>=3.8.0 in /srv/paws/lib/python3.8/site-packages (from tensorflow_hub) (3.18.1)
Requirement already satisfied: numpy>=1.12.0 in /srv/paws/lib/python3.8/site-packages (from tensorflow_hub) (1.19.5)
Installing collected packages: tensorflow-hub
Successfully installed tensorflow-hub-0.12.0
WARNING: You are using pip version 21.2.4; however, version 21.3.1 is available.
You should consider upgrading via the '/srv/paws/bin/python3 -m pip install --upgrade pip' command.
Collecting opencv-python
  Downloading opencv_python-4.5.4.58-cp38-cp38-manylinux2014_x86_64.whl (60.3 MB)
     |████████████████████████████████| 60.3 MB 83 kB/s s eta 0:00:01
Requirement already satisfied: numpy>=1.17.3 in /srv/paws/lib/python3.8/site-packages (from opencv-python) (1.19.5)
Installing collected packages: opencv-python
Successfully installed opencv-python-4.5.4.58
WARNING: You are using pip version 21.2.4; however, version 21.3.1 is available.
You should consider upgrading via the '/srv/paws/bin/python3 -m pip install --upgrade pip' command.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2535  100  2535    0     0  45267      0 --:--:-- --:--:-- --:--:-- 44473
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 62.3M  100 62.3M    0     0  28.4M      0  0:00:02  0:00:02 --:--:-- 28.4M

Importing the libraries. If this fails after installing the dependencies, restart the kernel.

In [1]:
%load_ext autoreload
%autoreload 2

import cv2
import image_similarity_tools
import os
import random
2021-10-30 18:40:37.106653: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /srv/paws/lib/python3.8/site-packages/cv2/../../lib64:
2021-10-30 18:40:37.106693: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_118/1035847415.py in <module>
      3 
      4 import cv2
----> 5 import image_similarity_tools
      6 import os
      7 import random

~/image_similarity_tools.py in <module>
      5 import os
      6 import tensorflow as tf
----> 7 import tensorflow_hub as hub
      8 
      9 '''

/srv/paws/lib/python3.8/site-packages/tensorflow_hub/__init__.py in <module>
     86 
     87 
---> 88 from tensorflow_hub.estimator import LatestModuleExporter
     89 from tensorflow_hub.estimator import register_module_for_export
     90 from tensorflow_hub.feature_column import image_embedding_column

/srv/paws/lib/python3.8/site-packages/tensorflow_hub/estimator.py in <module>
     60 
     61 
---> 62 class LatestModuleExporter(tf.compat.v1.estimator.Exporter):
     63   """Regularly exports registered modules into timestamped directories.
     64 

/srv/paws/lib/python3.8/site-packages/tensorflow/python/util/lazy_loader.py in __getattr__(self, item)
     60 
     61   def __getattr__(self, item):
---> 62     module = self._load()
     63     return getattr(module, item)
     64 

/srv/paws/lib/python3.8/site-packages/tensorflow/python/util/lazy_loader.py in _load(self)
     43     """Load the module and insert it into the parent's globals."""
     44     # Import the target module and insert it into the parent's namespace
---> 45     module = importlib.import_module(self.__name__)
     46     self._parent_module_globals[self._local_name] = module
     47 

/usr/lib/python3.8/importlib/__init__.py in import_module(name, package)
    125                 break
    126             level += 1
--> 127     return _bootstrap._gcd_import(name[level:], package, level)
    128 
    129 

/srv/paws/lib/python3.8/site-packages/tensorflow_estimator/__init__.py in <module>
      8 import sys as _sys
      9 
---> 10 from tensorflow_estimator._api.v1 import estimator
     11 
     12 del _print_function

/srv/paws/lib/python3.8/site-packages/tensorflow_estimator/_api/v1/estimator/__init__.py in <module>
      8 import sys as _sys
      9 
---> 10 from tensorflow_estimator._api.v1.estimator import experimental
     11 from tensorflow_estimator._api.v1.estimator import export
     12 from tensorflow_estimator._api.v1.estimator import inputs

/srv/paws/lib/python3.8/site-packages/tensorflow_estimator/_api/v1/estimator/experimental/__init__.py in <module>
      8 import sys as _sys
      9 
---> 10 from tensorflow_estimator.python.estimator.canned.dnn import dnn_logit_fn_builder
     11 from tensorflow_estimator.python.estimator.canned.kmeans import KMeansClustering as KMeans
     12 from tensorflow_estimator.python.estimator.canned.linear import LinearSDCA

/srv/paws/lib/python3.8/site-packages/tensorflow_estimator/python/estimator/canned/dnn.py in <module>
     25 from tensorflow.python.framework import ops
     26 from tensorflow.python.util.tf_export import estimator_export
---> 27 from tensorflow_estimator.python.estimator import estimator
     28 from tensorflow_estimator.python.estimator.canned import head as head_lib
     29 from tensorflow_estimator.python.estimator.canned import optimizers

/srv/paws/lib/python3.8/site-packages/tensorflow_estimator/python/estimator/estimator.py in <module>
     68 
     69 @estimator_export(v1=['estimator.Estimator'])
---> 70 @doc_controls.inheritable_header("""\
     71   Warning: Estimators are not recommended for new code.  Estimators run
     72   `v1.Session`-style code which is more difficult to write correctly, and

AttributeError: module 'tensorflow.tools.docs.doc_controls' has no attribute 'inheritable_header'

Data¶

There are pictures of three categories: dog, fox and wolf. The image file name is the name of the file on wikiedia commons. We can iterate over the image files in their respective category folders in the data directory.

In [3]:
for root, dirs, files in os.walk("data"):
    if len(files)>0:
        print(f'category {os.path.basename(root)} contains {len(files)} images')
        print(f'\texample images: {files[:2]}')
category Wolf contains 290 images
	example images: ['Loups_siberie.jpg', 'Canis_lupus_signatus_(Kerkrade_Zoo)_26.jpg']
category Fox contains 142 images
	example images: ['Kew_Gardens_-_London_-_September_2008_(2958753889).jpg', 'Vulpes_vulpes_qtl1.jpg']
category Dog contains 283 images
	example images: ['Henry_Tenré.jpg', '(2)_Isha_female_rajapalayam.jpg']
In [ ]:
image_similarity_tools.run_me()

Let's start with an initial analsys of the images in the dataset. This can be done by iterating over the files and using dict to accumulate results, but feel free to install to other libraries.

  • number of images; total, per category
  • resolution of images; avergage, are there outliers that might cause problems?
  • file size; total, average, per category
In [ ]:
# TODO compute statistics

The image pixels can be loaded as numpy arrays using the open cv library, and images can be displayed using image_similarity_tools.plot.

In [2]:
fox = 'data/Fox/Sierra_Nevada_red_fox_1_(cropped).jpg'
dog = 'data/Dog/Yacare_De_El_Siledin.jpg'
wolf = 'data/Wolf/WPZ_Gray_Wolf_02.jpg'
images_fn = [fox, dog, wolf]

# load pixels
image_pixels = [cv2.imread(f) for f in images_fn]
for i,im in enumerate(image_pixels):
    print(f'the type of {images_fn[i]} is {type(im)} with shape {im.shape}')

# display    
image_similarity_tools.plot(images_fn,(1,3))
the type of data/Fox/Sierra_Nevada_red_fox_1_(cropped).jpg is <class 'numpy.ndarray'> with shape (268, 600, 3)
the type of data/Dog/Yacare_De_El_Siledin.jpg is <class 'numpy.ndarray'> with shape (398, 600, 3)
the type of data/Wolf/WPZ_Gray_Wolf_02.jpg is <class 'numpy.ndarray'> with shape (399, 600, 3)