This notebook provides a tutorial to explore image similarity. The goal is to design an API that, given an input image, returns a collection of similar images.
Install the required dependencies, download data and model.
# install ml dependencies
! pip install tensorflow
! pip install tensorflow_hub
! pip install opencv-python
# download a python file with helper methods for image similarity
! curl -L https://analytics.wikimedia.org/published/datasets/one-off/image_similarity/image_similarity_tools.py -o image_similarity_tools.py
# download data
! curl -L https://analytics.wikimedia.org/published/datasets/one-off/image_similarity/microtask_data.tar.gz -o microtask_data.tar.gz
! tar -xf microtask_data.tar.gz
! rm microtask_data.tar.gz
Collecting tensorflow
Using cached tensorflow-2.6.0-cp38-cp38-manylinux2010_x86_64.whl (458.4 MB)
Collecting google-pasta~=0.2
Using cached google_pasta-0.2.0-py3-none-any.whl (57 kB)
Requirement already satisfied: wheel~=0.35 in /srv/paws/lib/python3.8/site-packages (from tensorflow) (0.37.0)
Collecting h5py~=3.1.0
Using cached h5py-3.1.0-cp38-cp38-manylinux1_x86_64.whl (4.4 MB)
Collecting clang~=5.0
Using cached clang-5.0-py3-none-any.whl
Collecting typing-extensions~=3.7.4
Using cached typing_extensions-3.7.4.3-py3-none-any.whl (22 kB)
Collecting six~=1.15.0
Using cached six-1.15.0-py2.py3-none-any.whl (10 kB)
Collecting absl-py~=0.10
Using cached absl_py-0.14.1-py3-none-any.whl (131 kB)
Collecting numpy~=1.19.2
Using cached numpy-1.19.5-cp38-cp38-manylinux2010_x86_64.whl (14.9 MB)
Collecting tensorboard~=2.6
Using cached tensorboard-2.6.0-py3-none-any.whl (5.6 MB)
Requirement already satisfied: protobuf>=3.9.2 in /srv/paws/lib/python3.8/site-packages (from tensorflow) (3.18.0)
Collecting tensorflow-estimator~=2.6
Using cached tensorflow_estimator-2.6.0-py2.py3-none-any.whl (462 kB)
Collecting astunparse~=1.6.3
Using cached astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Collecting grpcio<2.0,>=1.37.0
Using cached grpcio-1.41.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.9 MB)
Collecting termcolor~=1.1.0
Using cached termcolor-1.1.0-py3-none-any.whl
Collecting wrapt~=1.12.1
Using cached wrapt-1.12.1-cp38-cp38-linux_x86_64.whl
Collecting keras-preprocessing~=1.1.2
Using cached Keras_Preprocessing-1.1.2-py2.py3-none-any.whl (42 kB)
Collecting flatbuffers~=1.12.0
Using cached flatbuffers-1.12-py2.py3-none-any.whl (15 kB)
Collecting opt-einsum~=3.3.0
Using cached opt_einsum-3.3.0-py3-none-any.whl (65 kB)
Collecting gast==0.4.0
Using cached gast-0.4.0-py3-none-any.whl (9.8 kB)
Collecting keras~=2.6
Using cached keras-2.6.0-py2.py3-none-any.whl (1.3 MB)
Requirement already satisfied: markdown>=2.6.8 in /srv/paws/lib/python3.8/site-packages (from tensorboard~=2.6->tensorflow) (3.3.4)
Collecting tensorboard-plugin-wit>=1.6.0
Using cached tensorboard_plugin_wit-1.8.0-py3-none-any.whl (781 kB)
Collecting google-auth-oauthlib<0.5,>=0.4.1
Using cached google_auth_oauthlib-0.4.6-py2.py3-none-any.whl (18 kB)
Requirement already satisfied: setuptools>=41.0.0 in /srv/paws/lib/python3.8/site-packages (from tensorboard~=2.6->tensorflow) (57.5.0)
Requirement already satisfied: requests<3,>=2.21.0 in /srv/paws/lib/python3.8/site-packages (from tensorboard~=2.6->tensorflow) (2.26.0)
Collecting tensorboard-data-server<0.7.0,>=0.6.0
Using cached tensorboard_data_server-0.6.1-py3-none-manylinux2010_x86_64.whl (4.9 MB)
Collecting google-auth<2,>=1.6.3
Using cached google_auth-1.35.0-py2.py3-none-any.whl (152 kB)
Requirement already satisfied: werkzeug>=0.11.15 in /srv/paws/lib/python3.8/site-packages (from tensorboard~=2.6->tensorflow) (2.0.1)
Requirement already satisfied: cachetools<5.0,>=2.0.0 in /srv/paws/lib/python3.8/site-packages (from google-auth<2,>=1.6.3->tensorboard~=2.6->tensorflow) (4.2.2)
Collecting rsa<5,>=3.1.4
Using cached rsa-4.7.2-py3-none-any.whl (34 kB)
Collecting pyasn1-modules>=0.2.1
Using cached pyasn1_modules-0.2.8-py2.py3-none-any.whl (155 kB)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /srv/paws/lib/python3.8/site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard~=2.6->tensorflow) (1.3.0)
Collecting pyasn1<0.5.0,>=0.4.6
Using cached pyasn1-0.4.8-py2.py3-none-any.whl (77 kB)
Requirement already satisfied: charset-normalizer~=2.0.0 in /srv/paws/lib/python3.8/site-packages (from requests<3,>=2.21.0->tensorboard~=2.6->tensorflow) (2.0.6)
Requirement already satisfied: certifi>=2017.4.17 in /srv/paws/lib/python3.8/site-packages (from requests<3,>=2.21.0->tensorboard~=2.6->tensorflow) (2021.5.30)
Requirement already satisfied: idna<4,>=2.5 in /srv/paws/lib/python3.8/site-packages (from requests<3,>=2.21.0->tensorboard~=2.6->tensorflow) (3.2)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /srv/paws/lib/python3.8/site-packages (from requests<3,>=2.21.0->tensorboard~=2.6->tensorflow) (1.26.7)
Requirement already satisfied: oauthlib>=3.0.0 in /srv/paws/lib/python3.8/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard~=2.6->tensorflow) (3.1.1)
Installing collected packages: pyasn1, six, rsa, pyasn1-modules, google-auth, tensorboard-plugin-wit, tensorboard-data-server, numpy, grpcio, google-auth-oauthlib, absl-py, wrapt, typing-extensions, termcolor, tensorflow-estimator, tensorboard, opt-einsum, keras-preprocessing, keras, h5py, google-pasta, gast, flatbuffers, clang, astunparse, tensorflow
Attempting uninstall: six
Found existing installation: six 1.16.0
Uninstalling six-1.16.0:
Successfully uninstalled six-1.16.0
Attempting uninstall: numpy
Found existing installation: numpy 1.21.2
Uninstalling numpy-1.21.2:
Successfully uninstalled numpy-1.21.2
Attempting uninstall: typing-extensions
Found existing installation: typing-extensions 3.10.0.2
Uninstalling typing-extensions-3.10.0.2:
Successfully uninstalled typing-extensions-3.10.0.2
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bokeh 2.4.0 requires typing-extensions>=3.10.0, but you have typing-extensions 3.7.4.3 which is incompatible.
Successfully installed absl-py-0.14.1 astunparse-1.6.3 clang-5.0 flatbuffers-1.12 gast-0.4.0 google-auth-1.35.0 google-auth-oauthlib-0.4.6 google-pasta-0.2.0 grpcio-1.41.0 h5py-3.1.0 keras-2.6.0 keras-preprocessing-1.1.2 numpy-1.19.5 opt-einsum-3.3.0 pyasn1-0.4.8 pyasn1-modules-0.2.8 rsa-4.7.2 six-1.15.0 tensorboard-2.6.0 tensorboard-data-server-0.6.1 tensorboard-plugin-wit-1.8.0 tensorflow-2.6.0 tensorflow-estimator-2.6.0 termcolor-1.1.0 typing-extensions-3.7.4.3 wrapt-1.12.1
Collecting tensorflow_hub
Using cached tensorflow_hub-0.12.0-py2.py3-none-any.whl (108 kB)
Requirement already satisfied: numpy>=1.12.0 in /srv/paws/lib/python3.8/site-packages (from tensorflow_hub) (1.19.5)
Requirement already satisfied: protobuf>=3.8.0 in /srv/paws/lib/python3.8/site-packages (from tensorflow_hub) (3.18.0)
Installing collected packages: tensorflow-hub
Successfully installed tensorflow-hub-0.12.0
Collecting opencv-python
Using cached opencv_python-4.5.3.56-cp38-cp38-manylinux2014_x86_64.whl (49.9 MB)
Requirement already satisfied: numpy>=1.17.3 in /srv/paws/lib/python3.8/site-packages (from opencv-python) (1.19.5)
Installing collected packages: opencv-python
Successfully installed opencv-python-4.5.3.56
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2535 100 2535 0 0 46090 0 --:--:-- --:--:-- --:--:-- 46090
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 62.3M 100 62.3M 0 0 28.0M 0 0:00:02 0:00:02 --:--:-- 28.0M
Importing the libraries. If this fails after installing the dependencies, restart the kernel.
%load_ext autoreload
%autoreload 2
import cv2
import image_similarity_tools
import os
import random
2021-10-11 13:10:39.628038: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2021-10-11 13:10:39.628084: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2021-10-11 13:10:42.330725: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory 2021-10-11 13:10:42.330771: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303) 2021-10-11 13:10:42.330829: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (jupyter--41marachi-5f-41): /proc/driver/nvidia/version does not exist
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/ImageNetLabels.txt 16384/10484 [==============================================] - 0s 0us/step 24576/10484 [======================================================================] - 0s 0us/step
There are pictures of three categories: dog, fox and wolf. The image file name is the name of the file on wikiedia commons. We can iterate over the image files in their respective category folders in the data directory.
for root, dirs, files in os.walk("data"):
if len(files)>0:
print(f'category {os.path.basename(root)} contains {len(files)} images')
print(f'\texample images: {files[:2]}')
category Wolf contains 290 images example images: ['Loups_siberie.jpg', 'Canis_lupus_signatus_(Kerkrade_Zoo)_26.jpg'] category Fox contains 142 images example images: ['Kew_Gardens_-_London_-_September_2008_(2958753889).jpg', 'Vulpes_vulpes_qtl1.jpg'] category Dog contains 283 images example images: ['Henry_Tenré.jpg', '(2)_Isha_female_rajapalayam.jpg']
image_similarity_tools.run_me()
Let's start with an initial analsys of the images in the dataset. This can be done by iterating over the files and using dict to accumulate results, but feel free to install to other libraries.
# TODO compute statistics
We can load images as numpy arrays using the open cv library, and we can plot images using the utilities provided in image_similarity_tools.
fox = 'data/Fox/Sierra_Nevada_red_fox_1_(cropped).jpg'
dog = 'data/Dog/Yacare_De_El_Siledin.jpg'
wolf = 'data/Wolf/WPZ_Gray_Wolf_02.jpg'
images = [cv2.imread(f) for f in [fox,dog,wolf]]
print(f'the type of fox is {type(images[0])} with shape {images[0].shape}')
image_similarity_tools.plot_images(1,3,images)
We can run inference on an image by passing the file path to image_similarity_tools.classify_image, which returns a list of predictions for the imagenet labels.
# print top three likely categories for some images
for image in [fox,dog,wolf]:
predictions = image_similarity_tools.classify_image(image)
print(f'Top predictions for file {image}')
for label, prob in sorted(predictions, key=lambda kv: kv[1],reverse=True)[:3]:
print(f'\t{label} : {prob}')
In this section we will implement an "API", which is just a python function in fact, that takes an input image, and returns a set of similar images from the dataset.
The algorithm argument specifies which approach should be used to generate the image recommendations:
One direct approach is to first extract features from an image, and then use a distance metric (e.g. euclidian) to compute the nearest neighbors.
This microtask is based on a Wikimania workshop on image similarity, the notebooks contain useful code to extract features from an image.
def get_similar_images(input_image, algorithm, n_recommendation=5):
"""
Given an input image, this method returns a list of similar images.
:param input_image: file name of an image
:param algorithm: algorithm to use for recommendations
:return: list of image recommendations
"""
if algorithm=="metadata-based":
category_dir, _ = input_image.rsplit("/",1)
recommendations = random.sample(os.listdir(category_dir),n_recommendation)
return [f'{category_dir}/{f}' for f in recommendations]
elif algorithm == "random":
# TODO complete
pass
elif algorithm == "deeplearning":
# TODO complete
pass
elif algorithm == "color_distribution":
# TODO complete
pass #
else:
raise ValueError(f'algorithm {algorithm} not implemented')
get_similar_images(
input_image='data/Wolf/Timberwolf_Juli_2009_Zoo_Hannover.jpg',
algorithm="metadata-based")
get_similar_images(
input_image='data/Wolf/Timberwolf_Juli_2009_Zoo_Hannover.jpg',
algorithm="color_distribution")
get_similar_images(
input_image='data/Wolf/Timberwolf_Juli_2009_Zoo_Hannover.jpg',
algorithm="deeplearning")
Let's generate a summary of how the different models perform.
# TODO: build out analyses per instructions above
TODO: