## Language-tag similarity¶

topics_parent_lang_df = pd.read_csv('topics_parent_lang.csv',encoding='utf8',header=None,index_col=0) topics_parent_lang_dict = topics_parent_lang_df[1].to_dict() topics_parent_lang_dict with open('topics_parent_lang.json','w') as f: json.dump(topics_parent_lang_dict,f)

Get the names of all the images.

We want to align the image_tags_dict and files in lang_images, but there are Unicode normalization issues that happened when writing the filenames to disk as compared to how they're encoded on Wikipedia. It appears that writing to disk decomposed into characters and modifiers and Wikipedia has the composed representation.

These are images that were downloaded and tagged but apparently don't appear in any languages.