Load supporting data

There are some bad images we should avoid because they appear frequently in many articles. Define them here.

Some basics for testing.

Data scraping functions

There are five sets of data scraping functions.:

We also use some helper functions:

Wikidata scraping

Categories

We need to get all the members of categories, specifically to define the set of pages we will look at.

Testing

A combination of functions are needed to get all the links that connect language versions of the same article.

interlanguage_link_usage = get_interlanguage_link_usage(page_title)

Images

A combination of functions are needed to get all the image usage data we need.

I don't trust the globalusage endpoint after a lot of testing and debugging:

def get_global_usage(page_image_list,lang='en'): """The function accepts a list of filenames and returns a dictionary containing the other languages in which they appear page_image_list - a list of file name lang - a string (typically two letter ISO 639-1 code) for the language edition, defaults to "en" Return: filelink_dict - A dictionary keyed by filename returning a dictionary keyed by language code returning a list containing the article titles in which the image appears """ _filelink_dict = dict() page_image_chunks = chunk_list(page_image_list,5) for image_chunk in page_image_chunks: query_string = "https://{1}.wikipedia.org/w/api.php?action=query&format=json&titles={0}&prop=globalusage&guprop=url|namespace&gulimit=500".format('ملف:Ikhwan-logo.jpg','ar') json_response = requests.get(query_string).json() for _id,payload in json_response['query']['pages'].items(): file_title = 'File:' + payload['title'].split(':')[1] if 'globalusage' in payload: _image_list = payload['globalusage'] # Empty dictionary to be keyed by language with a list of article titles as values clean_image_dict = {} for linked_image in _image_list: if 'ns' in linked_image and linked_image['ns'] == '0': title = linked_image['title'] lang = linked_image['wiki'].split('.')[0] if 'wikipedia.org' in linked_image['wiki']: if lang in clean_image_dict: clean_image_dict[lang].append(title) else: clean_image_dict[lang] = [title] if any(len(page_list) > 25 for file,page_list in clean_image_dict.items()): print("Unusually high global usage on {0}".format(file_title)) _filelink_dict[file_title] = clean_image_dict else: _filelink_dict[file_title] = clean_image_dict return _filelink_dict

Testing images

Revisions

Testing revisions

Scrape the data

Now the fun part where we extract all the data. These steps may take several hours to complete, will generate hundreds of files (JSON and image), and may take >100MB on your local machine.

Scrape category members

This should only need to be run once to get the page_title_list we use at the start.

Commenting out to avoid changes.

categories_by_decade = ["Category:1950s coups d'état and coup attempts","Category:1960s coups d'état and coup attempts", "Category:1970s coups d'état and coup attempts","Category:1980s coups d'état and coup attempts", "Category:1990s coups d'état and coup attempts","Category:2000s coups d'état and coup attempts", "Category:2010s coups d'état and coup attempts"] page_titles = list() for category in categories_by_decade: decade_members = get_category_members(category) for member in decade_members: if member not in page_titles: page_titles.append(member) keywords = ['coup','incident','uprising','crisis','conspiracy','revolution', 'putsch','operation','incident','insurrection','plot','project'] #page_title_list = [i for i in page_titles if any(j.lower() in [k.lower() for k in i.split(' ')] for j in keywords)] page_title_list = [i for i in page_titles if any(j.lower() in i.lower() for j in keywords)] print("There are {0} total page titles".format(len(page_title_list))) with open('page_title_list_new.json','w') as f: json.dump(page_title_list,f)

Scrape file image usage

Pull images

I copied the errors from image downloading to "image_download_errors.txt" to download these manually. These errors are a function of the image not existing on Wikimedia Commons, but only on the local language edition.

Scrape revisions

Make graphs

Old and busted code

Scrape image category memberships

Get the category memberships

Also should clean up the image files to include the missing 120ish Emily hand-downloaded because of missing/broken prefixes or not appearing on Wikimedia commons.

Scrape other images in categories