This notebook provides a tutorial for how to study page protections on Wikipedia either via the Mediawiki dumps or API. It has three stages:
This is an example of how to parse through Mediawiki dumps and determine what sorts of edit protections are applied to a given Wikipedia article.
# TODO: add other libraries here as necessary
import gzip # necessary for decompressing dump file into text format
# Every language on Wikipedia has its own page restrictions table
# you can find all the dbnames (e.g., enwiki) here: https://www.mediawiki.org/w/api.php?action=sitematrix
# for example, you could replace the LANGUAGE parameter of 'enwiki' with 'arwiki' to study Arabic Wikipedia
LANGUAGE = 'enwiki'
# e.g., enwiki -> en.wikipedia (this is necessary for the API section)
SITENAME = LANGUAGE.replace('wiki', '.wikipedia')
# directory on PAWS server that holds Wikimedia dumps
DUMP_DIR = "/public/dumps/public/{0}/latest/".format(LANGUAGE)
DUMP_FN = '{0}-latest-page_restrictions.sql.gz'.format(LANGUAGE)
# The dataset isn't huge -- 1.1 MB -- so should be quick to process in full
!ls -shH "{DUMP_DIR}{DUMP_FN}"
1.1M /public/dumps/public/enwiki/latest/enwiki-latest-page_restrictions.sql.gz
# Inspect the first 1000 characters of the page protections dump to see what it looks like
# As you can see from the CREATE TABLE statement, each datapoint has 7 fields (pr_page, pr_type, ... , pr_id)
# A description of the fields in the data can be found here:
# https://www.mediawiki.org/wiki/Manual:Page_restrictions_table
# And the data that we want is on lines that start with INSERT INTO `page_restrictions` VALUES...
# The first datapoint (1086732,'edit','sysop',0,NULL,'infinity',1307) can be interpreted as:
# 1086732: page ID 1086732 (en.wikipedia.org/wiki/?curid=1086732)
# 'edit': has edit protections
# 'sysop': that require sysop permissions (https://en.wikipedia.org/wiki/Wikipedia:User_access_levels#Administrator)
# 0: does not cascade to other pages
# NULL: no user-specific restrictions
# 'infinity': restriction does not expire automatically
# 1307: table primary key -- has no meaning by itself
!zcat "{DUMP_DIR}{DUMP_FN}" | head -46 | cut -c1-1000
-- MySQL dump 10.16 Distrib 10.1.45-MariaDB, for debian-linux-gnu (x86_64) -- -- Host: 10.64.48.13 Database: enwiki -- ------------------------------------------------------ -- Server version 10.1.43-MariaDB /*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */; /*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */; /*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */; /*!40101 SET NAMES utf8mb4 */; /*!40103 SET @OLD_TIME_ZONE=@@TIME_ZONE */; /*!40103 SET TIME_ZONE='+00:00' */; /*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */; /*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */; /*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */; /*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */; -- -- Table structure for table `page_restrictions` -- DROP TABLE IF EXISTS `page_restrictions`; /*!40101 SET @saved_cs_client = @@character_set_client */; /*!40101 SET character_set_client = utf8 */; CREATE TABLE `page_restrictions` ( `pr_page` int(8) NOT NULL DEFAULT '0', `pr_type` varbinary(255) NOT NULL DEFAULT '', `pr_level` varbinary(255) NOT NULL DEFAULT '', `pr_cascade` tinyint(4) NOT NULL DEFAULT '0', `pr_user` int(10) unsigned DEFAULT NULL, `pr_expiry` varbinary(14) DEFAULT NULL, `pr_id` int(10) unsigned NOT NULL AUTO_INCREMENT, PRIMARY KEY (`pr_id`), UNIQUE KEY `pr_pagetype` (`pr_page`,`pr_type`), KEY `pr_typelevel` (`pr_type`,`pr_level`), KEY `pr_level` (`pr_level`), KEY `pr_cascade` (`pr_cascade`) ) ENGINE=InnoDB AUTO_INCREMENT=869230 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED; /*!40101 SET character_set_client = @saved_cs_client */; -- -- Dumping data for table `page_restrictions` -- /*!40000 ALTER TABLE `page_restrictions` DISABLE KEYS */; INSERT INTO `page_restrictions` VALUES (1086732,'edit','sysop',0,NULL,'infinity',1307),(1086732,'move','sysop',0,NULL,'infinity',1308),(1266562,'edit','autoconfirmed',0,NULL,'infinity',1358),(1266562,'move','autoconfirmed',0,NULL,'infinity',1359),(1534334,'edit','autoconfirmed',0,NULL,NULL,1437),(1534334,'move','autoconfirmed',0,NULL,NULL,1438),(1654125,'edit','autoconfirmed',0,NULL,NULL,1664),(1654125,'move','autoconfirmed',0,NULL,NULL,1665),(1654622,'edit','autoconfirmed',0,NULL,NULL,1672),(1654622,'move','autoconfirmed',0,NULL,NULL,1673),(1654633,'edit','autoconfirmed',0,NULL,NULL,1674),(1654633,'move','autoconfirmed',0,NULL,NULL,1675),(1654645,'edit','autoconfirmed',0,NULL,NULL,1676),(1654645,'move','autoconfirmed',0,NULL,NULL,1677),(1654656,'edit','autoconfirmed',0,NULL,NULL,1682),(1654656,'move','autoconfirmed',0,NULL,NULL,1683),(1654662,'edit','autoconfirmed',0,NULL,NULL,1684),(1654662,'move','autoconfirmed',0,NULL,NULL,1685),(1654673,'edit','autoconfirmed',0,NULL,NULL,1686),(16 gzip: stdout: Broken pipe
# TODO: Complete example that loops through all page restrictions in the dump file above and extracts data
# The Python gzip library will allow you to decompress the file for reading: https://docs.python.org/3/library/gzip.html#gzip.open
The Page Protection API can be a much simpler way to access data about page protections for a given article if you know what articles you are interested in and are interested in relatively few articles (e.g., hundreds or low thousands).
NOTE: the APIs are up-to-date while the Mediawiki dumps are always at least several days behind -- i.e. for specific snapshots in time -- so the data you get from the Mediawiki dumps might be different from the APIs if permissions have changed to a page's protections in the intervening days.
# TODO: add other libraries here as necessary
import mwapi # useful for accessing Wikimedia API
# TODO: Gather ten random page IDs from the data gathered from the Mediawiki dump to get data for from the API
# mwapi documentation: https://pypi.org/project/mwapi/
# user_agent helps identify the request if there's an issue and is best practice
tutorial_label = 'Page Protection API tutorial (mwapi)'
# NOTE: it is best practice to include a contact email in user agents
# generally this is private information though so do not change it to yours
# if you are working in the PAWS environment or adding to a Github repo
# for Outreachy, you can leave this as my (isaac's) email or switch it to your Mediawiki username
# e.g., Isaac (WMF) for https://www.mediawiki.org/wiki/User:Isaac_(WMF)
contact_email = 'isaac@wikimedia.org'
session = mwapi.Session('https://{0}.org'.format(SITENAME), user_agent='{0} -- {1}'.format(tutorial_label, contact_email))
# TODO: You'll have to add additional parameters here to query the pages you're interested in
# API endpoint: https://www.mediawiki.org/w/api.php?action=help&modules=query%2Binfo
# More details: https://www.mediawiki.org/wiki/API:Info
params = {'action':'query',
'prop':'info'}
# TODO: make request to API for data
# TODO: examine API results and compare to data from Mediawiki dump to see if they are the same and explain any discrepancies
Here we show some examples of things we can do with the data that we gathered about the protections for various Wikipedia articles. You'll want to come up with some questions to ask of the data as well. For this, you might need to gather additional data such as:
DUMP_DIR
under the name {LANGUAGE}-latest-page.sql.gz
# TODO: add any imports of data analysis / visualization libraries here as necessary
TODO: give an overview of basic details about page protections and any conclusions you reach based on the analyses you do below
# TODO: do basic analyses here to understand the data
TODO: Train and evaluate a predictive model on the data you gathered for the above descriptive statistics. Describe what you learned from the model or how it would be useful.
# imports
# TODO: preprocess data
# TODO: train model
# TODO: evaluate model
TODO: Describe any additional analyses you can think of that would be interesting (and why) -- even if you are not sure how to do them.