This notebook provides a tutorial for how to study page protections on Wikipedia either via the Mediawiki dumps or API. It has three stages:

• Accessing the Page Protection dumps
• Accessing the Page Protection API
• Example analysis of page protection data (both descriptive statistics and learning a predictive model)

## Accessing the Page Protection Dumps¶

This is an example of how to parse through Mediawiki dumps and determine what sorts of edit protections are applied to a given Wikipedia article.

## Accessing the Page Protection APIs¶

The Page Protection API can be a much simpler way to access data about page protections for a given article if you know what articles you are interested in and are interested in relatively few articles (e.g., hundreds or low thousands).

NOTE: the APIs are up-to-date while the Mediawiki dumps are always at least several days behind -- i.e. for specific snapshots in time -- so the data you get from the Mediawiki dumps might be different from the APIs if permissions have changed to a page's protections in the intervening days.

### The number of each protections scanned through DF data and API data differs.¶

There's a mismatch in the number of move and edit protections accessed from the both the data. Here, we are accessing the count of move and edit protections accessed from Dataframe and API, but API data is the updated one so there's a mismatch. We'll use the API data for further analysis

### Example Analyses of Page Protection Data¶

Here we show some examples of things we can do with the data that we gathered about the protections for various Wikipedia articles. You'll want to come up with some questions to ask of the data as well. For this, you might need to gather additional data such as:

• The page table, which, for example, can be found in the DUMP_DIR under the name {LANGUAGE}-latest-page.sql.gz
• Selecting a sample of, for example, 100 articles and getting additional information about them from other API endpoints.

## Analysis of data on the basis of count of protection types¶

TODO: give an overview of basic details about page protections and any conclusions you reach based on the analyses you do below

1. The least number of protection type is the combination of 'Edit' and 'Upload' which counts to 0
2. Protection type "Move" has the highest frequency in the dataframe.

## Observation 1¶

### Conditional Probablities show the probability of protection Type given the protection level¶

1. The highest probability is shown in the case of "Sysop" i.e. If protection level of any page is Sysop, then with a probability of 0.700727, the protection type of that page would be "Move"

2. This similar pattern can be seen on "Edit" and "Autoconfirmed" combination where the probability of protection type to be Edit (given the type is "Autoconfirmed") is 0.614525

## Observation 2¶

### Conditional Probablities show the probability of protection level given the protection type¶

1. The highest probability is shown in the case of "Upload" i.e. If protection type of any page is upload, then with a probability of 0.838095, the protection type of that page would be "SYSOP"

2. This similar pattern can be seen on "Edit" and "Autoconfirmed" combination where the probability of protection level to be Autoconfirmed (given the type is "Edit") is 0.655481

## Preprocessing the data for analysis¶

Accessing the data through API

### Analysis based on Correlation¶

Watchers and visitingwatchers along with length are correlated

### Predictive Model¶

TODO: Train and evaluate a predictive model on the data you gathered for the above descriptive statistics. Describe what you learned from the model or how it would be useful.

## Logistic regression works better than decision tree but still it has not shown the best results.¶

### Future Analyses¶

TODO: Describe any additional analyses you can think of that would be interesting (and why) -- even if you are not sure how to do them.

1. For future analysis the useful data must be more. Like here the data accessed through APIs was mostly useless which made us to drop that data. Since, prediction task is a much more complex technique, it needs more data to give better results.

2. Secondly, the number of watchers can be predicted but it needed relevant data.

3. Thirdly, the page data can be used to inference the popularity of each page (data which includes the number of clicks)