Wikipedia Edits: Burstiness Analysis¶

Christian Bouwense

Introduction¶

The world can be thought of as a set of systems. These consist of components and interactions between them. Think of two people friending each other on Facebook, or internet routers exchanging packets of information. On the surface, these systems may seem very different, but the underlying principle of actors and actions is the same.

If we can understand the behavior of these systems - what they have in common and what they do not - we could react to our world in a more efficient way. We could have an algorithm to allocate FEMA funds more accurately based on Hurricane patterns before disasters, or know when to administer what type of treatment based on cancer metabolization patterns.

What is Burstiness?¶

This is not a new idea. Mathematicians have tried to model systems for decades. However, classical statistics have not given very good results. The general idea is that human interaction is random, and if that is the case we should be able to model interactions with random statistical models. Unfortunately, the world is not that simple, and many different systems have very dissimilar behaviors.

Fortunately, there is a new way of modelling system behavior: burstiness. Burstiness is a measure of how regular, random, or "bursty" interactions in an system are. If system behavior is "bursty," there are long periods of inactivity, followed by short bursts of activity. On the opposite side of the spectrum, systems that are very regularly timed interactions are "anti-bursty," such as time between heartbeats at rest. The middle of the scale is completely random behavior, such as the time inbetween winning a slot machine.

Methodology¶

We can do better than just describing burstiness in English; this paper by Barabasi introduced the idea of burstiness, and provides mathematical ways of measuring it.

For this experiment, we will study the system of Wikipedia article editing. This is a very interesting case because it involves human and non-human interaction (bots can make edits as well), over many different subjects of varying degrees of public interest. They also have a ton of data.

In order to measure the burstiness of Wikipedia editing (I'll be using "edits" and "revisions" interchangeably), we need a program to download the data, and then measure its burstiness. For this I'll be using Python, notably using the mwapi module, which lets you query Wikipedia's databases.

The Code¶

Above every code cell, I will be putting descriptions of what the code is doing, and how the code is doing it. If you do not understand Python code, then reading these descriptions should give you a good idea of what's going on. If you are interested in the nitty-gritty of the Python, then reading the "how" should give you some insight. (I have marked the high level explanations with "Description" and nitty-gritty explanations with "Technical" to save people time!)

Importing Modules¶

Technical: Here we just import some modules that we will need in the program. The most notable one, as I have mentioned earlier, is the inclusion of "mwapi" which is an interfact between Python and REST APIs which we can send to Wikipedia. An interesting point is that mwapi is actually a wrapper for another module called mw, where you need to actually craft the REST calls yourself.

Getting the Data¶

Description: Here I am creating a bit of code that will ask Wikipedia for the revision data of any article we want. You can specify which article you want data for, a time period, and what kind of data you want about the revisions (e.g. who made the revisions, when the revisions was made, etc.).

Technical: get_revisions() is a wrapper for the mwapi get() call. We are always going to want to query, get revisions, and accept the maximum amount of revisions. However, you can also specify things like the revision properties you want, and the date range you wish to see (and if you don't really care, they have default values). The values are returned in JSON and stored in a global dictionary called article_data. So even though we are getting data, this function has a void return type.

You may notice that there is a for loop that assigns a variable called page_id. We will need this value to traverse the JSON later when we parse its contents, and also for continuing the query. Speaking about continuing the query, any get request is only allowed 500 objects to be received per call. However, if there are more than 500 revisions, there will be a field in the JSON called 'continue.' If you put the value of 'continue' into your next get call, it will return you the next 500. So we have a while loop that checks for this continue value, and if its there just continually make get calls until its gone.

Storing Article Data¶

Description: We need a place to store all of this data that Wikipedia is giving us, so this one line of code below is the place we will keep it. When we ask Wikipedia for the revision data of an article, we will put the amount of edits per user, the times each revision was created, and how much time is inbetween revisions in article_data.

Technical: This is simply a global dictionary to store the data parsed by get_revisions(). It has a key for each article we query. Each article has a key for 'user_revs' (a list of tuples containing a username and how many edits they created), 'revision_times' (a list of timestamps for each edit), and 'interevent_times' (a list of floats indicating the amount of seconds between each edit).

Analyzing Burstiness: B¶

Description: Now that we have a way of getting data from Wikipedia, we need to create a way to analyze its burstiness. One of the ways of measuring burstiness is looking at the distribution of the time between events in the system (interevent times). In order to do this, we need to find their mean and standard deviation and plug them into one of the two burstiness equations.

The measure of burstiness by interevent times is defined as follows:

$$B \ \triangleq\ \frac{\sigma_{\tau}-m_{\tau}}{\sigma_{\tau}+m_{\tau}}$$

If you're not a math person, essentially we are just taking some facts about the data and plugging them into an equation. You can think of this equation as a "Burstiness-o-meter." If we take some data and put it into the "Burstiness-o-meter" it spits out a number. This output is simply a number from -1 to 1; let's call it B. The lower the number is, the less bursty it is, and the higher it is, the burstier. If it is close to 0, that means that the data is random.

Technical: We use the Numpy module's built-in functions to find the mean and standard deviation of the data. Since we already stored the interevent times in the article_data, we can just give those to the Numpy functions.

B is essentially analyzing how different the distribution of interevent times is from the Poisson Distribution. The math here is pretty simple, we're just taking the difference between the standard deviation and the mean over the sum of the standard deviation and the mean. The range of this formula is [-1, 1]. The $\lim_{\sigma\to 0} B=-1$, which makes sense: as the data deviates less and less, there are less "bursty" situations. As $B\to0$, the distribution of interevent times more closely resembles the Poisson distribution, meaning that they are random. Finally as $B\to1$, the burstier the data.

Analyzing Burstiness: M¶

Description: There is a different angle that we can measure burstiness with: memory. At first it may sound the same but it is actually completely independent of the "Burstiness-o-meter" we used before. Memory means that short interevent times are really likely to happen after short ones, and long ones occur after long ones.

In order to measure this, we will make a "Burstiness-o-meter 2.0". The way that this new measurement will calculate burstiness is different, but the numbers it can output will be the same. So we're still looking at a scale that goes from -1 to 1; let's call this output M.

This isn't a better measurement of the data's burstiness, just a different way of measuring it. Just to be clear, say some data has a $B=0.75$ and $M=0.10$ and some other data has $B=0.75$ and $M=0.65$: just because the second data set had both numbers higher doesn't mean it was "burstier," it just means it is bursty in both ways instead of just one.

Technical: Memory is easiest thought of in the context of Markov Chains. The state machine consist of two states: "Short Interevent Time" and "Long Interevent Time." If a system is very bursty, the chance of changing states would be very small; and vice versa for a system with very regular interactions.

We will need the mean and standard deviation of interevent times [1, n-1], and [2, n] (assuming a total of n interevent times).

The coefficient of burstiness due to memory is defined as

$$M \ \triangleq\ \frac{1}{n_{\tau} - 1}\sum_{i=1}^{n_{\tau}-1} \frac{(\tau_{i}-m_{1})(\tau_{i+1}-m_{2})}{\sigma_{1}\sigma_{2}}$$

Bringing it All Together¶

Let's summarize a little bit. We now have code that gets revision data from Wikipedia, extracts some interesting facts about that data, and can analyze the burstiness of it in a couple different ways. Now that we have these tools at our disposal, let's use them on a bunch of different articles! I've Googled the top 15 most edited Wikipedia articles, so they should be a good place to start (I love the fact that "List of WWE Personnel" is the 2nd most edited Wikipedia article).

Visualizing the Results¶

Now let's take a look at the data that we've collected! We're going to use a bar graph of the B and M values for each article we queried. Remember that B and M range from -1 to 1, and the closer to 1 they are the more they contributed to the data's burstiness.

Digging Deeper¶

The further right the bars are in the graph, the burstier the data is: as you can see, Wikipedia edits are pretty bursty! In fact, human activities generally tend to have a B coefficient of 0.5 to 0.8, and an M of 0 to 0.1, so these results make a lot of sense.

Now that we know that Wikpedia edits are bursty, let's dig a little deeper. It would be interesting to find out if we can correlate burstiness with other variables, such as number of users editing the page, or number of overall revisions. We already have this data from earlier, so let's add it to the graph.