Lab 3 - Collaboration Networks

Professor Brian Keegan
Department of Information Science, CU Boulder
This notebook is copyright and made available under the Apache License v2.0 license.

This is the third of five lab notebooks that will explore how to analyze the structure of collaborations in Wikipedia data about users' revisions across multiple articles. This lab will extend the methods in the previous two labs about analyzing a single article's revision histories and analyzing the hyperlink networks around a single Wikipedia page. You do not need to be fluent in either to complete the lab, but there are many options for extending the analyses we do here by using more advanced queries and scripting methods.

Acknowledgements
I'd like to thank the Wikimedia Foundation for the PAWS system and related Wikitech infrastructure that this workbook runs within. Yuvi Panda, Aaron Halfaker, Jonathan Morgan, and Dario Taraborelli have all provided crucial support and feedback.

Confirm that basic Python commands work

Import modules and setup environment

Load up all the libraries we'll need to connect to the database, retreive information for analysis, and visualize results.

Define the name of the article you want to use for the rest of the lab.

Define helper functions

Start off by using two functions from Lab 2 and defining some additional helper functions.

Define a function to deal with redirects

Take an example article like "Laura Lane Welch", the maiden name of former First Lady Laura Bush. This article is an example of a redirect that is a basically a placeholder for an alternative name pointing to the canonical page title. These redirects also have revision histories, but they are typically very small.

We need to write a function resolve_redirects to deal with redirects so that they don't pollute our data with unusually small collaborations.

Retrieve the content of the page via API

Test to make sure the resolve_redirects function works

Test the function and compare the list of outlinks from the article to the pages they actually resolve to. This should produce a set of articles titles that are named differently than how they're linked to in the article.

Calculate communities within the network

Data retrieval

Retrieve the host name, username, and (encrypted) password associated with your account so you can login to the database.

Connect to the system using the creditials from above, then select the database for the English Wikipedia.

Define functions to retrieve data from the MySQL database

Define a function get_user_revision_counts to return the number of times and the first/last edits a user made to a single article.

Check to see the results of how this function works.

Write a function get_neighbors_all_revisions to take a single page, get all its outlinks, and for each one of these outlinks get the user revision counts. This will return a large DataFrame containing all the users who ever edited the article and the number of times.

Now run the get_neighbors_all_revisions function to not only get all the page outlinks, but also all the revisions for every page in the dataset.

This step may take more than a minute, depending on the number of articles and number of revisions made to them.

Convert edgelist data into a network

We can convert this edgelist into a networkx bipartite graph by doing several steps. First we store the unique names of all the users and pages as references for subsequent analysis. Then we use the from_pandas_dataframe function to take the "collab_g_edgelist_df" and turn it into a directed graph (order of the connections matters) with the users as a source and the page titles as a target. The nodes in the resulting graph should then be labelled as being users or pages so that we can compare them more clearly in the visualization.

However the amount of data in the network exceeds the available memory we have in the PAWS environment (~1 GB). The code for constructing the whole network graph is below, but not in an executable form so as to prevent a kernel crash and you losing your progress so far.

# Convert the edgelist_df into a networkx graph, # preseving source, target, and weights collab_g = nx.from_pandas_dataframe(all_rev_count_df,source='user',target='page', edge_attr=['edits','min_timestamp','max_timestamp'], create_using=nx.DiGraph()) # Label the nodes in the network as being users or pages using the lists from above collab_g.add_nodes_from(collab_users,nodetype='user') collab_g.add_nodes_from(collab_pages,nodetype='page') # Write the graph out to file so we can visualize it in Gephi nx.write_gexf(collab_g,'collaboration_{0}.gexf'.format(page_title.replace(' ','_')))

Try an alternate version that excludes users having the name 'bot' in the title and users who made only one revision. You may get a SettingWithCopyWarning that pops up as a red box, but it should still have worked succesfully if the number of users, pages, and edges prints out at the bottom.

Project bipartite network into a page-page graph

Project the two-mode/bipartite graph (having users connected to pages) into a weighted one-mode graph where pages are connected to pages if they share users in common.

You could similarly do this for users, but this requires much more memory than we have available within PAWS. You can do some things to remove the pendants (nodes having a degree of one). But it still generates super-hairball-y graphs.

This code is not executable because it will almost certainly cause your notebook to run out of memory and crash. I provide the code in case you want to run it locally on other data.

num_of_nodes = len(collab_nobot_g)-1 collab_nobot_dc = {u:dc*num_of_nodes for u,dc in nx.degree_centrality(collab_nobot_g).items()} pendant_users = [node for node,degree in collab_nobot_dc.items() if 'u:' in node and degree < 3] cg_no_pendants = collab_nobot_g.copy() cg_no_pendants.remove_nodes_from(pendant_users) cg_no_pendants_users = [n for n,d in cg_no_pendants.nodes_iter(data=True) if d['nodetype'] == 'user'] user_collaboration_g = nx.bipartite.weighted_projected_graph(cg_no_pendants,cg_no_pendants_users) user_collaboration_g.number_of_nodes(), user_collaboration_g.number_of_edges() nx.write_gexf(user_collaboration_g,'cwpg_users_cg_nobot_{0}.gexf'.format(page_title.replace(' ','_')))

Extract the backbone of the graph using the method from Serrano, et al. (2009). This method retains the most statistically-significant weighted edges for each node based on a cut-off value of alpha. Rather than using a global thresholding strategy (e.g., dropping all edges below a certain weight) it retains the most important edges for each node.

Load the data for the projected backbone (0.01 threshold) collaboration graph.

Run the function on the igraph version of the backbone (0.01 threshold) collaboration network.

This may take a minute or more since these are intensive calculations

Compute descriptive statistics for the bipartite graph

Identify the most well-connected nodes in the bipartite graph

Compute the directed degree centralities. The network is constructed such that users contribute to articles. It should be impossible for articles to contribute to articles, or users to contribute to users, or articles to contribute to users.

Following this enforced direction, in-degree values for articles should be non-zero and reflect the number of users who contributed to them while it should be zero for users. The out-degree values for users should be non-zero and reflect the number of articles they contributed to while it should be zero for pages.

Convert the in- and out-degree dictionaries from above into a DataFrame called "cg_degree_df" for collaboration degree centrality DataFrame. Also create new columns in cg_degree_df that label whether the row is for a page or a user to help with filtering later on.

What is the average connectivity for pages? for users?

Look at the nodes with the highest in-degree: these are articles and the number of unique users who contributed to them.

Look at the nodes with the highest out-degree: these are uses and the number of unique articles (in the set) they contributed to.

Are there any articles with only a single editor?

Identify which pages share the most editors in common in the bipartite graph

There are several different ways to calculate clustering in a bipartite network. These all capture the tendency for pages to share editors in common: values closer to 0 imply that an article's editors have contributed to few articles in common and values closer to 1 imply that an article's editors share all articles in common. This definition can be expanded to account for the size of different networks'

Figure 5 in Latapy, Magnien, & Del Vecchio (2008) provides another visualization:

bipartite_clustering

Create a loop that computes each of these bipartite clustering values for the pages in the bipartite collaboration network.

Look at what articles have the highest overlap with other articles based on sharing editors in common, according to each of the three definitions.

Plot the distribution of the clustering coefficients for the pages.

Because the distributions for these clustering values are so skewed, look at the median values to describe the network as a whole.

How is the clustering coefficient correlated with the connectivity of the node?

Measure neighbor connectivity in the bipartite graph

Assortativity is a complex network property measuring whether nodes are connected to other similar nodes. A related concept is called "homophily".

Working from the degree analyses we did above, we can do a similar measure of "degree assortativity" that captures the tendency for well-connected nodes to be connected to other well-connected nodes. This is a property seen in many social networks where popular actors surround themselves with other popular actors. Conversely, in many natural systems "dissortative" patterns are observed wherein well-connected nodes are connected to poorly-connected nodes. One popular reason given for this difference is that many natural (as opposed to social) systems have evolved to be resilient and that separating well-connected nodes from poorly-connected nodes helps insulate the network from failures.

While assortativity is typically examined in the context of one-mode (e.g., page-page or user-user) networks, we can extend this definition to our bipartite collaboration network by looking at neighbor connectivity.

Compute descriptive statistics for the projected graph

Compute the degree centrality of the nodes in the projected graph.

Compute the clustering of the nodes in the projected graph.

Plot the clustering against degree.

Compute the average neighbor degrees in the projected graph and plot the degree vs. avg. neighbor degree.