Professor Brian Keegan
Department of Information Science, CU Boulder
This notebook is copyright and made available under the Apache License v2.0 license.
This is the first of five lab notebooks that will explore how to do some introductory data extraction and analysis from Wikipedia data. This lab will use a combination of MySQL and Python to retrieve the revision histories for a selection of articles. You do not need to be fluent in either to complete the lab, but there are many options for extending the analyses we do here by using more advanced queries and scripting methods.
Acknowledgements
I'd like to thank the Wikimedia Foundation for the PAWS system and related Wikitech infrastructure that this workbook runs within. Yuvi Panda, Aaron Halfaker, Jonathan Morgan, and Dario Taraborelli have all provided crucial support and feedback.
a = 1
b = 4
a+b
5
list(range(0,5))
[0, 1, 2, 3, 4]
your_name = 'Ricarose Roque'
#print("Hello, {0}".format(your_name))
print("Hello, " + your_name)
Hello, Ricarose Roque
Load up all the libraries we'll need to connect to the database, retrieve information for analysis, and visualizing results.
# Makes the plots appear within the notebook
%matplotlib inline
# Two fundamental packages for doing data manipulation
import numpy as np # http://www.numpy.org/
import pandas as pd # http://pandas.pydata.org/
# Two related packages for plotting data
import matplotlib.pyplot as plt # http://matplotlib.org/
import seaborn as sb # https://stanford.edu/~mwaskom/software/seaborn/
# Two packages for accessing the MySQL server
import pymysql # http://pymysql.readthedocs.io/en/latest/
import os # https://docs.python.org/3.4/library/os.html
# Setup the code environment to use plots with a white background and DataFrames show more columns and rows
sb.set_style('whitegrid')
pd.options.display.max_columns = 100
pd.options.display.max_rows = 110
/srv/paws/lib/python3.4/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment. warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.') /srv/paws/lib/python3.4/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment. warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')
Retrieve the host name, username, and (encrypted) password associated with your account so you can login to the database.
host, user, password = os.environ['MYSQL_HOST'], os.environ['MYSQL_USERNAME'], os.environ['MYSQL_PASSWORD']
Connect to the system using the creditials from above, then select the database for the English Wikipedia.
conn = pymysql.connect(host=host,user=user,password=password,database='enwiki_p',connect_timeout=3600)
conn.cursor().execute('use enwiki_p');
pd.read_sql_query('show tables',conn)
Tables_in_enwiki_p | |
---|---|
0 | abuse_filter |
1 | abuse_filter_action |
2 | abuse_filter_log |
3 | aft_article_answer |
4 | aft_article_answer_text |
5 | aft_article_feedback |
6 | aft_article_feedback_properties |
7 | aft_article_feedback_ratings_rollup |
8 | aft_article_feedback_select_rollup |
9 | aft_article_field |
10 | aft_article_field_group |
11 | aft_article_field_option |
12 | aft_article_filter_count |
13 | aft_article_revision_feedback_ratings_rollup |
14 | aft_article_revision_feedback_select_rollup |
15 | archive |
16 | archive_userindex |
17 | article_feedback |
18 | article_feedback_pages |
19 | article_feedback_properties |
20 | article_feedback_ratings |
21 | article_feedback_revisions |
22 | article_feedback_stats |
23 | article_feedback_stats_types |
24 | category |
25 | categorylinks |
26 | change_tag |
27 | ep_articles |
28 | ep_cas |
29 | ep_courses |
30 | ep_events |
31 | ep_instructors |
32 | ep_oas |
33 | ep_orgs |
34 | ep_revisions |
35 | ep_students |
36 | ep_users_per_course |
37 | externallinks |
38 | filearchive |
39 | filearchive_userindex |
40 | flaggedimages |
41 | flaggedpage_config |
42 | flaggedpage_pending |
43 | flaggedpages |
44 | flaggedrevs |
45 | flaggedrevs_promote |
46 | flaggedrevs_statistics |
47 | flaggedrevs_stats |
48 | flaggedrevs_stats2 |
49 | flaggedrevs_tracking |
50 | flaggedtemplates |
51 | geo_tags |
52 | global_block_whitelist |
53 | image |
54 | imagelinks |
55 | interwiki |
56 | ipblocks |
57 | ipblocks_ipindex |
58 | iwlinks |
59 | l10n_cache |
60 | langlinks |
61 | localisation |
62 | localisation_file_hash |
63 | logging |
64 | logging_logindex |
65 | logging_userindex |
66 | mark_as_helpful |
67 | math |
68 | module_deps |
69 | msg_resource_links |
70 | oldimage |
71 | oldimage_userindex |
72 | ores_classification |
73 | ores_model |
74 | page |
75 | page_props |
76 | page_restrictions |
77 | pagelinks |
78 | pagetriage_log |
79 | pagetriage_page |
80 | pagetriage_page_tags |
81 | pagetriage_tags |
82 | pif_edits |
83 | povwatch_log |
84 | povwatch_subscribers |
85 | protected_titles |
86 | recentchanges |
87 | recentchanges_userindex |
88 | redirect |
89 | revision |
90 | revision_userindex |
91 | site_identifiers |
92 | site_stats |
93 | sites |
94 | tag_summary |
95 | templatelinks |
96 | transcode |
97 | updatelog |
98 | updates |
99 | user |
100 | user_former_groups |
101 | user_groups |
102 | user_properties |
103 | user_properties_anon |
104 | valid_tag |
105 | watchlist_count |
106 | wbc_entity_usage |
107 | wikilove_image_log |
108 | wikilove_log |
s = """
SELECT
*
FROM
wikilove_log
LIMIT
5
"""
pd.read_sql_query(s,conn)
wll_id | wll_timestamp | wll_sender | wll_sender_registration | wll_sender_editcount | wll_receiver | wll_receiver_registration | wll_receiver_editcount | wll_type | wll_subject | wll_message | wll_email | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | b'20110630183430' | 10796336 | b'20091016234100' | 92 | 10796336 | b'20091016234100' | 92 | b'food-pie' | b'A pie for you!' | b'here is another pie for me' | 0 |
1 | 2 | b'20110630183437' | 59944 | b'20040419234618' | 34358 | 59944 | b'20040419234618' | 34358 | b'food-baklava' | b'Some baklava for you!' | b'testing' | 0 |
2 | 3 | b'20110630183525' | 10796336 | b'20091016234100' | 93 | 10796336 | b'20091016234100' | 93 | b'food-baklava' | b'Some baklava for you!' | b'here is another wikilove!' | 0 |
3 | 4 | b'20110630183537' | 59944 | b'20040419234618' | 34359 | 59944 | b'20040419234618' | 34359 | b'makeyourown' | b'test' | b'Howdy' | 0 |
4 | 5 | b'20110630183611' | 12732325 | b'20100714203639' | 119 | 59944 | b'20040419234618' | 34360 | b'kitten' | b'A kitten for you!' | b'Hurky hurky test' | 0 |
Article page titles need to be separated by underscores "_" rather than spaces. It's easy to replace spaces with underscores using the replace
method called on the string.
page_title = 'Mitt Romney'.replace(' ','_')
page_title
'Mitt_Romney'
Write a function get_page_revisions
that takes the name of a Wikipedia page and its namespace and use the database connection to return a pandas DataFrame containing all the revisions for the article. Some of the fields inside needs to be cleaned up by converting to other data types to help with analysis in later steps.
def get_page_revisions(page_title,namespace,conn):
""" Takes a page title and returns its revision history.
page_title = a string for the page title to get its revisions
namespace = an int to get the revisions from a specific namespace
conn = a database connection
Returns:
A pandas DataFrame containing all the article revisions and page information
"""
# In case you pass a page title with spaces in it, replace the spaces with underscores
page_title = page_title.replace(' ','_')
# The MySQL query string used to retrieve the data. By line, it is
## selecting all columns
## from the "revisions" table
## joining the "page" table on it
## using the page_id and rev_page columns as keys
## limiting the results to entries that occur in the namespace and pagetitle
## and ordering the results by the timestamp
s = """
SELECT
*
FROM
revision
JOIN
page ON page.page_id = revision.rev_page
WHERE
page.page_namespace = {1} AND page.page_title = '{0}'
ORDER BY
revision.rev_timestamp ASC
""".format(page_title,namespace)
# Use the connection to run the query and return the results as a DataFrame
_df = pd.read_sql_query(s,conn)
# Some of the results have a "bytestring" format
byte_columns = ['rev_comment','rev_user_text','rev_timestamp','rev_sha1','page_content_model',
'page_title','page_restrictions','page_touched','page_links_updated']
# For each column, convert it from bytestring to a utf8 string
for col in byte_columns:
_df[col] = _df[col].str.decode('utf8')
# Several of the columns are timestamps. Convert to datetimes
_df['rev_timestamp'] = _df['rev_timestamp'].apply(lambda x:pd.datetime.strptime(x,'%Y%m%d%H%M%S'))
_df['page_touched'] = _df['page_touched'].apply(lambda x:pd.datetime.strptime(x,'%Y%m%d%H%M%S'))
_df['page_links_updated'] = _df['page_links_updated'].apply(lambda x:pd.datetime.strptime(x,'%Y%m%d%H%M%S'))
# Return the data
return _df
Use the function to get the revisions for a single article and use the .head()
method to inspect the first 5 rows in the DataFrame.
rev_df = get_page_revisions('Mitt Romney',0,conn)
rev_df.head()
rev_id | rev_page | rev_text_id | rev_comment | rev_user | rev_user_text | rev_timestamp | rev_minor_edit | rev_deleted | rev_len | rev_parent_id | rev_sha1 | rev_content_model | rev_content_format | page_id | page_namespace | page_title | page_restrictions | page_counter | page_is_redirect | page_is_new | page_random | page_touched | page_links_updated | page_latest | page_len | page_content_model | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2229556 | 426208 | 0.0 | 30528 | Academic Challenger | 2004-01-10 07:37:48 | 0 | 0 | 237.0 | 0 | igni8ilffo85cqdcxtf2nffgel4ua9v | None | None | 426208 | 0 | Mitt_Romney | 0 | 0 | 0 | 0.487268 | 2016-09-24 00:37:57 | 2016-09-24 00:38:05 | 740891468 | 251636 | wikitext | ||
1 | 2671292 | 426208 | 0.0 | SLC Olympics. | 0 | 18.24.0.120 | 2004-01-25 21:23:55 | 0 | 0 | 369.0 | 2229556 | 1m9i39rbo5nmu8klkh2rvtzgmi79qlp | None | None | 426208 | 0 | Mitt_Romney | 0 | 0 | 0 | 0.487268 | 2016-09-24 00:37:57 | 2016-09-24 00:38:05 | 740891468 | 251636 | wikitext | |
2 | 2802157 | 426208 | 0.0 | 47982 | Xinoph | 2004-03-07 22:46:47 | 0 | 0 | 972.0 | 2671292 | 13k734g9hhg0c0gfoeajlrbgm6qwk36 | None | None | 426208 | 0 | Mitt_Romney | 0 | 0 | 0 | 0.487268 | 2016-09-24 00:37:57 | 2016-09-24 00:38:05 | 740891468 | 251636 | wikitext | ||
3 | 3064041 | 426208 | 0.0 | Paragraph formatting. Info about father. | 64194 | Chrisn4255 | 2004-03-17 08:37:03 | 0 | 0 | 1139.0 | 2802157 | om843tubijlwo0i3qxyih3i0si82111 | None | None | 426208 | 0 | Mitt_Romney | 0 | 0 | 0 | 0.487268 | 2016-09-24 00:37:57 | 2016-09-24 00:38:05 | 740891468 | 251636 | wikitext | |
4 | 3435894 | 426208 | 0.0 | fixing link | 32345 | Acegikmo1 | 2004-04-05 06:17:41 | 1 | 0 | 1152.0 | 3064041 | rkd11yolu9udl9lkehw8cg0bnwhuuok | None | None | 426208 | 0 | Mitt_Romney | 0 | 0 | 0 | 0.487268 | 2016-09-24 00:37:57 | 2016-09-24 00:38:05 | 740891468 | 251636 | wikitext |
Count the total number of revisions in the article.
len(rev_df)
11046
Count the total number of unique users contributing to the article.
len(rev_df['rev_user_text'].unique())
2654
When was the first revision made to the article?
rev_df['rev_timestamp'].min()
Timestamp('2004-01-10 07:37:48')
When was the most recent revision made to the article?
rev_df['rev_timestamp'].max()
Timestamp('2016-09-24 00:37:57')
How has the size of the article changed over time?
ax = rev_df['rev_len'].plot()
ax.set_xlabel('Revisions')
ax.set_ylabel('Article size (bytes)')
<matplotlib.text.Text at 0x7fc7e29b8b00>
Who made the largest change to the article?
# Use the idxmax() method to return the index value for the largest value in the series
loc_max_revision = rev_df['rev_len'].idxmax()
# Show the values of a single row
rev_df.ix[loc_max_revision]
rev_id 173926275 rev_page 426208 rev_text_id 0 rev_comment rev_user 0 rev_user_text 76.197.212.135 rev_timestamp 2007-11-26 17:47:03 rev_minor_edit 0 rev_deleted 0 rev_len 288466 rev_parent_id 173917961 rev_sha1 azle9orolzo3vh81tn7d4i7zehlxvqk rev_content_model None rev_content_format None page_id 426208 page_namespace 0 page_title Mitt_Romney page_restrictions page_counter 0 page_is_redirect 0 page_is_new 0 page_random 0.487268 page_touched 2016-09-24 00:37:57 page_links_updated 2016-09-24 00:38:05 page_latest 740891468 page_len 251636 page_content_model wikitext Name: 4419, dtype: object
Go inspect what this largest revision was. Actually this specific example is extremely NSFW.
rev_id = rev_df.ix[loc_max_revision,'rev_id']
print('https://en.wikipedia.org/w/index.php?oldid={0}'.format(rev_id))
https://en.wikipedia.org/w/index.php?oldid=173926275
Muliple revisions can be made in a single day. We want to be able to count the number of revisions per day and how the article's length has changed over time.
# For each timestamp extract only its date (ignoring the time of day)
# and save as a new column "rev_date"
rev_df['rev_date'] = rev_df['rev_timestamp'].apply(lambda x:x.date())
# Compare the full "rev_timestamp" to the new "rev_date"
rev_df[['rev_timestamp','rev_date']].head()
rev_timestamp | rev_date | |
---|---|---|
0 | 2004-01-10 07:37:48 | 2004-01-10 |
1 | 2004-01-25 21:23:55 | 2004-01-25 |
2 | 2004-03-07 22:46:47 | 2004-03-07 |
3 | 2004-03-17 08:37:03 | 2004-03-17 |
4 | 2004-04-05 06:17:41 | 2004-04-05 |
What dates had the most revisions in a single day? A simple way of doing this is to count the number of times each "rev_date" occurs in the data with the value_counts
method. Let's look at the top 10, 5 of which are in October 2012 in the weeks before the Election Day.
rev_df['rev_date'].value_counts().head(10)
2012-10-07 77 2012-10-06 73 2012-10-13 71 2012-05-13 65 2007-12-13 58 2012-07-27 54 2012-10-05 50 2012-05-30 50 2012-10-03 50 2012-07-20 49 Name: rev_date, dtype: int64
We can use the groupby functionality to create "baby" DataFrames having all the same particular value. For example, we can group by the date and each of the baby DataFrames would be all the revisions that occurred on that date. Store the groupby in the "rev_gb_date" object for subsequent steps.
Inspect a single baby DataFrame by passing the key (the date values we grouped on) to the get_group
method and look at the first 5 entries.
rev_gb_date = rev_df.groupby(['rev_date'])
_date = pd.Timestamp('2012-10-07').date()
rev_gb_date.get_group(_date).head(5)
rev_id | rev_page | rev_text_id | rev_comment | rev_user | rev_user_text | rev_timestamp | rev_minor_edit | rev_deleted | rev_len | rev_parent_id | rev_sha1 | rev_content_model | rev_content_format | page_id | page_namespace | page_title | page_restrictions | page_counter | page_is_redirect | page_is_new | page_random | page_touched | page_links_updated | page_latest | page_len | page_content_model | rev_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9915 | 516397493 | 426208 | 0.0 | /* Political positions */ ce, improve prose | 11243047 | GabeMc | 2012-10-07 00:11:21 | 0 | 0 | 215302.0 | 516387652 | sylmwj2deuw63nk04da8uv1u2bgt8t3 | None | None | 426208 | 0 | Mitt_Romney | 0 | 0 | 0 | 0.487268 | 2016-09-24 00:37:57 | 2016-09-24 00:38:05 | 740891468 | 251636 | wikitext | 2012-10-07 | |
9916 | 516398755 | 426208 | 0.0 | /* Political positions */ ce, avoid redundancy... | 11243047 | GabeMc | 2012-10-07 00:22:12 | 0 | 0 | 215311.0 | 516397493 | lsh4gi37dd000s67qsulfv35uf0i16i | None | None | 426208 | 0 | Mitt_Romney | 0 | 0 | 0 | 0.487268 | 2016-09-24 00:37:57 | 2016-09-24 00:38:05 | 740891468 | 251636 | wikitext | 2012-10-07 | |
9917 | 516403792 | 426208 | 0.0 | clarify that church positions were not full time | 61960 | Anythingyouwant | 2012-10-07 01:06:39 | 0 | 0 | 215338.0 | 516398755 | pe5aabxcd8lvll3wv9y5c3n8bbhrc51 | None | None | 426208 | 0 | Mitt_Romney | 0 | 0 | 0 | 0.487268 | 2016-09-24 00:37:57 | 2016-09-24 00:38:05 | 740891468 | 251636 | wikitext | 2012-10-07 | |
9918 | 516405141 | 426208 | 0.0 | /* University, France mission, marriage, and c... | 61960 | Anythingyouwant | 2012-10-07 01:19:16 | 0 | 0 | 215346.0 | 516403792 | d5lo1cm9nzg2oh7ply53xj9hdq7cua0 | None | None | 426208 | 0 | Mitt_Romney | 0 | 0 | 0 | 0.487268 | 2016-09-24 00:37:57 | 2016-09-24 00:38:05 | 740891468 | 251636 | wikitext | 2012-10-07 | |
9919 | 516405398 | 426208 | 0.0 | /* University, France mission, marriage, and c... | 61960 | Anythingyouwant | 2012-10-07 01:21:48 | 0 | 0 | 215357.0 | 516405141 | 2mbiudrlxffpz90zx4pi5cq33xywkpb | None | None | 426208 | 0 | Mitt_Romney | 0 | 0 | 0 | 0.487268 | 2016-09-24 00:37:57 | 2016-09-24 00:38:05 | 740891468 | 251636 | wikitext | 2012-10-07 |
We can also use aggregate functionality to perform computations efficiently across all the baby DataFrames.
daily_revs = rev_gb_date.agg({'rev_id':pd.Series.nunique})
daily_revs.head()
rev_id | |
---|---|
rev_date | |
2004-01-10 | 1 |
2004-01-25 | 1 |
2004-03-07 | 1 |
2004-03-17 | 1 |
2004-04-05 | 1 |
Reindex the axis to make it continuous over the date range.
# Compute the first and last days revisions were observed
_min, _max = daily_revs.index.min(), daily_revs.index.max()
# Create a new index that is continuous over the whoel date range
_ix = pd.date_range(_min,_max)
# Reindex the daily revisions over the continuous date range filling in the blanks with 0s
daily_revs_reindexed = daily_revs.reindex(_ix).fillna(0)
daily_revs_reindexed.head()
rev_id | |
---|---|
2004-01-10 | 1.0 |
2004-01-11 | 0.0 |
2004-01-12 | 0.0 |
2004-01-13 | 0.0 |
2004-01-14 | 0.0 |
What was the most active day of editing?
daily_revs_reindexed['rev_id'].sort_values(ascending=False).head(5)
2012-10-07 77.0 2012-10-06 73.0 2012-10-13 71.0 2012-05-13 65.0 2007-12-13 58.0 Name: rev_id, dtype: float64
Plot the number of revisions by day.
ax = daily_revs_reindexed['rev_id'].plot(lw=.5)
# Label the axes
ax.set_xlabel('Date')
ax.set_ylabel('Revisions')
<matplotlib.text.Text at 0x7fc7e2dc3a90>
This is pretty noisy. We can clean it up by applying a rolling mean over a 60-day window.
smoothed_s = daily_revs_reindexed['rev_id'].rolling(60).mean()
ax = smoothed_s.plot(lw=3)
# Label the axes
ax.set_xlabel('Date')
ax.set_ylabel('Revisions')
<matplotlib.text.Text at 0x7fc7e380ecf8>
Plot the cumulative sum of the number of revisions over time using the cumsum
method.
ax = daily_revs_reindexed['rev_id'].cumsum().plot(lw=2)
# Label the axes
ax.set_xlabel('Date')
ax.set_ylabel('Cumulative Revisions')
<matplotlib.text.Text at 0x7fc7e2a742b0>
Using the same "rev_gb_date" object, aggregate each of the baby DataFrames and compute the median article size ("rev_len"). Reindex again to make a continuous time axis.
daily_length = rev_gb_date.agg({'rev_len':np.median})
# Reindex
daily_length_reindexed = daily_length.reindex(_ix).fillna(method='ffill')
daily_length_reindexed.head()
rev_len | |
---|---|
2004-01-10 | 237.0 |
2004-01-11 | 237.0 |
2004-01-12 | 237.0 |
2004-01-13 | 237.0 |
2004-01-14 | 237.0 |
Plot the reindexed series, making the y-axis be on a log scale to pick up detail in the early days when the article was orders of magnitude smaller.
ax = daily_length_reindexed['rev_len'].plot(lw=2,logy=True)
# Label the axes
ax.set_xlabel('Date')
ax.set_ylabel('Article size (bytes)')
<matplotlib.text.Text at 0x7fc7e08ad160>
The "rev_len" only records the length of the revision at that point in time. By taking the difference with the previous observation, we can compute how much net content the user added or removed from the previous revision. Using the diff
method on the "rev_len" series, we store the resulting differences as "rev_len_diff".
Compare the first five elements of the two series to verify it's worked as intended.
# Compute the diff and store as a new series
rev_df['rev_len_diff'] = rev_df['rev_len'].diff()
# Verify the differences are accurate
rev_df[['rev_len','rev_len_diff']].head()
rev_len | rev_len_diff | |
---|---|---|
0 | 237.0 | NaN |
1 | 369.0 | 132.0 |
2 | 972.0 | 603.0 |
3 | 1139.0 | 167.0 |
4 | 1152.0 | 13.0 |
Do some more cleanup of the diffs. In particular, when a revision is deleted (which is rare) the "rev_len" value is null, which makes the diff null as well. Make the first revision's "rev_len_diff" equal to the size of the article and then ffill the other nan values.
# Set the revision lengths for the null rows to be the previous value
rev_df['rev_len'] = rev_df['rev_len'].fillna(method='ffill')
# Compute the diffs again
rev_df['rev_len_diff'] = rev_df['rev_len'].diff()
# Set the value for the first rev_len_diff to be the size of the rev_len
rev_df.loc[0,'rev_len_diff'] = rev_df.loc[0,'rev_len']
# Verify the differences are accurate
rev_df[['rev_len','rev_len_diff']].head()
rev_len | rev_len_diff | |
---|---|---|
0 | 237.0 | 237.0 |
1 | 369.0 | 132.0 |
2 | 972.0 | 603.0 |
3 | 1139.0 | 167.0 |
4 | 1152.0 | 13.0 |
Similarly, the "rev_timestamp" series only records the time when the revision occurred. By taking the difference with the previous revision, we can compute how much time has elapsed since the last edit. Using the diff
method again on the "rev_timestamp" series, we store the resulting differences as "ref_timestamp_diff".
The resulting values are Timedeltas which we convert back to number of day floats by dividing by np.timedelta64
.
Compare the first five elements of the two series to verify it's worked as intended.
# Compute the diff and store as a new series
rev_df['rev_timestamp_diff'] = rev_df['rev_timestamp'].diff()
# Convert the values of the new series to floats by dividing by 1 day time delta.
rev_df['rev_timestamp_diff'] = rev_df['rev_timestamp_diff'].apply(lambda x:x/np.timedelta64(1,'D'))
# Verify the differences are accurate.
rev_df[['rev_timestamp','rev_timestamp_diff']].head()
rev_timestamp | rev_timestamp_diff | |
---|---|---|
0 | 2004-01-10 07:37:48 | NaN |
1 | 2004-01-25 21:23:55 | 15.573692 |
2 | 2004-03-07 22:46:47 | 42.057546 |
3 | 2004-03-17 08:37:03 | 9.409907 |
4 | 2004-04-05 06:17:41 | 18.903218 |
Because we added new series to the "rev_df" we need to re-run the groupby to update it. We can also combine all the different aggregate functions we did before into one step by defining an "agg_dict" that applies an aggregation function over the values in a specific column. This will return a DataFrame with only a subset of columns with the aggregated values we specified.
# Groupby again
rev_gb_date = rev_df.groupby(['rev_date'])
# Define an aggregation mapping of functions to columns names
agg_dict_date = {'rev_id':pd.Series.nunique,
'rev_len':np.median,
'rev_len_diff':np.median,
'rev_timestamp_diff':np.mean
}
# Execiute the aggregation functions on the groupby object
rev_agg_date = rev_gb_date.agg(agg_dict_date)
# Reindex to continuous dates
rev_agg_date = rev_agg_date.reindex(_ix)
# Fill in the resulting values of the reindexed and aggregated "rev_len" from the last value
rev_agg_date[['rev_len']] = rev_agg_date[['rev_len']].fillna(method='ffill')
# Fill in the resulting values of the reindexed and aggregated other series as 0s.
rev_agg_date[['rev_id','rev_len_diff']] = rev_agg_date[['rev_id','rev_len_diff']].fillna(0)
#rev_agg_date[['rev_id','rev_len_diff','rev_timestamp_diff']] = rev_agg_date[['rev_id','rev_len_diff','rev_timestamp_diff']].fillna(0)
# Show the first 10 values to verify it's worked as intended
rev_agg_date.tail(10)
rev_len | rev_len_diff | rev_timestamp_diff | rev_id | |
---|---|---|---|---|
2016-09-15 | 251695.0 | 0.0 | NaN | 0.0 |
2016-09-16 | 251695.0 | 0.0 | NaN | 0.0 |
2016-09-17 | 251695.0 | 0.0 | NaN | 0.0 |
2016-09-18 | 251636.0 | -59.0 | 16.369988 | 1.0 |
2016-09-19 | 251636.0 | 0.0 | NaN | 0.0 |
2016-09-20 | 251636.0 | 0.0 | NaN | 0.0 |
2016-09-21 | 251636.0 | 0.0 | NaN | 0.0 |
2016-09-22 | 251636.0 | 0.0 | NaN | 0.0 |
2016-09-23 | 251636.0 | 0.0 | NaN | 0.0 |
2016-09-24 | 251636.0 | 0.0 | 5.920498 | 1.0 |
Plot the changes in the average time since the last revision by day.
_s = rev_agg_date['rev_timestamp_diff']
ax = _s.plot(logy=True)
ax.set_xlabel('Date')
ax.set_ylabel('Average time since last edit (Days)')
<matplotlib.text.Text at 0x7fc7e0714ba8>
That's pretty noisy. We can apply the rolling
method over a 60-day window and get the mean to smooth out some of the noise.
_s = rev_agg_date['rev_timestamp_diff'].fillna(0).rolling(60).mean()
ax = _s.plot()
ax.set_xlabel('Date')
ax.set_ylabel('Average time since last edit (Days)')
<matplotlib.text.Text at 0x7fc7e0575ba8>
Similarly, what was the median size of a change to the article over time?
_s = rev_agg_date['rev_len_diff']
ax = _s.plot(lw=.5)
ax.set_xlabel('Date')
ax.set_ylabel('Median size of revision')
ax.set_yscale('symlog')