In this challenge you must analyze demographic data using Pandas. You are given a dataset of demographic data that was extracted from the 1994 Census database.
Doctorate) make more than 50K?
# extract data import pandas as pd import numpy as np df = pd.read_csv('data/adult_data.csv', usecols = ['age', 'education', 'occupation', 'race', 'sex', 'hours-per-week', 'native-country', 'salary' ]) df.head()
# explore data df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 32561 entries, 0 to 32560 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 32561 non-null int64 1 education 32561 non-null object 2 occupation 32561 non-null object 3 race 32561 non-null object 4 sex 32561 non-null object 5 hours-per-week 32561 non-null int64 6 native-country 32561 non-null object 7 salary 32561 non-null object dtypes: int64(2), object(6) memory usage: 2.0+ MB
# 1. how many people of each race are represented in this dataset? df.groupby(['race']).size().sort_values(ascending = False)
race White 27816 Black 3124 Asian-Pac-Islander 1039 Amer-Indian-Eskimo 311 Other 271 dtype: int64
# 2. average age of men avg_age = df[df['sex'] == 'Male'].groupby( # filter to get data where row have 'Male' as the value for the sex col ['sex'])['age'].describe()[['mean']].reset_index('sex').iloc[0, 1] # group by sex and get the arithmetic mean for the age round(avg_age, 1)
# 3. percentage of people who have a Bachelor's degree s = df.groupby(['education']).size() percentage_bachelors = (s.Bachelors / s.sum()) * 100 # number of occurrences of the value by the total sum percentage_bachelors round(percentage_bachelors, 1)
# create two datasets for future operations: group by salary high_salary = df[df['salary'] == '>50K'] low_salary = df[df['salary'] == '<=50K']
# create two datasets for future operations: group by education higher = df.salary[df['education'].isin(['Bachelors', 'Doctorate', 'Masters'])] lower = df.salary[~df['education'].isin(['Bachelors', 'Doctorate', 'Masters'])]
# 4. percentage of people with advanced education making more than 50k advanced = round((higher[higher == '>50K'].count() / higher.count()) * 100, 1) # 5. percentage of people without advanced education making more than 50k without_advanced = round((lower[lower == '>50K'].count() / lower.count()) * 100, 1) # output display(advanced ) display(without_advanced)
# 6. min hours a person works per week min_hours = df.loc[:,'hours-per-week'].min() min_hours
# 7. percentage people working the minimum number of h and earning more than 50K high_min_hours = high_salary[high_salary['hours-per-week'] == 1].index all_min_hours = df[df['hours-per-week'] == 1].index round(len(high_min_hours ) / len(all_min_hours), 1) * 100
# 8. country with the highest percentage of people that earn > 50K percentages = round((high_salary.groupby(['native-country']).size() # count of people earning > 50K by country / df.groupby(['native-country']).size()) * 100.0, 1) # divide by total number of people by country percentages.sort_values(ascending = False) # get highest percentage
# 9. most common occupation for those who earn >50K in India high_salary[high_salary['native-country'] == 'India'].groupby(['occupation']).size().sort_values( ascending = False).head(1).index