import pandas as pd
import numpy as np
%pylab inline
Populating the interactive namespace from numpy and matplotlib
data = pd.read_csv("data_1304.csv",encoding='latin-1')
data.dropna(inplace=True)
pd.to_datetime(data['_started_at'].head())
0 2018-01-10 15:55:00 1 2018-01-10 17:04:00 2 2018-01-10 15:53:00 3 2018-01-11 02:24:00 4 2018-01-11 05:14:00 Name: _started_at, dtype: datetime64[ns]
pd.to_datetime(data['_created_at'])- pd.to_datetime(data['_started_at']) #use pd.to_numeric() to convert to number of ns
0 00:25:00 1 00:19:20 2 00:28:06 4 00:07:22 5 00:25:10 6 00:13:59 7 00:16:48 8 00:16:12 9 00:22:55 10 00:13:19 11 00:10:26 13 00:17:22 14 00:17:55 15 00:24:56 16 00:27:48 17 00:25:27 18 00:28:41 19 00:22:18 20 00:08:38 21 00:06:07 22 00:28:04 23 00:27:28 25 00:23:54 26 00:20:47 27 00:18:40 29 00:09:51 30 00:17:40 33 00:09:06 34 00:24:43 35 00:27:52 ... 57 00:26:12 59 00:13:00 61 00:21:06 62 00:24:37 63 00:25:15 65 00:15:52 66 00:12:22 68 00:14:17 70 00:24:54 71 00:29:49 72 00:23:56 74 00:24:29 75 00:28:43 76 00:16:54 77 00:20:06 78 00:24:14 79 00:20:13 82 00:24:47 83 00:18:22 85 00:18:31 87 00:21:30 88 00:20:33 89 00:23:10 93 00:26:09 94 00:20:54 95 00:19:57 96 00:29:33 97 00:23:53 98 00:21:24 99 00:27:11 Length: 73, dtype: timedelta64[ns]
print(len(data))
data
100
better_0 | _unit_id | _started_at | _created_at | _trust | _worker_id | _city | age | similarity_0 | explanation_0 | asi1 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Your keyword | 4 | 01.10.18 15:55 | 01.10.18 16:20 | 0.385118 | 32 | Ernakulam | 36-50 | 6 | they are all dressed well and using computers ... | 4 |
1 | The two keywords are completely identical | 6 | 01.10.18 17:04 | 01.10.18 17:23 | 0.033270 | 13 | Kolkata | 36-50 | 6 | Almost identicalexcept the tiny spelling diffe... | 3 |
2 | Search engine query | 18 | 01.10.18 15:53 | 01.10.18 16:21 | 0.551213 | 21 | Pune | 36-50 | 6 | All the images represents the search better... | 1 |
3 | The two keywords are completely identical | 1 | 01.11.18 02:24 | 01.11.18 02:47 | 0.204184 | 82 | NaN | 19-25 | 7 | they wear casual clothes | 4 |
4 | The two keywords are completely identical | 6 | 01.11.18 05:14 | 01.11.18 05:21 | 0.708808 | 70 | Mangalagiri | 19-25 | 7 | both are similar | 5 |
5 | The two keywords are completely identical | 20 | 01.10.18 16:41 | 01.10.18 17:06 | 0.899786 | 95 | Patna | 26-35 | 7 | they both describe the same kind of people | 3 |
6 | Your keyword | 13 | 01.10.18 15:47 | 01.10.18 16:01 | 0.873825 | 37 | Ulhasnagar | 19-25 | 6 | We can see a relaxed state in that images | 5 |
7 | Search engine query | 30 | 01.10.18 15:40 | 01.10.18 15:57 | 0.264847 | 78 | Kolkata | 19-25 | 5 | YES | 3 |
8 | Your keyword | 9 | 01.11.18 04:55 | 01.11.18 05:11 | 0.512431 | 29 | Mangalagiri | 19-25 | 6 | THEY ARE THINKING | 3 |
9 | Your keyword | 7 | 01.10.18 15:39 | 01.10.18 16:02 | 0.260237 | 87 | Hyderabad | 26-35 | 6 | A person is generalized and one cannot find th... | 5 |
10 | Search engine query | 10 | 01.10.18 16:25 | 01.10.18 16:38 | 0.915093 | 91 | New Delhi | 19-25 | 4 | they are calm | 3 |
11 | Your keyword | 25 | 01.11.18 03:56 | 01.11.18 04:06 | 0.212509 | 79 | Roorkee | 19-25 | 4 | genious | 3 |
12 | The two keywords are completely identical | 17 | 01.10.18 17:01 | 01.10.18 17:26 | 0.557112 | 43 | Chennai | 26-35 | 6 | NaN | 3 |
13 | Search engine query | 10 | 01.10.18 19:57 | 01.10.18 20:15 | 0.770169 | 6 | Hyderabad | 26-35 | 6 | only 1 image | 4 |
14 | The two keywords are completely identical | 10 | 01.11.18 03:02 | 01.11.18 03:20 | 0.914456 | 86 | Roorkee | 26-35 | 7 | interested in their work | 4 |
15 | Search engine query | 26 | 01.10.18 16:49 | 01.10.18 17:14 | 0.283502 | 71 | Cochin | 19-25 | 3 | i think this is correct that calm person becau... | 3 |
16 | Search engine query | 4 | 01.10.18 20:14 | 01.10.18 20:42 | 0.373995 | 64 | Kolkata | 19-25 | 5 | YES | 4 |
17 | Your keyword | 14 | 01.10.18 17:45 | 01.10.18 18:11 | 0.035891 | 85 | Mumbai | 26-35 | 4 | images looks like taking a deep breath | 5 |
18 | Search engine query | 14 | 01.11.18 04:34 | 01.11.18 05:02 | 0.797305 | 34 | Amritsar | 26-35 | 4 | it now seems more like to give these results w... | 4 |
19 | Your keyword | 15 | 01.10.18 19:23 | 01.10.18 19:45 | 0.814008 | 11 | Bhopal | 36-50 | 4 | based on result of image | 5 |
20 | Your keyword | 30 | 01.11.18 04:06 | 01.11.18 04:15 | 0.634484 | 80 | Dehradun | 19-25 | 4 | whipping | 3 |
21 | Search engine query | 17 | 01.10.18 16:51 | 01.10.18 16:57 | 0.613763 | 98 | New Delhi | 19-25 | 4 | Yes | 2 |
22 | The two keywords are completely identical | 10 | 01.10.18 16:19 | 01.10.18 16:47 | 0.189142 | 17 | Kolkata | 36-50 | 7 | calm person and calmness same | 5 |
23 | Your keyword | 1 | 01.10.18 16:49 | 01.10.18 17:17 | 0.801677 | 20 | Bangalore | 26-35 | 5 | result suits more to this kerword | 4 |
24 | Your keyword | 27 | 01.11.18 04:23 | 01.11.18 04:47 | 0.825728 | 84 | NaN | 19-25 | 4 | engry | 2 |
25 | Search engine query | 28 | 01.10.18 15:44 | 01.10.18 16:08 | 0.935156 | 83 | Pune | 19-25 | 5 | working person uses the things that i mentioned | 4 |
26 | Search engine query | 15 | 01.10.18 16:23 | 01.10.18 16:44 | 0.851697 | 94 | Kolkata | 19-25 | 6 | yes | 3 |
27 | The two keywords are completely identical | 27 | 01.10.18 17:12 | 01.10.18 17:30 | 0.796527 | 65 | Meerut | 19-25 | 4 | anger | 2 |
28 | The two keywords are completely identical | 9 | 01.11.18 05:12 | 01.11.18 05:25 | 0.333228 | 61 | Bokaro | 19-25 | 6 | NaN | 4 |
29 | Search engine query | 16 | 01.10.18 16:11 | 01.10.18 16:21 | 0.940862 | 90 | Dehradun | 36-50 | 4 | hot air baloon | 3 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
70 | Your keyword | 17 | 01.11.18 04:26 | 01.11.18 04:51 | 0.427894 | 53 | Mangalagiri | 19-25 | 6 | by query image i understood that person seems ... | 3 |
71 | Your keyword | 7 | 01.10.18 23:35 | 01.11.18 00:04 | 0.621600 | 28 | Kolkata | 50-80 | 7 | Everything is related with warm | 3 |
72 | Search engine query | 7 | 01.10.18 16:06 | 01.10.18 16:30 | 0.677330 | 38 | Hyderabad | 26-35 | 5 | it gives better ideas about all the image | 4 |
73 | Search engine query | 11 | 01.11.18 02:38 | 01.11.18 02:54 | 0.504001 | 56 | Chennai | 19-25 | 5 | NaN | 4 |
74 | Search engine query | 24 | 01.11.18 04:06 | 01.11.18 04:31 | 0.354657 | 67 | Mangalagiri | 19-25 | 7 | we got the same image when search in google | 5 |
75 | Search engine query | 16 | 01.10.18 19:32 | 01.10.18 20:01 | 0.090422 | 52 | Hyderabad | 26-35 | 6 | with the facial expression we can find him too... | 4 |
76 | Search engine query | 19 | 01.10.18 16:31 | 01.10.18 16:47 | 0.967048 | 59 | Pune | 19-25 | 6 | Their some people shouting at each other | 4 |
77 | Your keyword | 2 | 01.10.18 21:04 | 01.10.18 21:25 | 0.928919 | 25 | Howrah | 19-25 | 6 | very much about that | 3 |
78 | Search engine query | 24 | 01.10.18 16:36 | 01.10.18 17:00 | 0.525274 | 77 | Kolkata | 26-35 | 7 | It's the image of that | 3 |
79 | The two keywords are completely identical | 21 | 01.11.18 03:08 | 01.11.18 03:28 | 0.588048 | 31 | Unnao | 26-35 | 7 | Smart Person Bring Innovation and must have hi... | 0 |
80 | The two keywords are completely identical | 31 | 01.10.18 15:43 | 01.10.18 16:03 | 0.916103 | 41 | Nellore | 26-35 | 7 | NaN | 4 |
81 | Search engine query | 30 | 01.10.18 17:14 | 01.10.18 17:34 | 0.752203 | 88 | NaN | NaN | 6 | it is more relative. | 5 |
82 | The two keywords are completely identical | 4 | 01.10.18 15:39 | 01.10.18 16:04 | 0.625525 | 3 | Hyderabad | 36-50 | 5 | person in aggression is shouting at others | 5 |
83 | Your keyword | 30 | 01.11.18 05:56 | 01.11.18 06:15 | 0.042048 | 5 | Burdwan | 26-35 | 7 | they are all were casual dress | 5 |
84 | The two keywords are completely identical | 23 | 01.11.18 03:52 | 01.11.18 04:07 | 0.464201 | 39 | Siuri | 26-35 | 7 | NaN | 5 |
85 | Your keyword | 7 | 01.10.18 17:08 | 01.10.18 17:26 | 0.280744 | 49 | Noida | 0-18 | 3 | By nature | 5 |
86 | Search engine query | 29 | 01.10.18 16:13 | 01.10.18 16:39 | 0.278253 | 73 | NaN | 26-35 | 7 | Casual suits more than free style | 4 |
87 | Your keyword | 24 | 01.10.18 19:54 | 01.10.18 20:15 | 0.888706 | 35 | Delhi | 26-35 | 7 | because it shows that | 3 |
88 | The two keywords are completely identical | 27 | 01.11.18 04:05 | 01.11.18 04:26 | 0.058092 | 55 | Mangalagiri | 19-25 | 7 | we got the same image when search in google | 3 |
89 | Search engine query | 21 | 01.10.18 15:50 | 01.10.18 16:13 | 0.805073 | 50 | Hyderabad | 26-35 | 7 | people are working i guess working people is m... | 4 |
90 | Your keyword | 20 | 01.10.18 15:50 | 01.10.18 16:01 | 0.761000 | 96 | NaN | 26-35 | 4 | frustreted | 2 |
91 | Your keyword | 17 | 01.11.18 03:56 | 01.11.18 04:21 | 0.117168 | 9 | Erode | 26-35 | 6 | NaN | 4 |
92 | The two keywords are completely identical | 21 | 01.10.18 16:11 | 01.10.18 16:22 | 0.343565 | 93 | Pune | 26-35 | 7 | NaN | 5 |
93 | The two keywords are completely identical | 3 | 01.11.18 04:13 | 01.11.18 04:39 | 0.853188 | 45 | Mangalagiri | 19-25 | 7 | BOTH ARE SIMILAR | 5 |
94 | Your keyword | 13 | 01.10.18 16:01 | 01.10.18 16:22 | 0.325484 | 57 | Guwahati | 36-50 | 6 | They also look happy | 4 |
95 | The two keywords are completely identical | 11 | 01.11.18 05:45 | 01.11.18 06:05 | 0.988551 | 23 | Mangalagiri | 0-18 | 7 | similar | 5 |
96 | Search engine query | 8 | 01.11.18 02:23 | 01.11.18 02:52 | 0.520720 | 0 | Hyderabad | 26-35 | 5 | Since not all the images belong to science exa... | 4 |
97 | Search engine query | 15 | 01.11.18 01:56 | 01.11.18 02:20 | 0.046097 | 7 | Chennai | 26-35 | 5 | On detailed viewing smart person might be a be... | 4 |
98 | Your keyword | 3 | 01.11.18 06:49 | 01.11.18 07:11 | 0.091185 | 27 | Kolkata | 36-50 | 5 | everybody is yelling | 4 |
99 | Search engine query | 31 | 01.11.18 04:19 | 01.11.18 04:46 | 0.951531 | 16 | Mangalagiri | 19-25 | 4 | a complete act of expression works out here | 3 |
100 rows × 11 columns
data.describe()
_unit_id | _trust | _worker_id | similarity_0 | asi1 | |
---|---|---|---|---|---|
count | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 |
mean | 15.730000 | 0.542089 | 49.500000 | 5.640000 | 3.680000 |
std | 8.539693 | 0.301171 | 29.011492 | 1.275408 | 1.071957 |
min | 0.000000 | 0.033270 | 0.000000 | 2.000000 | 0.000000 |
25% | 9.000000 | 0.295970 | 24.750000 | 5.000000 | 3.000000 |
50% | 15.000000 | 0.571722 | 49.500000 | 6.000000 | 4.000000 |
75% | 23.000000 | 0.807306 | 74.250000 | 7.000000 | 4.000000 |
max | 31.000000 | 0.988551 | 99.000000 | 7.000000 | 5.000000 |
Let's see how many judgments we have per unit
data.groupby('better_0').size()
better_0 Search engine query 33 The two keywords are completely identical 32 Your keyword 35 dtype: int64
data.groupby('_unit_id').size()
array([1, 3, 3, 2, 4, 1, 2, 7, 1, 2, 4, 3, 2, 4, 6, 7, 3, 5, 3, 1, 2, 4, 1, 6, 5, 3, 2, 4, 2, 1, 4, 2])
data.groupby('_unit_id').size().hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7ff0abee0d30>
Let's remove the units that have only one judgment
(data.groupby('_unit_id').size()==1).values
array([ True, False, False, False, False, True, False, False, True, True, False, True, False, False, False, False, False, False, True, True, False, False, False, False, False, False, False, False, True])
a = np.where((data.groupby('_unit_id').size()==1))
a
(array([ 0, 5, 8, 9, 11, 18, 19, 28]),)
a = list(a[0])
a
[0, 5, 8, 9, 11, 18, 19, 28]
data = data[~data['_unit_id'].isin(a)]
len(data)
63
data['time_spent'] = pd.to_datetime(data['_created_at']) - pd.to_datetime(data['_started_at'])
data.head()
better_0 | _unit_id | _started_at | _created_at | _trust | _worker_id | _city | age | similarity_0 | explanation_0 | asi1 | time_spent | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Your keyword | 4 | 1/10/2018 15:55:41 | 1/10/2018 16:20:41 | 0.385118 | 32 | Ernakulam | 36-50 | 6 | they are all dressed well and using computers ... | 4 | 00:25:00 |
1 | The two keywords are completely identical | 6 | 1/10/2018 17:04:22 | 1/10/2018 17:23:42 | 0.033270 | 13 | Kolkata | 36-50 | 6 | Almost identicalexcept the tiny spelling diffe... | 3 | 00:19:20 |
4 | The two keywords are completely identical | 6 | 1/11/2018 05:14:03 | 1/11/2018 05:21:25 | 0.708808 | 70 | Mangalagiri | 19-25 | 7 | both are similar | 5 | 00:07:22 |
5 | The two keywords are completely identical | 20 | 1/10/2018 16:41:16 | 1/10/2018 17:06:26 | 0.899786 | 95 | Patna | 26-35 | 7 | they both describe the same kind of people | 3 | 00:25:10 |
6 | Your keyword | 13 | 1/10/2018 15:47:20 | 1/10/2018 16:01:19 | 0.873825 | 37 | Ulhasnagar | 19-25 | 6 | We can see a relaxed state in that images | 5 | 00:13:59 |
data.groupby('_unit_id')['similarity_0'].mean()
_unit_id 1 5.500000 2 4.666667 3 6.000000 4 5.750000 6 6.500000 7 5.400000 10 6.000000 13 6.333333 14 4.000000 15 5.166667 16 5.000000 17 5.666667 20 7.000000 21 7.000000 23 6.500000 24 6.200000 25 4.333333 26 4.500000 27 5.666667 30 5.333333 31 4.000000 Name: similarity_0, dtype: float64
If we are also doing a per-worker analysis, we can compute values from the worker
data.groupby('_worker_id')['_trust'].mean().values
array([0.45086904, 0.92770687, 0.62552536, 0.93464997, 0.04204837, 0.77016858, 0.04609719, 0.60929146, 0.5741613 , 0.81400838, 0.03326987, 0.91541507, 0.95153081, 0.18914215, 0.30560117, 0.80167709, 0.97441615, 0.92891881, 0.96747946, 0.09118499, 0.62159951, 0.58563959, 0.58804797, 0.38511825, 0.12424342, 0.79730475, 0.88870635, 0.87382468, 0.67732971, 0.85318828, 0.34576951, 0.28074398, 0.80507253, 0.05786407, 0.09042158, 0.42789365, 0.05809224, 0.32548398, 0.30012607, 0.03610733, 0.85113121, 0.37399525, 0.79652694, 0.62465149, 0.3546574 , 0.91410825, 0.70880836, 0.28350176, 0.91083596, 0.33243423, 0.03891988, 0.52527424, 0.26484709, 0.21250903, 0.63448413, 0.03589091, 0.91445626, 0.2602371 , 0.9408621 , 0.9150934 , 0.85169676, 0.89978581, 0.61376345])
data.groupby('_worker_id')['_trust'].mean().hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7ff0b3eff390>
Now we can't do the following because the following is a categorical variable:
data.groupby('_unit_id')['better_0'].mean()
--------------------------------------------------------------------------- DataError Traceback (most recent call last) <ipython-input-20-2f584eae24bb> in <module>() ----> 1 data.groupby('_unit_id')['better_0'].mean() /srv/paws/lib/python3.6/site-packages/pandas/core/groupby.py in mean(self, *args, **kwargs) 1126 nv.validate_groupby_func('mean', args, kwargs, ['numeric_only']) 1127 try: -> 1128 return self._cython_agg_general('mean', **kwargs) 1129 except GroupByError: 1130 raise /srv/paws/lib/python3.6/site-packages/pandas/core/groupby.py in _cython_agg_general(self, how, alt, numeric_only, min_count) 925 926 if len(output) == 0: --> 927 raise DataError('No numeric types to aggregate') 928 929 return self._wrap_aggregated_output(output, names) DataError: No numeric types to aggregate
Let's explore what is this column and decide what to do
data.groupby('_unit_id')['better_0'].describe()
count | unique | top | freq | |
---|---|---|---|---|
_unit_id | ||||
1 | 2 | 2 | Your keyword | 1 |
2 | 3 | 2 | Your keyword | 2 |
3 | 2 | 2 | Your keyword | 1 |
4 | 4 | 3 | The two keywords are completely identical | 2 |
6 | 2 | 1 | The two keywords are completely identical | 2 |
7 | 5 | 2 | Your keyword | 3 |
10 | 4 | 2 | The two keywords are completely identical | 2 |
13 | 3 | 2 | Your keyword | 2 |
14 | 3 | 2 | Search engine query | 2 |
15 | 6 | 2 | Your keyword | 4 |
16 | 2 | 1 | Search engine query | 2 |
17 | 3 | 3 | Your keyword | 1 |
20 | 1 | 1 | The two keywords are completely identical | 1 |
21 | 2 | 2 | Search engine query | 1 |
23 | 4 | 1 | The two keywords are completely identical | 4 |
24 | 5 | 2 | Search engine query | 3 |
25 | 3 | 3 | Your keyword | 1 |
26 | 2 | 1 | Search engine query | 2 |
27 | 3 | 2 | The two keywords are completely identical | 2 |
30 | 3 | 2 | Your keyword | 2 |
31 | 1 | 1 | Search engine query | 1 |
print(data['better_0'].unique())
len(data['better_0'].unique())
['Your keyword' 'The two keywords are completely identical' 'Search engine query']
3
The majority vote of an array is simply the mode
data['better_0'].mode()
0 Search engine query 1 Your keyword dtype: object
How is the variable distributed?
data.groupby('better_0')['better_0'].size()
better_0 Search engine query 22 The two keywords are completely identical 19 Your keyword 22 Name: better_0, dtype: int64
Let's compute the majority voting
data.groupby('_unit_id')['better_0'].apply(lambda x: x.mode())
_unit_id 1 0 The two keywords are completely identical 1 Your keyword 2 0 Your keyword 3 0 The two keywords are completely identical 1 Your keyword 4 0 The two keywords are completely identical 6 0 The two keywords are completely identical 7 0 Your keyword 10 0 Search engine query 1 The two keywords are completely identical 13 0 Your keyword 14 0 Search engine query 15 0 Your keyword 16 0 Search engine query 17 0 Search engine query 1 The two keywords are completely identical 2 Your keyword 20 0 The two keywords are completely identical 21 0 Search engine query 1 The two keywords are completely identical 23 0 The two keywords are completely identical 24 0 Search engine query 25 0 Search engine query 1 The two keywords are completely identical 2 Your keyword 26 0 Search engine query 27 0 The two keywords are completely identical 30 0 Your keyword 31 0 Search engine query Name: better_0, dtype: object
Sometimes this returns two values, let's get the first in that case (better way would be random)
data.groupby('_unit_id')['better_0'].apply(lambda x: x.mode()[0])
_unit_id 1 The two keywords are completely identical 2 Your keyword 3 The two keywords are completely identical 4 The two keywords are completely identical 6 The two keywords are completely identical 7 Your keyword 10 Search engine query 13 Your keyword 14 Search engine query 15 Your keyword 16 Search engine query 17 Search engine query 20 The two keywords are completely identical 21 Search engine query 23 The two keywords are completely identical 24 Search engine query 25 Search engine query 26 Search engine query 27 The two keywords are completely identical 30 Your keyword 31 Search engine query Name: better_0, dtype: object
def weigthed_mean(df,weights,values): #df is a dataframe containing a single question
sum_values = (df[weights]*df[values]).sum()
total_weight = df[weights].sum()
return sum_values/total_weight
data.groupby('_unit_id').apply(lambda x: weigthed_mean(x,'_trust','similarity_0'))
_unit_id 1 5.532764 2 4.961362 3 6.806888 4 5.789938 6 6.955167 7 5.675547 10 5.739468 13 6.437989 14 4.000000 15 5.166934 16 4.175357 17 5.840521 20 7.000000 21 7.000000 23 6.465985 24 6.481138 25 4.525120 26 4.556271 27 4.706914 30 4.415340 31 4.000000 dtype: float64
data.groupby('_unit_id').apply(lambda x: (x['_trust']*x['similarity_0']).sum()/(x['_trust'].sum()))
_unit_id 1 5.532764 2 4.961362 3 6.806888 4 5.789938 6 6.955167 7 5.675547 10 5.739468 13 6.437989 14 4.000000 15 5.166934 16 4.175357 17 5.840521 20 7.000000 21 7.000000 23 6.465985 24 6.481138 25 4.525120 26 4.556271 27 4.706914 30 4.415340 31 4.000000 dtype: float64
Now we need, for each unit, to find the category with the highest trust score
data.head()
better_0 | _unit_id | _started_at | _created_at | _trust | _worker_id | _city | age | similarity_0 | explanation_0 | asi1 | time_spent | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Your keyword | 4 | 1/10/2018 15:55:41 | 1/10/2018 16:20:41 | 0.385118 | 32 | Ernakulam | 36-50 | 6 | they are all dressed well and using computers ... | 4 | 00:25:00 |
1 | The two keywords are completely identical | 6 | 1/10/2018 17:04:22 | 1/10/2018 17:23:42 | 0.033270 | 13 | Kolkata | 36-50 | 6 | Almost identicalexcept the tiny spelling diffe... | 3 | 00:19:20 |
4 | The two keywords are completely identical | 6 | 1/11/2018 05:14:03 | 1/11/2018 05:21:25 | 0.708808 | 70 | Mangalagiri | 19-25 | 7 | both are similar | 5 | 00:07:22 |
5 | The two keywords are completely identical | 20 | 1/10/2018 16:41:16 | 1/10/2018 17:06:26 | 0.899786 | 95 | Patna | 26-35 | 7 | they both describe the same kind of people | 3 | 00:25:10 |
6 | Your keyword | 13 | 1/10/2018 15:47:20 | 1/10/2018 16:01:19 | 0.873825 | 37 | Ulhasnagar | 19-25 | 6 | We can see a relaxed state in that images | 5 | 00:13:59 |
def weigthed_majority(df,weights,values): #df is a dataframe containing a single question
#print(df.groupby(values)[weights].sum())
best_value = df.groupby(values)[weights].sum().argmax()
return best_value
data.groupby('_unit_id').apply(lambda x: weigthed_majority(x,'_trust','better_0'))
/srv/paws/lib/python3.6/site-packages/ipykernel_launcher.py:3: FutureWarning: 'argmax' is deprecated. Use 'idxmax' instead. The behavior of 'argmax' will be corrected to return the positional maximum in the future. Use 'series.values.argmax' to get the position of the maximum now. This is separate from the ipykernel package so we can avoid doing imports until
_unit_id 1 The two keywords are completely identical 2 Your keyword 3 The two keywords are completely identical 4 The two keywords are completely identical 6 The two keywords are completely identical 7 Search engine query 10 Search engine query 13 Your keyword 14 Search engine query 15 Your keyword 16 Search engine query 17 The two keywords are completely identical 20 The two keywords are completely identical 21 Search engine query 23 The two keywords are completely identical 24 Search engine query 25 The two keywords are completely identical 26 Search engine query 27 The two keywords are completely identical 30 Your keyword 31 Search engine query dtype: object
results = pd.DataFrame()
results['better'] = data.groupby('_unit_id').apply(lambda x: weigthed_majority(x,'_trust','better_0'))
results['similarity'] = data.groupby('_unit_id').apply(lambda x: weigthed_mean(x,'_trust','similarity_0'))
results['better_code'] = results['better'].astype('category').cat.codes
results
/srv/paws/lib/python3.6/site-packages/ipykernel_launcher.py:3: FutureWarning: 'argmax' is deprecated. Use 'idxmax' instead. The behavior of 'argmax' will be corrected to return the positional maximum in the future. Use 'series.values.argmax' to get the position of the maximum now. This is separate from the ipykernel package so we can avoid doing imports until
better | similarity | better_code | |
---|---|---|---|
_unit_id | |||
1 | The two keywords are completely identical | 5.532764 | 1 |
2 | Your keyword | 4.961362 | 2 |
3 | The two keywords are completely identical | 6.806888 | 1 |
4 | The two keywords are completely identical | 5.789938 | 1 |
6 | The two keywords are completely identical | 6.955167 | 1 |
7 | Search engine query | 5.675547 | 0 |
10 | Search engine query | 5.739468 | 0 |
13 | Your keyword | 6.437989 | 2 |
14 | Search engine query | 4.000000 | 0 |
15 | Your keyword | 5.166934 | 2 |
16 | Search engine query | 4.175357 | 0 |
17 | The two keywords are completely identical | 5.840521 | 1 |
20 | The two keywords are completely identical | 7.000000 | 1 |
21 | Search engine query | 7.000000 | 0 |
23 | The two keywords are completely identical | 6.465985 | 1 |
24 | Search engine query | 6.481138 | 0 |
25 | The two keywords are completely identical | 4.525120 | 1 |
26 | Search engine query | 4.556271 | 0 |
27 | The two keywords are completely identical | 4.706914 | 1 |
30 | Your keyword | 4.415340 | 2 |
31 | Search engine query | 4.000000 | 0 |
Now we analyse the case in which we have free text
data['better_0'].unique()
array(['Your keyword', 'The two keywords are completely identical', 'Search engine query'], dtype=object)
data['explanation_0'].unique()
array(['they are all dressed well and using computers so its more like a business scenario.', 'Almost identicalexcept the tiny spelling difference.', 'both are similar', 'they both describe the same kind of people', 'We can see a relaxed state in that images', 'YES', 'A person is generalized and one cannot find the images of Einstein or kids in them.', 'they are calm', 'genious', 'only 1 image', 'interested in their work', 'i think this is correct that calm person because every one is calm in this images', 'images looks like taking a deep breath', 'it now seems more like to give these results whn we think of interested person rather than thinking and surprising', 'based on result of image', 'whipping', 'Yes', 'calm person and calmness same', 'result suits more to this kerword', 'yes', 'anger', 'hot air baloon', 'both are the same', 'same attitude of boss', 'the results are same', 'all my words are feature of Search engine query', 'They all are working in the office', 'in image person looking very casual', 'both refer to the same traits but intelligent word is more suited', 'i know', 'Because all people here look casual.', 'both are same', 'Casualness is used in both the words', 'interested person only can do Research, smart, thinging', 'Casual person is more accurate of the images.', 'i believe this is my personal theory..so i think aggressive person would be better keyword for these images', 'My keyword "happy people" and Search engine query "calm person" is almost same.', 'My answer is more specific regarding images.', 'i know need search engine when i already knew it', 'BOTH ARE SIMILAR', 'by query image i understood that person seems very angry', 'Everything is related with warm', 'it gives better ideas about all the image', 'we got the same image when search in google', 'with the facial expression we can find him too aggresive', 'very much about that', "It's the image of that", 'Smart Person Bring Innovation and must have high IQ', 'person in aggression is shouting at others', 'they are all were casual dress', 'By nature', 'because it shows that', 'people are working i guess working people is more apt', 'They also look happy', 'On detailed viewing smart person might be a better keyword.', 'everybody is yelling', 'a complete act of expression works out here'], dtype=object)
We can't use the weighted majority voting here! We need first to assign a score to this values.
def compute_score(text):
score = len(text)
return score
data['score'] = data['explanation_0'].apply(compute_score)
data.groupby('_unit_id').apply(lambda x: weigthed_mean(x,'_trust','score'))
48.263201486681275