Time spent on a question (can be useful for worker ability)

Let's see how many judgments we have per unit

Let's remove the units that have only one judgment

  1. Create a column with time spent (use pd.to_datetime)
  2. Compute the average time per worker

Basic aggregation

Quantitative variables

If we are also doing a per-worker analysis, we can compute values from the worker

Categorical variables

Now we can't do the following because the following is a categorical variable:

Let's explore what is this column and decide what to do

The majority vote of an array is simply the mode

How is the variable distributed?

Let's compute the majority voting

Sometimes this returns two values, let's get the first in that case (better way would be random)

Weighted measures

Weighted mean

Weighted majority voting

Now we need, for each unit, to find the category with the highest trust score

Creating a summary table

Free text

Now we analyse the case in which we have free text

We can't use the weighted majority voting here! We need first to assign a score to this values.

Exercise

Exercise

  1. aggregate per _unit_id using average means for time_spent (you need to apply pd.to_numeric() and divide by 1e9 to get a column in seconds
  2. create a code that assigns 1 if the text contains any element for a list of words (list_words=['similar','same'], by doing a for loop (for i in list_words) and checking with (if i in text)
  3. compute the weigthed mean for that