Filtering rows of dataframe based on quintiles

Question 1

I created a dataframe using the following code. I am triong to create a new dataframe that contains only rows from the original dataframe that are less than or greater than Q1 or Q4 for each PP.

I tried filtered_df = df[df['Hours per action'].lt(first_quartile) | df['Hours per action'].gt(fourth_quartile)], but that isn’t right. I need all rows with ‘PP’ == 1 to use the Q1 and Q4 from ‘PP’ 1, all rows with ‘PP’ == 2 to use the Q1 and Q4 from ‘PP’ 2, and so on. Thank you

import pandas as pd

# Create a list of all possible combinations of 'PP' and 'Name'
pp_employee_combinations = [(pp, name) for pp in range(1, 27) for name in ['Wyatt', 'Thom', 'Pete', 'Sue', 'Dave']]

# Create a new DataFrame with all of the possible combinations of 'PP' and 'Name'
df = pd.DataFrame(pp_employee_combinations, columns=['PP', 'Name'])

import numpy.random as rnd

# Fill in the 'Hours per action' column with random values
df['Hours per action'] = rnd.randint(10, 100, size=df.shape[0])

Q1 = df.groupby('PP')['Hours per action'].quantile(0.25)
Q4 = df.groupby('PP')['Hours per action'].quantile(0.75)

Question 2

You could just use map and between to select the relevant rows:

>>> df[df["Hours per action"].between(df["PP"].map(Q1),df["PP"].map(Q4))]

Question 3

You can use the transform method along with groupby to broadcast the computed Q1 and Q4 values to each row.
After that, you can use boolean indexing to filter the rows accordingly. you can achieve this like:

df['Q1'] = df.groupby('PP')['Hours per action'].transform(lambda x: x.quantile(0.25))
df['Q4'] = df.groupby('PP')['Hours per action'].transform(lambda x: x.quantile(0.75))

filtered_df = df[(df['Hours per action'] < df['Q1']) | (df['Hours per action'] > df['Q4'])]

filtered_df = filtered_df.drop(columns=['Q1', 'Q4'])

Leave a Comment Cancel reply