faster way to run countifs in python

Question

I previously asked the question on how to do a countifs in python across multiple data frames, just like you can do countifs on separate worksheets in Excel. somebody gave me a very creative answer:

python pandas countifs using multiple criteria AND multiple data frames

Thank you for that @AlexG--I tried it, and it worked superbly:

import pandas as pd
import numpy as np
import matplotlib as plt

#import the data
students = pd.read_csv("Student Detail stump.csv")
exams = pd.read_csv("Exam Detail stump.csv")

#get data parameters
student_info = students[['Student Number', 'Enrollment Date', 'Detail Date']].values

#prepare an empty list to hold the results
N_exams_passed = []

#count records in data set according to parameters
for s_id, s_enroll, s_qual in student_info:
N_exams_passed.append(len(exams[(exams['Student Number']==s_id) &
                         (exams['Exam Grade Date']>=s_enroll) &
                         (exams['Exam Grade Date']<=s_qual) &
                         (exams['Exam Grade']>=70)])
                      )

#add the results to the original data set
students['Exams Passed'] = N_exams_passed

HOWEVER, it only worked effectively on small data sets. When I ran the data with 100,000s of rows, it wouldn't even be done overnight. It doesn't seem very pythonic.

The SQL way you can do this in seconds is to use a correlated subquery, like this:

SELECT
   s.*,
   (SELECT COUNT(e.[Exam Grade]) 
 FROM
     exams AS e 
 WHERE
    e.[Exam Grade] >= 65 
    AND e.[Student Number] = s.[Student Number] 
    AND e.[Exam Grade Date] >= s.[Enrollment Date] 
    AND e.[Exam Grade Date] <= s.[Detail Date]) 
    AS ExamsPassed
FROM 
    students AS s;

How do I reproduce such a correlated subquery in pandas or some other pythonic way?

Here are the data frames:

 #Students
 Student Number Enroll Date Detail Date
 1              1/1/2016    2/1/2016
 1              1/1/2016    3/1/2016
 2              2/1/2016    3/1/2016
 3              3/1/2016    4/1/2016

 #Exams
 Student Number Exam Date   Exam Grade
 1              1/1/2016    50
 1              1/15/2016   80
 1              1/28/2016   90
 1              2/5/2016    100
 1              3/5/2016    80
 1              4/5/2016    40
 2              2/2/2016    85
 2              2/3/2016    10
 2              2/4/2016    100

Final data frame should look like this, with a count of 'Passed Exams' at the end:

 #FinalResult
 Student Number Enroll Date Detail Date Passed Exams
 1              1/1/2016    2/1/2016    2
 1              1/1/2016    3/1/2016    3
 2              2/1/2016    3/1/2016    2
 3              3/1/2016    4/1/2016    0

ASGM · Accepted Answer · 2016-11-04T19:36:13.773

1

If I understand the structure of your dataframes correctly, I'd suggest merging the two dataframes and then performing the task on the merged data using numpy.where.

import numpy as np

exams = exams.merge(students, on='Student Number', how='left')
exams['Passed'] = np.where(
    (exams['Exam Grade Date'] >= exams['Enrollment Date']) &
    (exams['Exam Grade Date'] <= exams['Detail Date']) &
    (exams['Grade'] >= 70),
    1, 0)

students = students.merge(
    exams.groupby(['Student Number', 'Detail Date'])['Passed'].sum().reset_index(),
    left_on=['Student Number', 'Detail Date'],
    right_on=['Student Number', 'Detail Date'],
    how='left')
students['Passed'] = students['Passed'].fillna(0).astype('int')

Note: you'll need to make sure the date columns are properly stored as datetimes (you can use pandas.to_datetime to do this).

numpy.where creates a new array where the values are one way (1 in the example above) if the conditions you specify are met and another (0) if they aren't met.

The line exams.groupby(['Student Number', 'Detail Date'])['Passed'].sum() produces a series in which the index is Student Number and Detail Date and the values are the counts of passed exams corresponding to that Student Number and Detail Date combination. The reset_index() makes it into a dataframe for merging.

edited Nov 04 '16 at 19:36

answered Nov 03 '16 at 16:53

ASGM

11,051
1
32
53

That merge resulted in duplication, so the counts didn't come back correctly. – Joel Underwood Nov 03 '16 at 17:13
Does the first part (the conditional counting) work and just the merge fail? I might not understand your dataframe structure correctly. In what way does the merge look wrong? – ASGM Nov 03 '16 at 17:25
The merged frame resulted in more rows than exams, so when you grouped it, you ended up with more exams in 'Passed' than there were exam records. – Joel Underwood Nov 03 '16 at 18:43
I think I've fixed the problem by adding `how='left'` to the merge arguments. Let me know if that works. – ASGM Nov 03 '16 at 20:02
No, the final result shows as 5, 5, 2 when adding that to the merge, but the actual count should be 2,4,2. The merge should be de-duplicated on 'Student Number' and 'Detail Date', and I can't figure that out. – Joel Underwood Nov 03 '16 at 20:43
Can you add a small sample of your data (with fake names, of course) that causes the error when you run it, so that I can run the same data and fix the issue? – ASGM Nov 03 '16 at 20:56
I edited the original post, adding the relevant data frames. Fake data. – Joel Underwood Nov 04 '16 at 01:31
I'm having some trouble understanding your sample data - Student Number 1 is repeated (how should we know which student to assign the results to? Do they refer to different students or the same one?) and Student Number 3 has no exams in the exams table but in your desired output has passed 1. – ASGM Nov 04 '16 at 14:57
Sorry about that, you're correct... fixed it. The students should be as it is. It is not meant to be a normalized listing, as it is unique on Student ID and Detail date. – Joel Underwood Nov 04 '16 at 17:58
1

Now your column names are different from your original example. Also, it's more helpful if you post data that can be directly copied into a script. – ASGM Nov 04 '16 at 19:26

faster way to run countifs in python

1 Answers1