generate a random number between 2 and 40 with mean 20 as a column in pandas

Question

I have a data frame as shown below

session      slot_num     appt_time
s1           1            2020-01-06 09:00:00
s1           2            2020-01-06 09:20:00
s1           3            2020-01-06 09:40:00
s1           3            2020-01-06 09:40:00
s1           4            2020-01-06 10:00:00
s1           4            2020-01-06 10:00:00
s2           1            2020-01-06 08:20:00
s2           2            2020-01-06 08:40:00
s2           2            2020-01-06 08:40:00
s2           3            2020-01-06 09:00:00
s2           4            2020-01-06 09:20:00
s2           5            2020-01-06 09:40:00
s2           5            2020-01-06 09:40:00
s2           6            2020-01-06 10:00:00
s3           1            2020-01-09 13:00:00
s3           1            2020-01-09 13:00:00
s3           2            2020-01-09 13:20:00
s3           3            2020-01-09 13:40:00

In the above I would like to add a column called service_time.

service_time should contain any random digits between 2 to 40 with mean 20 for each session.

I prefer random numbers should follow random normal distribution with mean 20, standard deviation 10, minimum 2 and maximum 40

Expected output:

session      slot_num     appt_time               service_time
s1           1            2020-01-06 09:00:00     30
s1           2            2020-01-06 09:20:00     10
s1           3            2020-01-06 09:40:00     15
s1           3            2020-01-06 09:40:00     35
s1           4            2020-01-06 10:00:00     20
s1           4            2020-01-06 10:00:00     10
s2           1            2020-01-06 08:20:00     15
s2           2            2020-01-06 08:40:00     20
s2           2            2020-01-06 08:40:00     25
s2           3            2020-01-06 09:00:00     30
s2           4            2020-01-06 09:20:00     20
s2           5            2020-01-06 09:40:00     8
s2           5            2020-01-06 09:40:00     40
s2           6            2020-01-06 10:00:00     2
s3           1            2020-01-09 13:00:00     4
s3           1            2020-01-09 13:00:00     32
s3           2            2020-01-09 13:20:00     26
s3           3            2020-01-09 13:40:00     18

Note : please note that this is the one of that random combination which follows the minimum, maximum and mean criteria mentioned above.

You will have to specify from which random distribution you want to sample. Normal? Uniform? Other? Depending on which distribution you pick, other parameters in addition to the mean may need to be specified, e.g. the standard deviation for a normal distribution. — Arne, Apr 23 '20 at 12:54
And are you sure the mean should be 20? That is not the midpoint between 2 and 40. — Arne, Apr 23 '20 at 12:57
@Arne added I want random normal with mean 20, standard deviation 10, minimum 2 and maximum 40 — Danish, Apr 23 '20 at 12:58
@jezrael I don't mind standard deviation if min =2 and max = 40 — Danish, Apr 23 '20 at 13:18

score 1 · Accepted Answer · answered Apr 23 '20 at 13:18

One possible solution with cutom function:

#https://stackoverflow.com/a/39435600/2901002
def gen_avg(n, expected_avg=20, a=2, b=40):
    while True:
        l = np.random.randint(a, b, size=n)
        avg = np.mean(l)

        if avg == expected_avg:
            return l

df['service_time'] = df.groupby('session')['session'].transform(lambda x: gen_avg(len(x)))
print (df)
   session  slot_num            appt_time  service_time
0       s1         1  2020-01-06 09:00:00            31
1       s1         2  2020-01-06 09:20:00             9
2       s1         3  2020-01-06 09:40:00            23
3       s1         3  2020-01-06 09:40:00            37
4       s1         4  2020-01-06 10:00:00             6
5       s1         4  2020-01-06 10:00:00            14
6       s2         1  2020-01-06 08:20:00            33
7       s2         2  2020-01-06 08:40:00            29
8       s2         2  2020-01-06 08:40:00            18
9       s2         3  2020-01-06 09:00:00            32
10      s2         4  2020-01-06 09:20:00             9
11      s2         5  2020-01-06 09:40:00            26
12      s2         5  2020-01-06 09:40:00            10
13      s2         6  2020-01-06 10:00:00             3
14      s3         1  2020-01-09 13:00:00            19
15      s3         1  2020-01-09 13:00:00            22
16      s3         2  2020-01-09 13:20:00             5
17      s3         3  2020-01-09 13:40:00            34

score 1 · Answer 2 · answered Apr 23 '20 at 13:38

Here's a solution with NumPy's new Generator infrastructure. See the documentation for a discussion of the differences between this and the older RandomState infrastructure.

import numpy as np
from numpy.random import default_rng

# assuming df is the name of your dataframe
n = len(df)

# set up random number generator
rng = default_rng()

# sample more than enough values
vals = rng.normal(loc=20., scale=10., size=2*n)

# filter values according to cut-off conditions
vals = vals[2 <= vals]
vals = vals[vals <= 40]

# add n random values to dataframe
df['service_time'] = vals[:n]

score 1 · Answer 3 · answered Apr 23 '20 at 16:34

The normal distribution has an unbounded range, so if you're bounding between 2 and 40 the distribution isn't normal. An alternative which is bounded, and avoids acceptance/rejection schemes, is to use the triangular distribution (see Wikipedia for details). Since the mean of a triangular distribution is (left + mode + right) / 3, with left = 2 and right = 40 you would set mode = 18 to get the desired mean of 20.

generate a random number between 2 and 40 with mean 20 as a column in pandas

3 Answers3

Linked