1

I have a data frame as shown below

session      slot_num     appt_time
s1           1            2020-01-06 09:00:00
s1           2            2020-01-06 09:20:00
s1           3            2020-01-06 09:40:00
s1           3            2020-01-06 09:40:00
s1           4            2020-01-06 10:00:00
s1           4            2020-01-06 10:00:00
s2           1            2020-01-06 08:20:00
s2           2            2020-01-06 08:40:00
s2           2            2020-01-06 08:40:00
s2           3            2020-01-06 09:00:00
s2           4            2020-01-06 09:20:00
s2           5            2020-01-06 09:40:00
s2           5            2020-01-06 09:40:00
s2           6            2020-01-06 10:00:00
s3           1            2020-01-09 13:00:00
s3           1            2020-01-09 13:00:00
s3           2            2020-01-09 13:20:00
s3           3            2020-01-09 13:40:00

In the above I would like to add a column called service_time.

service_time should contain any random digits between 2 to 40 with mean 20 for each session.

I prefer random numbers should follow random normal distribution with mean 20, standard deviation 10, minimum 2 and maximum 40

Expected output:

session      slot_num     appt_time               service_time
s1           1            2020-01-06 09:00:00     30
s1           2            2020-01-06 09:20:00     10
s1           3            2020-01-06 09:40:00     15
s1           3            2020-01-06 09:40:00     35
s1           4            2020-01-06 10:00:00     20
s1           4            2020-01-06 10:00:00     10
s2           1            2020-01-06 08:20:00     15
s2           2            2020-01-06 08:40:00     20
s2           2            2020-01-06 08:40:00     25
s2           3            2020-01-06 09:00:00     30
s2           4            2020-01-06 09:20:00     20
s2           5            2020-01-06 09:40:00     8
s2           5            2020-01-06 09:40:00     40
s2           6            2020-01-06 10:00:00     2
s3           1            2020-01-09 13:00:00     4
s3           1            2020-01-09 13:00:00     32
s3           2            2020-01-09 13:20:00     26
s3           3            2020-01-09 13:40:00     18

Note : please note that this is the one of that random combination which follows the minimum, maximum and mean criteria mentioned above.

Danish
  • 2,719
  • 17
  • 32

3 Answers3

1

One possible solution with cutom function:

#https://stackoverflow.com/a/39435600/2901002
def gen_avg(n, expected_avg=20, a=2, b=40):
    while True:
        l = np.random.randint(a, b, size=n)
        avg = np.mean(l)

        if avg == expected_avg:
            return l

df['service_time'] = df.groupby('session')['session'].transform(lambda x: gen_avg(len(x)))
print (df)
   session  slot_num            appt_time  service_time
0       s1         1  2020-01-06 09:00:00            31
1       s1         2  2020-01-06 09:20:00             9
2       s1         3  2020-01-06 09:40:00            23
3       s1         3  2020-01-06 09:40:00            37
4       s1         4  2020-01-06 10:00:00             6
5       s1         4  2020-01-06 10:00:00            14
6       s2         1  2020-01-06 08:20:00            33
7       s2         2  2020-01-06 08:40:00            29
8       s2         2  2020-01-06 08:40:00            18
9       s2         3  2020-01-06 09:00:00            32
10      s2         4  2020-01-06 09:20:00             9
11      s2         5  2020-01-06 09:40:00            26
12      s2         5  2020-01-06 09:40:00            10
13      s2         6  2020-01-06 10:00:00             3
14      s3         1  2020-01-09 13:00:00            19
15      s3         1  2020-01-09 13:00:00            22
16      s3         2  2020-01-09 13:20:00             5
17      s3         3  2020-01-09 13:40:00            34
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

Here's a solution with NumPy's new Generator infrastructure. See the documentation for a discussion of the differences between this and the older RandomState infrastructure.

import numpy as np
from numpy.random import default_rng

# assuming df is the name of your dataframe
n = len(df)

# set up random number generator
rng = default_rng()

# sample more than enough values
vals = rng.normal(loc=20., scale=10., size=2*n)

# filter values according to cut-off conditions
vals = vals[2 <= vals]
vals = vals[vals <= 40]

# add n random values to dataframe
df['service_time'] = vals[:n]
Arne
  • 9,990
  • 2
  • 18
  • 28
1

The normal distribution has an unbounded range, so if you're bounding between 2 and 40 the distribution isn't normal. An alternative which is bounded, and avoids acceptance/rejection schemes, is to use the triangular distribution (see Wikipedia for details). Since the mean of a triangular distribution is (left + mode + right) / 3, with left = 2 and right = 40 you would set mode = 18 to get the desired mean of 20.

pjs
  • 18,696
  • 4
  • 27
  • 56