I am trying to understand how to generate a normal distribution if I already know specific percentiles.
A user gave a very comprehensive answer to a similar question (link here), but when I tried and tested it with my existing data, the variance was much too large.
How I did this:
x <- c(5,8,11)
PercRank <- c(2.1, 51.1, 98.8)
PercRank = 2.1 for example tells that 2.1% of the data has a value/score <= 5 (the first value of x). Similarly, PercRank = 51.1 tells that 51.1% of the data has a value/score <= 8.
I followed the method in this link. This is my code:
cum.p <- c(2.1, 51.1, 98.8)/100
prob <- c( cum.p[1], diff(cum.p), .01)
x <- c(5,8,11)
freq <- 1000 # final output size that we want
# Extreme values beyond x (to sample)
init <- -(abs(min(x)) + 1)
fin <- abs(max(x)) + 1
ival <- c(init, x, fin) # generate the sequence to take pairs from
len <- 100 # sequence of each pair
s <- sapply(2:length(ival), function(i) {
seq(ival[i-1], ival[i], length.out=len)
})
# sample from s, total of 10000 values with probabilities calculated above
out <- sample(s, freq, prob=rep(prob, each=len), replace = T)
quantile(out, cum.p)
# 2% 51.1% 98.8%
# 5 8 11
c(mean(out), sd(out))
# [1] 7.834401 2.214227
All of this is from the comment (linked), and so far so good. Then I tried to check how well the generated normal distribution worked with my fitted values:
data.frame(sort(rnorm(1000, mean=mean(out), sd=sd(out))))
...
# 988 13.000904
# 989 13.028881
# 990 13.076649
...
# 1000 14.567080
I was concerned because the 988th value (e.g., 98.8% of 1000 samples) was 13.000904, while the value I had fitted for the 98.8% percentile was 11.0.
I re-generated the distribution many times and the variance was consistently larger than it needed to be.
I'm stumped. I would appreciate if anyone could show me a way to make the variance more accurate. Or, is this unavoidable?
(My first time posting here, I apologize if I broke rules - I can make it more clear if needed.)