How to find SD and Mean of a continuous variable that has individuals in one of two groups within a categorical variable

Question

Essentially the assignment is to find the SD, mean, p-value, and the number of people within each category of a categorical variable from a continuous variable.

As an example, having a variable BMI (continuous) which has the BMI of patients but the assignment asks us to find the mean and sd of variable BMI within the "No diabetes" group and "Diabetes" group of the same categorical variable.

The first variable is a list of BMI's per patient, the second variable indicates if the individual has BMI or not, 1 and 2 is for type 1 and 2 diabetes and 3 is for no diabetes.

My assignment is to get the p-value, amount of individuals, mean, and standard deviation of individuals in BMI that have diabetes and individuals in BMI without diabetes while removing anyone with missing information.

I have tried:

 mean(ds$bmi[ds$diabetesI==1|ds$diabetesI==2])

However, this returns NA. My thought behind this was to see if I could get the mean for individuals with type 1 and 2 diabetes but as stated above, it did not work.

data

ds <- structure(list(bmi_list = c(23.56748874, 30.2897933, 26.79150092, 
    29.52347213, 32.60591716, 35.04961743, 21.41223797, 27.46530314, 
    28.73467206, 21.19391994, 25.59362916, 27.62345679, 34.45651021, 
    27.48650005, 31.49548668, 26.05817112, 35.83864796, 31.42131479, 
    22.49134948, 33.99585346, 23.67125363, 22.55335653, 29.41248346, 
    32.94855347, 23.2915562, 30.37962963, 23.759308, 25.2493372, 
    29.27315022, 35.26197253), diab4 = c(1L, 1L, 3L, 1L, 1L, 3L, 
    1L, 1L, 1L, 3L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 1L, 3L, 1L, 1L, 1L, 
    3L, 1L, 3L, 1L, 1L, 1L, 1L, 3L)), row.names = c(1L, 2L, 3L, 4L, 
    5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 
   19L, 20L, 21L, 22L, 23L, 24L, 25L, 27L, 28L, 30L, 31L, 32L), class = 
   "data.frame")

Welcome to SO. Please make your problem [reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and include the data you are using. Please also show the code you have tried that did not work. — markus, Nov 06 '18 at 19:47
@markus Thank you for your information, I tried as best as I could to be more informative and try to show the code that' reproducible although I imagine in this case that I just do not know the functions to get the solution I want. — NewApple, Nov 06 '18 at 20:09
Can you share the output of `dput(ds[, c("bmi", "diabetesI")])`? Include that in your question. — markus, Nov 06 '18 at 20:12
Thanks for the output but this is not reproducible. I see that you included columns like `diab4` and `bmi_list` which do not seem to be relevant to solve the problem. Again, please share the output of `dput(ds[, c("bmi", "diabetesI")])` or if it is still to large then `dput(head(ds[, c("bmi", "diabetesI")], 30))`. — markus, Nov 06 '18 at 20:24
@markus I should also say the original names are bmi_list for BMI of patients and diab4 for the status of the patients (type 1,2, or no diabetes). Just wanted to make it a little more logical but I edited the document to represent the true variables. — NewApple, Nov 06 '18 at 20:37
Way better! Here is how you get the `mean` per group: `tapply(ds$bmi_list, ds$diab4, mean)`. — markus, Nov 06 '18 at 20:37
@markus Thank you for the code! I actually tried that but I got 1 2 3 and underneath it is NA 29.193 NA. Is this because I didn't take care of the Missing variables? — NewApple, Nov 06 '18 at 20:52
@NewApple: Sort of. It's because markus didn't include na.rm=TRUE in the parameters for `mean`. — IRTFM, Nov 06 '18 at 22:30

score 0 · Answer 1 · answered Nov 06 '18 at 20:46

My advice is to work in stages. (1) Remove missing data, (2a) identify diabetes cases, (2c) identify non-diabetes cases, (3a) select diabetes cases, (3b) select non-diabetes cases, (4a) compute mean for diabetes cases, (4b) compute mean for non-diabetes cases.

At each step along the way, review what you have gotten so far, and convince yourself you have the right thing to do the next step. Naturally your ideas about what you have and what you need might change along the way, that's to be expected.

E.g. for (1), look at is.na(whateverdata). That's a list of flags showing whether each value is NA. Does that look right? You have several data fields, probably you need to omit a case if any field is missing. Look at is.na applied to each field, and look at the disjunction | of all of them. Does that look right? Count up the missing values via sum. Does that look right? Then create flags for non-missing data via !. Finally select the non-missing data by subscripting with the non-missing flags via whateverdata[nonmissingflags].

Similarly for (2a) and (2b), construct flags for each subset and look at then. For (3a) and (3b) select the cases using the subset flags and look at those data.

In (4a) and (4b) just apply mean to the data you selected. But at this point you have the subsets ready for any analysis you could apply -- you can go in different directions here.

I am having on 2a and 2b, 3a and 3b. I do not understand how to set up those groups. I know that I want in rationality, 2 different groups: BMI-patients with Diabetes (Type 1 or 2) and BMI-patients with no Diabetes. — NewApple, Nov 06 '18 at 20:54
E.g. `T1Dflags <- ds$diab4 == 1`. `T2Dflags <- ds$diab4 == 2` -- does that look right? If so, then `T1Dcases <- ds$bmi_list[T1Dflags]`, likewise for T2Dcases. — Robert Dodier, Nov 06 '18 at 21:27
To get non-diabetes flags, look at `!(T1Dflags | T2Dflags)`. This is all assuming that missing data have already been omitted. — Robert Dodier, Nov 06 '18 at 21:28

How to find SD and Mean of a continuous variable that has individuals in one of two groups within a categorical variable

1 Answers1