0

I have a dataframe in R with four columns. I want to calculate the total percentage of citizens below 44 by state and by year.

How do I accomplish this in R, preserving the year and state column?

I was already able to use aggregate to get the sum and preserve the year/state/age values from a bigger dataset. I just just couldn't get the sum of the whole column, but now I'm not sure where to go from here to calculate percent.

|------------||------------||------------||------------|
|    Year    ||    State   ||    Age     ||    Pop     |
|------------||------------||------------||------------|
|    2000    ||     VA     ||    <44     ||    150     |
|------------||------------||------------||------------|
|    2000    ||     VA     ||    44+     ||    350     |
|------------||------------||------------||------------|
|    2000    ||     VA     ||    Total   ||    500     |

Ideal Output:

|------------||------------||------------||------------|
|    Year    ||    State   ||    Age     ||    Pop     |
|------------||------------||------------||------------|
|    2000    ||     VA     ||    <44     ||    0.3     |
|------------||------------||------------||------------|
|    2004    ||     VA     ||    <44     ||    0.2     |
|------------||------------||------------||------------|
|    2008    ||     VA     ||    <44     ||    0.4     |

This is the last bit of code I used to get the data frame to look how it does.

demos_sub <- aggregate(demos_sub$total_citizen_pop, by=list(Year=demos_sub$year, State=demos_sub$state, Age=demos_sub$age), FUN=sum)
names(demos_sub) <- c("year","state", "age", "total_citizen_pop")
demos_sub <- demos_sub[with(demos_sub, order(year)),]
demos_sub <- demos_sub[with(demos_sub, order(state)),]

I'm just not sure where to go from here to shrink it down further and calculate percentages.

carousallie
  • 776
  • 1
  • 7
  • 25
  • Can you show us what you have tried so far and what problem you faced please ? – godot Apr 21 '19 at 22:04
  • Added; I'm just really not sure how to move forward. – carousallie Apr 21 '19 at 22:06
  • Sorry should have been more precise but I meant can you add your code? It is much easier to explain if we can see what you're stuck on... (You can use a simpler dataset to get to your point.) – godot Apr 21 '19 at 22:12
  • I added the code I last did to get to this point, but I don't even know where to go from here to move forward. I don't think aggregate will do the trick again... – carousallie Apr 21 '19 at 22:20
  • Its difficult to say if we don't have any reproducible example, maybe you are looking for `?prop.table`. – P. Paccioretti Apr 21 '19 at 22:25
  • Once you have filtered your data of all that must be removed, you should have a look at dplyr group_by & summarize function. – godot Apr 21 '19 at 23:09
  • can you add a reproducible example along with expected output. Update your question with `dput(demos_sub)`. – Ronak Shah Apr 21 '19 at 23:21
  • 1
    Possible duplicate of [Getting both column counts and proportions in the same table in R](https://stackoverflow.com/questions/9438193/getting-both-column-counts-and-proportions-in-the-same-table-in-r) – NelsonGon Apr 22 '19 at 02:56

1 Answers1

0

This is the beginning of a solution:

library(dplyr)

Year <- rep(2000, 6)
State <- c(rep("VA", 4), rep("MA", 2))
Age <- c("<44", "44+", "44+", "<44", "<44", "44+")
Pop <- c(150, 350, 500, 200, 100, 100)

df <- data.frame(State = State, Age = Age, Pop = Pop, Year= Year)

df %>% filter(Age != "Total") %>% group_by(Year, State)  %>% summarize(Pop44 = sum(Pop[Age=="<44"]) / sum(Pop))

You don't have to filter the "Total" category but it's usually not a good idea to have a "total" category (better have a column for that)

godot
  • 1,550
  • 16
  • 33
  • This is a great starting point, thank you so much! And this isn't my data, but I definitely wouldn't have put a put total category in the same column. – carousallie Apr 22 '19 at 18:13