0

I have several months of meteorological data at a resolution of 30 minutes. I need to gap fill some missing data. In excel, I use a 14-day moving average to fill gaps. Specifically, if the data gap occurs at 11:00 a.m., I use data points ONLY at 11:00 a.m. from the previous 7 days and the following 7 days to find an average. This is absurdly simple to do in Excel, but I'm forbidding myself from using it. I want an r solution. There must be a simple answer, but I can't figure it out.

So far I've tried using rollmean and rollapply from zoo. I've tried several dpylr solutions and roll_mean from RcppRoll. The closest I've come to a fix is with zoo::rollapply. I think the trick lies in defining the "width" and "align" parameters. I need the window of measurements to be 672 (48 data points per day, over 14 days) with 336 on each side of the missing data point, but I only want to use the 48th measurement in that series of 672 AND I want to ignore NAs. Also, I'd like to know how to set a minimum requirement of available data points for the gap filling. Data points at the beginning and end of the time series should be NA.

I'm not going to provide a reproducible example from my massive dataset, but here is something to play with on a smaller scale:

data <- data.frame(hour =
c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23),
values = c(1,3,1,3,1,3,1,3,1,NA,1,3,1,3,NA,3,1,3,1,3,1,3,1,3))

Let's try to calculate a 6-hour rolling mean to fill the two gaps in this data set, using only data from even hours to fill even hours, and data from odd hours to fill odd hours. Therefore the gap at 10:00 should be 3 and the gap at 15:00 should be 1.

I tried this, it didn't work:

data$gap_fill <- rollapply(data$values, width = 6, by = 2, FUN = function(x) mean(x, 
na.rm=TRUE), by.column=TRUE, partial=TRUE, fill=NA, align="center")

Thanks in advance for your help!

jls
  • 224
  • 2
  • 13
  • I think you want this: `rollapply(data$values, list(c(-2, 0, 2)), FUN = mean, na.rm = TRUE, partial = TRUE)` which averages the points at the indicated 3 offsets (2 points ago, current point, 2 points ahead). or maybe what you want is `list(c(-4, -2, 0, 2, 4))` See `?rollapply` for more info on the `width` argument. – G. Grothendieck Jan 18 '18 at 14:51
  • @G.Grothendieck, thanks for your input. Using that command gave the following: "Error in seq.default(start.at, NROW(data), by = by) : wrong sign in 'by' argument" I had more success with Nathan's example below. – jls Jan 19 '18 at 11:55
  • I tried it on the data in the question and it does not produce any errors. You will need to provide a reproducible example that demonstrates the error. – G. Grothendieck Jan 19 '18 at 14:10
  • @G.Grothendieck the error above DID occur with the reproducible example provided. I didn't run it on my original dataset because it didn't work on the example. – jls Jan 23 '18 at 12:39
  • Are you using the latest version of zoo 1.8-1. `packageVersion("zoo")` – G. Grothendieck Jan 23 '18 at 13:20

1 Answers1

1

rollapply is the right tool for the job, but your call needs some tweaking.

library(zoo)

data$gap_fill <- rollapply(
  data    = data$values,
  width   = 5,
  FUN     = function(x) {
    same_oddity <- seq(1, length(x), by = 2)
    mean(x[same_oddity], na.rm = TRUE)
  },
  partial = TRUE,
  fill    = NA,
  align   = "center"
)

data$imputed <- ifelse(
  is.na(data$values),
  data$gap_fill,
  data$values
)

data
#    hour values gap_fill imputed
# 1     0      1        1       1
# 2     1      3        1       3
# 3     2      1        1       1
# 4     3      3        3       3
# 5     4      1        1       1
# 6     5      3        3       3
# 7     6      1        1       1
# 8     7      3        3       3
# 9     8      1        1       1
# 10    9     NA        3       3
# 11   10      1        1       1
# 12   11      3        3       3
# 13   12      1        1       1
# 14   13      3        3       3
# 15   14     NA        1       1
# 16   15      3        3       3
# 17   16      1        1       1
# 18   17      3        3       3
# 19   18      1        1       1
# 20   19      3        3       3
# 21   20      1        1       1
# 22   21      3        3       3
# 23   22      1        1       1
# 24   23      3        3       3

What I changed:

  • The by parameter means windows will only be looked at around the byth elements of data. It doesn't affect how FUN works. You need to subset the windows within FUN.
  • by.column is useful when data is matrix-shaped. It says to do the process for each column and return a matrix.
Nathan Werth
  • 5,093
  • 18
  • 25
  • Thanks, Nathan. Your solution works perfectly for the example data. Unfortunately, it doesn't work for my actual dataset (live and learn, I should have provided a reproducible example in the first place). I think the problem is in the formula itself, but after playing with it for a few hours, I can't fix it. Here is the link to my dataset, if you'd like to test your code: [link](https://drive.google.com/file/d/1HdO816BpPhAeNMUPEn_sdmkIomdV4WcW/view?usp=sharing) – jls Jan 19 '18 at 11:49
  • @jls Sadly, my work blocks Google file sharing stuff. So somebody else will have to take this up. – Nathan Werth Jan 19 '18 at 15:50