2

I'm trying to convert the following and am not successful with one of the dates [1]. "4/2/10" becomes "0010-04-02".

Is there a way to correct this?

thanks, Vivek

data <- data.frame(initialDiagnose = c("4/2/10","14.01.2009", "9/22/2005", 
        "4/21/2010", "28.01.2010", "09.01.2009", "3/28/2005", 
        "04.01.2005", "04.01.2005", "Created on 9/17/2010", "03 01 2010"))

mdy <- mdy(data$initialDiagnose) 
dmy <- dmy(data$initialDiagnose) 
mdy[is.na(mdy)] <- dmy[is.na(mdy)] # some dates are ambiguous, here we give 
data$initialDiagnose <- mdy        # mdy precedence over dmy
data

   initialDiagnose
1       0010-04-02
2       2009-01-14
3       2005-09-22
4       2010-04-21
5       2010-01-28
6       2009-09-01
7       2005-03-28
8       2005-04-01
9       2005-04-01
10      2010-09-17
11      2010-03-01
Vivek Kumar
  • 283
  • 1
  • 3
  • 8
  • Is it just the first value, or do you need a more general solution for a larger data set? – Bryan Goggin Jun 09 '16 at 21:18
  • lots of useful info on converting 2 digit years to 4 digit years here: http://stackoverflow.com/questions/9508747/add-correct-century-to-dates-with-year-provided-as-year-without-century-y – jalapic Jun 09 '16 at 21:27
  • 1
    If you parse it individually, it works fine; it's just that the diversity of formats is stretching `parse_date_time`'s format guessing too wide. Assuming it isn't a huge vector, just loop it and it'll work fine: `do.call(c, lapply(data$initialDiagnose, lubridate::parse_date_time, orders = c('mdy', 'dmy')))` – alistaire Jun 09 '16 at 21:50

1 Answers1

5

I think this is occurring because the mdy() function prefers to match the year with %Y (the actual year) over %y (2 digit abbreviation for the year, defaulting to 19XX or 20XX).

There is a workaround, though. I took a look at the help files for lubridate::parse_date_time (?parse_date_time), and near the bottom of the help file, there is an example for adding an argument that prefers matching with the %y format over the %Y format for the year. The relevant bit of code from the help file:

## ** how to use `select_formats` argument **
## By default %Y has precedence:
parse_date_time(c("27-09-13", "27-09-2013"), "dmy")
## [1] "13-09-27 UTC"   "2013-09-27 UTC"

## to give priority to %y format, define your own select_format function:

my_select <-   function(trained){
   n_fmts <- nchar(gsub("[^%]", "", names(trained))) + grepl("%y", names(trained))*1.5
   names(trained[ which.max(n_fmts) ])
}

parse_date_time(c("27-09-13", "27-09-2013"), "dmy", select_formats = my_select)
## '[1] "2013-09-27 UTC" "2013-09-27 UTC"

So, for your example, you can adapt this code and replace the mdy <- mdy(data$initialDiagnose) line with this:

# Define a select function that prefers %y over %Y. This is copied 
# directly from the help files
my_select <-   function(trained){
  n_fmts <- nchar(gsub("[^%]", "", names(trained))) + grepl("%y", names(trained))*1.5
  names(trained[ which.max(n_fmts) ])
}

# Parse as mdy dates
mdy <- parse_date_time(data$initialDiagnose, "mdy", select_formats = my_select)
# [1] "2010-04-02 UTC" NA               "2005-09-22 UTC" "2010-04-21 UTC" NA              
# [6] "2009-09-01 UTC" "2005-03-28 UTC" "2005-04-01 UTC" "2005-04-01 UTC" "2010-09-17 UTC"
#[11] "2010-03-01 UTC"

And running the remaining lines of code from your question, it gives me this data frame as the result:

   initialDiagnose
1       2010-04-02
2       2009-01-14
3       2005-09-22
4       2010-04-21
5       2010-01-28
6       2009-09-01
7       2005-03-28
8       2005-04-01
9       2005-04-01
10      2010-09-17
11      2010-03-01
ialm
  • 8,510
  • 4
  • 36
  • 48