Sequence of numbers by hyphen without hyphenating single occurrences

Question

I want to generate readable number sequences (e.g. 1, 2, 3, 4 = 1-4), but for a set of data where each number in the sequence must have four digits (e.g. 99 = 0099 or 1 = 0001 or 1022 = 1022) AND where there are different letters in front of each number.

I was looking at the answer to this question, which managed to do almost exactly as I want with two caveats:

If there is a stand-alone number that does not appear in a sequence, it will appear twice with a hyphen in between
If there are several stand-alone numbers that do no appear in a sequence, they won't be included in the result

### Create Data Set ====
## Create the data for different tags. I'm only using two unique levels here, but in my dataset I've got
## 400+ unique levels.
FM <- paste0('FM', c('0001', '0016', '0017', '0018', '0019', '0021', '0024', '0026', '0028'))
SC <- paste0('SC', c('0002', '0003', '0004', '0010', '0012', '0014', '0033', '0036', '0039'))

## Combine data
my.seq1 <- c(FM, SC)

## Sort data by number in sequence
my.seq1 <- my.seq1[order(substr(my.seq1, 3, 7))]

### Attempt Number Sequencing ====
## Get the letters
sp.tags <- substr(my.seq1, 1, 2)

## Get the readable number sequence
lapply(split(my.seq1, sp.tags), ## Split data by the tag ID
       function(x){
  
  ## Get the run lengths as per [previous answer][1]
  rl <- rle(c(1, pmin(diff(as.numeric(substr(x, 3, 7))), 2)))
  
  ## Generate number sequence by separator as per [previous answer][1]
  seq2 <- paste0(x[c(1, cumsum(rl$lengths))], c("-", ",")[rl$values], collapse="")
  
  return(substr(seq2, 1, nchar(seq2)-1))
})

## Combine lists and sort elements
my.seq2 <- unlist(strsplit(do.call(c, my.seq2), ","))
my.seq2 <- my.seq2[order(substr(my.seq2, 3, 7))]
names(my.seq2) <- NULL

my.seq2
[1] "FM0001-FM0001" "SC0002-SC0004" "FM0016-FM0019" "FM0028" "SC0039"

my.seq1
[1] "FM0001" "SC0002" "SC0003" "SC0004" "SC0010" "SC0012" "SC0014" "FM0016" "FM0017" "FM0018" "FM0019" "FM0021"
[13] "FM0024" "FM0026" "FM0028" "SC0033" "SC0036" "SC0039"

The major problems with this are:

Some values are completely missing from the data set (e.g. FM0021, FM0024, FM0026)
The first number in the sequence (FM0001) appears with a hyphen in between

I feel like I'm getting warmer by using A5C1D2H2I1M1N2O1R2T1's answer to utilize seqToHumanReadable because it's quite elegant AND solves both problems. Two more problems are that I'm not able to tag the ID before each number and can't force the number of digits to four (e.g. 0004 becomes 4).

library(R.utils)

lapply(split(my.seq1, sp.tags), function(x){
  return(unlist(strsplit(seqToHumanReadable(substr(x, 3, 7)), ',')))
})

$FM
[1] "1"      " 16-19" " 21"    " 24"    " 26"    " 28"   

$SC
[1] "2-4" " 10" " 12" " 14" " 33" " 36" " 39"

Ideally the result would be:

"FM0001, SC002-SC004, SC0012, SC0014, FM0017-FM0019, FM0021, FM0024, FM0026, FM0028, SC0033, SC0036, SC0039"

Any ideas? It's one of those things that's really simple to do by hand but would take blinking ages, and you'd think a function would exist for it but I haven't found it yet or it doesn't exist :(

score 2 · Accepted Answer · answered Oct 13 '20 at 06:54

This should do?

# get the prefix/tag and number
tag <- gsub("(^[A-z]+)(.+)", "\\1", my.seq1)
num <- gsub("([A-z]+)(\\d+$)", "\\2", my.seq1)

# get a sequence id
n <- length(tag)
do_match <- c(FALSE, diff(as.numeric(num)) == 1 & tag[-1] == tag[-n])
seq_id <- cumsum(!do_match) # a sequence id

# tapply to combine the result
res <- setNames(tapply(my.seq1, seq_id, function(x)
  if(length(x) < 2)
    return(x)
  else
    paste(x[1], x[length(x)], sep = "-")), NULL)

# show the result
res
#R>  [1] "FM0001"        "SC0002-SC0004" "SC0010"        "SC0012"        "SC0014"        "FM0016-FM0019" "FM0021"       
#R>  [8] "FM0024"        "FM0026"        "FM0028"        "SC0033"        "SC0036"        "SC0039"

# compare with 
my.seq1
#R>  [1] "FM0001" "SC0002" "SC0003" "SC0004" "SC0010" "SC0012" "SC0014" "FM0016" "FM0017" "FM0018" "FM0019" "FM0021" "FM0024"
#R> [14] "FM0026" "FM0028" "SC0033" "SC0036" "SC0039"

Data

FM <- paste0('FM', c('0001', '0016', '0017', '0018', '0019', '0021', '0024', '0026', '0028'))
SC <- paste0('SC', c('0002', '0003', '0004', '0010', '0012', '0014', '0033', '0036', '0039'))
my.seq1 <- c(FM, SC)
my.seq1 <- my.seq1[order(substr(my.seq1, 3, 7))]

Sequence of numbers by hyphen without hyphenating single occurrences

1 Answers1

Data