-2

I wanted to merge every line 2-3 together and keep line 1. Here is the example of my text

>chrX:147147161-147148161
ATGATGGTGATGTACAGATGGGTTTTTGG
TTATCTAATTCATGTGTTGGTCAGATCAA
>chrY:16119725-16120725
CAGCTTTGTTCCGTTGCTGGTGAGGAACT
GACTCCCTGGGTGTAGGACCCTCCGAGCC

What I want it to look like

>chrX:147147161-147148161
ATGATGGTGATGTACAGATGGGTTTTTGGTTATCTAATTCATGTGTTGGTCAGATCAA
>chrY:16119725-16120725
CAGCTTTGTTCCGTTGCTGGTGAGGAACTGACTCCCTGGGTGTAGGACCCTCCGAGCC

I have tried several ways but none has been working so far. Here is what I have been trying to do

> sed '/>$/,/>$/ {//b; N; s/\n//;}' file.txt

This command could not merge my lines. I also tried this before

> paste -d "" - - < txt.file . 

This only merge my chr line and the sequence line, which was not what I wanted. Can someone give my some suggestions? Thank you!

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
minhntran
  • 1
  • 1

4 Answers4

1

Assumptions:

  • sequence lines do not contain (trailing) white space
  • sequence lines do not end with a \r (windows/dos line ending)
  • a sequence spans exactly 2 lines (not 1 line, not 3+ lines)
  • NOTE: additional logic can be added to address any invalid assumptions

A couple variations on an awk idea:

awk '
(NR%3)==2 { line2=$0; next }
(NR%3)==0 { print line2 $0; next }
1
' file.txt

####################

awk '
(NR%3)==2 { line2=$0; next }
          { print line2 $0 }
(NR%3)==0 { line2="" }
' file.txt

Both of these generate:

>chrX:147147161-147148161
ATGATGGTGATGTACAGATGGGTTTTTGGTTATCTAATTCATGTGTTGGTCAGATCAA
>chrY:16119725-16120725
CAGCTTTGTTCCGTTGCTGGTGAGGAACTGACTCCCTGGGTGTAGGACCCTCCGAGCC
markp-fuso
  • 28,790
  • 4
  • 16
  • 36
1

You are dealing with FASTA files and these can be processed with ease using awk. The following works for generic fasta files, with one or more lines per sequence.

awk 'BEGIN{RS=">";FS="\n";OFS=""}
    (FNR==1){next}
    {name=$1;seq=substr($0,index($0,FS));gsub(FS,OFS,seq)}
    {print RS name FS seq}' file.fasta
kvantour
  • 25,269
  • 4
  • 47
  • 72
1
awk '
    /^>/ {
          header=$0; lines=""; i=0
          while(i<2){
            getline
            lines=lines $0
            i++
          }
          printf "%s\n%s\n", header, lines
    }
' input

>chrX:147147161-147148161
ATGATGGTGATGTACAGATGGGTTTTTGGTTATCTAATTCATGTGTTGGTCAGATCAA
>chrY:16119725-16120725
CAGCTTTGTTCCGTTGCTGGTGAGGAACTGACTCCCTGGGTGTAGGACCCTCCGAGCC
ufopilot
  • 3,269
  • 2
  • 10
  • 12
0

If ed is available/acceptable.

With the v global command, something like:

printf '%s\n' 'v/^>/j' ,p Q | ed -s file.fasta

With the g global command, something like:

printf '%s\n' 'g/^[^>]/j' ,p Q | ed -s file.fasta 

  • Change Q to w if in-place editing is needed.
  • Remove the ,p to silence the output.
Jetchisel
  • 7,493
  • 2
  • 19
  • 18