0

My apologies if there is a duplicate in here. As far as I have searched I have not found an answer to this.

I have a thousands of DNA sequences which are ~50bp (50 characters) long. I have a variable sequence in the middle which ranges from 6-30 bp and two conserved sequences; one on the left and on the right of the variable sequence which are ~10 bp long! My data looks like this

ATTGCGCGA NAAANNNANNNNNNA CGAAAATTTA
ATTGCGCGA = conserved area on the left (reference)

NAAANNNANNNNNNA = random sequence between 

CGAAAATTTA = conserved area on the right (reference)

So far so good. I know how to extract the string between the conserved areas; However sometimes I anticipate mistakes in the conserved areas. I want to find a way to allow some mismatches in the conserved areas (e.g. two or three mismatches) and extract any sequence/string that is between them and has a length between 6-30bp.

my data looks like this

1 ATTGCGCGA NAAANNNANNNNNNA CGAAAATTTA # it looks good
2 ATTGCGCGA NAAANNNAN  CGAAAATTTA      # it looks good
3 ATTGCGCGA NAA CGAAAATTTA             # the variable sequence is too short
4 ATASGCGCGA NAAGGNNN CGAfATTTA        # two mismatches on the left and two on the right conserved sequences
5 ATASjkCGCGA NAAGGNNN CGAfjfkdfTTA    # more than 3 mismatches at the left area and more than 3 mismatches at the right area
6 ATTGCGCGA NAAGGNNN CGAfjfkdfTTA      # more than 3 mismatches at the left conserved area

and I would like my output to look like this

1 NAAANNNANNNNNNA # it looks good
2 NAAANNNAN       # it looks good
4 NAAGGNNN        # two mismatches on the left and two on the right conserved sequences

!!! Important My data is not separated in chunks with gaps. I put it here to make it visually understandable raw data looks like this

1 ATTGCGCGANAAANNNANNNNNNACGAAAATTTA # it looks good
2 ATTGCGCGANAAANNNANCGAAAATTTA       # it looks good
3 ATTGCGCGANAACGAAAATTTA             # the variable sequence is too short
4 ATASGCGCGANAAGGNNNCGAfATTTA        # two mismatches on the left and two on the right conserved sequences
5 ATASjkCGCGANAAGGNNNCGAfjfkdfTTA    # more than 3 mismatches at the left area and more than 3 mismatches at the right area
6 ATTGCGCGANAAGGNNNCGAfjfkdfTTA      # more than 3 mismatches at the left conserved area
LDT
  • 2,856
  • 2
  • 15
  • 32
  • 1
    Could you describe what constitutes a match/mismatch or an error? – 0x263A Sep 09 '21 at 19:16
  • The only requirement for the middle chunck is to be from 6-25bp. The mismatches refer to the left and right conserved areas. The reference for the left and right area are ATTGCGCGA and CGAAAATTTA respectively – LDT Sep 09 '21 at 19:17
  • 1
    I wrote my answer before you have added "!!! Important My data is not separated in chunks with gaps..." to the question. You will have to add the split logic and calc the distance as suggested in the answer. – balderman Sep 09 '21 at 19:19
  • thank you balderman, I ll try to work towards this direction, still I am not sure how to connect the X & Z to the reference but I ll work on it – LDT Sep 09 '21 at 19:21
  • 1
    https://stackoverflow.com/questions/46540062/fuzzy-regex-e-g-e-2-correct-usage-in-python is relevant, and part of the solution. Let me know if you need a complete I can do an attempt – Willem Hendriks Sep 09 '21 at 20:34
  • That's very helpful Willem. I highly appreciate. I am trying to finding a solution in the dark night. I would not say no to a little bit of help hahaha – LDT Sep 09 '21 at 20:37

1 Answers1

1

Each line has 3 parts. Lets call them X Y Z.
You are interested in Y but you say that sometimes X & Z comes with some variations.

You should use Levenshtein distance in order to check the "distance" between the actual X & Z to the official X & Z.

If the "distance" is small enough - you can take Y.

See here for a python lib the will calculate the distance.

balderman
  • 22,927
  • 7
  • 34
  • 52