My apologies if there is a duplicate in here. As far as I have searched I have not found an answer to this.
I have a thousands of DNA sequences which are ~50bp (50 characters) long. I have a variable sequence in the middle which ranges from 6-30 bp and two conserved sequences; one on the left and on the right of the variable sequence which are ~10 bp long! My data looks like this
ATTGCGCGA NAAANNNANNNNNNA CGAAAATTTA
ATTGCGCGA = conserved area on the left (reference)
NAAANNNANNNNNNA = random sequence between
CGAAAATTTA = conserved area on the right (reference)
So far so good. I know how to extract the string between the conserved areas; However sometimes I anticipate mistakes in the conserved areas. I want to find a way to allow some mismatches in the conserved areas (e.g. two or three mismatches) and extract any sequence/string that is between them and has a length between 6-30bp.
my data looks like this
1 ATTGCGCGA NAAANNNANNNNNNA CGAAAATTTA # it looks good
2 ATTGCGCGA NAAANNNAN CGAAAATTTA # it looks good
3 ATTGCGCGA NAA CGAAAATTTA # the variable sequence is too short
4 ATASGCGCGA NAAGGNNN CGAfATTTA # two mismatches on the left and two on the right conserved sequences
5 ATASjkCGCGA NAAGGNNN CGAfjfkdfTTA # more than 3 mismatches at the left area and more than 3 mismatches at the right area
6 ATTGCGCGA NAAGGNNN CGAfjfkdfTTA # more than 3 mismatches at the left conserved area
and I would like my output to look like this
1 NAAANNNANNNNNNA # it looks good
2 NAAANNNAN # it looks good
4 NAAGGNNN # two mismatches on the left and two on the right conserved sequences
!!! Important My data is not separated in chunks with gaps. I put it here to make it visually understandable raw data looks like this
1 ATTGCGCGANAAANNNANNNNNNACGAAAATTTA # it looks good
2 ATTGCGCGANAAANNNANCGAAAATTTA # it looks good
3 ATTGCGCGANAACGAAAATTTA # the variable sequence is too short
4 ATASGCGCGANAAGGNNNCGAfATTTA # two mismatches on the left and two on the right conserved sequences
5 ATASjkCGCGANAAGGNNNCGAfjfkdfTTA # more than 3 mismatches at the left area and more than 3 mismatches at the right area
6 ATTGCGCGANAAGGNNNCGAfjfkdfTTA # more than 3 mismatches at the left conserved area