Extracting lines of text depending on the len() of a particular column

Question

I'm trying to write a simple script to extract particular data from a VCF file, which displays variants in genome sequences.

The script needs to extract the header from the file, as well as SNVs while omitting any indels. Variants are displayed in 2 columns, the ALT and REF. Each column is separated by white space. Indels will have 2 characters in the ALT or REF, SNVs will always have 1.

What I have so far extracts the headers (which always begin with ##), but not any of the variant data.

original_file = open('/home/user/Documents/NA12878.vcf', 'r')
extracted_file = open('NA12878_SNV.txt', 'w+')

for line in original_file:
   if '##' in line:
       extracted_file.write(line)

# Extract SNVs while omitting indels 
# Indels will have multiple entries in the REF or ALT column
# The ALT and REF columns appear at position 4 & 5 respectively

for line in original_file:
    ref = str.split()[3]
    alt = str.split()[4]
    if len(ref) == 1 and len(alt) == 1:
        extracted_file.write(line)

original_file.close()
extracted_file.close()

Are you sure you are splitting the data correctly? Why do you use `str.split()` if your variable is called `line`? It should be `line.split()` — alec_djinn, Feb 21 '19 at 15:53
What output are you currently getting vs what output do you desire? — John, Feb 21 '19 at 15:54
I think you're right, the data probably isn't splitting correctly, I've amended the code to `line.split()`. The code runs, but I'm not getting my desired output. Ideally, the extracted file will contain the complete header, plus the lines of data not containing indels. My current output is only the header lines, which begin with #. The rest of the VCF file contains lines of plain text without the # character. These lines are missing. — umbro_tracksuit, Feb 21 '19 at 16:04
Ive uploaded the files to this google drive, the VCF file can be opened as plain text. I've also uploaded my output. `https://drive.google.com/open?id=1kzFZOxliWmbCcezsmMfNt0EBUdcfdPud` — umbro_tracksuit, Feb 21 '19 at 16:11

Adirmola · Accepted Answer · 2019-02-21T16:37:49.203

original_file = open('NA12878.vcf', 'r')
extracted_file = open('NA12878_SNV.txt', 'w+')
i=0

for line in original_file:
    if '##' in line:
        extracted_file.write(line)
    else:
        ref = line.split('  ')[3]
        alt = line.split('  ')[4]
        if len(ref) == 1 and len(alt) == 1:
            extracted_file.write(line)

# Extract SNVs while omitting indels 
# Indels will have multiple entries in the REF or ALT column
# The ALT and REF columns appear at position 4 & 5 respectively

original_file.close()
extracted_file.close()

There Was two issues:

The second loop never get executed because you already get to the end of the VCF file in the first loop. You can see here how to start over new loop on the same text file.
You didn't separate the line correctly it is tab delimited.

So i set the code to execute with only one loop and placed tab as the split parameter.

Awesome, that is perfect! Thanks a lot! – umbro_tracksuit Feb 21 '19 at 16:38 — umbro_tracksuit, Feb 21 '19 at 16:38
I'd do `re.split(r'\t+', line)` with `import re`. – Stefan Falk Feb 21 '19 at 16:40 — Stefan Falk, Feb 21 '19 at 16:40

score 0 · Answer 2 · answered Feb 27 '19 at 16:32

The answer given by Adirmola is fine, but you can improve the code quality by applying a few modifications:

# Use "with" context managers to open files.
# The closing will be automatic, even in case of problems.
with open("NA12878.vcf", "r") as vcf_file, \
        open("NA12878_SNV.txt", "w") as snv_file:
    for line in vcf_file:
        # Since you have specific knowledge of the format
        # you can be more specific when identifying header lines
        if line[:2] == "##":
            snv_file.write(line)
        else:
            # You can avoid doing the splitting twice
            # with list unpacking (using _ for ignored fields)
            # https://www.python.org/dev/peps/pep-3132/
            [_, _, _, ref, alt, *_] = line.split("\t")  # "\t" represents a tab
            if len(ref) == 1 and len(alt) == 1:
                snv_file.write(line)

I tested this with python 3.6 on your file, and I end up with 554 SNVs. Some of the syntax used here (especially for list unpacking) may not work with older python versions.

Extracting lines of text depending on the len() of a particular column

2 Answers2