0

I'm trying to write a simple script to extract particular data from a VCF file, which displays variants in genome sequences.

The script needs to extract the header from the file, as well as SNVs while omitting any indels. Variants are displayed in 2 columns, the ALT and REF. Each column is separated by white space. Indels will have 2 characters in the ALT or REF, SNVs will always have 1.

What I have so far extracts the headers (which always begin with ##), but not any of the variant data.

original_file = open('/home/user/Documents/NA12878.vcf', 'r')
extracted_file = open('NA12878_SNV.txt', 'w+')

for line in original_file:
   if '##' in line:
       extracted_file.write(line)

# Extract SNVs while omitting indels 
# Indels will have multiple entries in the REF or ALT column
# The ALT and REF columns appear at position 4 & 5 respectively

for line in original_file:
    ref = str.split()[3]
    alt = str.split()[4]
    if len(ref) == 1 and len(alt) == 1:
        extracted_file.write(line)

original_file.close()
extracted_file.close()
Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
  • 1
    Are you sure you are splitting the data correctly? Why do you use `str.split()` if your variable is called `line`? It should be `line.split()` – alec_djinn Feb 21 '19 at 15:53
  • What output are you currently getting vs what output do you desire? – John Feb 21 '19 at 15:54
  • I think you're right, the data probably isn't splitting correctly, I've amended the code to `line.split()`. The code runs, but I'm not getting my desired output. Ideally, the extracted file will contain the complete header, plus the lines of data not containing indels. My current output is only the header lines, which begin with #. The rest of the VCF file contains lines of plain text without the # character. These lines are missing. – umbro_tracksuit Feb 21 '19 at 16:04
  • can you attached your file here? – Adirmola Feb 21 '19 at 16:05
  • Ive uploaded the files to this google drive, the VCF file can be opened as plain text. I've also uploaded my output. `https://drive.google.com/open?id=1kzFZOxliWmbCcezsmMfNt0EBUdcfdPud` – umbro_tracksuit Feb 21 '19 at 16:11

2 Answers2

1
original_file = open('NA12878.vcf', 'r')
extracted_file = open('NA12878_SNV.txt', 'w+')
i=0

for line in original_file:
    if '##' in line:
        extracted_file.write(line)
    else:
        ref = line.split('  ')[3]
        alt = line.split('  ')[4]
        if len(ref) == 1 and len(alt) == 1:
            extracted_file.write(line)

# Extract SNVs while omitting indels 
# Indels will have multiple entries in the REF or ALT column
# The ALT and REF columns appear at position 4 & 5 respectively

original_file.close()
extracted_file.close()

There Was two issues:

  1. The second loop never get executed because you already get to the end of the VCF file in the first loop. You can see here how to start over new loop on the same text file.
  2. You didn't separate the line correctly it is tab delimited.

So i set the code to execute with only one loop and placed tab as the split parameter.

Adirmola
  • 783
  • 5
  • 15
0

The answer given by Adirmola is fine, but you can improve the code quality by applying a few modifications:

# Use "with" context managers to open files.
# The closing will be automatic, even in case of problems.
with open("NA12878.vcf", "r") as vcf_file, \
        open("NA12878_SNV.txt", "w") as snv_file:
    for line in vcf_file:
        # Since you have specific knowledge of the format
        # you can be more specific when identifying header lines
        if line[:2] == "##":
            snv_file.write(line)
        else:
            # You can avoid doing the splitting twice
            # with list unpacking (using _ for ignored fields)
            # https://www.python.org/dev/peps/pep-3132/
            [_, _, _, ref, alt, *_] = line.split("\t")  # "\t" represents a tab
            if len(ref) == 1 and len(alt) == 1:
                snv_file.write(line)

I tested this with python 3.6 on your file, and I end up with 554 SNVs. Some of the syntax used here (especially for list unpacking) may not work with older python versions.

bli
  • 7,549
  • 7
  • 48
  • 94