Get sequences from a file and store them into a list in python

Question

Here is the code (i took it from this discussion Translation DNA to Protein, but here i'm using RNA instead of DNA file):

from itertools import takewhile

def translate_rna(sequence, d, stop_codons=('UAA', 'UGA', 'UAG')):
    start = sequence.find('AUG') 

    # Take sequence from the first start codon
    trimmed_sequence = sequence[start:]

    # Split it into triplets
    codons = [trimmed_sequence[i:i + 3] for i in range(0, len(trimmed_sequence), 3)]

    # Take all codons until first stop codon
    coding_sequence = takewhile(lambda x: x not in stop_codons and len(x) == 3, codons)

    # Translate and join into string
        protein_sequence = ''.join([codontable[codon] for codon in coding_sequence])

    # This line assumes there is always stop codon in the sequence
    return "{0}".format(protein_sequence)

Calling the translate_rna function:

sequence = ''
for line in open("to_rna", "r"):
    sequence += line.strip()

translate_rna(sequence, d)

My to_rna file looks like:

CCGCCCCUCUGCCCCAGUCACUGAGCCGCCGCCGAGGAUUCAGCAGCCUCCCCCUUGAGCCCCCUCGCUU
CCCGACGUUCCGUUCCCCCCUGCCCGCCUUCUCCCGCCACCGCCGCCGCCGCCUUCCGCAGGCCGUUUCC
ACCGAGGAAAAGGAAUCGUAUCGUAUGUCCGCUAUCCAG.........

The function translate only the first proteine (from the first AUG to the first stop_codon)

I think the problem is in this line:

# Take all codons until first stop codon
coding_sequence  =  takewhile(lambda x: x not in stop_codons and len(x) == 3 , codons)

My question is : How can i tell python (after finding the first AUG and store it into coding_sequence as a list) to search again the next AUG in the RNA file and sotre it in the next position.

As a result, i wanna have a list like that:

['here_is_the_1st_coding_sequence', 'here_is_the_2nd_coding_sequence', ...]

PS : This is a homework, so i can't use Biopython.

EDIT:

A simple way to describe the problem:

From this code:

from itertools import takewhile

lst = ['N', 'A', 'B', 'Z', 'C', 'A', 'V', 'V' 'Z', 'X']
ch = ''.join(lst)

stop = 'Z'
start = ch.find('A')

seq = takewhile(lambda x: x not in stop, ch)

I want to get this:

['AB', 'AVV']

EDIT 2:

For instance, from this string:

UUUAUGCGCCGCUAACCCAUGGUUCCCUAGUGGUCCUGACGCAUGUGA

I should get as result:

['AUGCGCCGC', 'AUGGUUCCC', 'AUG']

So... do you know Python? What have you tried? I'm not in the business of writing people's code for them (especially when it's homework), but I'll certainly help when you get stuck *after trying*. — Bob Dylan, Nov 17 '15 at 16:51
i can get the `sequence.find` to understand, that it should move to the next `AUG` which is after the `stop_codons`, i tried a lot of things (i can paste the code here), such as replacing `takewhile` by a _for loop_ and define a function that return whether the codon is a `stop_codon` or not (i'm just unable to jump to the next `AUG` codon !) (this is only a small part of the homework) — Bilal, Nov 17 '15 at 17:44

R Nar · Answer 1 · 2015-11-17T18:34:23.197

1

looking at your basic code, because I couldn't quite follow your main stuff, it looks like you just want to split your string on all occurences of another string, and substring the string starting from the index of another string. If that is wrong, please tell me and I can update accordingly.

To achieve this, python has a builtin str.split(sub) which splits a string at every occurence of sub. Also, it has a str.index(sub) which returns the first index of sub. Example:

>>> ch = 'NABZCAVZX'
>>> ch[ch.index('A'):].split('Z')
['AB', 'CAV', 'X']

you can also specify sub strings that aren't just one char:

>>> ch = 'NACBABQZCVEZTZCGE'
>>> ch[ch.index('AB'):].split('ZC')
['ABQ', 'VEZT', 'GE']

Using multiple delimiters:

>>> import re
>>> stop_codons = ['UAA','UGA','UAG']
>>> re.compile('|'.join(stop_codons))\
>>> delim = re.compile('|'.join(stop_codons))
>>> ch = 'CCHAUAABEGTAUAAVEGTUGAVKEGUAABEGEUGABRLVBUAGCGGA'
>>> delim.split(ch)
['CCHA', 'BEGTA', 'VEGT', 'VKEG', 'BEGE', 'BRLVB', 'CGGA']

note that there is no order preferance to the split, ie if there is a UGA string ahead of a UAA, it will still split on the UGA. I am not sure if thats what you want but thats it.

edited Nov 17 '15 at 18:34

answered Nov 17 '15 at 17:41

R Nar

5,465
1
16
32

Thank you ! that's helping a lot, but here i have multiple delimiters: `UAA` and `UGA` and `UAG` – Bilal Nov 17 '15 at 18:25
ah, you will have to use regex for that then. I will edit. – R Nar Nov 17 '15 at 18:25
sorryyy ! i was mistaken in the simple example :/ – Bilal Nov 17 '15 at 18:33
dont worry about it, check the edit for multiple delimiters. – R Nar Nov 17 '15 at 18:34
There is another condition, _every element on the list should start with `AUG`_ (this is the part where i was mistaken on the example :/ ) – Bilal Nov 17 '15 at 18:40
hahaha you just keep adding on new things dont ya. I will reconsider and see if there is a better way with all this in mind, give me a few minutes – R Nar Nov 17 '15 at 18:51
@Bilal just to clarify, do you want to include the `AUG` in your ending sequence? it would be really great if you gave some sort of expected output for your real code instead of just the example – R Nar Nov 17 '15 at 18:56
That's right, i want to include `AUG` but not the `stop_codon` (i added an example) – Bilal Nov 17 '15 at 19:09

Get sequences from a file and store them into a list in python

1 Answers1