0

I have been trying a lot but this wont happen at all.

Input:-

condor  t   airline airline
eight   n   0   flightnumber
nine    n   0   flightnumber
five    n   0   flightnumber
hallo   t   0   sentence
turn    t   com turn_heading
left    t   0   direction
heading t   com turn_heading
three   n   0   degree_absolute
two     n   0   degree_absolute
zero    n   0   degree_absolute

Expected Output:

<s> <callsign> <airline> condor </airline> <flightnumber> eight nine five </flightnumber> </callsign> hallo <command="turn_heading"> turn <direction> left </direction> heading <degree_absolute> three two zero </degree_absolute> </command> </s>

Every time I try to input the contents the tabs get in the way of tokenizing the strings even though I input them as a list or strings. This is what happens when I try to strip the tabs

['condor\tt\tairline\tairline\n', 'eight\tn\t \tflightnumber\n', 'nine\tn\t \tflightnumber\n', 'five\tn\t \tflightnumber\n', 'hallo\tt\t \tsentence\n', 'turn\tt\tcom\tturn_heading\n', 'left\tt\t \tdirection\n', 'heading\tt\tcom\tturn_heading\n', 'three\tn\t \tdegree_absolute\n', 'two\tn\t \tdegree_absolute\n', 'zero\tn\t \tdegree_absolute\n', '\n', 'aeh\tt\t \tsentence\n', 'two\tn\t \tflightnumber\n', 'eight\tn\t \tflightnumber\n', 'november\tt\tflightnumber\tflightnumber\n', 'hallo\tt\t \tsentence\n', 'reduce\tt\tcom\treduce\n', 'two\tn\t \tspeed\n', 'two\tn\t \tspeed\n', 'zero\tn\t \tspeed\n', 'knots\tt\t \treduce\n', '\n', 'condor\tt\tairline\tairline\n', 'eight\tn\t \tflightnumber\n', 'nine\tn\t \tflightnumber\n', 'five\tn\t \tflightnumber\n', 'descend\tt\tcom\tdescend\n', 'three\tn\t \taltitude\n', 'thousand\tn\t \taltitude\n', 'feet\tt\t \tdescend\n', 'turn\tt\tcom\tturn_heading\n', 'left\tt\t \tdirection\n', 'heading\tt\tcom\tturn_heading\n', 'two\tn\t \tdegree_absolute\n', 'six\tn\t \tdegree_absolute\n', 'zero\tn\t \tdegree_absolute\n', 'cleared\tt\tcom\tcleared_ils\n', 'ils\tt\t \tcleared_ils\n', 'runway\tt\t \tcleared_ils\n', 'two\tn\t \trunway\n', 'three\tn\t \trunway\n', 'left\tt\t \trunway\n', 'turn\tt\tcom\tturn_heading\n', 'left\tt\t \tdirection\n', 'heading\tt\tcom\tturn_heading\n', 'two\tn\t \tdegree_absolute\n', 'five\tn\t \tdegree_absolute\n', 'zero\tn\t \tdegree_absolute\n']

Any help so that I can strip off the tabs and tokenize them and convert them to markup format??

The code I have been using to remove control characters:

import string
with open('input.txt', 'r') as file1:
    lines = str(list(file1))
    print lines.translate(string.maketrans("\n\t\r", "   "))
Harish Prasanna
  • 108
  • 1
  • 8

1 Answers1

3

This is very easy if you use the csv module:

>>> import csv
>>> f = ["condor\tt\tairline\tairline", 
         "eight\tn\t0\tflightnumber",
         "nine\tn\t0\tflightnumber",
         "turn\tt\tcom\tturn_heading",
         "left\tt\t0\tdirection"] # fake 'file' for testing
>>> list(csv.DictReader(f, delimiter="\t"))
[{'condor': 'eight', 't': 'n', 'airline': 'flightnumber'}, 
 {'condor': 'nine', 't': 'n', 'airline': 'flightnumber'},
 {'condor': 'turn', 't': 't', 'airline': 'turn_heading'}, 
 {'condor': 'left', 't': 't', 'airline': 'direction'}]

Note that I specify delimiter='\t' to specify a tab-delimited (rather than the default comma-delimited) input file, and used the DictReader to automatically make each line a dictionary {fieldname: value, ...}.

You can then process those dictionaries into any format you want.

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
  • This seems to work really well in shell. But fails to work when i import the input as a file. Should I have to do any kind of datatype conversion to carry this out? – Harish Prasanna Jun 22 '14 at 12:06
  • What do you mean *"fails to work"*? What happens instead - errors, unexpected results? Are you opening the file first (e.g. `with open(filename) as f: reader = csv.DictReader(f, ...)`)? You could try the [`Sniffer`](https://docs.python.org/2/library/csv.html#csv.Sniffer) to determine the appropriate dialect for the file. – jonrsharpe Jun 22 '14 at 12:08
  • Wow. It just ripped all the characters and delimited it with a comma. Oopz sorry.. I was trying to say I did not get the desired output rather got all characters mapped like so => [{'[': "'"}, {'[': 'c'}, {'[': 'o'}, {'[': 'n'}, {'[': 'd'}, {'[': 'o'}, {'[': 'r'},... – Harish Prasanna Jun 22 '14 at 12:10
  • yes. this gave me some perspective. Will post the final program when I am done. @jonrsharpe – Harish Prasanna Jun 22 '14 at 12:19