How to merge only the unique lines from file_a to file_b?

Question

This question has been asked here in one form or another but not quite the thing I'm looking for. So, this is the situation I shall be having: I already have one file, named file_a and I'm creating another file - file_b. file_a is always bigger than file_b in size. There will be a number of duplicate lines in file_b (hence, in file_a as well) but both the files will have some unique lines. What I want to do is: to copy/merge only the unique lines from file_a to file_b and then sort the line order, so that the file_b becomes the most up-to-date one with all the unique entries. Either of the original files shouldn't be more than 10MB in size. What's the most efficient (and fastest) way I can do that?

I was thinking something like that, which does the merging alright.

#!/usr/bin/env python

import os, time, sys

# Convert Date/time to epoch
def toEpoch(dt):
    dt_ptrn = '%d/%m/%y %H:%M:%S'
    return int(time.mktime(time.strptime(dt, dt_ptrn)))

# input files
o_file = "file_a"
c_file = "file_b"
n_file = [o_file,c_file]

m_file = "merged.file"

for x in range(len(n_file)):
    P = open(n_file[x],"r")
    output = P.readlines()
    P.close()

    # Sort the output, order by 2nd last field
    #sp_lines = [ line.split('\t') for line in output ]
    #sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) )

    F = open(m_file,'w') 
    #for line in sp_lines:
    for line in output:
        if "group_" in line:
            F.write(line)
    F.close()

But, it's:

not with only the unique lines
not sorted (by next to last field)
and introduces the 3rd file i.e. m_file

Just a side note (long story short): I can't use sorted() here as I'm using v2.3, unfortunately. The input files look like this:

On 23/03/11 00:40:03
JobID   Group.User          Ctime   Wtime   Status  QDate               CDate
===================================================================================
430792  group_atlas.pltatl16    0   32  4   02/03/11 21:52:38   02/03/11 22:02:15
430793  group_atlas.atlas084    30  472 4   02/03/11 21:57:43   02/03/11 22:09:35
430794  group_atlas.atlas084    12  181 4   02/03/11 22:02:37   02/03/11 22:05:42
430796  group_atlas.atlas084    8   185 4   02/03/11 22:02:38   02/03/11 22:05:46

I tried to use cmp() to sort by the 2nd last field but, I think, it doesn't work just because of the first 3 lines of the input files.

Can anyone please help? Cheers!!!

Update 1:

For the future reference, as suggested by Jakob, here is the complete script. It worked just fine.

#!/usr/bin/env python

import os, time, sys
from sets import Set as set

def toEpoch(dt):
    dt_ptrn = '%d/%m/%y %H:%M:%S'
    return int(time.mktime(time.strptime(dt, dt_ptrn)))

def yield_lines(fileobj):
    #I want to discard the headers
    for i in xrange(3):
        fileobj.readline()
    #
    for line in fileobj:
        yield line

def app(path1, path2):
    file1 = set(yield_lines(open(path1)))
    file2 = set(yield_lines(open(path2)))
    return file1.union(file2)

# Input files
o_file = "testScript/03"
c_file = "03.bak"
m_file = "finished.file"

print time.strftime('%H:%M:%S', time.localtime())

# Sorting the output, order by 2nd last field
sp_lines = [ line.split('\t') for line in app(o_file, c_file) ]
sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) )

F = open(m_file,'w')
print "No. of lines: ",len(sp_lines)

for line in sp_lines:

    MF = '\t'.join(line)
    F.write(MF)
F.close()

It took about 2m:47s to finish for 145244 lines.

[testac1@serv07 ~]$ ./uniq-merge.py 
17:19:21
No. of lines:  145244
17:22:08

thanks!!

Update 2:

Hi eyquem, this is the Error message I get when I run your script(s).

From the first script:

[testac1@serv07 ~]$ ./uniq-merge_2.py 
  File "./uniq-merge_2.py", line 44
    fm.writelines( '\n'.join(v)+'\n' for k,v in output )
                                       ^
SyntaxError: invalid syntax

From the second script:

[testac1@serv07 ~]$ ./uniq-merge_3.py 
  File "./uniq-merge_3.py", line 24
    output = sett(line.rstrip() for line in fa)
                                  ^
SyntaxError: invalid syntax

Cheers!!

Update 3:

The previous one wasn't sorting the list at all. Thanks to eyquem to pointing that out. Well, it does now. This is a further modification to Jakob's version - I converted the set:app(path1, path2) to a list:myList() and then applied the sort( lambda ... ) to the myList to sort the merged file by the nest to last field. This is the final script.

#!/usr/bin/env python

import os, time, sys
from sets import Set as set

def toEpoch(dt):
    # Convert date/time to epoch
    dt_ptrn = '%d/%m/%y %H:%M:%S'
    return int(time.mktime(time.strptime(dt, dt_ptrn)))

def yield_lines(fileobj):
    # Discard the headers (1st 3 lines)
    for i in xrange(3):
        fileobj.readline()

    for line in fileobj:
        yield line

def app(path1, path2):
    # Remove duplicate lines
    file1 = set(yield_lines(open(path1)))
    file2 = set(yield_lines(open(path2)))
    return file1.union(file2)

print time.strftime('%H:%M:%S', time.localtime())

# I/O files
o_file = "testScript/03"
c_file = "03.bak"
m_file = "finished.file"

# Convert set into to list
myList = list(app(o_file, c_file))

# Sort the list by the date
sp_lines = [ line.split('\t') for line in myList ]
sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) )

F = open(m_file,'w')
print "No. of lines: ",len(sp_lines)

# Finally write to the outFile
for line in sp_lines:
    MF = '\t'.join(line)
    F.write(MF)
F.close()

There is no speed boost at all, it took 2m:50s to process the same 145244 lines. Is anyone see any scope of improvement, please let me know. Thanks to Jakob and eyquem for their time. Cheers!!

Update 4:

Just for future reference, this is a modified version of eyguem, which works much better and faster then the previous ones.

#!/usr/bin/env python

import os, sys, re
from sets import Set as sett
from time import mktime, strptime, strftime

def sorting_merge(o_file, c_file, m_file ):

    # RegEx for Date/time filed
    pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d')

    def kl(lines,pat = pat):
        # match only the next to last field
        line = lines.split('\t')
        line = line[-2]
        return mktime(strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S'))

    output = sett()
    head = []

    # Separate the header & remove the duplicates
    def rmHead(f_n):
        f_n.readline()
        for line1 in f_n:
            if pat.search(line1):  break
            else:  head.append(line1) # line of the header
        for line in f_n:
            output.add(line.rstrip())
        output.add(line1.rstrip())
        f_n.close()

    fa = open(o_file, 'r')
    rmHead(fa)

    fb = open(c_file, 'r')
    rmHead(fb)

    # Sorting date-wise
    output = [ (kl(line),line.rstrip()) for line in output if line.rstrip() ]
    output.sort()

    fm = open(m_file,'w')
    # Write to the file & add the header
    fm.write(strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head[0]+head[1])))
    for t,line in output:
        fm.write(line + '\n')
    fm.close()


c_f = "03_a"
o_f = "03_b"

sorting_merge(o_f, c_f, 'outfile.txt')

This version is much faster - 6.99 sec. for 145244 lines compare to the 2m:47s - then the previous one using lambda a, b: cmp(). Thanks to eyquem for all his support. Cheers!!

Python 2.3 provides `set` type in [`sets` module](http://docs.python.org/release/2.3.5/lib/module-sets.html). — Michal Chruszcz, Mar 23 '11 at 15:06
humm........ I thought set() was introduced in v2.5; need to figure out how to use that though. cheers!! — MacUsers, Mar 23 '11 at 15:12
What are you intending to do with the first 3 lines? Do you keep them or throw them away? — wds, Mar 23 '11 at 15:32
@wds: Oh yes, I should have mentioned - those three lines should be added to the resultant file. the last two lines as-it-is but with the current time stamp as the 1st line. Cheers!! — MacUsers, Mar 23 '11 at 16:07
@MacUsers _"with the current time stamp as the 1st line"_ What do you mean ? The time at the moment at which the files are treated ? Or the time unchanged ? Or with the time transformed by toEpoch() ? — eyquem, Mar 23 '11 at 17:01
@MacUsers _"three lines should be added to the resultant file"_ That is to say at the END of the resultant file ? Is it sure that these 3 first lines are always 3 ? They are rather annoying for the code I thought. In fact, the most important is: is it sure that after the 3 (or 4 or 12 ..;) first lines not of the desired format, all the following lines are homogeneously of the same format ? — eyquem, Mar 23 '11 at 17:02
@MacUsers There is a bug with your code. If one of the file ends without newline , that is to say without '\n' or '\r\n' at the end, the last line of this file is anywhere in one of the set, since there is no order in a set. Consequently, a line without newline at its end is written somewhere in the merging file, not necessarily at the end. I tried your code and one of my file was like that -> two lines were merged in only one — eyquem, Mar 23 '11 at 19:56
@MacUsers That is not an update. You only changed `sp_lines = [ line.split('\t') for line in app(o_file, c_file) ]` in two instructions `myList = list(app(o_file, c_file))` and `sp_lines = [ line.split('\t') for line in myList ]` : the interest is zero. There is still the same bug with this code of Update 3: two lines not separated by '\n' in the resulting file. And it is 3 times longer to execute than my code. — eyquem, Mar 24 '11 at 16:50

Jakob Bowyer · Answer 1 · 2011-03-23T16:13:47.117

2

Maybe something along these lines?

from sets import Set as set

def yield_lines(fileobj):
    #I want to discard the headers
    for i in xrange(3):
        fileobj.readline()

    for line in fileobj:
        yield line

def app(path1, path2):
    file1 = set(yield_lines(open(path1)))
    file2 = set(yield_lines(open(path2)))

    return file1.union(file2)

EDIT: Forgot about with :$

edited Mar 23 '11 at 16:13

answered Mar 23 '11 at 15:58

Jakob Bowyer

33,878
8
76
91

@Jakob: is `with` supported in v2.3 or it's a v2.5 onward thing? Cheers!! – MacUsers Mar 23 '11 at 16:11
OH Woops :$ Ill change it now. – Jakob Bowyer Mar 23 '11 at 16:12
@Jakob: thanks! funny though: in the document, it says `with` is introduced to make script cleaner. I see it is now, without `with`. lol! Testing now. cheers!! – MacUsers Mar 23 '11 at 16:16
I didn't write the whole thing just the parts to provide the union of both files. – Jakob Bowyer Mar 23 '11 at 16:18
@Jakob: Many thanks! it worked just fine. It took about 2m.47s. for 145244 lines. Is there any scope for a performance boost. Updated my original post with your suggestion. cheers!! – MacUsers Mar 23 '11 at 17:46
@MacUsers @Jakob Fine code. But the lines aren't sorted, are they ? – eyquem Mar 23 '11 at 18:29
@eyquem Nope. But I did write that this code only did a union for the two files. @MacUsers Do you not have access to pypy or python 3 or python 2.6? – Jakob Bowyer Mar 23 '11 at 19:16
@Jakob: I do have access to v2.6 but not on this particular machine. Unfortunately, I have to go with this mess for couple more more months. Cheers!! – MacUsers Mar 23 '11 at 20:11
@eyquem: You are right, lines are not sorted. `sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) )` should have done that job, isn't it? Thanks! – MacUsers Mar 23 '11 at 21:08
@MacUsers I saw your code in which you sort. I'am studying my own code to modify it. I had written it believing that you couldn't use set() – eyquem Mar 23 '11 at 21:39
@eyquem: It appears that I was wrong assuming set() wasn't in v2.3 but Jakob's code, using set(), works just fine in v2.3. So, the I can use set(). I've modified my original post to remove the confusion. Cheers!! – MacUsers Mar 23 '11 at 21:50

eyquem · Accepted Answer · 2011-03-25T18:02:42.777

2

EDIT 2

My previous codes have problems with output = sett(line.rstrip() for line in fa) and output.sort(key=kl)

Moreover, they have some complications.

So I examined the choice of reading the files directly with a set() function taken by Jakob Bowyer in his code.

Congratulations Jakob ! (and Michal Chruszcz by the way) : set() is unbeatable, it's faster than a reading one line at a time.

Then , I abandonned my idea to read the files line after line.

.

But I kept my idea to avoid a sorting with the help of cmp() function because, as it is described in the doc:

s.sort([cmpfunc=None])

The sort() method takes an optional argument specifying a comparison function of two arguments (list items) (...) Note that this slows the sorting process down considerably

http://docs.python.org/release/2.3/lib/typesseq-mutable.html

Then, I managed to obtain a list of tuples (t,line) in which the t is

time.mktime(time.strptime(( 1st date-and-hour in line ,'%d/%m/%y %H:%M:%S'))

by the instruction

output = [ (kl(line),line.rstrip()) for line in output]

.

I tested 2 codes. The following one in which 1st date-and-hour in line is computed thanks to a regex:

def kl(line,pat = pat):
    return time.mktime(time.strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S'))

output = [ (kl(line),line.rstrip()) for line in output if line.rstrip()]

output.sort()

And a second code in which kl() is:

def kl(line,pat = pat):
    return time.mktime(time.strptime(line.split('\t')[-2],'%d/%m/%y %H:%M:%S'))

.

The results are

Times of execution:

0.03598 seconds for the first code with regex

0.03580 seconds for the second code with split('\t')

that is to say the same

This algorithm is faster than a code using a function cmp() :

a code in which the set of lines output isn't transformed in a list of tuples by

output = [ (kl(line),line.rstrip()) for line in output]

but is only transformed in a list of the lines (without duplicates, then) and sorted with a function mycmp() (see the doc):

def mycmp(a,b):
    return cmp(time.mktime(time.strptime(a.split('\t')[-2],'%d/%m/%y %H:%M:%S')),
               time.mktime(time.strptime(b.split('\t')[-2],'%d/%m/%y %H:%M:%S')))

output = [ line.rstrip() for line in output] # not list(output) , to avoid the problem of newline of the last line of each file
output.sort(mycmp)

for line in output:
    fm.write(line+'\n')

has an execution time of

0.11574 seconds

.

The code:

#!/usr/bin/env python

import os, time, sys, re
from sets import Set as sett

def sorting_merge(o_file , c_file, m_file ):

    pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
                     '(?=[ \t]+[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d)') 

    def kl(line,pat = pat):
        return time.mktime(time.strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S'))

    output = sett()
    head = []

    fa = open(o_file)
    fa.readline() # first line is skipped
    while True:
        line1 = fa.readline()
        mat1  = pat.search(line1)
        if not mat1: head.append(line1) # line1 is here a line of the header
        else: break # the loop ends on the first line1 not being a line of the heading
    output = sett( fa )
    fa.close()

    fb = open(c_file)
    while True:
        line1 = fb.readline()
        if pat.search(line1):  break
    output = output.union(sett( fb ))
    fb.close()

    output = [ (kl(line),line.rstrip()) for line in output]
    output.sort()

    fm = open(m_file,'w')
    fm.write(time.strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head)))
    for t,line in output:
        fm.write(line + '\n')
    fm.close()


te = time.clock()
sorting_merge('ytre.txt','tataye.txt','merged.file.txt')
print time.clock()-te

This time, I hope it will run correctly, and that the only thing to do is to wait the times of execution on real files much bigger than the ones on which I tested the codes

.

EDIT 3

pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
                 '(?=[ \t]+'
                 '[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
                 '|'
                 '[ \t]+aborted/deleted)')

.

EDIT 4

#!/usr/bin/env python

import os, time, sys, re
from sets import Set

def sorting_merge(o_file , c_file, m_file ):

    pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
                     '(?=[ \t]+'
                     '[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
                     '|'
                     '[ \t]+aborted/deleted)')

    def kl(line,pat = pat):
        return time.mktime(time.strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S'))

    head = []
    output = Set()

    fa = open(o_file)
    fa.readline() # first line is skipped
    for line1 in fa:
        if pat.search(line1):  break # first line after the heading
        else:  head.append(line1) # line of the header
    for line in fa:
        output.add(line.rstrip())
    output.add(line1.rstrip())
    fa.close()

    fb = open(c_file)
    for line1 in fb:
        if pat.search(line1):  break
    for line in fb:
        output.add(line.rstrip())
    output.add(line1.rstrip())
    fb.close()

    if '' in output:  output.remove('')
    output = [ (kl(line),line) for line in output]
    output.sort()

    fm = open(m_file,'w')
    fm.write(time.strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head)))
    for t,line in output:
        fm.write(line+'\n')
    fm.close()

te = time.clock()
sorting_merge('A.txt','B.txt','C.txt')
print time.clock()-te

edited Mar 25 '11 at 18:02

answered Mar 24 '11 at 16:26

eyquem

26,771
7
38
46

@eyquem: Thanks for the new one. I'm gonna test with my files and see the result. Cheers!! – MacUsers Mar 24 '11 at 16:34
@eyquem: looks like it worked but introduced a new problem with output file. Whatever the files I use as input files, in the output file there is only two lines - first row is just fine and the rest of the files in a single line as 2nd row. – MacUsers Mar 24 '11 at 16:47
@MacUsers Replace `fm.write(line)` by `fm.write(line+'\n')` . Isn't it evident ? I did the modification in one of the two codes I used (one code with regex, the other code with `split('\t')` ) but I forgot to correct the other code, which I posted. – eyquem Mar 24 '11 at 16:54
Also, the script fails to run if any of the file has any blank line(s) in it. it fails on the line: `return time.mktime(time.strptime((pat.search(line).group()), ....))` in *def kl(line,pat = pat)*, saying: `AttributeError: 'NoneType' object has no attribute 'group'`. – MacUsers Mar 24 '11 at 16:58
@MacUsers Are there indeed blank lines ? I thought that the exemple of file you posted in your question was representative – eyquem Mar 24 '11 at 17:01
@eyquem: I don't think there are any blank lines in my original files. I made two small files with only 20 lines each out of the originals for testing. The script ran okay with the test files but failed to run using the actual ones. I intentionally added blank lines in one of the test files and ran the script - I get exactly the same error as before, hence the comment. – MacUsers Mar 24 '11 at 17:07
@MacUsers You're right, you're right, it's blank lines, indeed. Replace `output = [ (kl(line),line.rstrip()) for line in output]` by the following instruction `output = [ (kl(line),line.rstrip()) for line in output if line.rstrip()]` – eyquem Mar 24 '11 at 17:12
@eyquem: It still doesn't work for my original files. the test files and the original ones should be identical as I just deleted the rest of the lines keeping only first 20. I'm trying to figure out what what's your script is not working with the original ones. It worked just fine on the test ones. – MacUsers Mar 24 '11 at 17:31
@MacUsers _"It doesn't work"_ has never been an amount of information sufficient to begin to have an idea of what does'nt work . Give the error message, by Jove ! – eyquem Mar 24 '11 at 18:35
@eyquem: I already told you the error message: `AttributeError: 'NoneType' object has no attribute 'group'` and it's coming from the line: `return time.mktime(time.strptime((pat.search(line).group()), ....))`. I'll have a look now if it's really matching the next to last field. Cheers!! – MacUsers Mar 24 '11 at 19:20
@eyquem: Found the "reason" for not working - it's very strange though. In the original files, there are some lines like this: `441417 group_camont.camon023 929 345621 3 08/03/11 19:41:54 aborted/deleted`. If there are more than three such a line in the merged output, the script fails with the above error. As soon as the number of such lines down to three, the the script is happy. I think, it shouldn't really matter if the script is picking up on the 2nd last field only. Very strange though. – MacUsers Mar 24 '11 at 19:38
@MacUsers It's easy to correct. See my EDIT 3 in the answer. I suppose that the only date-hour in these particular lines is of the same type as the others, in first "position". But I realize, suddenly: this error happens only with the code with regex, doesn't ? Because the spliting according `'\t'` does not depend of the values in the fields . That 's why I was speaking of verification with the part ``(?:....)`` – eyquem Mar 24 '11 at 19:49
@eyquem: I really hate to say "it still doesn't work" but I'm still out of luck. I'm not that strong with python regex, but did you give a try yourself with the modified regex? One thing I'd still ask: Why does the last field (i.e. where that aborted/deleted appears time to time) matter at all? The actual files are auto generated by the batch-system; it's aborted/deleted today, maybe something else tomorrow (depends on the JobStatus). I don't think this is the way it should be done. The 2nd last field [-2] is always the same and the regex should only match that field. My 2 cents though!! – MacUsers Mar 24 '11 at 20:29
@MacUsers No I didn't try. I don't know if the lines with mention "aborted/deleted" matter. I wrote that the new RE corrects because there is no more lines for which the regex doesn't match. With this new RE, these particular lines match too, and they are kept in the new merging file. If you don't want to keep them, then the code must be adapted. It's you who knows I proposed a solution to stop th error, that's all. By the way, the story about the fact that 3 such lines don't provole an error is a strange one. If I had the files , I would examine that precisely – eyquem Mar 24 '11 at 20:57
@MacUsers So, what do you want to do of the lines with "aborted/deleted" . This is the point now. – eyquem Mar 24 '11 at 20:58
@eyquem: I probably didn't able to make you understand properly. The lines with "aborted/deleted" will stay as well. The date/time field(s) only matters just because the merged list should be sorted based on the first date/time (i.e. next to the last) field. Other wise those fields got nothing to do with anything. Let me see, if I can figure out any thing from here. – MacUsers Mar 24 '11 at 21:19
@MacUsers Excuse me, I had done an error in the RE: forgot ``[ \t]+`` before ``aborted/deleted`` . I have just corrected in the EDIT – eyquem Mar 24 '11 at 21:22
@MacUsers With my RE (corrected) a line as ``441417 group_camont.camon023 929 345621 3 08/03/11 19:41:54 aborted/deleted`` will be kept and be sorted according to ``08/03/11 19:41:54`` as the others. The only difference is that its last field is not a second date-hour but ``aborted/deleted`` – eyquem Mar 24 '11 at 21:25
@MacUsers Note that the code with the spliting according to ``\t`` should work without the slightest hitch because the date-hour before ``aborted/deleted`` is the -2 field. That's why I prefer to use regex: that allows unexpected cases to be detected because they produce an error. Without a regex, you wouldn' have noticed that there are such lines with ``aborted/deleted`` – eyquem Mar 24 '11 at 21:31
@MacUsers I corrected ``time.mktime(time.strptime(( 2nd date-and-hour in line ,'%d/%m/%y %H:%M:%S'))`` to ``time.mktime(time.strptime(( 1st date-and-hour in line ,'%d/%m/%y %H:%M:%S'))`` and also ``2nd date-and-hour in line is computed thanks to a regex`` to ``1st date-and-hour in line is computed thanks to a regex`` . I hope this error of designation didn't lead you to misunderstand my code. By the way, what are the news concerning execution on your original files ? Let me know, please. – eyquem Mar 25 '11 at 10:04
@eyquem: Where is your corrected time.mktime()? I've fixed most of the thing as per my requirements but I just discovered that there is another fundamental flaw in the script - the very first line (after the `====` line) is being removed from list during the sett() operation. Post you back in a while. Cheers!! – MacUsers Mar 25 '11 at 12:54
@eyquem: I've to agree that I don't understand the logic of your script. After more careful inspection, I found the lines before the first occurrence of duplicate line, are not being added to the list/sett(). I can send you two sample files for you try with your script and you will understand lot better what I mean. If you interested, let me know how can I do that. Cheers!! – MacUsers Mar 25 '11 at 14:33
@MacUsers What a mess I had done. I think I got the wrong file when I posted the update 3. At one moment, I had 9 codes in a repertory and I deleted 7 of them. I bet I made a mistake. I'm confused. See the new code in update 4 and say me if it works well now. You can send me what you want at eyguem@gmail.com . TAKE ATTENTION to the fact it is ey **g** uem , with a **g** because _eyquem@gmail.com_ is _NOT MINE_, it was already to someone else when I needed an email address – eyquem Mar 25 '11 at 18:14
@eyquem: I think the problem is coming from these code block: `while True: line1 = rmHead(fb).readline() if pat.search(line1): break` it breaks upon reading the first line which appears to match your requested pattern pat() i.e. it returns a non NULL object therefore if pat.search(line1) is True. So, it's doing exactly what it's told to do. I'll try your update 4 in a short while. Cheers!! – MacUsers Mar 25 '11 at 20:15
@MacUsers Update 3's code was incomplete. Try the update 4 and we'll see after. IMO there are no more problems. – eyquem Mar 25 '11 at 20:31
@eyquem: Yes, it worked this time for the test files. But it actually failed, as the `aborted/deleted` is matched statically and this field varies based on the actual Jobstatus, for the real file. But, the good news is: taking bits and pieces from you 2nd and 4th edit, now I have the script, which works exactly the way I wanted it and it's much faster than using camp(), which was used previously. Thanks very much for you help. I've updated my original script with the new version. I've sent the details in your gmail as well. Cheers!! – MacUsers Mar 26 '11 at 15:19
@MacUsers Ah, at last. I suppose that you took the RE without the `(?:....)` part, that is to say only `'[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'` to avoid the fails on the minority of lines that fail. If there are only a delimited set of jobstatus strings that can happen in the field -1 , it is possible to insert these diverse possible strings in the RE; but maybe it isn't of greatest interest, if you are sure that the field -2 **always** contains a date-and-hour – eyquem Mar 26 '11 at 15:39
@MacUsers Did you compare the execution times of the code with a regex and the code with splitting by '\t' ? I would be also very interested to know the execution time of this code without `cmp()` , compared to the 2mn50 of the previous one . I was strongly convinced that not using cmp() would increase the speed. I was right – eyquem Mar 26 '11 at 15:46
@eyquem: You were absolutely right - execution time decreases dramatically - 6.99 sec. compare to 2m:47s using cmp(), for the same 145244 lines. Looks like I need to pay more attention to learn python RegEx. cheers!! – MacUsers Mar 26 '11 at 16:03
@MacUsers Time decreasing of 96 % !!!!! I didn't think it could be so much. Was expecting around - 60 %. It isn't the use of regex that allows that, it's the fact that the sorting is done without `cmp()` function. It was in the doc: **this slows the sorting process down considerably;** With `split('\t')` , decrease of execution time will be also very important provided `cmp()` isn't employed. Though it would be interesting to know if code with `split('\t')` would do - 99 % or -95% or -80 %. .- I wrote you in email, I have to much things to add, and it begins to be hard to read and write here. – eyquem Mar 26 '11 at 16:41

eyquem · Answer 3 · 2011-03-23T23:20:01.133

0

I wrote this new code, with the ease of using a set. It is faster that my previous code. And, it seems, than your code

#!/usr/bin/env python

import os, time, sys, re
from sets import Set as sett

def sorting_merge(o_file , c_file, m_file ):

    # Convert Date/time to epoch
    def toEpoch(dt):
        dt_ptrn = '%d/%m/%y %H:%M:%S'
        return int(time.mktime(time.strptime(dt, dt_ptrn)))

    pat = re.compile('([0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d)'
                     '[ \t]+[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d') 

    fa = open(o_file)
    head = []
    fa.readline()
    while True:
        line1 = fa.readline()
        mat1 = pat.search(line1)
        if not mat1:
            head.append(('',line1.rstrip()))
        else:
            break
    output = sett((toEpoch(pat.search(line).group(1)) , line.rstrip())
                 for line in fa)
    output.add((toEpoch(mat1.group(1)) , line1.rstrip()))
    fa.close()


    fb = open(c_file)
    while True:
        line1 = fb.readline()
        mat1 = pat.search(line1)
        if mat1:  break
    for line in fb:
        output.add((toEpoch(pat.search(line).group(1)) , line.rstrip()))
    output.add((toEpoch(mat1.group(1)) , line1.rstrip()))
    fb.close()

    output = list(output)
    output.sort()
    output[0:0] = head
    output[0:0] = [('',time.strftime('On %d/%m/%y %H:%M:%S'))]

    fm = open(m_file,'w')
    fm.writelines( line+'\n' for t,line in output)
    fm.close()



te = time.clock()
sorting_merge('ytr.txt','tatay.txt','merged.file.txt')
print time.clock()-te

Note that this code put a heading in the merged file

.

EDIT

Aaaaaah... I got it... :-))

Execution's time divided by 3 !

#!/usr/bin/env python

import os, time, sys, re
from sets import Set as sett

def sorting_merge(o_file , c_file, m_file ):

    pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
                     '(?=[ \t]+[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d)') 

    def kl(line,pat = pat):
        return time.mktime(time.strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S'))

    fa = open(o_file)
    head = []
    fa.readline()
    while True:
        line1 = fa.readline()
        mat1 = pat.search(line1)
        if not mat1:
            head.append(line1.rstrip())
        else:
            break
    output = sett(line.rstrip() for line in fa)
    output.add(line1.rstrip())
    fa.close()

    fb = open(c_file)
    while True:
        line1 = fb.readline()
        mat1 = pat.search(line1)
        if mat1:  break
    for line in fb:
        output.add(line.rstrip())
    output.add(line1.rstrip())
    fb.close()

    output = list(output)
    output.sort(key=kl)
    output[0:0] = [time.strftime('On %d/%m/%y %H:%M:%S')] + head

    fm = open(m_file,'w')
    fm.writelines( line+'\n' for line in output)
    fm.close()

te = time.clock()
sorting_merge('ytre.txt','tataye.txt','merged.file.txt')
print time.clock()-te

edited Mar 23 '11 at 23:20

answered Mar 23 '11 at 18:46

eyquem

26,771
7
38
46

@eyquem: On the first run, I get SyntaxError on this line `fm.writelines( '\n'.join(v)+'\n' for k,v in output)` [line #44], where for is used. Trying to understand the error now. cheers!! – MacUsers Mar 23 '11 at 22:21
@MacUsers Throw away this code: it was one with a complicated algorithm using a defaultdict because I believed set() couldn't be used. I will post another one just in few minutes – eyquem Mar 23 '11 at 22:30
Just a thought: I used to use this simple perl RegEx `perl -ne '$H{$_}++ or print'` (like cat | perl -ne '$H{$_}++ or print') to print only the unique lines from a file. I don't think there is anything similar available in Python. My python knowledge is very limited. Cheers!! – MacUsers Mar 23 '11 at 22:39
@MacUsers Regex of Perl are very powerful. I don't know them, and since they are more powerful , they can't be useful for me to be translated in pythonic RE -.- Presently, I can't succeed in using Set of the module sets of Python 2.7. I'd like to understand. I wish to give you a code based on the Set of module sets in order you can run it as is. – eyquem Mar 23 '11 at 22:48
@eyquem: Sure, no problem. Looking forward to your new set() based code. Many thanks for helping me out. Cheers!! – MacUsers Mar 23 '11 at 23:06
@MacUsers I posted a second code with set. By the way: why the hell did you write int(time.mktime...etc) ? No need of int ! Cheers !!! – eyquem Mar 23 '11 at 23:21
@eyquem: I'm still getting SyntaxError on the line `fm.writelines( line+'\n' for line in output )` - don't understand what's going on. – MacUsers Mar 23 '11 at 23:54
@MacUsers Give the error message , please. Do writelines exist in Python 2.3 ? – eyquem Mar 24 '11 at 00:02
@MacUsers writelines do exist in Python 2.3 : (http://docs.python.org/release/2.3/lib/bltin-file-objects.html#l2h-243) – eyquem Mar 24 '11 at 00:05
@eyquem: Sorry for the delay. I've updated my original post (as Update2) with the error message. writelines() exists in 2.3. I tried replacing it with write() as well but get the same error. – MacUsers Mar 24 '11 at 06:49
@MacUsers Note that when using a generator between parentheses , its own parens can be eluded, for exemple `li.extend((x for x in some_list[12:565:4] if x not in range(5,1005,13))` can be written `li.extend(x for x in some_list[12:565:4] if x not in range(5,1005,13)` . So, in `fm.writelines(line+'\n' for line in output)` , we have in fact a generator `(line+'\n' for line in output)` . – eyquem Mar 24 '11 at 07:13
@MacUsers I like the instruction `fm.writelines((line+'\n' for line in output))` (I emphasize the presence of the generator) because it's different from `fm.writelines([line+'\n' for line in output])` in which there is not a generator but a list. `(line+'\n' for line in output)` being a generator, it seems to me that the writing in the file on the hard disk is then executed one line at a time that is to say: it doesn't need the preparatory creation of a list, which means an additional time of execution – eyquem Mar 24 '11 at 07:18
@eyquem: Is it really possible to write this `li.extend(x for x in some_list[12:565:4] if x not in range(5,1005,13)` without closing brace? I know `(line+'\n' for line in output)` should work (I've used that before) but not working here. So, any idea what might be the problem here? Cheers!! – MacUsers Mar 24 '11 at 07:22
@MacUsers The problem is that it seems that writelines() of version 2.3 doesn't in fact accept a generator. I had understood the contrary after reading the documentation of Python 2.3: _"writelines( sequence) The sequence can be any iterable object producing strings"_ It is written: _"any ITERABLE"_ so I thought a generator was right. Seems I was wrong to think so. – eyquem Mar 24 '11 at 07:23
@MacUsers No; Sure it is ``li.extend(x for x in some_list[12:565:4] if x not in range(5,1005,13) )`` – eyquem Mar 24 '11 at 07:23
@MacUsers You should test in a simpler and apart code if **writelines()** accept a generator or not in the 2.3 version, to be sure that there is no weird effect due to other parts of my code – eyquem Mar 24 '11 at 07:26
@eyquem : I've used this: `sp_lines = [ line.split('\t') for line in app(o_file, c_file) ]` in my script without any problem (not with writelins() though). I'm trying to figure out what might be the problem this time. Cheers!! – MacUsers Mar 24 '11 at 07:37
@MacUsers There is the same problem with sett(). So try to replace the instruction ``output = sett(line.rstrip() for line in fa)`` by ``for line in fa:`` and ``output = sett(line.rstrip())`` and do the same with ``fm.writelines(line+'\n' for line in output)`` , replace this one line by a for-loop with ``fm.write(line+'\n')`` – eyquem Mar 24 '11 at 07:43
@MacUsers I wrote an error; You must replace ``output = sett(line.rstrip() for line in fa)`` by ``output = sett()`` and ``for line in fa:`` and ``output.add(line.rstrip())`` !! – eyquem Mar 24 '11 at 08:02
@MacUsers pat match the first date-and-hour, that is to say, in ``430783 group_atlas.atlas074 32 472 4 02/03/82 21:57:43 02/02/11 09:09:09`` , it matches **02/03/82 21:57:43** . It is catched by the first part ``'[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'`` – eyquem Mar 24 '11 at 08:05
@MacUsers the second part beginning with ``(?:`` is to confirm that the first date-and-hour matched isn't the only one in the line as it is the case in the first line of each file. This part doesn't catch the portion it matches, it only verifies that the second date-and-hour is indeed where it must be , after the first one – eyquem Mar 24 '11 at 08:08
@eyquem: I think, there is another problem - not exactly with your code but v2.3 python - sort() doesn't take any keyword arguments in v2.3, like you have used `output.sort(key=kl)`. Trying out the rest of the thing as you suggested. Do you think, you update your original [2nd] script with these changes as well? cheers!! – MacUsers Mar 24 '11 at 10:50
@MacUsers Then do ``def mycmp(a,b,pat = pat):`` and ``return cmp(time.mktime(time.strptime((pat.search(a).group()),'%d/%m/%y %H:%M:%S')), time.mktime(time.strptime((pat.search(b).group()),'%d/%m/%y %H:%M:%S')))`` as in the doc, with ``output.sort(mycmp)`` . But alas, the time is no more divided by 3. They warned indeed: _" Note that this slows the sorting process down considerably;"_ :( – eyquem Mar 24 '11 at 11:08
@MacUsers I tried the following, as explained in the doc: ``tmplist = [(kl(x), x) for x in output]`` and ``tmplist.sort()`` and ``output = [x for (key, x) in tmplist]``. The execution time isn't divided by 3, but by 2, it's better than no division – eyquem Mar 24 '11 at 11:15
@eyquem: I can't thank you enough for the time you have spent for helping me. Is it possible for you to update you original script with all the changes you came up with? cheers!! – MacUsers Mar 24 '11 at 11:43
@MacUsers Yes, I will post an update. But I'm studying the problem globally. I mean, the codes I wrote until now are codes in which files are read with ``for line in fa:`` , that is to say one line at a time, because it is less heavy for the memory. But that leads to an annoying problem due to the last line, according if it has an ending newline or not. So I want to change the manner of reading, and I wonder what will be the difference with the code of Jakob. I come back in 20 minutes with news, I hope. – eyquem Mar 24 '11 at 12:23
@eyquem @Jakob: Based on Jakob's idea, I've come up with a version, which works just fine for me. Now it's sorting the file in proper order as well. Uploaded the new script as `Update 3`, for future reference. Cheers!! – MacUsers Mar 24 '11 at 16:23

score 0 · Answer 4 · answered Mar 27 '11 at 11:20

Last codes, I hope.

Because I found a killer code.

First , I created two files "xxA.txt" and "yyB.txt" of 30 lines having 30000 lines as

430559  group_atlas.atlas084    12  181 4       04/03/10 01:38:02   02/03/11 22:05:42
430502  group_atlas.atlas084    12  181 4       23/01/10 21:45:05   02/03/11 22:05:42
430544  group_atlas.atlas084    12  181 4       17/06/11 12:58:10   02/03/11 22:05:42
430566  group_atlas.atlas084    12  181 4       25/03/10 23:55:22   02/03/11 22:05:42

with the following code:

create AB.py

from random import choice

n = tuple( str(x) for x in xrange(500,600))
days = ('01','02','03','04','05','06','07','08','09','10','11','12','13','14','15','16',
        '17','18','19','20','21','22','23','24','25','26','27','28')
# not '29','30,'31' to avoid problems with strptime() on last days of february
months = days[0:12]
hours = days[0:23]
ms = ['00','01','02','03','04','05','06','07','09'] + [str(x) for x in xrange(10,60)]

repeat = 30000

with open('xxA.txt','w') as f:
    # 430794  group_atlas.atlas084    12  181 4     02/03/11 22:02:37   02/03/11 22:05:42
    ch = ('On 23/03/11 00:40:03\n'
          'JobID   Group.User          Ctime   Wtime   Status  QDate               CDate\n'
          '===================================================================================\n')
    f.write(ch)
    for i in xrange(repeat):
        line  = '430%s  group_atlas.atlas084    12  181 4   \t%s/%s/%s %s:%s:%s\t02/03/11 22:05:42\n' %\
                (choice(n),
                 choice(days),choice(months),choice(('10','11')),
                 choice(hours),choice(ms),choice(ms))
        f.write(line)


with open('yyB.txt','w') as f:
    # 430794  group_atlas.atlas084    12  181 4     02/03/11 22:02:37   02/03/11 22:05:42
    ch = ('On 25/03/11 13:45:24\n'
          'JobID   Group.User          Ctime   Wtime   Status  QDate               CDate\n'
          '===================================================================================\n')
    f.write(ch)
    for i in xrange(repeat):
        line  = '430%s  group_atlas.atlas084    12  181 4   \t%s/%s/%s %s:%s:%s\t02/03/11 22:05:42\n' %\
                (choice(n),
                 choice(days),choice(months),choice(('10','11')),
                 choice(hours),choice(ms),choice(ms))
        f.write(line)

with open('xxA.txt') as g:
    print 'readlines of xxA.txt :',len(g.readlines())
    g.seek(0,0)
    print 'set of xxA.txt :',len(set(g))

with open('yyB.txt') as g:
    print 'readlines of yyB.txt :',len(g.readlines())
    g.seek(0,0)
    print 'set of yyB.txt :',len(set(g))

Then I ran these 3 programs:

"merging regex.py"

#!/usr/bin/env python

from time import clock,mktime,strptime,strftime
from sets import Set
import re

infunc = []

def sorting_merge(o_file, c_file, m_file ):
    infunc.append(clock()) #infunc[0]
    pat = re.compile('([0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d)')
    output = Set()

    def rmHead(filename, a_set):
        f_n = open(filename, 'r')
        f_n.readline()
        head = []
        for line in f_n:
            head.append(line) # line of the header
            if line.strip('= \r\n')=='':  break
        for line in f_n:
            a_set.add(line.rstrip())
        f_n.close()
        return head

    infunc.append(clock()) #infunc[1]
    head = rmHead(o_file, output)
    infunc.append(clock()) #infunc[2]
    head = rmHead(c_file, output)
    infunc.append(clock()) #infunc[3]
    if '' in output:  output.remove('')

    infunc.append(clock()) #infunc[4]
    output = [ (mktime(strptime(pat.search(line).group(),'%d/%m/%y %H:%M:%S')),line)
               for line in output ]
    infunc.append(clock()) #infunc[5]
    output.sort()
    infunc.append(clock()) #infunc[6]

    fm = open(m_file,'w')
    fm.write(strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head)))
    for t,line in output:
        fm.write(line + '\n')
    fm.close()
    infunc.append(clock()) #infunc[7]



c_f = "xxA.txt"
o_f = "yyB.txt"

t1 = clock()
sorting_merge(o_f, c_f, 'zz_mergedr.txt')
t2 = clock()
print 'merging regex'
print 'total time of execution :',t2-t1
print '              launching :',infunc[1] - t1
print '            preparation :',infunc[1] - infunc[0]
print '    reading of 1st file :',infunc[2] - infunc[1]
print '    reading of 2nd file :',infunc[3] - infunc[2]
print '      output.remove(\'\') :',infunc[4] - infunc[3]
print 'creation of list output :',infunc[5] - infunc[4]
print '      sorting of output :',infunc[6] - infunc[5]
print 'writing of merging file :',infunc[7] - infunc[6]
print 'closing of the function :',t2-infunc[7]

"merging split.py"

#!/usr/bin/env python

from time import clock,mktime,strptime,strftime
from sets import Set

infunc = []

def sorting_merge(o_file, c_file, m_file ):
    infunc.append(clock()) #infunc[0]
    output = Set()

    def rmHead(filename, a_set):
        f_n = open(filename, 'r')
        f_n.readline()
        head = []
        for line in f_n:
            head.append(line) # line of the header
            if line.strip('= \r\n')=='':  break
        for line in f_n:
            a_set.add(line.rstrip())
        f_n.close()
        return head

    infunc.append(clock()) #infunc[1]
    head = rmHead(o_file, output)
    infunc.append(clock()) #infunc[2]
    head = rmHead(c_file, output)
    infunc.append(clock()) #infunc[3]
    if '' in output:  output.remove('')

    infunc.append(clock()) #infunc[4]
    output = [ (mktime(strptime(line.split('\t')[-2],'%d/%m/%y %H:%M:%S')),line)
               for line in output ]
    infunc.append(clock()) #infunc[5]
    output.sort()
    infunc.append(clock()) #infunc[6]

    fm = open(m_file,'w')
    fm.write(strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head)))
    for t,line in output:
        fm.write(line + '\n')
    fm.close()
    infunc.append(clock()) #infunc[7]



c_f = "xxA.txt"
o_f = "yyB.txt"

t1 = clock()
sorting_merge(o_f, c_f, 'zz_mergeds.txt')
t2 = clock()
print 'merging split'
print 'total time of execution :',t2-t1
print '              launching :',infunc[1] - t1
print '            preparation :',infunc[1] - infunc[0]
print '    reading of 1st file :',infunc[2] - infunc[1]
print '    reading of 2nd file :',infunc[3] - infunc[2]
print '      output.remove(\'\') :',infunc[4] - infunc[3]
print 'creation of list output :',infunc[5] - infunc[4]
print '      sorting of output :',infunc[6] - infunc[5]
print 'writing of merging file :',infunc[7] - infunc[6]
print 'closing of the function :',t2-infunc[7]

"merging killer"

#!/usr/bin/env python

from time import clock,strftime
from sets import Set
import re

infunc = []

def sorting_merge(o_file, c_file, m_file ):
    infunc.append(clock()) #infunc[0]
    patk = re.compile('([0123]\d)/([01]\d)/(\d{2}) ([012]\d:[0-6]\d:[0-6]\d)')
    output = Set()

    def rmHead(filename, a_set):
        f_n = open(filename, 'r')
        f_n.readline()
        head = []
        for line in f_n:
            head.append(line) # line of the header
            if line.strip('= \r\n')=='':  break
        for line in f_n:
            a_set.add(line.rstrip())
        f_n.close()
        return head

    infunc.append(clock()) #infunc[1]
    head = rmHead(o_file, output)
    infunc.append(clock()) #infunc[2]
    head = rmHead(c_file, output)
    infunc.append(clock()) #infunc[3]
    if '' in output:  output.remove('')

    infunc.append(clock()) #infunc[4]
    output = [ (patk.search(line).group(3,2,1,4),line)for line in output ]
    infunc.append(clock()) #infunc[5]
    output.sort()
    infunc.append(clock()) #infunc[6]

    fm = open(m_file,'w')
    fm.write(strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head)))
    for t,line in output:
        fm.write(line + '\n')
    fm.close()
    infunc.append(clock()) #infunc[7]



c_f = "xxA.txt"
o_f = "yyB.txt"

t1 = clock()
sorting_merge(o_f, c_f, 'zz_mergedk.txt')
t2 = clock()
print 'merging killer'
print 'total time of execution :',t2-t1
print '              launching :',infunc[1] - t1
print '            preparation :',infunc[1] - infunc[0]
print '    reading of 1st file :',infunc[2] - infunc[1]
print '    reading of 2nd file :',infunc[3] - infunc[2]
print '      output.remove(\'\') :',infunc[4] - infunc[3]
print 'creation of list output :',infunc[5] - infunc[4]
print '      sorting of output :',infunc[6] - infunc[5]
print 'writing of merging file :',infunc[7] - infunc[6]
print 'closing of the function :',t2-infunc[7]

results

merging regex
total time of execution : 14.2816595405
              launching : 0.00169211450059
            preparation : 0.00168093989599
    reading of 1st file : 0.163582242995
    reading of 2nd file : 0.141301478261
      output.remove('') : 2.37460347614e-05
     creation of output : 13.4460212122
      sorting of output : 0.216363532237
writing of merging file : 0.232923737514
closing of the function : 0.0797514767938

merging split
total time of execution : 13.7824474898
              launching : 4.10666718815e-05
            preparation : 2.70984161395e-05
    reading of 1st file : 0.154349784679
    reading of 2nd file : 0.136050810927
      output.remove('') : 2.06730184981e-05
     creation of output : 12.9691854691
      sorting of output : 0.218704332534
writing of merging file : 0.225259076223
closing of the function : 0.0788362766776

merging killer
total time of execution : 2.14315311024
              launching : 0.00206199391263
            preparation : 0.00205026057781
    reading of 1st file : 0.158711791582
    reading of 2nd file : 0.138976601775
      output.remove('') : 2.37460347614e-05
     creation of output : 0.621466415424
      sorting of output : 0.823161602941
writing of merging file : 0.227701565422
closing of the function : 0.171049393149

During killer program, sorting output takes 4 times longer , but time of creation of output as a list is divided by 21 ! Then globaly, the execution's time is reduced at least by 85 %.

How to merge only the unique lines from file_a to file_b?

4 Answers4

create AB.py

"merging regex.py"

"merging split.py"

"merging killer"