How to find single entries in a txt file?

Question

I have a txt file with 12 columns. Some lines are duplicated and some are not. As an example i copied to first 4 columns of my data.

0       0       chr12   48548073  
0       0       chr13   80612840
2       0       chrX    4000600 
2       0       chrX    31882528 
3       0       chrX    3468481 
4       0       chrX    31882726
4       0       chr3    75007624

Based on the first column, you can see that some there are duplicates except entry '3'. I would like to print the only single entries, in this case '3'.

The output will be

3       0       chrX    3468481

IS there a quick way of doing this with awk or perl? I can only think of using for loop in perl but given the fact that i have around 1.5M entries it will probably take some time.

Always 12 columns? Comparison based in just first column or all the row? — fedorqui, Jul 10 '13 at 12:03
It is always 12 columns and yes comparison should be just based on 1st column. But i would like to print the all columns once it find the single entries. — user1007742, Jul 10 '13 at 12:07

score 4 · Accepted Answer · answered Jul 10 '13 at 12:07

4

try this awk one-liner:

awk '{a[$1]++;b[$1]=$0}END{for(x in a)if(a[x]==1)print b[x]}' file

answered Jul 10 '13 at 12:07

Kent

189,393
32
233
301

2

@JS웃 no i'm not, my machine is fast. ;) – Kent Jul 10 '13 at 12:09
@user1007742 since you have 1.5M records to analyze, I'd be interested to hear of any comparisons you are able to make between this `awk` & || `uniq` approach and the proper `perl` script by @Hunter McMillen or my `perl` one-liner further below. I suspect @Hunter may have the fastest approach. – G. Cito Jul 13 '13 at 04:47

score 3 · Answer 2 · edited Jun 20 '20 at 09:12

Here is another way:

uniq -uw8 inputFile

-w8 will compare the first 8 characters (that is your first column) for uniqueness.
-u option will print only lines that appear once.

Test:

$ cat file
0       0       chr12   48548073  
0       0       chr13   80612840
2       0       chrX    4000600 
2       0       chrX    31882528 
3       0       chrX    3468481 
4       0       chrX    31882726
4       0       chr3    75007624

$ uniq -uw8 file
3       0       chrX    3468481

score 2 · Answer 3 · answered Jul 10 '13 at 13:11

2

Not a one-liner but this small Perl script accomplishes the same task:

#!/usr/bin/perl
use strict;
use warnings FATAL => 'all';

# get filehandle
open( my $fh, '<', 'test.txt');

# all lines from your file
my %line_map; 

while( my $line = <$fh> ) { # read a line

   my $key;
   my @values;

   # split on whitespace
   ($key, @values) = split(/\s+/, $line);

   # delete a line if it already exists in the map
   if( exists $line_map{$key} ) {
       delete $line_map{$key};
   } 
   else { # mark a line to show that it has been seen
      $line_map{$key} = join("\t", @values);
   }
}

# now the map should only contain non-duplicates
for my $k ( keys %line_map ) {
   print "$k\t", $line_map{$k}, "\n"; 
}

answered Jul 10 '13 at 13:11

Hunter McMillen

59,865
24
119
170

Nice, is it slow when reading in a large file as a %hash? Could it benefit from `Tie::File`? I concocted a `perl` one-liner and it seems to work. See my second answer. – G. Cito Jul 13 '13 at 04:55
1

I think this will only work correctly if the non unique lines show up in pairs. I believe replacing `delete $line_map{$key};` with `$line_map{$key} = undef;` plus adding `next unless defined $line_map{$k}` to the beginning of the `for` loop would work though. – Brad Gilbert Aug 21 '13 at 14:24

score 1 · Answer 4 · answered Jul 10 '13 at 20:33

1

Can't format properly for a comment. @JS웃 might be relying on GNU uniq ... this seems to work in BSD derived versions:

grep ^`cut -d" " -f1 col_data.txt  | uniq -u` file.txt

There simply must be a shorter perl answer :-)

answered Jul 10 '13 at 20:33

G. Cito

6,210
3
29
42

score 0 · Answer 5 · edited May 23 '17 at 11:49

I knew there must be a perl one-liner response. Here it is - not heavily tested so caveat emptor ;-)

perl -anE 'push @AoA,[@F]; $S{$_}++ for @F[0];}{for $i (0..$#AoA) {for $j (grep {$S{$_}==1} keys %S) {say "@{$AoA[$i]}" if @{$AoA[$i]}[0]==$j}}' data.txt

The disadvantage of this approach is that it outputs the data in slightly modified format (this is easy enough to fix, I think) and it uses two for loops and a "butterfly operator" (!!) It also uses grep() (which introduces an implicit loop - i..e one that the code runs even if you don't have to code up a loop yourself) so it may be slow with 1.5 million records. I would like to see it compared to awk and uniq though.

On the plus side it uses no modules and should run on Windows and OSX. It works when there are several dozen similar records with unique first column and doesn't require the input to be sorted prior to checking for unique lines. The solution is mostly cribbed from the one-liner examples near the end of Effective Perl Programming by Joseph Hall, Joh McAdams, and brian d foy (a great book- when the smart match ~~ and given when dust settles I hope a new edition appears):

Here's how ( I think) it works:

since we're using -a we get the @F array for free so using it instead of splitting
since we're using -n we're inside a while() {} loop, so push the elements of @F into @AoA as anonymous arrays of references (the [] acts as an "anonymous array constructor"). That way they hang around and we can refer to them later (does this even make sense ???)
use the $seen{$_}++ idiom (we use $S instead of $seen) from the book mentioned above and described so well by @Axeman here on SO to look at the unique elements of @F[0] and set/increment keys in our %S hash according to how many times we see an element (or line) with a given value (i.e the line contents).
use a "butterfly" }{ to break out of the while then, in a separate block, we use two for loops to go through the outer array and examine each element (which are themselves anonymous arrays $i - one for each line) and then, for each inner anonymous array, grep which values go with keys that are equal to "1" in the %S hash we created previously (the for $j (grep {$S{$_}==1} keys %S), or inner loop) and consecutively place those values in $j.
finally, we iterate through the outer array and print any anonymous arrays where that array's first element equals the value of each ($j). We do that with: (@{$AoA[$i]}[0]==$j).

awk in the hands of @Kent is a bit more pithy. If anyone has suggestions on how to shorten or document my "line noise" (and I never say that about perl!) please add constructive comments!

Thanks for reading.

Just translating the awk solution seems simpler: `perl -anE '$c{$F[0]}++; $l{$F[0]} = $_; END {say $l{$_} for grep {$c{$_} == 1} keys %c}' file` — Prakash K, Sep 13 '13 at 20:17
Good one. I was so concerned with not getting bit by the case of `uniq` lines having to be in pairs (requiring them to be `sort`-ed first) that I guess overthought this a bit. Cheers. — G. Cito, Sep 15 '13 at 16:25

How to find single entries in a txt file?

5 Answers5

Test:

Linked