I knew there must be a perl
one-liner response. Here it is - not heavily tested so caveat emptor ;-)
perl -anE 'push @AoA,[@F]; $S{$_}++ for @F[0];}{for $i (0..$#AoA) {for $j (grep {$S{$_}==1} keys %S) {say "@{$AoA[$i]}" if @{$AoA[$i]}[0]==$j}}' data.txt
The disadvantage of this approach is that it outputs the data in slightly modified format (this is easy enough to fix, I think) and it uses two for
loops and a "butterfly operator" (!!) It also uses grep()
(which introduces an implicit loop - i..e one that the code runs even if you don't have to code up a loop yourself) so it may be slow with 1.5 million records. I would like to see it compared to awk
and uniq
though.
On the plus side it uses no modules and should run on Windows and OSX. It works when there are several dozen similar records with unique first column and doesn't require the input to be sorted prior to checking for unique lines. The solution is mostly cribbed from the one-liner examples near the end of Effective Perl Programming by Joseph Hall, Joh McAdams, and brian d foy (a great book- when the smart match ~~
and given when
dust settles I hope a new edition appears):
Here's how ( I think) it works:
- since we're using
-a
we get the @F
array for free so using it instead of splitting
- since we're using
-n
we're inside a while() {}
loop, so push
the elements of @F
into @AoA
as anonymous arrays of references (the []
acts as an "anonymous array constructor"). That way they hang around and we can refer to them later (does this even make sense ???)
- use the
$seen{$_}++
idiom (we use $S
instead of $seen
) from the book mentioned above and described so well by @Axeman here on SO to look at the unique elements of @F[0]
and set/increment keys in our %S
hash according to how many times we see an element (or line) with a given value (i.e the line contents).
- use a "butterfly"
}{
to break out of the while
then, in a separate block, we use two for
loops to go through the outer array and examine each element (which are themselves anonymous arrays $i
- one for each line) and then, for each inner anonymous array, grep
which values go with keys
that are equal to "1" in the %S
hash we created previously (the for $j (grep {$S{$_}==1} keys %S)
, or inner loop) and consecutively place those values in $j
.
- finally, we iterate through the outer array and print any anonymous arrays where that array's first element equals the value of each (
$j
). We do that with: (@{$AoA[$i]}[0]==$j
).
awk
in the hands of @Kent is a bit more pithy. If anyone has suggestions on how to shorten or document my "line noise" (and I never say that about perl
!) please add constructive comments!
Thanks for reading.