Remove duplicates using regex

Question

Input:

OUT :abc123: : Warning: /var/tmp/prodperim/installer/abc123.fw is older than it should be (not updated for 36 hours)
OUT :abc123 : : Warning: /var/tmp/prodperim/installer/abc123.fw.schedule is older than it should be (not updated for 36 hours)
OUT abc1234: : Warning: / filesystem 100% full
OUT abc1234: : Warning: / filesystem 100% full
OUT abc1234: : Warning: /var/tmp/prodperim/installer/abc123.fw is older than it should be (not updated for 36 hours)
OUT bcd111: : Warning: /var/tmp/prodperim/installer/abc123.fw.schedule is older than it should be (not updated for 36 hours)
OUT bcd111: : Succeeded.

I want to filter only hosts which has matched "Warnings".

Output:

abc123 
abc1234
bcd111

I have tried the below regex it matched all.

([\w]+)\s+:\s+:\s+Warning

Is it possible to avoid duplicates using regex?

Probably better to iterate over the lines and populate a hash. — arco444, Oct 13 '14 at 12:20

score 3 · Answer 1 · answered Oct 13 '14 at 12:21

When you hear "unique" in Perl, think "hash":

#!/usr/bin/perl
use warnings;
use strict;

my %uniq;
while (<>) {
    /:?(\S+?)[:\s]+Warning/ and $uniq{$1} = 1;
}

print "$_\n" for keys %uniq;

BTW, You input and regex don't lead to the output you indicated. I changed the regex, but I'm not sure your input sample is correct. Is the placement of colons really so wild?

score 1 · Answer 2 · answered Oct 13 '14 at 12:37

1

OUT\s*:?([^:]*):(?=.*?\bWarning\b)(?:(?!OUT).)*(?!.*?\1[:\s]*Warning)

You can try this.See demo.Grab the capture.

http://regex101.com/r/sK8oK9/12

answered Oct 13 '14 at 12:37

vks

67,027
10
91
124

score 0 · Answer 3 · answered Oct 13 '14 at 12:29

0

You can use this perl one-liner:

perl -lane 'if (/\bWarning\b/) { @F[1] =~ s/(\W+)//g; print "@F[1]" }' file
abc123
abc123
abc1234
abc1234
abc1234
bcd111

answered Oct 13 '14 at 12:29

anubhava

761,203
64
569
643

score 0 · Answer 4 · answered Oct 13 '14 at 14:58

0

use this pattern w/ gs option

OUT\s*:?([^:]+):\s*:\s*Warning(?!.*?\1\s*:\s*:\s*Warning)

Demo

answered Oct 13 '14 at 14:58

alpha bravo

7,838
1
19
23

score 0 · Answer 5 · edited May 23 '17 at 12:11

This is more of a supplement/complement to @choroba's response above since he nailed it with "when you hear 'unique' think 'hash'". You should accept @choroba's answer :-)

Here I simplified the regex part of your question into a call to grep in order to focus on uniqueness, changed the data in your file a bit (so it could fit here) and saved it as dups.log:

# dups.log 
OUT :abc123: : Warning: /var/tmp/abc123.fw old (not updated for 36 hours)
OUT :abc123: : Warning: /var/tmp/abc123.fw.sched old (not updated for 36 hours)
OUT abc1234: : Warning: / filesystem 100% full
OUT abc1234: : Warning: / filesystem 100% full
OUT abc1234: : Warning: /var/tmp/abc123.fw old (not updated for 36 hours)
OUT bcd111: : Warning: /var/tmp/abc123.fw.sched old (not updated for 36 hours)
OUT bcd111: : Warning: /var/tmp/abc123.fw.sched old (not updated for 36 hours)
OUT bcd111: : Warning: /var/tmp/abc123.fw.sched old (not updated for 36 hours)
OUT bcd111: : Succeeded.

This one-liner give the output below:

perl -E '++$seen{$_} for grep{/Warning/} <>; print %seen' dups.log

OUT :abc123: : Warning: /var/tmp/abc123.fw old (not updated for 36 hours)
OUT abc1234: : Warning: / filesystem 100% full
OUT :abc123: : Warning: /var/tmp/abc123.fw.sched old (not updated for 36 hours)
OUT bcd111: : Warning: /var/tmp/abc123.fw.sched old (not updated for 36 hours)
OUT abc1234: : Warning: /var/tmp/abc123.fw old (not updated for 36 hours)

This is pretty much the same output you'd get with uniq log_with_dups.log | grep Warning. It works because perl creates a hash key from each line it reads on STDIN adding a key to the hash and incrementing its value (with ++$seen{$_}) each time it sees the key. For perl "same key" here means a line that is a duplicate. Try printing values %seen or using -MDDP and p %seen to get a sense of what is going on.

To get your output @choroba's regex adds the capture (instead of the whole line) to the hash:

perl -nE '/:?(\S+?)[:\s]+Warning/ && ++$seen{$1} }{ say for keys %seen' dups.log

but, just as with the whole line method above, the regex will create only one copy of the key (from the match and capture) and then increment it with ++ so in the you get "unique" keys à la uniq in the %seen hash.

It's a neat perl trick you never forget :-)

References:

The SO question has some good explanations of the perl idiom for uniq using a hash as per @choroba.
This is touched on in perlfaq4 which describes the %seen{} hash trick.
Perlmaven shows how to make your own "home made" uniq using this approach.
...

Remove duplicates using regex

5 Answers5