0

I need your help for solving a problem in bash. I'm starting to use it and now I need to extract only words (in the second column) in common among all files that I have in a folder. I understood how it works with only two files but not with all of those. This is the beginning of my script:

for file in *
do
    awk '{print $2}' $file | sort -n > ord/$file
done

The above lines worked for extracting and sorting the second column, but now I don't know how can I find only words that are in all files.

tripleee
  • 175,061
  • 34
  • 275
  • 318
alb_alb
  • 58
  • 5
  • As an aside, you should use double quotes around variables which refer to file names. See also http://shellcheck.net/ – tripleee Feb 15 '19 at 10:14

4 Answers4

2

Extracting Lines Common For All Files

The following recursive command extracts lines common for all files. An advantage is that we don't have to sort anything.

intersect() { f="$1"; if shift; then grep -Fxf "$f" | intersect "$@"; else cat; fi; }
common() { f="$1"; shift; intersect "$@" < "$f"; }
common *

The trick here is to intersect files recursively. If we understand files as mathematical sets of lines the question boils down to »Given sets a, b, …, n, how to compute a ∩ b ∩ … ∩ n«.

We can compute the intersection a ∩ b with the command grep -Fxf a b which is the same as cat b | grep -Fxf a or cat a | grep -Fxf b (useless use of cat only for better readability). The order of a and b does not matter.

To compute the intersection a ∩ b ∩ c we can compute (a ∩ b) ∩ c. How to compute (a ∩ b) is already known (see above), so we apply the same approach to the result of (a ∩ b): cat a | grep -Fxf b | grep -Fxf c. Alternatively, you can replace the entire grep command by common a b from moreutils.

How to proceed from there should be clear.

Use 2nd Column Instead Of Whole Lines

To use only the 2nd column instead of whole lines we can either work on modified copies

for f in *; do
    awk '{ print $2 }' "$f" > "$f-col2"
done
common *-col2

… or adapt the function

mask() { awk '{ print $2 }' "$@"; }
intersect() { f="$1"; if shift; then grep -Fxf <(mask "$f") | intersect "$@"; else cat; fi; }
common() { f="$1"; shift; mask "$f" | intersect "$@"; }
common *
Socowi
  • 25,550
  • 3
  • 32
  • 54
2

Here's a simple Awk script to print all values of $2 which are present in all the files.

awk '# Count number of files; no lines were seen in this file yet
    FNR==1 { file++; delete b }
    # If not already seen in this file, add one to count
    # and mark as seen in this file as a side effect
    !b[$2]++ { a[$2]++ }
    # In the end, print all values which occurred in all files
    END { for (k in a) if (a[k]==file) print k }' *

This will examine all files in the current directory. You can replace the wildcard at the end with whatever will match the set of files you want to examine.

With comments removed, this can be a one-liner, though let's not cheat too much. Here's a two-liner:

awk 'FNR==1 { file++; delete b }  !b[$2]++ { a[$2]++ }
    END { for (k in a) if (a[k]==file) print k }' *
tripleee
  • 175,061
  • 34
  • 275
  • 318
  • 1
    Due to the comments this seems way longer than it actually is. Great one-liner! – Socowi Feb 15 '19 at 10:28
  • Thanks! Your code worked! Can you please explain me for what should I use the double quotes? – alb_alb Feb 15 '19 at 10:47
  • If you refer to my comment under the question, see https://stackoverflow.com/questions/10067266/when-to-wrap-quotes-around-a-shell-variable/27701642 and again, paste your script into the box at http://shellcheck.net/ to get basically the same diagnostic, along with a convenient "apply all" button which fixes your script automagically. – tripleee Feb 15 '19 at 10:49
  • I see! Thank you – alb_alb Feb 15 '19 at 11:00
  • In the extreme case, this might become a bit problematic if you are processing large amounts of data since you store every single possible value of `$2` in `a`. You might want to do something like: `!b[$2]++ { if (++a[$2] != file) delete a[$2] }`. But as I said, extreme case! – kvantour Feb 15 '19 at 11:42
  • Yeh, my original attack plan was to delete from `a` and see what's left when we are done, but I got distracted and figured this was good enough. – tripleee Feb 15 '19 at 11:45
  • also there might be a maximum array size (which I am not aware of) – kvantour Feb 15 '19 at 11:45
0

Try something like this:

$ FILES=`ls -1 *`
$ COUNT=`grep -c ^ <<<"$FILES"`
$ for FILE in $FILES; do awk '{ print $2}' $FILE | sort -u; done | \
     sort | uniq -c | grep " $COUNT "

Breaking this apart, we first get the list of files into FILES, and then count how many into COUNT - this is used at the end of the process.

Then we get the words in the second column in each file and use "sort -u" to return just one of each.

We do this in a loop for all files, and then count the number of times each word appears. This uses "uniq -c" which displays the words prefixed by the count of words. So if the word "pepper" appears in 7 files, the loop outputs "pepper" once for each of those 7 files, and "uniq -c" outputs " 7 pepper" (always has whitespace at the start). If the total number of files was 7, then we now know every file had at least one instance of the word "pepper" in the second column.

We know that the number of files is in COUNT. So we just search for the "uniq -c" output that has " 7 " (with spaces either side).

jezzaaaa
  • 91
  • 4
  • 1
    Overall approach sounds good. You may want to replace `for FILE in $FILE` with `for file in *` so that you don't have to rely on `ls` and word splitting. Also `sort | uniq` could be `sort -u`. `sort -n` could be just `sort`. And the last `grep` could avoid false positives with `grep "^ *$count "`. – Socowi Feb 15 '19 at 09:27
  • 1
    Also don't use upper case for your private variables. – tripleee Feb 15 '19 at 10:17
  • I'm sorry but I can't understand the output of your code. If I run it the output is 35, which is the number of files in my folder. Did I make any mistakes? – alb_alb Feb 15 '19 at 10:26
  • @Alberto - Perhaps you did, or could be your environment. Works for me. Try taking off the | grep " $COUNT " at the end, and see what you get. For me, I get a list of all the words prefixed by the number found. – jezzaaaa Feb 15 '19 at 10:58
  • @tripleee Why should I not use uppercase for private variables? – jezzaaaa Feb 15 '19 at 11:00
  • https://unix.stackexchange.com/questions/42847/are-there-naming-conventions-for-variables-in-shell-scripts – tripleee Feb 15 '19 at 11:01
  • @Socowi Thanks for the hints. I've adjusted the sort/uniq accordingly. The false-positives you hint at are probably impossible, given that awk splits on space (assuming default FS). If I did "for FILE in *" I'd have to count the files anyway, so I figured I'd avoid reading the directory twice. This also avoids the possibility of an extra file being added during the execution, which could affect the results. I suppose I increment COUNT inside the loop... – jezzaaaa Feb 15 '19 at 11:10
  • @triplee - Thanks for the link. I think this is a style thing, and for me, upper-case variable names make the variables stand out from the operators and make the code more easily grokked. However, I never mentally distinguish between environment variables and local variables, which is probably poor practice. So I'm going to reconsider my bash style from now on. – jezzaaaa Feb 15 '19 at 11:15
  • 1
    Even some popular tutorials erroneously teach uppercase variables, so you are by no means alone. Then we get questions from people who tried to use `PATH` or `LD_PRELOAD` as private variables and get confused when stuff breaks. – tripleee Feb 15 '19 at 11:17
0

Here is another awk one:

awk '(NR==FNR){a[$2]++; next}
     (FNR==1) { for(i in a) if (a[i]==0) delete a[i]; else a[i]=0; }
     ($2 in a) {a[$2]++}
     END { for(i in a) if (a[i]!=0) print i }' f1 f2 f3 f4 ...

This works in the following way. We keep track of an array a which holds all common entries. If a value is seen in a file, we increment the value of the array. Each time a new file is read, we check which values are still zero and delete them from the array:

  • (NR==FNR){a[$2]++; next}: the first file is read. Initialise the array a with all its values.
  • (FNR==1) { for(i in a) if (a[i]==0) delete a[i]; else a[i]=0; }: If we enter a new file (FNR==1), then check all entries in the array a. If the value is still 0, it implies that we did not encounter the key of array a in the previous file, so delete it. Otherwise, reset it to zero to start the next cycle.
  • ($2 in a) {a[$2]++}: here we process each line of the file. If the entry is in the array a, increment it. This means that all values which are not common will still have a value 0, others a value higher.
  • END { for(i in a) if (a[i]!=0) print i }: at the end of all the processing, print whatever is left.
kvantour
  • 25,269
  • 4
  • 47
  • 72