1

Trying to process a CSV file using AWK, however I have met a problem that many of my cells in my row already contain comma ,, meaning I can not separate field using awk -F,.

CSV FILE

Name,...DATE,COLUMNX,ADDRESSES
host1,...,NOV 24, 2022,['Element1', 'Element2'],"['192.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"
host2,...,NOV 24, 2022,['Element3'],"['192.168.x.101', 'fe80:XX']"

The ... represents rows/columns containing [, ,, ', "

What I have tried:
awk -F, '{print $X}'
This give me following output:

'Element2']
"['192.168.x.101'

What I want to accomplish:

host1 192.168.x.99
host1 192.168.x.100
host2 192.168.x.101
oguz ismail
  • 1
  • 16
  • 47
  • 69
N. J
  • 398
  • 2
  • 13
  • How many columns does `,...,` represent? Also, do any columns have `"`, `[`, or `'` before that contining the required IP address? – Dave Pritlove Nov 24 '22 at 11:30
  • Thanks. I will post an approach that will work but you may have to vary the column number to get the part you need. – Dave Pritlove Nov 24 '22 at 11:44
  • 1
    Your CSV is invalid: a field that contains comma(s) shall be quoted (see how the `ADDRESSES` fields are) – Fravadona Nov 24 '22 at 12:30
  • 1
    Do you REALLY have some fields that are inside double quotes, e.g. `"['192.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"` and some that aren't e.g. `['Element1', 'Element2']`? Also, do you REALLY have some sub-fields that are inside single quotes, e.g. `'192.168.x.100'` and some that aren't e.g. the last `fe80:XX` on the same line? If you do then you should fix whatever is generating that mess, if you don't then please fix your example to show more accurate sample input/output. – Ed Morton Nov 24 '22 at 23:26

4 Answers4

1

I'd recommend a proper CSV parser to do the job, then use awk to do the regex, e.g.

$ ruby -r 'csv' -ne 'lines=$_
  CSV.parse(lines) do |i| 
    i.each do |j| 
      printf("%s ", j)
    end
    puts ""
  end' file | 
awk '{gsub(/\[\047|\047\]|\047|\]|,/, "", $0)}
  /^host/{for(i=1;i<=NF;i++){if($i~/^[0-9]+\.+/){print $1, $i}}}'
host1 192.168.x.99
host1 192.168.x.100
host2 192.168.x.101
Andre Wildberg
  • 12,344
  • 3
  • 12
  • 29
0

Modern versions of awk allow a record to split at more than one field separator. Thus, each line can be split at both commas and single quote marks to isolate the data you need.

To use ' as a field separator along with , requires the former to be escaped and it can be quite tricky to then combine the two. The simplest way I've found after a few trials is to use the shell F switch with a regular expression including the escaped ' and ,. It's messy as you have to close the first single quote before escaping the required one and re-opening a single-quotted command: -F'[,'\''=]' (I generally prefer setting field separators within the awk procedure but this one defeated me).

This edited version works to isolate the field (change $35 to suit by trail-and error):

awk -F'[,'\'']' 'NR>1{print $1" "$35}' data.csv

I tested the above on the following test file:

data.csv:
Name,...DATE,COLUMNX,ADDRESSES
host1,['El3', 'El6'],['El7', 'El12'],['El1', 'El2'],['El', 'E12'],NOV 24, 2022,['Element1', 'Element2'],"['192.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"
host2,['El3', 'El6'],['El7', 'El12'],['El1', 'El2'],['El', 'E12'],NOV 24, 2022,['Element1', 'Element2'],"['192.168.xxx.yy', 'fe80:XX','192.168.x.100', fe80:XX]"
host3,['El3', 'El6'],['El7', 'El12'],['El1', 'El2'],['El', 'E12'],NOV 24, 2022,['Element1', 'Element2'],"['192.xxx.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"
host4,['El3', 'El6'],['El7', 'El12'],['El1', 'El2'],['El', 'E12'],NOV 24, 2022,['Element1', 'Element2'],"['xxx.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"

output:

host1 192.168.x.99
host2 192.168.xxx.yy
host3 192.xxx.x.99
host4 xxx.168.x.99
Dave Pritlove
  • 2,601
  • 3
  • 15
  • 14
  • Apologies, I inadvertantly used $0 instead of $1 in the original post, edited to correct. – Dave Pritlove Nov 24 '22 at 11:50
  • Should the `'` inside the `FS` statement be escaped? – N. J Nov 24 '22 at 11:51
  • No but I just realised also that the OR `|` is probably not needed as the square brackets mean OR already. And the quotes should probably be regex boundaries `//`. Really sorry, rushing and don't have time to make a test file but I think setting two delimerters will get what you need. – Dave Pritlove Nov 24 '22 at 11:53
  • Unfortunately @DavePritlove, I'm just as close as with my original try: `awk -F, '{print $X}'` – N. J Nov 24 '22 at 11:59
  • OK, I have to dash but will have a look again later. I'm sure that setting two separators is the way to go as it will isolate both your output parts, it's just getting the regex right for `'`OR`'`, you might try deleting by BEGIN block and setting your shell argument to `-F ' | ,` with and without bounding quotes or parenthesis around `'|,` . – Dave Pritlove Nov 24 '22 at 12:05
  • You can't have a `'`, escaped or not, inside a `'`-delimited string (including scripts) in shell, and `FS=/[,']/` (with `/`s) is invalid syntax for assigning a value to FS. – Ed Morton Nov 24 '22 at 17:58
  • @N.J I finally have it working and have revised my answer. Setting two separators turns out to be tricky when one needs escaping but I used the simplest option from my trials which was within the shell switch. My test file added plenty of early padding with both commas and single quotes so you will have to find the second field number by trial and error – Dave Pritlove Nov 24 '22 at 18:09
  • @EdMorton Yes, it turned out to be tricky when I got a chance to tweak it. I've revised my answer using the F switch with a closed quote before the escaped `'` and a reopening quote. Horrible but works. I'd much prefer to set the pair of separators in BEGIN but couldn't find a way that wasn't horribly messy. Suggestions welcome. – Dave Pritlove Nov 24 '22 at 18:11
  • @DavePritlove , each of your row contains the same amount of data, this is not the case for me... – N. J Nov 24 '22 at 20:36
  • `-F"[,']"` or `BEGIN{FS="[,\047]"} ...`, see http://awk.freeshell.org/PrintASingleQuote – Ed Morton Nov 24 '22 at 21:20
  • 1
    @EdMorton thanks, I like both of those. – Dave Pritlove Nov 24 '22 at 21:43
  • @N.J perhaps I should have phrased "how many columns does `,...,` represent?" in my initial comment more explicitly (assuming you are now saying you have varying column numbers before the IP field, the length of data within a field is irrelevant). If you're still seeking ideas, I'd either add a real snippet from your file to your question or state very clearly that the required data can be in different columns in different rows. – Dave Pritlove Nov 24 '22 at 21:49
0

Using awk:

awk -F",\"|\"$" 'NR>1 { \
gsub(/\047|[\[\]]/,""); \
split($2,a,", "); \
split($1,h,","); \
for (n in a) {if (a[n] ~ /^[0-9]/) printf "%s %s\n", h[1], a[n]}}' src.csv

Output:

host1 192.168.x.100
host1 192.168.x.99
host2 192.168.x.101

Details:

-F",\"|\"$" (split on ," or " at end of record (will remove trailing double quote and each record will be split into two fields.

gsub(/\047|[\[\]]/,""); (sanitize by removing both single quotes and/or brackets)

split($2,a,", "); (split second field into array a on comma-space)

split($1,h,","); (split first field into array h on comma.

for (n in a) {if (a[n] ~ /^[0-9]/) printf "%s %s\n", h[1], a[n] Iterate over array a and only print output if array item starts with a number

j_b
  • 1,975
  • 3
  • 8
  • 14
0

Using GNU awk for FPAT:

$ cat tst.awk
BEGIN {
    FPAT = "([^,]*)|(\"[^\"]+\")|([[][^]]+])"
}
NR > 1 {
    n = split($NF,a,/\047/)
    for ( i=2; i<=n; i+=4 ) {
        print $1, a[i]
    }
}

$ awk -f tst.awk file
host1 192.168.x.99
host1 192.168.x.100
host2 192.168.x.101

To see how FPAT is splitting the input into comma-separated fields and then split() is splitting the last field into '-separated subfields, add some prints, e.g.:

$ cat tst.awk
BEGIN {
    FPAT = "([^,]*)|(\"[^\"]+\")|([[][^]]+])"
}
NR > 1 {
    print "============="
    print
    for ( i=1; i<=NF; i++ ) {
        print i, "<" $i ">"
    }
    n = split($NF,a,/\047/)
    for ( i=1; i<=n; i++ ) {
        print "\t" NF "." i, "<" a[i] ">"
    }
    for ( i=2; i<=n; i+=4 ) {
        print $1, a[i]
    }
}

$ awk -f tst.awk file
=============
host1,...,NOV 24, 2022,['Element1', 'Element2'],"['192.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"
1 <host1>
2 <...>
3 <NOV 24>
4 < 2022>
5 <['Element1', 'Element2']>
6 <"['192.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]">
        6.1 <"[>
        6.2 <192.168.x.99>
        6.3 <, >
        6.4 <fe80:XX>
        6.5 <,>
        6.6 <192.168.x.100>
        6.7 <, fe80:XX]">
host1 192.168.x.99
host1 192.168.x.100
=============
host2,...,NOV 24, 2022,['Element3'],"['192.168.x.101', 'fe80:XX']"
1 <host2>
2 <...>
3 <NOV 24>
4 < 2022>
5 <['Element3']>
6 <"['192.168.x.101', 'fe80:XX']">
        6.1 <"[>
        6.2 <192.168.x.101>
        6.3 <, >
        6.4 <fe80:XX>
        6.5 <]">
host2 192.168.x.101

See What's the most robust way to efficiently parse CSV using awk? for more information on FPAT and parsing CSVs with awk.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185