1

I am using Ruby for this. Freeling (a NLP tool) has a shallow parser which returns a string like this for the text "I just read the book, the grasshopper lies heavy" when I run a shallow parsing command.

a = <<EOT
S_[
  sn-chunk_[
    +(I i PRP -)
  ]
  adv_[
    +(just just RB -)
  ]
  vb-chunk_[
    +(read read VB -)
  ]
  sn-chunk_[
    (the the DT -)
    +n-chunk_[
      (book book NN -)
      +n-chunk_[
        +(The_Grasshopper_Lies_Heavy the_grasshopper_lies_heavy NP -)
      ]
    ]
  ]
  st-brk_[
    +(. . Fp -)
  ]
]

EOT

I want to get the following array from this:

["I", "just", "read", "the book The Grasshopper Lies Heavy","."]

(I want to merge the words that are under a tree and have it as a single array element.)

So far, I have written this much:

b = a.gsub(/.*\[/,'[').gsub(/.*\+?\((\w+|.) .*/,'\1').gsub(/\n| /,"").gsub("_","")

which returns

[[I][just][read][the[book[The Grasshopper Lies Heavy]]][.]]

So, how can i get the desired array?

fedorqui
  • 275,237
  • 103
  • 548
  • 598
B A
  • 1,089
  • 2
  • 14
  • 24
  • Are you sure the API you are using cannot output the tokens list? Acc. to the docs, try `--outf token` if you are using a command line. – Wiktor Stribiżew Nov 08 '16 at 12:42
  • @WiktorStribiżew it doesn't. 'shallow' is one of the result options, if i do the other option called 'tagged' it just tags each word/named entity separately and i don't get the tree for "the book The Grasshopper Lies Heavy". – B A Nov 08 '16 at 12:49
  • With the regex approach, how can you differentiate between a real `_` and the one introduced by Freeling tree builder? You are now removing all underscores. Or will there be no underscores in the output tree? – Wiktor Stribiżew Nov 08 '16 at 12:57
  • @WiktorStribiżew there wont be. but i think that is a minor problem i can do \w_\w or something like that. but the main problem is converting the thing to a consumable array. – B A Nov 08 '16 at 13:04

2 Answers2

2

From your solution so far:

result = a.gsub(/.*\[/,'[').gsub(/.*\+?\((\w+|.) .*/,'\1').gsub(/\n| /,"").gsub("_"," ")
result.split('][').map { |s| s.gsub(/\[|\]/, ' ').strip }     # ["I", "just", "read", "the book The Grasshopper Lies Heavy", "."]
Jagdeep Singh
  • 4,880
  • 2
  • 17
  • 22
0

If you call FreeLing from Ruby via the API, you can get the tree and traverse it at will.

If you are using the output of the command-line program and loading it into Ruby as a string, it may be easier to call it with option "--output conll" which will produce a tabular format easier to deal with.

Lluís Padró
  • 215
  • 1
  • 5
  • i just did that but it didn't work: this works ` "#{analyzer_path} -f #{cfg_path} --inpf plain --outf shallow" ` this doesn't `"#{analyzer_path} -f #{cfg_path} --inpf plain --outf shallow --output conll"` – B A Nov 30 '16 at 11:48
  • option "--output conll" only works in version 4.0 newer, and it seems you are using 3.x – Lluís Padró Feb 14 '17 at 14:53