0

Getting used to regex here.

I have a file in the structure of

word1 word2 word3 word4 word5 "word6" "word7"
word1 word2 word3 word4 word5 "word6" "word7"
word1 word2 word3 word4 word5 "word6" "word7"
...

which I want to capture into:

arr[0] = word1
arr[1] = word2
arr[2] = word3
arr[3] = word4
arr[4] = word5
arr[5] = word6
arr[6] = word7

My regex is: (?m)(.* )(.* )(.* )(.* )(.* )(".*") (".*")

Now I'm sure there is a more elegant way to write this where I don't have to repeat the same sequence multiple times.

My understanding is something like this should work?

(?:(.* )*|(".*")*)

I believe (?:(.* )|(".*")) means match EITHER .* or ".*" and the * at the end of (.* ) and (".*") forming (.* )* and (".*")* means match 0 or more times. This should do the same thing as my working regex no?

Thoughts?

EDIT After reading everything, I was simply trying to shorten my regex by capturing based on (.) or \"(.)\" without specifying the number of times the capturing will occur which is not possible. thank you!

the correct regex: (?m)(.*) (.*) (.*) (.*) (.*) \"(.*)\" \"(.*)\"

Shi Zhang
  • 143
  • 5
  • 18
  • Why don't you just use the built-in String.split() function? So, String[] arr = lineInput.split(" "); – khriskooper Aug 10 '17 at 15:13
  • Do you **need** to capture each word ? Or do you just want to match them ? Because if you want to capture them, you need to write each capture group specifically – Gawil Aug 10 '17 at 15:42
  • What is a word for you? What characters are allowed? – Toto Aug 10 '17 at 15:52
  • @khriskooper because I need to split by a space OR " " and I know I could do something to make it work but I want to get better at regex – Shi Zhang Aug 10 '17 at 17:48
  • @Gawil match as in figuring out whether they exist? and capture as in get the value? unsure of the terminology – Shi Zhang Aug 10 '17 at 17:49
  • @Toto word as in either words (alphanumeric) or filepaths (UNIX and windows) so I just used `.*` – Shi Zhang Aug 10 '17 at 17:50
  • @ShiZhang please, put the word definition into the question. – Gangnus Aug 11 '17 at 10:41

1 Answers1

1
  1. If you have a group repeating by * or +, it will still be taken only once - the last time when it matches. Alas, we have to write such groups many times.
  2. Space is done by \s
  3. (.*)\s(.*)\s(.*)\s(.*)\s(.*)\s"(.*)"\s"(.*)"

is enough. You mustn't put " IN groups, according to your task. Your regex is NOT working, taking " and spaces into arr[6] and arr[5].

  1. Example

If you want to read words independently on if they are in "" or not, and number of spaces between words can be any, then:

[\s"]*(\w+)[\s"]+(\w+)[\s"]+(\w+)[\s"]+(\w+)[\s"]+(\w+)[\s"]+(\w+)[\s"]+(\w+)[\s"]*

Really, it is the shortened variant, for this way we cannot check for presence of "" on both sides of the words.

Example

If you really want to take arbitrary number of words, use split() function, splitting by spaces \\s? and after that trimming off excessive " and/or spaces from the elements.

Look here for example.

It is impossible to split lines into arbitrary number of groups by regex only, without split() or something similar.

Gangnus
  • 24,044
  • 16
  • 90
  • 149
  • you wrote **It is impossible to split lines into arbitrary number of groups by regex only, without split() or something similar.** does that mean what I was trying to accomplish with `(?:(.* )*|(".*")*)` is not possible? – Shi Zhang Aug 10 '17 at 17:53
  • after reading everything, including the bottom answer, I was simply trying to **shorten** my regex by capturing based on `(.*)` or `\"(.*)\"` **without specifying the number of times the capturing will occur** which is not possible. thank you! – Shi Zhang Aug 10 '17 at 17:59
  • @ShiZhang Please, distinguish matching and capturing - matching is about finding a piece in the line that corresponds to regex. It can use undefined or defined repeaters. Capturing is taking all pieces that correspond to regex groups. When you use repeater for a group #2, for example, it captures many found pieces one after another into the same result #2. Naturally, only the last one remains there. – Gangnus Aug 11 '17 at 08:36
  • @ShiZhang sO, It IS possible to use repeaters for groups, only they won't work as you expected. -Sigh- For me it was a great disappointment, too. – Gangnus Aug 11 '17 at 08:40