0

I have these below file names and using below regex to match:

File Names:

  1. 1234 12345678 TEST DOCUMENT December 20, 2018.pdf
  2. 1234 12345678 TESTDOCUMENT December 20, 2018.pdf

Regex I am using to match the file name is:

(\d+)\s(\d+)\s(\w+\s?\w+)

for the first file it is working, but for the 2nd file it is matching month December also as these are two words with a space too.

How to write a regex to match only upto "1234 12345678 TEST DOCUMENT" in both cases with or with out space between TEST and DOCUMENT.

Expected Result:

  1. 1234 12345678 TEST DOCUMENT
  2. 1234 12345678 TESTDOCUMENT

Not this for 2nd file(1234 12345678 TESTDOCUMENT December)

Ulysse BN
  • 10,116
  • 7
  • 54
  • 82
Bhanuchandar Challa
  • 1,123
  • 2
  • 9
  • 17
  • 1
    How are you expected to distinguish between those two cases? Will it always say "TEST DOCUMENT" or might it be some other name? – Robert Harvey Dec 21 '18 at 15:54
  • It could be any text. Possible combination is two words with a space – Bhanuchandar Challa Dec 21 '18 at 15:55
  • 3
    Then I don't see how you can tell the difference between the two. There aren't any distinguishing characteristics, unless you have something else like fixed columns. – Robert Harvey Dec 21 '18 at 15:56
  • After TEST DOCUMENT, it is always a month in format January-December – Bhanuchandar Challa Dec 21 '18 at 15:56
  • Ok. Then part of your regex needs to match on all twelve month names, something like [this](https://stackoverflow.com/questions/2655476/regex-to-match-month-name-followed-by-year). That will give you your ending demarcation. – Robert Harvey Dec 21 '18 at 15:57

4 Answers4

3

Given that you said

After TEST DOCUMENT, it is always a month in format January-December

You can use a lookahead to ensure that you don't match the month:

(\d+)\s(\d+)\s(\w+\s?(?!Jan|Feb|Mar|...|Dec)\w+)
                     ^^^^^^^^^^^^^^^^^^^^^^^...

This will ensure that the second word doesn't start with month names.

iBug
  • 35,554
  • 7
  • 89
  • 134
2

Another option is to match the "datelike" format at the end and capture what is before in a capturing group:

(\d+)\s(\d+)\s(.*?)\s\d{1,2},\s\d{4}\.pdf$

Regex demo

As @iBug points out, if you only want to match word characters or a whitespace you could replace (.*?) with ([\w ]+)

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
1

Just make sure to always match the part with the date, for example:

(\d+)\s(\d+)\s(\w+\s?\w+)\s\w+\s\d+

Would be enough.

aleon
  • 61
  • 1
  • 1
0

You can select everything from the start of the line that is followed by (the lookahead (?=...)) a white space and the name of a month. Here is for november and december:

^.*(?= December| November)

Be careful with the cases of the month names (camelcase, upper, etc). Also, do you have localized data, months names in different languages...

wi2ard
  • 1,471
  • 13
  • 24