1

I have strings that have dates in different formats. For example,

sample_str_1 = 'this amendment of lease, made and entered as of the  10th day of august, 2016,   by and between john doe and jane smith'

Also, another string that has the date in it as,

sample_str_2 ='this agreement, made and entered as of May 1, 2016, between john doe and jane smith'

In order to extract just the date from the first string, I did something like this,

match = re.findall(r'\S+d{4}\s+', sample_str_1)

this gives an empty list.

For the second string, I used the same method as I used for first string and getting an empty string.

I also, tried datefinder module and it gave me an output like this,

import datefinder
match = datefinder.find_dates(sample_str_1)

for m in match:
    print(m)

>> 2016-08-01 00:00:00

Above output is wrong, which should be 2016-08-10 00:00:00

I tried another way using this older post

match = re.findall(r'\d{2}(?:january|february|march|april|may|june|july|august|september|october|november|december)\d{4}',sample_str_1)

This again gave me an empty list.

How can I extract dates like that from a string? Is there a generic method to extract dates that have text and digits? Any help would be appreciated.

user9431057
  • 1,203
  • 1
  • 14
  • 28
  • Maybe you should look at the [dateparser](https://pypi.python.org/pypi/dateparser) package. Reinventing the wheel here doesn't make much sense... – ctwheels Mar 01 '18 at 21:21
  • @ctwheels That didn't wordk, I used `date_parse = DateDataParser().get_date_data(sample_str_1)` and I got `{'date_obj': None, 'locale': None, 'period': 'day'}` – user9431057 Mar 01 '18 at 21:43
  • Do you only need to match the specific phrases `[day]st/nd/rd/th day of [month], [year]` and `[month] [day], [year]`? There are many other ways to format a date. – CAustin Mar 01 '18 at 22:17
  • You have only two formats of date `10th day of august, 2016` and `May 1, 2016`? – Srdjan M. Mar 01 '18 at 22:21
  • @CAustin yes, that is one format and string 2 has a different format. – user9431057 Mar 01 '18 at 22:24
  • @S.Jovan yes, for now, I have those two formats and if I can extract them, that would be awesome. – user9431057 Mar 01 '18 at 22:25

1 Answers1

1

Regex: (?:(\d{1,2})(?:th|nd|rd).* ([a-z]{3})[a-z]*|([a-z]{3})[a-z]* (\d{1,2})),\s*(\d{4})

Python code:

regex = re.compile('(?:(\d{1,2})(?:th|nd|rd).* ([a-z]{3})[a-z]*|([a-z]{3})[a-z]* (\d{1,2})),\s*(\d{4})', re.I)

for x in regex.findall(text):
    if x[0] == '':
        date = '-'.join(filter(None, x))
    else:
        date = '%s-%s-%s' % (x[1],x[0],x[4])

    print(datetime.datetime.strptime(date, '%b-%d-%Y').date())

Output:

2016-08-10
2016-05-01

Code demo

Srdjan M.
  • 3,310
  • 3
  • 13
  • 34
  • this works great. What can I do if I have `2nd`, `3rd` etc. I tried to add `(?:(\d{1,2})th|nd|rd.* (..` it prints blank. How can I add that? (As I am a new user, I can not upvote yet, as you deserve one) – user9431057 Mar 01 '18 at 22:41