0

I'm trying to analyze a large column of data that contains 12 digit numbers as seen below:

0    802112134267
1    300949934377
2    300999934377
3    222589009836
4    950279219923
Name: number, dtype: object

I want to grab any number that has 3 or more repeated characters. Row 2 contains 4 '9's and row 3 contains 3 '2's. I would want to return:

0    None
1    None
2    300999934377
3    222589009836
4    None
Name: number, dtype: object

Or just a truncated/filtered dataframe/series would suffice.

The regex that I think solves this is: '(\d)\1{2,}'

However, I haven't been able to successfully apply this regex to the series.

regex = re.compile('(\d)\1{2,}')
s.apply(lambda x: np.nan if regex.search(x) == None else x)

returns all NaN.

s.str.extract('(\d)\1{2,}', expand=True)

returns all NaN.

s.str.contains('(\d)\1{2,}')

returns all False.

Any help would be appreciated. I've tried searching the forum and haven't found any good examples that have worked.

Thanks

cs95
  • 379,657
  • 97
  • 704
  • 746
Matt
  • 93
  • 1
  • 5

1 Answers1

2

@WiktorStribiżew's regex is correct. In addition to his answer, I'd like to provide you with a simpler way to filter out your column.

You can use the df.where function to filter out your rows:

In [1524]: df['Col1'] = df.where(df['Col1'].str.contains(r'(\d)\1{2,2}'), other=None)

In [1525]: df['Col1']
Out[1525]: 
0            None
1            None
2    300999934377
3    222589009836
4            None
Name: Col1, dtype: object

df['Col1'].str.contains returns a boolean mask which is then applied to the dataframe using df.where. other=None specifies that None should be put in place of elements that do not match your pattern.

You can skip the regex compilation, and you don't need a lambda anymore.

cs95
  • 379,657
  • 97
  • 704
  • 746