2

I am working on a program that scrapes emails in Ruby, therefore simply using a regex to try to utilize .match(/some regex/) can only be part of the solution. There is no perfect regex for this problem in any language.

Either the expression accepts too many strings, resulting in false-positive matches, or valid results are excluded. I am using a regex for email "validation" (actually email "suspicion" is a more apt term) that casts a "wide net".

This strategy allows me to maximize positive results by storing the suspected addresses in an array and iterating through to deal with edge cases. This question revolves around one particular edge case.

Take for example the string:

desktop_variety_top@728x90

The logic to handle strings like this example would be to purge any string that contains no periods between the @ and then end of the string.

So we might be looking at something like:

def purge_edge_case(array)
  array.reject! { |s| s.<first_condition>? && s.<second_condition>? }
end

Figuring out the two string-based conditions is where I'm currently stuck.

HMLDude
  • 1,547
  • 7
  • 27
  • 47
  • Possible duplicate of [What is the best/easy way to validate an email address in Ruby?](http://stackoverflow.com/questions/4776907/what-is-the-best-easy-way-to-validate-an-email-address-in-ruby) – user000001 May 06 '17 at 16:56
  • I don't think so. There are many regular expressions to match email addresses written in all the major programming languages. The problem is that none of them is perfect. So the "net" in invariable cast either two wide or too narrow. The optimal solution, in scrapping applications (which is what I am working on), is to cast the net wide and then whittle down the list through a series of steps. This question represents one such step. – HMLDude May 06 '17 at 17:20
  • I'm a bit lost. What is a 'conditional regex statement'? Second, why are you showing 2 conditions for testing for periods? Lastly, there are no all seeing solutions, as you mentioned, so what makes you think you are going to create one? – grail May 06 '17 at 17:23
  • I think you need one regex - `/@[^@.]+\z/` matching any string that has no dot in between the last `@` and the end of the string. – Wiktor Stribiżew May 06 '17 at 17:29
  • 1
    Ruby does support conditional regex statements: http://rubular.com/r/qyxnL8RQpQ – Casimir et Hippolyte May 06 '17 at 17:37
  • @CasimiretHippolyte Thanks for your example! It's good to know that Ruby supports conditional regex of the form (this || that). To satisfy my question however, it would have to support (this, excluding specified subset). – HMLDude May 06 '17 at 20:52
  • @WiktorStribiżew I believe that your solution would work. That is a very elegant way of handling the problem! – HMLDude May 06 '17 at 21:19
  • @pweslow: Yes, it just works as you described. Use if you like it :) – Wiktor Stribiżew May 06 '17 at 21:45

2 Answers2

2

There is no need for regex here:

test = input.split('@')
test.size == 2 && \
   && !test.last.starts_with?('.') \
   && !test.last.ends_with?('.') \ 
   && test.last.includes?('.')

Or, less strict, exactly as you requested:

test.size == 2 && test.last[/\./] # at least one dot after `@`
Aleksei Matiushkin
  • 119,336
  • 10
  • 100
  • 160
0

Here is the completed method that solves the problem:

def purge_edge_case(array)
    array.reject! { |s| s.match(/@.*/).to_s != nil && s.match(/@.*/).to_s.match(/\./) == nil }
end
HMLDude
  • 1,547
  • 7
  • 27
  • 47
  • How on the Earth that could have been upvoted? `to_s != nil` is a nonsense, the whole answer is a perfect example of code smell and bad practice. Flagged for mod attention. – Aleksei Matiushkin May 06 '17 at 19:21
  • 1
    @mudasobwa: Just curious : why mod attention? – Eric Duminil May 06 '17 at 20:41
  • @mudasobwa I am sure that there are cleaner ways to write the code. However to declare it as "nonsense" is nonsense! The code is in fact valid Ruby, and not only does it run (without error I might add), but it also solves the issue I raised in my question. – HMLDude May 06 '17 at 20:54
  • @EricDuminil the question “How would I use regexp to detect an email” and a clumsy answer from the OP receive 2 and 1 upvotes respectively. – Aleksei Matiushkin May 07 '17 at 04:53
  • @mudasobwa I up voted the question to give more visibility to your answer. – Eric Duminil May 07 '17 at 08:50
  • @pweslow the nonsense part was for the comparison between to_s and nil. `to_s` returns a string. Can a string be nil? – Eric Duminil May 07 '17 at 08:53
  • @EricDuminil As I understand it, Stack Overflow is supposed to be a question and answer forum for professional and enthusiast programmers. If a piece of code does not make sense, then a constructive response would be to point out the error and possibly offer an alternative. A senior member of the community adopting an elitist attitude does not encourage those trying to learn, rather it creates barriers between those with experience in a particular topic and those trying to learn. – HMLDude May 07 '17 at 18:15