0

I would like to search for all punctuations in a string except /, #, $ and dot with no space after (.net for example)

I have done this so far.

?!['\/#$])\p{P}

Now I need to handle the dot with no space after.

If anyone has an idea...

I am using Java.
For example I need to replace all punctuations by "" (empty character) except for dot with no space after:

.net, asp.net hello. world, c#

becomes

.net asp.net hello world c#

thomas
  • 1,201
  • 2
  • 15
  • 35
  • 1
    Please provide sample inputs and tag the appropriate programming language (if any). – Jan Feb 19 '20 at 14:11
  • post edited with example – thomas Feb 19 '20 at 14:16
  • Well, it's hard to tell with just that single example. But since a dot is a special character you will *a)* need to escape them with a backslash and *b)* check the following character, possibly with a negative lookahead. So something like: `\.(?!\S)` would work just for the dot. – JvdV Feb 19 '20 at 14:21
  • thx; would you have a regex that would work with the example : .net, asp.net hello. world, c# -> .net asp.net hello world c# – thomas Feb 19 '20 at 14:36
  • this regex ?!['\/#$])\p{P} match every punctuation except / # $. Now I want to add the exception : every dot with a letter or number right after it. – thomas Feb 19 '20 at 14:41

3 Answers3

1

I would use the following :

[\]!"%&'()*+,:;<=>?@[\\^_`{|}~-]|\.(?![a-zA-Z0-9])

The character class [!"%&'()*+,:;<=>?@[\]^_`{|}~-] matches any symbol of \p{P} except /, #, $ and ., and the other alternative matches a dot that is not followed by a letter nor a digit.

Note that it's alluring to use \b but that it's a bad idea as \w includes _ in addition to [a-zA-Z0-9].

If you want to keep using \p{P}, you could use the following, but expect lesser performances :

(?![/#$]|\.[a-zA-Z0-9])\p{P}

The following would also work and might be as efficient as my first answer, but it's based on lesser known syntax which if I'm not mistaken is specific to Java regexs :

[\p{P}&&[^/#$.]]|\.(?![a-zA-Z0-9])
Aaron
  • 24,009
  • 2
  • 33
  • 57
  • thanks, all of your answers work, difficult now to know which to choose :) – thomas Feb 19 '20 at 15:11
  • Personnally I would avoid the second one because it's innefficient (tests the two alternatives of the lookahead for each character of the input string before even looking at whether it's a "punctutation"), and the third because it uses an obscure feature which will be sure to make future maintainers scratch their head. The first one ain't pretty, but it's basic and it works – Aaron Feb 19 '20 at 16:00
  • Also it avoids the use of a "Punctuation" class which I personnally find unhelpfully named : I would expect it to match localized punctuations such as `¿` and `¡` which it doesn't, and I've never heard `#` `=` or `>` being called punctuations outside of this class' definition. – Aaron Feb 19 '20 at 16:07
1

You may add an alternative into your negative lookahead:

(?![/#$]|\.(?!\s))\p{P}
        ^^^^^^^^^

See the regex demo.

Details

  • (?![/#$]|\.(?!\s)) - fail the match if immediately to the right, there is /, # or $, or a . not followed with a whitespace char
  • \p{P} - any punctuation proper char
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
1

This regex appears to satisfy your use case:

(?!\.\w)(?!['\/#$])\p{P}

You may need to make changes (such as to the \p{P}) for use in Java, see Regular Expressions on Punctuation

https://regex101.com/r/xtBfYt/1

rbonestell
  • 420
  • 4
  • 13
  • 1
    Note that `\w` includes `_` in addition to `[a-zA-Z0-9]`, so this will not match the `.` in `._` as it should. – Aaron Feb 19 '20 at 16:03