1

I'm currently using this RegEx to match every link in a string:

let regex = /\b(https?:\/\/)?(([a-z\d]+([-_][a-z\d]+)*)\.)+\w{2,32}(:[\d]{2,5})?([-a-z0-9@:%_+.~#?&/=]*)\b/gi;

These links match without any problem, and this is intended.

https://gle.co.ulk.co.com:443/?key=test&5345@%20#arr/do
https://google.coshe
google.comls
google.co

These links will match too, but the last / (trailing slash) is not included in the matches, which is not intended:

https://gle.co.ulk.co.com:443/?key=test&5345@%20#arr/do/
https://google.coshe/
google.comls/
google.co/

I've tried using these match groups instead of ([-a-zA-Z0-9@:%_+.~#?&/=]*), without any success:

([-a-zA-Z0-9@:%_+.~#?&/=]*)\/?
([-a-zA-Z0-9@:%_+.~#?&/=]*\/?)
(([-a-zA-Z0-9@:%_+.~#?&/=]\/?)*)
Lorenzo Lapucci
  • 129
  • 4
  • 12
  • 2
    Have you tried with `/\b(https?:\/\/)?(([a-z\d]+([-_][a-z\d]+)*)\.)+\w{2,32}(:[\d]{2,5})?([-a-z0-9@:%_+.~#?&/=]*)\b\/?/gi` ? – giuseppedeponte Nov 10 '19 at 17:03
  • @giuseppedeponte I actually didn't, because I was seeing `\b` as the end of the whole match. It is working, thank you very much – Lorenzo Lapucci Nov 10 '19 at 17:17
  • It must be because `\b` matches the boundary between a word and a non-word character and \ is a non-word character. You can explicitly set the start/end of the whole string like this instead: `/^(https?:\/\/)?(([a-z\d]+([-_][a-z\d]+)*)\.)+\w{2,32}(:[\d]{2,5})?([-a-z0-9@:%_+.~#?&/=]*)$/gi` (add the `m` flag if you have multiple lines in one string) – giuseppedeponte Nov 10 '19 at 17:26
  • I wanted to note that I don't want to match the start or the end of the string, this is why I used `\b` – Lorenzo Lapucci Nov 10 '19 at 18:38
  • 1
    You've set it to case insensitive. That means it will match e.g. `HTTP` at the start. You probably didn't mean to do that. Also, you can change `[-a-z0-9@:%_+.~#?&/=]` to `[-\w@:%_+.~#?&/=]`. – David Knipe Nov 12 '19 at 17:53
  • @DavidKnipe yes, case insensitive protocol is intentional too, as this is user generated content, also thanks for the semplification – Lorenzo Lapucci Nov 12 '19 at 19:16

1 Answers1

1

Using a boundary condition can be problematic for url's.
This includes \b.

The relaxed way of doing url matching is with a bit more specific form
in the domain part, and a generalized directory / parameters structure at the end.

This is relaxed with no boundary specification:

/((?:https?:\/\/)?(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/\S*)?)/gi

https://regex101.com/r/UxcFCO/1


This is relaxed with whitespace boundary specification:
The url is in group 1.

/(?:^|\s)((?:https?:\/\/)?(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/\S*)?)/gi

https://regex101.com/r/NpqgXX/1