Home » Javascript » javascript – regex expression for string.split() that splits a string on urls – Stack Overflow

javascript – regex expression for string.split() that splits a string on urls – Stack Overflow

Posted by: admin February 20, 2020 Leave a comment

Questions:

I have a regex expression for finding url’s in a string, but when I use it with String.prototype.split() it returns undefined’s.

const regex = /(http(s)?:\/\/.)?(www\.)?[[email protected]:%._\+~#=]{2,256}\.[a-z]{2,6}\b([[email protected]:%_\+.~#?&//=]*)/g;
const str = "yay http://www.google.com woo www.google.com haha google.com";

console.log(str.match(regex));
// [ 'http://www.google.com', 'www.google.com', 'google.com' ]

console.log(str.split(regex));
// [ 'yay ','http://w',undefined,undefined,'',' woo ',undefined,undefined,'www.','',' haha ',undefined,undefined,undefined,'','' ]

After some research it appears that this has to do with capturing groups. I attempted adding :? to all the capturing groups (parts wrapped in parenthesis) and it removed the undefined’s.

const reg2 = /(?:http(?:s)?:\/\/.)?(?:www\.)?[[email protected]:%._\+~#=]{2,256}\.[a-z]{2,6}\b(?:[[email protected]:%_\+.~#?&//=]*)/g

const str = "yay http://www.google.com woo www.google.com haha google.com";

console.log(str.split(reg2));
// [ 'yay ', ' woo ', ' haha ', '' ]

But it is omitting the urls. I am looking to return:

[ 'yay ', 'http://www.google.com', ' woo ', 'www.google.com', ' haha ', 'google.com' ]
Answers:

You might be able to just split on whitespace here:

var str = "yay http://www.google.com woo www.google.com haha google.com";
var parts = str.split(/\s+/);
console.log(parts);

If the leading/trailing whitespace are really required here, then try matching on the pattern:

<URL>|\s*\S+\s*

This would match either a URL, or a series of non word characters, with surrounding whitespace. Consider:

var str = "yay http://www.google.com woo www.google.com haha google.com";
console.log(str.match(/(?:http(s)?:\/\/.)?(?:www\.)?[[email protected]:%._\+~#=]{2,256}\.[a-z]{2,6}\b(?:[[email protected]:%_\+.~#?&//=]*)|\s*\S+\s*/g));

This uses an alternation trick to first try to selectively consume a URL, with no surrounding whitespace. That failing, the fallback pattern is \s*\S\s*, i.e. any other word with the leading/trailing whitespace.