Home » Javascript » How can I strip all punctuation from a string in JavaScript using regex?

How can I strip all punctuation from a string in JavaScript using regex?

Posted by: admin November 30, 2017 Leave a comment

Questions:

If I have a string with any type of non-alphanumeric character in it:

"This., -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation"

How would I get a no-punctuation version of it in JavaScript:

"This is an example of a string with punctuation"
Answers:

If you want to remove specific punctuation from a string, it will probably be best to explicitly remove exactly what you want like

replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g,"")

Doing the above still doesn’t return the string as you have specified it. If you want to remove any extra spaces that were left over from removing crazy punctuation, then you are going to want to do something like

replace(/\s{2,}/g," ");

My full example:

var s = "This., -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation";
var punctuationless = s.replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g,"");
var finalString = punctuationless.replace(/\s{2,}/g," ");

Results of running code in firebug console:

alt text

Questions:
Answers:
str = str.replace(/[^\w\s]|_/g, "")
         .replace(/\s+/g, " ");

Removes everything except alphanumeric characters and whitespace, then collapses multiple adjacent characters to single spaces.

Detailed explanation:

  1. \w is any digit, letter, or underscore.
  2. \s is any whitespace.
  3. [^\w\s] is anything that’s not a digit, letter, whitespace, or underscore.
  4. [^\w\s]|_ is the same as #3 except with the underscores added back in.
Questions:
Answers:

Here are the standard punctuation characters for US-ASCII: !"#$%&'()*+,-./:;<=>[email protected][\]^_`{|}~

For Unicode punctuation (such as curly quotes, em-dashes, etc), you can easily match on specific block ranges. The General Punctuation block is \u2000-\u206F, and the Supplemental Punctuation block is \u2E00-\u2E7F.

Put together, and properly escaped, you get the following RegExp:

/[\u2000-\u206F\u2E00-\u2E7F\'!"#$%&()*+,\-.\/:;<=>[email protected]\[\]^_`{|}~]/

That should match pretty much any punctuation you encounter. So, to answer the original question:

var punctRE = /[\u2000-\u206F\u2E00-\u2E7F\'!"#$%&()*+,\-.\/:;<=>[email protected]\[\]^_`{|}~]/g;
var spaceRE = /\s+/g;
var str = "This, -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation";
str.replace(punctRE, '').replace(spaceRE, ' ');

>> "This is an example of a string with punctuation"

US-ASCII source: http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#posix

Unicode source: http://kourge.net/projects/regexp-unicode-block

Questions:
Answers:

For en-US ( American English ) strings this should suffice:

"This., -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation".replace( /[^a-zA-Z ]/g, '').replace( /\s\s+/g, ' ' )

Be aware that if you support UTF-8 and characters like chinese/russian and all, this will replace them as well, so you really have to specify what you want.

Questions:
Answers:

I ran across the same issue, this solution did the trick and was very readable:

var sentence = "This., -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation";
var newSen = sentence.match(/[^_\W]+/g).join(' ');
console.log(newSen);

Result:

"This is an example of a string with punctuation"

The trick was to create a negated set. This means that it matches anything that is not within the set i.e. [^abc] – not a, b or c

\W is any non-word, so [^\W]+ will negate anything that is not a word char.

By adding in the _ (underscore) you can negate that as well.

Make it apply globally /g, then you can run any string through it and clear out the punctuation:

/[^_\W]+/g

Nice and clean 😉

Questions:
Answers:

/[^A-Za-z0-9\s]/g should match all punctuation but keep the spaces.
So you can use .replace(/\s{2,}/g, " ") to replace extra spaces if you need to do so. You can test the regex in http://rubular.com/

.replace(/[^A-Za-z0-9\s]/g,"").replace(/\s{2,}/g, " ")

Update: Will only work if the input is ANSI English.

Questions:
Answers:

In a Unicode-aware language, the Unicode Punctuation character property is \p{P} — which you can usually abbreviate \pP and sometimes expand to \p{Punctuation} for readability.

Are you using a Perl Compatible Regular Expression library?

Questions:
Answers:

I’ll just put it here for others.

Match all punctuation chars for for all languages:

Constructed from Unicode punctuation category and added some common keyboard symbols like $ and brackets and \-=_

http://www.fileformat.info/info/unicode/category/Po/list.htm

basic replace:

".test'da, te\"xt".replace(/[\-=_!"#%&'*{},.\/:;?\(\)\[\]@\$\^*+<>~`\u00a1\u00a7\u00b6\u00b7\u00bf\u037e\u0387\u055a-\u055f\u0589\u05c0\u05c3\u05c6\u05f3\u05f4\u0609\u060a\u060c\u060d\u061b\u061e\u061f\u066a-\u066d\u06d4\u0700-\u070d\u07f7-\u07f9\u0830-\u083e\u085e\u0964\u0965\u0970\u0af0\u0df4\u0e4f\u0e5a\u0e5b\u0f04-\u0f12\u0f14\u0f85\u0fd0-\u0fd4\u0fd9\u0fda\u104a-\u104f\u10fb\u1360-\u1368\u166d\u166e\u16eb-\u16ed\u1735\u1736\u17d4-\u17d6\u17d8-\u17da\u1800-\u1805\u1807-\u180a\u1944\u1945\u1a1e\u1a1f\u1aa0-\u1aa6\u1aa8-\u1aad\u1b5a-\u1b60\u1bfc-\u1bff\u1c3b-\u1c3f\u1c7e\u1c7f\u1cc0-\u1cc7\u1cd3\u2016\u2017\u2020-\u2027\u2030-\u2038\u203b-\u203e\u2041-\u2043\u2047-\u2051\u2053\u2055-\u205e\u2cf9-\u2cfc\u2cfe\u2cff\u2d70\u2e00\u2e01\u2e06-\u2e08\u2e0b\u2e0e-\u2e16\u2e18\u2e19\u2e1b\u2e1e\u2e1f\u2e2a-\u2e2e\u2e30-\u2e39\u3001-\u3003\u303d\u30fb\ua4fe\ua4ff\ua60d-\ua60f\ua673\ua67e\ua6f2-\ua6f7\ua874-\ua877\ua8ce\ua8cf\ua8f8-\ua8fa\ua92e\ua92f\ua95f\ua9c1-\ua9cd\ua9de\ua9df\uaa5c-\uaa5f\uaade\uaadf\uaaf0\uaaf1\uabeb\ufe10-\ufe16\ufe19\ufe30\ufe45\ufe46\ufe49-\ufe4c\ufe50-\ufe52\ufe54-\ufe57\ufe5f-\ufe61\ufe68\ufe6a\ufe6b\uff01-\uff03\uff05-\uff07\uff0a\uff0c\uff0e\uff0f\uff1a\uff1b\uff1f\uff20\uff3c\uff61\uff64\uff65]+/g,"")
"testda text"

added \s as space

".da'fla, te\"te".split(/[\s\-=_!"#%&'*{},.\/:;?\(\)\[\]@\$\^*+<>~`\u00a1\u00a7\u00b6\u00b7\u00bf\u037e\u0387\u055a-\u055f\u0589\u05c0\u05c3\u05c6\u05f3\u05f4\u0609\u060a\u060c\u060d\u061b\u061e\u061f\u066a-\u066d\u06d4\u0700-\u070d\u07f7-\u07f9\u0830-\u083e\u085e\u0964\u0965\u0970\u0af0\u0df4\u0e4f\u0e5a\u0e5b\u0f04-\u0f12\u0f14\u0f85\u0fd0-\u0fd4\u0fd9\u0fda\u104a-\u104f\u10fb\u1360-\u1368\u166d\u166e\u16eb-\u16ed\u1735\u1736\u17d4-\u17d6\u17d8-\u17da\u1800-\u1805\u1807-\u180a\u1944\u1945\u1a1e\u1a1f\u1aa0-\u1aa6\u1aa8-\u1aad\u1b5a-\u1b60\u1bfc-\u1bff\u1c3b-\u1c3f\u1c7e\u1c7f\u1cc0-\u1cc7\u1cd3\u2016\u2017\u2020-\u2027\u2030-\u2038\u203b-\u203e\u2041-\u2043\u2047-\u2051\u2053\u2055-\u205e\u2cf9-\u2cfc\u2cfe\u2cff\u2d70\u2e00\u2e01\u2e06-\u2e08\u2e0b\u2e0e-\u2e16\u2e18\u2e19\u2e1b\u2e1e\u2e1f\u2e2a-\u2e2e\u2e30-\u2e39\u3001-\u3003\u303d\u30fb\ua4fe\ua4ff\ua60d-\ua60f\ua673\ua67e\ua6f2-\ua6f7\ua874-\ua877\ua8ce\ua8cf\ua8f8-\ua8fa\ua92e\ua92f\ua95f\ua9c1-\ua9cd\ua9de\ua9df\uaa5c-\uaa5f\uaade\uaadf\uaaf0\uaaf1\uabeb\ufe10-\ufe16\ufe19\ufe30\ufe45\ufe46\ufe49-\ufe4c\ufe50-\ufe52\ufe54-\ufe57\ufe5f-\ufe61\ufe68\ufe6a\ufe6b\uff01-\uff03\uff05-\uff07\uff0a\uff0c\uff0e\uff0f\uff1a\uff1b\uff1f\uff20\uff3c\uff61\uff64\uff65]+/g)

added ^ to invert patternt to match not punctuation but the words them selves

".test';the, te\"xt".match(/[^\s\-=_!"#%&'*{},.\/:;?\(\)\[\]@\$\^*+<>~`\u00a1\u00a7\u00b6\u00b7\u00bf\u037e\u0387\u055a-\u055f\u0589\u05c0\u05c3\u05c6\u05f3\u05f4\u0609\u060a\u060c\u060d\u061b\u061e\u061f\u066a-\u066d\u06d4\u0700-\u070d\u07f7-\u07f9\u0830-\u083e\u085e\u0964\u0965\u0970\u0af0\u0df4\u0e4f\u0e5a\u0e5b\u0f04-\u0f12\u0f14\u0f85\u0fd0-\u0fd4\u0fd9\u0fda\u104a-\u104f\u10fb\u1360-\u1368\u166d\u166e\u16eb-\u16ed\u1735\u1736\u17d4-\u17d6\u17d8-\u17da\u1800-\u1805\u1807-\u180a\u1944\u1945\u1a1e\u1a1f\u1aa0-\u1aa6\u1aa8-\u1aad\u1b5a-\u1b60\u1bfc-\u1bff\u1c3b-\u1c3f\u1c7e\u1c7f\u1cc0-\u1cc7\u1cd3\u2016\u2017\u2020-\u2027\u2030-\u2038\u203b-\u203e\u2041-\u2043\u2047-\u2051\u2053\u2055-\u205e\u2cf9-\u2cfc\u2cfe\u2cff\u2d70\u2e00\u2e01\u2e06-\u2e08\u2e0b\u2e0e-\u2e16\u2e18\u2e19\u2e1b\u2e1e\u2e1f\u2e2a-\u2e2e\u2e30-\u2e39\u3001-\u3003\u303d\u30fb\ua4fe\ua4ff\ua60d-\ua60f\ua673\ua67e\ua6f2-\ua6f7\ua874-\ua877\ua8ce\ua8cf\ua8f8-\ua8fa\ua92e\ua92f\ua95f\ua9c1-\ua9cd\ua9de\ua9df\uaa5c-\uaa5f\uaade\uaadf\uaaf0\uaaf1\uabeb\ufe10-\ufe16\ufe19\ufe30\ufe45\ufe46\ufe49-\ufe4c\ufe50-\ufe52\ufe54-\ufe57\ufe5f-\ufe61\ufe68\ufe6a\ufe6b\uff01-\uff03\uff05-\uff07\uff0a\uff0c\uff0e\uff0f\uff1a\uff1b\uff1f\uff20\uff3c\uff61\uff64\uff65]+/g)

for language like Hebrew maybe to remove ” ‘ the single and the double quote. and do more thinking on it.

using this script:

step 1: select in Firefox holding control a column of U+1234 numbers and copy it, do not copy U+12456 they replace English

step 2 (i did in chrome)find some textarea and paste it into it then rightclick and click inspect. then you can access the selected element with $0.

var x=$0.value
var z=x.replace(/U\+/g,"").split(/[\r\n]+/).map(function(a){return parseInt(a,16)})
var ret=[];z.forEach(function(a,k){if(z[k-1]===a-1 && z[k+1]===a+1) { if(ret[ret.length-1]!="-")ret.push("-");} else {  var c=a.toString(16); var prefix=c.length<3?"\u0000":c.length<5?"\u0000":"\u000000"; var uu=prefix.substring(0,prefix.length-c.length)+c; ret.push(c.length<3?String.fromCharCode(a):uu)}});ret.join("")

step 3 copied over the first letters the ascii as separate chars not ranges because someone might add or remove individual chars

Questions:
Answers:

As per Wikipedia’s list of punctuations I had to build the following regex which detects punctuations :

[\.’'\[\](){}⟨⟩:,،、‒–—―…!.‹›«»‐\-?‘’“”'";/⁄·\&*@\•^†‡°”¡¿※#№÷׺ª%‰+−=‱¶′″‴§~_|‖¦©℗®℠™¤₳฿₵¢₡₢$₫₯֏₠€ƒ₣₲₴₭₺₾ℳ₥₦₧₱₰£៛₽₹₨₪৳₸₮₩¥]

Questions:
Answers:

If you want to retain only alphabets and spaces, you can do:

str.replace(/[^a-zA-Z ]+/g, '').replace('/ {2,}/',' ')

Questions:
Answers:

if you are using lodash

_.words('This, is : my - test,line:').join(' ')

This Example

_.words('"This., -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation"').join(' ')

Leave a Reply

Your email address will not be published. Required fields are marked *