Home » Ruby » Regex to remove non letters

Regex to remove non letters

Posted by: admin December 31, 2017 Leave a comment

Questions:

I’m trying to remove non-letters from a string. Would this do it:

c = o.replace(o.gsub!(/\W+/, ''))
Answers:

Just gsub! is sufficient:

o.gsub!(/\W+/, '')

Note that gsub! modifies the original o object. Also, if the o does not contain any non-word characters, the result will be nil, so using the return value as the modified string is unreliable.

You probably want this instead:

c = o.gsub(/\W+/, '')

Questions:
Answers:

Remove anything that is not a letter:

> " sd  190i.2912390123.aaabbcd".gsub(/[^a-zA-Z]/, '')
"sdiaaabbcd"

EDIT: as ikegami points out, this doesn’t take into account accented characters, umlauts, and other similar characters. The solution to this problem will depend on what exactly you are referring to as “not a letter”. Also, what your input will be.

Questions:
Answers:

That will work most of the cases, except when o initially does not contain any non-letter, in which case gsub! will return nil.

If you just want a replaced string, it can be simpler:

c = o.gsub(/\W+/, '')

Questions:
Answers:

Using \W or \w to select or delete only characters won’t work. \w means A-Z, a-z, 0-9, and “_”:

irb(main):002:0> characters = (' ' .. "\x7e").to_a.join('')
=> " !\"\#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
irb(main):003:0> characters.gsub(/\W+/, '')
=> "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz"

So, stripping using \W preserves digits and underscores.

If you want to match characters use /[A-Za-z]+/, or the POSIX character class [:alpha:], i.e. /[[:alpha:]]+/, or /\p{ALPHA}/.

The final format is the Unicode property for ‘A’..’Z’ + ‘a’..’z’ in ASCII, and gets extended when dealing with Unicode, so if you have multibyte characters you should probably use that.

Questions:
Answers:

Keep in mind that ruby considers the underscore _ to be a word character. So if you want to keep underscores as well, this should do it

string.gsub!(/\W+/, '')

Otherwise, you need to do this:

string.gsub!(/[^a-zA-Z]/, '')

Questions:
Answers:

use Regexp#union to create a big matching object

allowed = Regexp.union(/[a-zA-Z0-9]/, " ", "-", ":", ")", "(", ".")
cleanstring = dirty_string.chars.select {|c| c =~ allowed}.join("")

Questions:
Answers:

I don’t see what that o.replace is in there for if you have a string:

string = 't = 4 6 ^'

And you do:

string.gsub!(/\W+/, '')

You get:

t46

If you want to get rid of the number characters too, you can do:

string.gsub!(/\W+|\d+/, '')

And you get:

t