Home » Mysql » How to remove all non-alpha numeric characters from a string?

How to remove all non-alpha numeric characters from a string?

Posted by: admin November 1, 2017 Leave a comment

Questions:

I’m working on a routine that compares strings, but for better efficiency I need to remove all characters that are not letters or numbers.

I’m using multiple REPLACE functions now, but maybe there is a faster and nicer solution ?

Answers:

None of these answers worked for me. I had to create my own function called alphanum which stripped the chars for me:

DROP FUNCTION IF EXISTS alphanum; 
DELIMITER | 
CREATE FUNCTION alphanum( str CHAR(32) ) RETURNS CHAR(16) 
BEGIN 
  DECLARE i, len SMALLINT DEFAULT 1; 
  DECLARE ret CHAR(32) DEFAULT ''; 
  DECLARE c CHAR(1); 
  SET len = CHAR_LENGTH( str ); 
  REPEAT 
    BEGIN 
      SET c = MID( str, i, 1 ); 
      IF c REGEXP '[[:alnum:]]' THEN 
        SET ret=CONCAT(ret,c); 
      END IF; 
      SET i = i + 1; 
    END; 
  UNTIL i > len END REPEAT; 
  RETURN ret; 
END | 
DELIMITER ; 

Now I can do:

select 'This works finally!', alphanum('This works finally!');

and I get:

+---------------------+---------------------------------+
| This works finally! | alphanum('This works finally!') |
+---------------------+---------------------------------+
| This works finally! | Thisworksfinally                |
+---------------------+---------------------------------+
1 row in set (0.00 sec)

Hurray!

Questions:
Answers:

From a performance point of view,
(and on the assumption that you read more than you write)

I think the best way would be to pre calculate and store a stripped version of the column,
This way you do the transform less.

You can then put an index on the new column and get the database to do the work for you.

Questions:
Answers:
SELECT teststring REGEXP '[[:alnum:]]+';

SELECT * FROM testtable WHERE test REGEXP '[[:alnum:]]+'; 

See: http://dev.mysql.com/doc/refman/5.1/en/regexp.html
Scroll down to the section that says: [:character_class:]

If you want to manipulate strings the fastest way will be to use a str_udf, see:
https://github.com/hholzgra/mysql-udf-regexp

Questions:
Answers:

I have wrote this UDF, but I am trimming special chars just at the beginning, and convert to lower ase. But you could update this function

DELIMITER //

DROP FUNCTION IF EXISTS DELETE_DOUBLE_SPACES//

CREATE FUNCTION DELETE_DOUBLE_SPACES ( title VARCHAR(250) )
RETURNS VARCHAR(250) DETERMINISTIC
BEGIN
    DECLARE result VARCHAR(250);
    SET result = REPLACE( title, '  ', ' ' );
    WHILE (result <> title) DO 
        SET title = result;
        SET result = REPLACE( title, '  ', ' ' );
    END WHILE;
    RETURN result;
END//

DROP FUNCTION IF EXISTS LFILTER//

CREATE FUNCTION LFILTER ( title VARCHAR(250) )
RETURNS VARCHAR(250) DETERMINISTIC
BEGIN
    WHILE (1=1) DO
        IF(  ASCII(title) BETWEEN ASCII('a') AND ASCII('z')
            OR ASCII(title) BETWEEN ASCII('A') AND ASCII('Z')
            OR ASCII(title) BETWEEN ASCII('0') AND ASCII('9')
        ) THEN
            SET title = LOWER( title );
            SET title = REPLACE(
                REPLACE(
                    REPLACE(
                        title,
                        CHAR(10), ' '
                    ),
                    CHAR(13), ' '
                ) ,
                CHAR(9), ' '
            );
            SET title = DELETE_DOUBLE_SPACES( title );
            RETURN title;
        ELSE
            SET title = SUBSTRING( title, 2 );          
        END IF;
    END WHILE;
END//
DELIMITER ;

SELECT LFILTER(' [email protected]#$%^&*()_+1a    b');

Also you could use regular expressions, but will need to install MySql extension

Questions:
Answers:

The fastest way I was able to find (and using ) is with convert().

from Doc. CONVERT() with USING is used to convert data between different character sets.

Example:

convert(string USING ascii)

In your case the right character set will be self defined

NOTE from Doc. The USING form of CONVERT() is available as of 4.1.0.

Questions:
Answers:

Be careful, characters like ’ or » are considered as alpha by MySQL.
It better to use something like :

IF c BETWEEN ‘a’ AND ‘z’ OR c BETWEEN ‘A’ AND ‘Z’ OR c BETWEEN ‘0’ AND
‘9’ OR c = ‘-‘ THEN

Questions:
Answers:

Based on the answer by Ryan Shillington, modified to work with strings longer than 255 characters and preserving spaces from the original string.

FYI there is lower(str) in the end.

I used this to compare strings:

DROP FUNCTION IF EXISTS spacealphanum;
DELIMITER $$
CREATE FUNCTION `spacealphanum`( str TEXT ) RETURNS TEXT CHARSET utf8
BEGIN 
  DECLARE i, len SMALLINT DEFAULT 1; 
  DECLARE ret TEXT DEFAULT ''; 
  DECLARE c CHAR(1); 
  SET len = CHAR_LENGTH( str ); 
  REPEAT 
    BEGIN 
      SET c = MID( str, i, 1 ); 
      IF c REGEXP '[[:alnum:]]' THEN 
        SET ret=CONCAT(ret,c); 
      ELSEIF  c = ' ' THEN
          SET ret=CONCAT(ret," ");
      END IF; 
      SET i = i + 1; 
    END; 
  UNTIL i > len END REPEAT; 
  SET ret = lower(ret);
  RETURN ret; 
  END $$
  DELIMITER ;

Questions:
Answers:

Straight and battletested solution for latin and cyrillic characters:

DELIMITER //

CREATE FUNCTION `remove_non_numeric_and_letters`(input TEXT)
  RETURNS TEXT
  BEGIN
    DECLARE output TEXT DEFAULT '';
    DECLARE iterator INT DEFAULT 1;
    WHILE iterator < (LENGTH(input) + 1) DO
      IF SUBSTRING(input, iterator, 1) IN
         ('0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'А', 'Б', 'В', 'Г', 'Д', 'Е', 'Ж', 'З', 'И', 'Й', 'К', 'Л', 'М', 'Н', 'О', 'П', 'Р', 'С', 'Т', 'У', 'Ф', 'Х', 'Ц', 'Ч', 'Ш', 'Щ', 'Ъ', 'Ы', 'Ь', 'Э', 'Ю', 'Я', 'а', 'б', 'в', 'г', 'д', 'е', 'ж', 'з', 'и', 'й', 'к', 'л', 'м', 'н', 'о', 'п', 'р', 'с', 'т', 'у', 'ф', 'х', 'ц', 'ч', 'ш', 'щ', 'ъ', 'ы', 'ь', 'э', 'ю', 'я')
      THEN
        SET output = CONCAT(output, SUBSTRING(input, iterator, 1));
      END IF;
      SET iterator = iterator + 1;
    END WHILE;
    RETURN output;
  END //

DELIMITER ;

Usage:

-- outputs "hello12356"
SELECT remove_non_numeric_and_letters('hello - 12356-привет ""]')

Questions:
Answers:

I had a similar problem with trying to match last names in our database that were slightly different. For example, sometimes people entered the same person’s name as “McDonald” and also as “Mc Donald”, or “St John” and “St. John”.

Instead of trying to convert the Mysql data, I solved the problem by creating a function (in PHP) that would take a string and create an alpha-only regular expression:

function alpha_only_regex($str) {
    $alpha_only = str_split(preg_replace('/[^A-Z]/i', '', $str));
    return '^[^a-zA-Z]*'.implode('[^a-zA-Z]*', $alpha_only).'[^a-zA-Z]*$';
}

Now I can search the database with a query like this:

$lastname_regex = alpha_only_regex($lastname);
$query = "SELECT * FROM my_table WHERE lastname REGEXP '$lastname_regex';

Questions:
Answers:

So far, the only alternative approach less complicated than the other answers here is to determine the full set of special characters of the column, i.e. all the special characters that are in use in that column at the moment, and then do a sequential replace of all those characters, e.g.

update pages set slug = lower(replace(replace(replace(replace(name, ' ', ''), '-', ''), '.', ''), '&', '')); # replacing just space, -, ., & only

.

This is only advisable on a known set of data, otherwise it’s
trivial for some special characters to slip past with a
blacklist approach instead of a whitelist approach.

Obviously, the simplest way is to pre-validate the data outside of sql due to the lack of robust built-in whitelisting (e.g. via a regex replace).

Questions:
Answers:

Probably a silly suggestion compared to others:

if(!preg_match("/^[a-zA-Z0-9]$/",$string)){
    $sortedString=preg_replace("/^[a-zA-Z0-9]+$/","",$string);
}

Questions:
Answers:

I needed to get only alphabetic characters of a string in a procedure, and did:

SET @source = "whatever you want";
SET @target = '';
SET @i = 1;
SET @len = LENGTH(@source);
WHILE @i <= @len DO
    SET @char = SUBSTRING(@source, @i, 1);
    IF ((ORD(@char) >= 65 && ORD(@char) <= 90) || (ORD(@char) >= 97 && ORD(@char) <= 122)) THEN
        SET @target = CONCAT(@target, @char);
    END IF;
    SET @i = @i + 1;
END WHILE;

Questions:
Answers:

I tried a few solutions but at the end used replace. My data set is part numbers and I fairly know what to expect. But just for sanity, I used PHP to build the long query:

$dirty = array(' ', '-', '.', ',', ':', '?', '/', '!', '&', '@');
$query = 'part_no';
foreach ($dirty as $dirt) {
    $query = "replace($query,'$dirt','')";
}
echo $query;

This outputs something I used to get a headache from:

replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(part_no,' ',''),'-',''),'.',''),',',''),':',''),'?',''),'/',''),'!',''),'&',''),'@','')

Questions:
Answers:

This can be done with a regular expression replacer function I posted in another answer. (It may not be the most efficient solution possible and might look overkill for the job in hand – but is proving to be a bit of a “Swiss army knife” that may come in useful for other reasons.)

It can be seen in action removing all non-alphanumeric characters here: SQL Fiddle demo.

SQL (excluding function code for brevity):

SELECT txt,
       reg_replace(txt,
                   '[^a-zA-Z0-9]+',
                   '',
                   TRUE,
                   0,
                   0
                   ) AS `reg_replaced`
FROM test;

Questions:
Answers:

if you are using php then….

try{
$con = new PDO ("mysql:host=localhost;dbname=dbasename","root","");
}
catch(PDOException $e){
echo "error".$e-getMessage();   
}

$select = $con->prepare("SELECT * FROM table");
$select->setFetchMode(PDO::FETCH_ASSOC);
$select->execute();

while($data=$select->fetch()){ 

$id = $data['id'];
$column = $data['column'];
$column = preg_replace("/[^a-zA-Z0-9]+/", " ", $column); //remove all special characters

$update = $con->prepare("UPDATE table SET column=:column WHERE id='$id'");
$update->bindParam(':column', $column );
$update->execute();

// echo $column."<br>";
} 

Questions:
Answers:

the alphanum function (self answered) have a bug, but I don’t know why.
For text “cas synt ls 75W140 1L” return “cassyntls75W1401”, “L” from the end is missing some how.

Now I use

delimiter //
DROP FUNCTION IF EXISTS alphanum //
CREATE FUNCTION alphanum(prm_strInput varchar(255))
RETURNS VARCHAR(255)
DETERMINISTIC
BEGIN
  DECLARE i INT DEFAULT 1;
  DECLARE v_char VARCHAR(1);
  DECLARE v_parseStr VARCHAR(255) DEFAULT ' ';
WHILE (i <= LENGTH(prm_strInput) )  DO
  SET v_char = SUBSTR(prm_strInput,i,1);
  IF v_char REGEXP  '^[A-Za-z0-9]+$' THEN 
        SET v_parseStr = CONCAT(v_parseStr,v_char);  
  END IF;
  SET i = i + 1;
END WHILE;
RETURN trim(v_parseStr);
END
//

(found on google)