Home » Mysql » How can I find non-ASCII characters in MySQL?

How can I find non-ASCII characters in MySQL?

Posted by: admin October 29, 2017 Leave a comment

Questions:

I’m working with a MySQL database that has some data imported from Excel. The data contains non-ASCII characters (em dashes, etc.) as well as hidden carriage returns or line feeds. Is there a way to find these records using MySQL?

Answers:

It depends exactly what you’re defining as “ASCII”, but I would suggest trying a variant of a query like this:

SELECT * FROM tableName WHERE NOT columnToCheck REGEXP '[A-Za-z0-9]';

That query will return all rows where columnToCheck contains any non-alphanumeric characters. If you have other characters that are acceptable, add them to the character class in the regular expression. For example, if periods, commas, and hyphens are OK, change the query to:

SELECT * FROM tableName WHERE NOT columnToCheck REGEXP '[A-Za-z0-9.,-]';

The most relevant page of the MySQL documentation is probably 12.5.2 Regular Expressions.

Questions:
Answers:

MySQL provides comprehensive character set management that can help with this kind of problem.

SELECT whatever
  FROM tableName 
 WHERE columnToCheck <> CONVERT(columnToCheck USING ASCII)

The CONVERT(col USING charset) function will turns the unconvertable characters into replacement characters. Then, the converted and unconverted text will be unequal.

See this for more discussion. http://dev.mysql.com/doc/refman/5.7/en/charset-repertoire.html

You can use any character set name you wish in place of ASCII. For example, if you want to find out which characters won’t render correctly in code page 1257 (Lithuanian, Latvian, Estonian) use CONVERT(columnToCheck USING cp1257)

Questions:
Answers:

You can define ASCII as all characters that have a decimal value of 0 – 127 (0x00 – 0x7F) and find columns with non-ASCII characters using the following query

SELECT * FROM TABLE WHERE NOT HEX(COLUMN) REGEXP '^([0-7][0-9A-F])*$';

This was the most comprehensive query I could come up with.

Questions:
Answers:

This is probably what you’re looking for:

select * from TABLE where COLUMN regexp '[^ -~]';

It should return all rows where COLUMN contains non-ASCII characters (or non-printable ASCII characters such as newline).

Questions:
Answers:

One missing character from everyone’s examples above is the termination character (\0). This is invisible to the MySQL console output and is not discoverable by any of the queries heretofore mentioned. The query to find it is simply:

select * from TABLE where COLUMN like '%
select * from TABLE where COLUMN like '%\0%';
%';

Questions:
Answers:

Based on the correct answer, but taking into account ASCII control characters as well, the solution that worked for me is this:

SELECT * FROM `table` WHERE NOT `field` REGEXP  "[\x00-\xFF]|^$";

It does the same thing: searches for violations of the ASCII range in a column, but lets you search for control characters too, since it uses hexadecimal notation for code points. Since there is no comparison or conversion (unlike @Ollie’s answer), this should be significantly faster, too. (Especially if MySQL does early-termination on the regex query, which it definitely should.)

It also avoids returning fields that are zero-length. If you want a slightly-longer version that might perform better, you can use this instead:

SELECT * FROM `table` WHERE `field` <> "" AND NOT `field` REGEXP  "[\x00-\xFF]";

It does a separate check for length to avoid zero-length results, without considering them for a regex pass. Depending on the number of zero-length entries you have, this could be significantly faster.

Note that if your default character set is something bizarre where 0x00-0xFF don’t map to the same values as ASCII (is there such a character set in existence anywhere?), this would return a false positive. Otherwise, enjoy!

Questions:
Answers:

Try Using this query for searching special character records

SELECT *
FROM tableName
WHERE fieldName REGEXP '[^[email protected]:. \'\-`,\&]'