I’m wondering if there is a “best” choice for collation in MySQL for a general website where you aren’t 100% sure of what will be entered? I understand that all the encodings should be the same, such as MySQL, Apache, the HTML and anything inside PHP.
In the past I have set PHP to output in “UTF-8”, but which collation does this match in MySQL? I’m thinking it’s one of the UTF-8 ones, but I have used
The main difference is sorting accuracy (when comparing characters in the language) and performance. The only special one is utf8_bin which is for comparing characters in binary format.
utf8_general_ci is somewhat faster than
utf8_unicode_ci, but less accurate (for sorting). The specific language utf8 encoding (such as
utf8_swedish_ci) contain additional language rules that make them the most accurate to sort for those languages. Most of the time I use
utf8_unicode_ci (I prefer accuracy to small performance improvements), unless I have a good reason to prefer a specific language.
You can read more on specific unicode character sets on the MySQL manual – http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html
Actually, you probably want to use
utf8_general_cisorts by stripping away all accents and sorting as if it were ASCII
utf8_unicode_ciuses the Unicode sort order, so it sorts correctly in more languages
However, if you are only using this to store English text, these shouldn’t differ.
Be very, very aware of this problem that can occur when using
MySQL will not distinguish between some characters in select statements, if the
utf8_general_ci collation is used. This can lead to very nasty bugs – especially for example, where usernames are involved. Depending on the implementation that uses the database tables, this problem could allow malicious users to create a username matching an administrator account.
This problem exposes itself at the very least in early 5.x versions – I’m not sure if this behaviour as changed later.
I’m no DBA, but to avoid this problem, I always go with
utf8-bin instead of a case-insensitive one.
The script below describes the problem by example.
-- first, create a sandbox to play in CREATE DATABASE `sandbox`; use `sandbox`; -- next, make sure that your client connection is of the same -- character/collate type as the one we're going to test next: charset utf8 collate utf8_general_ci -- now, create the table and fill it with values CREATE TABLE `test` (`key` VARCHAR(16), `value` VARCHAR(16) ) CHARACTER SET utf8 COLLATE utf8_general_ci; INSERT INTO `test` VALUES ('Key ONE', 'value'), ('Key TWO', 'valúe'); -- (verify) SELECT * FROM `test`; -- now, expose the problem/bug: SELECT * FROM test WHERE `value` = 'value'; -- -- Note that we get BOTH keys here! MySQLs UTF8 collates that are -- case insensitive (ending with _ci) do not distinguish between -- both values! -- -- collate 'utf8_bin' doesn't have this problem, as I'll show next: -- -- first, reset the client connection charset/collate type charset utf8 collate utf8_bin -- next, convert the values that we've previously inserted in the table ALTER TABLE `test` CONVERT TO CHARACTER SET utf8 COLLATE utf8_bin; -- now, re-check for the bug SELECT * FROM test WHERE `value` = 'value'; -- -- Note that we get just one key now, as you'd expect. -- -- This problem appears to be specific to utf8. Next, I'll try to -- do the same with the 'latin1' charset: -- -- first, reset the client connection charset/collate type charset latin1 collate latin1_general_ci -- next, convert the values that we've previously inserted -- in the table ALTER TABLE `test` CONVERT TO CHARACTER SET latin1 COLLATE latin1_general_ci; -- now, re-check for the bug SELECT * FROM test WHERE `value` = 'value'; -- -- Again, only one key is returned (expected). This shows -- that the problem with utf8/utf8_generic_ci isn't present -- in latin1/latin1_general_ci -- -- To complete the example, I'll check with the binary collate -- of latin1 as well: -- first, reset the client connection charset/collate type charset latin1 collate latin1_bin -- next, convert the values that we've previously inserted in the table ALTER TABLE `test` CONVERT TO CHARACTER SET latin1 COLLATE latin1_bin; -- now, re-check for the bug SELECT * FROM test WHERE `value` = 'value'; -- -- Again, only one key is returned (expected). -- -- Finally, I'll re-introduce the problem in the exact same -- way (for any sceptics out there): -- first, reset the client connection charset/collate type charset utf8 collate utf8_generic_ci -- next, convert the values that we've previously inserted in the table ALTER TABLE `test` CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci; -- now, re-check for the problem/bug SELECT * FROM test WHERE `value` = 'value'; -- -- Two keys. -- DROP DATABASE sandbox;
It is best to use character set
utf8mb4 with the collation
The character set,
utf8, only supports a small amount of UTF-8 code points, about 6% of possible characters.
utf8 only supports the Basic Multilingual Plane (BMP). There 16 other planes. Each plane contains 65,536 characters.
utf8mb4 supports all 17 planes.
MySQL will truncate 4 byte UTF-8 characters resulting in corrupted data.
utf8mb4 character set was introduced in MySQL 5.5.3 on 2010-03-24.
Some of the required changes to use the new character set are not trivial:
- Changes may need to be made in your application database adapter.
- Changes will need to be made to my.cnf, including setting the character set, the collation and switching innodb_file_format to Barracuda
- SQL CREATE statements may need to include:
- DYNAMIC is required for indexes on VARCHAR(192) and larger.
NOTE: Switching to
Antelope, may require restarting the MySQL service more than once.
innodb_file_format_max does not change until after the MySQL service has been restarted to:
innodb_file_format = barracuda.
MySQL uses the old
Antelope InnoDB file format.
Barracuda supports dynamic row formats, which you will need if you do not want to hit the SQL errors for creating indexes and keys after you switch to the charset:
- #1709 – Index column size too large. The maximum column size is 767 bytes.
- #1071 – Specified key was too long; max key length is 767 bytes
The following scenario has been tested on MySQL 5.6.17:
By default, MySQL is configured like this:
SHOW VARIABLES; innodb_large_prefix = OFF innodb_file_format = Antelope
Stop your MySQL service and add the options to your existing my.cnf:
[client] default-character-set= utf8mb4 [mysqld] explicit_defaults_for_timestamp = true innodb_large_prefix = true innodb_file_format = barracuda innodb_file_format_max = barracuda innodb_file_per_table = true # Character collation character_set_server=utf8mb4 collation_server=utf8mb4_unicode_ci
Example SQL CREATE statement:
CREATE TABLE Contacts ( id INT AUTO_INCREMENT NOT NULL, ownerId INT DEFAULT NULL, created timestamp NOT NULL DEFAULT '0000-00-00 00:00:00', modified timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, contact VARCHAR(640) NOT NULL, prefix VARCHAR(128) NOT NULL, first VARCHAR(128) NOT NULL, middle VARCHAR(128) NOT NULL, last VARCHAR(128) NOT NULL, suffix VARCHAR(128) NOT NULL, notes MEDIUMTEXT NOT NULL, INDEX IDX_CA367725E05EFD25 (ownerId), INDEX created (created), INDEX modified_idx (modified), INDEX contact_idx (contact), PRIMARY KEY(id) ) DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci ENGINE = InnoDB ROW_FORMAT=DYNAMIC;
- You can see error #1709 generated for
INDEX contact_idx (contact)if
ROW_FORMAT=DYNAMICis removed from the CREATE statement.
NOTE: Changing the index to limit to the first 128 characters on
contacteliminates the requirement for using Barracuda with
INDEX contact_idx (contact(128)),
Also note: when it says the size of the field is
VARCHAR(128), that is not 128 bytes. You can use have 128, 4 byte characters or 128, 1 byte characters.
INSERT statement should contain the 4 byte ‘poo’ character in the 2 row:
INSERT INTO `Contacts` (`id`, `ownerId`, `created`, `modified`, `contact`, `prefix`, `first`, `middle`, `last`, `suffix`, `notes`) VALUES (1, NULL, '0000-00-00 00:00:00', '2014-08-25 03:00:36', '1234567890', '12345678901234567890', '1234567890123456789012345678901234567890', '1234567890123456789012345678901234567890', '12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678', '', ''), (2, NULL, '0000-00-00 00:00:00', '2014-08-25 03:05:57', 'poo', '12345678901234567890', '💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩', '💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩', '💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩', '', ''), (3, NULL, '0000-00-00 00:00:00', '2014-08-25 03:05:57', 'poo', '12345678901234567890', '💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩', '💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩', '123💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩💩', '', '');
You can see the amount of space used by the
mysql> SELECT BIT_LENGTH(`last`), CHAR_LENGTH(`last`) FROM `Contacts`; +--------------------+---------------------+ | BIT_LENGTH(`last`) | CHAR_LENGTH(`last`) | +--------------------+---------------------+ | 1024 | 128 | -- All characters are ASCII | 4096 | 128 | -- All characters are 4 bytes | 4024 | 128 | -- 3 characters are ASCII, 125 are 4 bytes +--------------------+---------------------+
In your database adapter, you may want to set the charset and collation for your connection:
SET NAMES 'utf8mb4' COLLATE 'utf8mb4_unicode_ci'
In PHP, this would be set for:
Collations affect how data is sorted and how strings are compared to each other. That means you should use the collation that most of your users expect.
Example from the documentation for charset unicode:
utf8_general_cialso is satisfactory
for both German and French, except
that ‘ß’ is equal to ‘s’, and not to
‘ss’. If this is acceptable for your
application, then you should use
utf8_general_cibecause it is faster.
it is more accurate.
So – it depends on your expected user base and on how much you need correct sorting. For an English user base,
utf8_general_ci should suffice, for other languages, like Swedish, special collations have been created.
Essentially, it depends on how you think of a string.
I always use utf8_bin because of the problem highlighted by Guus. In my opinion, as far as the database should be concerned, a string is still just a string. A string is a number of UTF-8 characters. A character has a binary representation so why does it need to know the language you’re using? Usually, people will be constructing databases for systems with the scope for multilingual sites. This is the whole point of using UTF-8 as a character set. I’m a bit of a pureist but I think the bug risks heavily outweigh the slight advantage you may get on indexing. Any language related rules should be done at a much higher level than the DBMS.
In my books “value” should never in a million years be equal to “valúe”.
If I want to store a text field and do a case insensitive search, I will use MYSQL string functions with PHP functions such as LOWER() and the php function strtolower().
For UTF-8 textual information, you should use
utf8_bin: compare strings by the
binary value of each character in
utf8_general_ci: compare strings
using general language rules and
using case-insensitive comparisons
a.k.a. it will should making searching and indexing the data faster/more efficient/more useful.
The accepted answer fairly definitively suggests using utf8_unicode_ci, and whilst for new projects that’s great, I wanted to relate my recent contrary experience just in case it saves anyone some time.
Because utf8_general_ci is the default collation for Unicode in MySQL, if you want to use utf8_unicode_ci then you end up having to specify it in a lot of places.
For example, all client connections not only have a default charset (makes sense to me) but also a default collation (i.e. the collation will always default to utf8_general_ci for unicode).
Likely, if you use utf8_unicode_ci for your fields, your scripts that connect to the database will need to be updated to mention the desired collation explicitly — otherwise queries using text strings can fail when your connection is using the default collation.
The upshot is that when converting an existing system of any size to Unicode/utf8, you may end up being forced to use utf8_general_ci because of the way MySQL handles defaults.
For the case highlighted by Guus, I would strongly suggest using either utf8_unicode_cs (case sensitive, strict matching, ordering correctly for the most part) instead of utf8_bin (strict matching, incorrect ordering).
If the field is intended to be searched, as opposed to matched for a user, then use utf8_general_ci or utf8_unicode_ci. Both are case-insensitive, one will losely match (‘ß’ is equal to ‘s’, and not to ‘ss’). There are also language specific versions, like utf8_german_ci where the lose matching is more suitable for the language specified.[Edit – nearly 6 years later]
I no longer recommend the “utf8” character set on MySQL, and instead recommend the “utf8mb4” character set. They match almost entirely, but allow for a little (lot) more unicode characters.
Realistically, MySQL should have updated the “utf8” character set and respective collations to match the “utf8” specification, but instead, a separate character set and respective collations as to not impact storage designation for those already using their incomplete “utf8” character set.
I found these collation charts helpful. http://collation-charts.org/mysql60/. I’m no sure which is the used utf8_general_ci though.
For example here is the chart for utf8_swedish_ci. It shows which characters it interprets as the same. http://collation-charts.org/mysql60/mysql604.utf8_swedish_ci.html
In your database upload file, add the followin line before any line:
SET NAMES utf8;
And your problem should be solved.