我想知道对于一般网站,MySQL 是否有 “最佳” 整理方式,您不确定 100%确定要输入什么?我知道所有编码都应该相同,例如 MySQL,Apache,HTML 和 PHP 中的任何内容。
过去,我已将 PHP 设置为在 “UTF-8” 中输出,但是在 MySQL 中此匹配哪种排序规则?我以为它是 UTF-8 之一,但是我之前曾使用过utf8_unicode_ci
, utf8_general_ci
和utf8_bin
。
主要区别是排序准确性(在比较语言中的字符时)和性能。唯一的特殊之处是 utf8_bin,它用于比较二进制格式的字符。
utf8_general_ci
utf8_unicode_ci
快一些,但准确性不高(用于排序)。特定语言 utf8 编码(例如utf8_swedish_ci
)包含其他语言规则,这些规则使它们最准确地分类为这些语言。大部分时间我都使用utf8_unicode_ci
(我宁愿使用精度而不是对性能进行小幅改进),除非我有充分的理由偏爱特定的语言。
您可以在 MySQL 手册(http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html)上阅读有关特定 Unicode 字符集的更多信息。
实际上,您可能要使用utf8_unicode_ci
或utf8_general_ci
。
utf8_general_ci
排序方式是剥离所有重音符号并进行排序,就好像它是 ASCII 一样utf8_unicode_ci
使用 Unicode 排序顺序,因此可以在更多语言中正确排序但是,如果仅使用它来存储英文文本,则它们应该没有什么不同。
utf8_general_ci
时可能发生的此问题。
utf8_general_ci
归类,MySQL 不会在 select 语句中区分某些字符。这可能会导致非常讨厌的错误,尤其是涉及用户名的错误。根据使用数据库表的实现,此问题可能允许恶意用户创建与管理员帐户匹配的用户名。
此问题至少在 5.x 早期版本中暴露出来 - 我不确定此行为是否稍后会更改。
我不是 DBA,但是为了避免出现此问题,我始终使用utf8-bin
而不是大小写不敏感的选项。
下面的脚本通过示例描述了该问题。
-- first, create a sandbox to play in
CREATE DATABASE `sandbox`;
use `sandbox`;
-- next, make sure that your client connection is of the same
-- character/collate type as the one we're going to test next:
charset utf8 collate utf8_general_ci
-- now, create the table and fill it with values
CREATE TABLE `test` (`key` VARCHAR(16), `value` VARCHAR(16) )
CHARACTER SET utf8 COLLATE utf8_general_ci;
INSERT INTO `test` VALUES ('Key ONE', 'value'), ('Key TWO', 'valúe');
-- (verify)
SELECT * FROM `test`;
-- now, expose the problem/bug:
SELECT * FROM test WHERE `value` = 'value';
--
-- Note that we get BOTH keys here! MySQLs UTF8 collates that are
-- case insensitive (ending with _ci) do not distinguish between
-- both values!
--
-- collate 'utf8_bin' doesn't have this problem, as I'll show next:
--
-- first, reset the client connection charset/collate type
charset utf8 collate utf8_bin
-- next, convert the values that we've previously inserted in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET utf8 COLLATE utf8_bin;
-- now, re-check for the bug
SELECT * FROM test WHERE `value` = 'value';
--
-- Note that we get just one key now, as you'd expect.
--
-- This problem appears to be specific to utf8. Next, I'll try to
-- do the same with the 'latin1' charset:
--
-- first, reset the client connection charset/collate type
charset latin1 collate latin1_general_ci
-- next, convert the values that we've previously inserted
-- in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET latin1 COLLATE latin1_general_ci;
-- now, re-check for the bug
SELECT * FROM test WHERE `value` = 'value';
--
-- Again, only one key is returned (expected). This shows
-- that the problem with utf8/utf8_generic_ci isn't present
-- in latin1/latin1_general_ci
--
-- To complete the example, I'll check with the binary collate
-- of latin1 as well:
-- first, reset the client connection charset/collate type
charset latin1 collate latin1_bin
-- next, convert the values that we've previously inserted in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET latin1 COLLATE latin1_bin;
-- now, re-check for the bug
SELECT * FROM test WHERE `value` = 'value';
--
-- Again, only one key is returned (expected).
--
-- Finally, I'll re-introduce the problem in the exact same
-- way (for any sceptics out there):
-- first, reset the client connection charset/collate type
charset utf8 collate utf8_generic_ci
-- next, convert the values that we've previously inserted in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
-- now, re-check for the problem/bug
SELECT * FROM test WHERE `value` = 'value';
--
-- Two keys.
--
DROP DATABASE sandbox;