协慌网

登录 贡献 社区

搭配 MySQL 与 MySQL 搭配使用的最佳排序规则是什么?

我想知道对于一般网站,MySQL 是否有 “最佳” 整理方式,您不确定 100%确定要输入什么?我知道所有编码都应该相同,例如 MySQL,Apache,HTML 和 PHP 中的任何内容。

过去,我已将 PHP 设置为在 “UTF-8” 中输出,但是在 MySQL 中此匹配哪种排序规则?我以为它是 UTF-8 之一,但是我之前曾使用过utf8_unicode_ciutf8_general_ciutf8_bin

答案

主要区别是排序准确性(在比较语言中的字符时)和性能。唯一的特殊之处是 utf8_bin,它用于比较二进制格式的字符。

utf8_general_ci utf8_unicode_ci快一些,但准确性不高(用于排序)。特定语言 utf8 编码(例如utf8_swedish_ci )包含其他语言规则,这些规则使它们最准确地分类为这些语言。大部分时间我都使用utf8_unicode_ci (我宁愿使用精度而不是对性能进行小幅改进),除非我有充分的理由偏爱特定的语言。

您可以在 MySQL 手册(http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html)上阅读有关特定 Unicode 字符集的更多信息。

实际上,您可能要使用utf8_unicode_ciutf8_general_ci

  • utf8_general_ci排序方式是剥离所有重音符号并进行排序,就好像它是 ASCII 一样
  • utf8_unicode_ci使用 Unicode 排序顺序,因此可以在更多语言中正确排序

但是,如果仅使用它来存储英文文本,则它们应该没有什么不同。

utf8_general_ci时可能发生的此问题。

utf8_general_ci归类,MySQL 不会在 select 语句中区分某些字符。这可能会导致非常讨厌的错误,尤其是涉及用户名的错误。根据使用数据库表的实现,此问题可能允许恶意用户创建与管理员帐户匹配的用户名。

此问题至少在 5.x 早期版本中暴露出来 - 我不确定此行为是否稍后会更改。

我不是 DBA,但是为了避免出现此问题,我始终使用utf8-bin而不是大小写不敏感的选项。

下面的脚本通过示例描述了该问题。

-- first, create a sandbox to play in
CREATE DATABASE `sandbox`;
use `sandbox`;

-- next, make sure that your client connection is of the same 
-- character/collate type as the one we're going to test next:
charset utf8 collate utf8_general_ci

-- now, create the table and fill it with values
CREATE TABLE `test` (`key` VARCHAR(16), `value` VARCHAR(16) )
    CHARACTER SET utf8 COLLATE utf8_general_ci;

INSERT INTO `test` VALUES ('Key ONE', 'value'), ('Key TWO', 'valúe');

-- (verify)
SELECT * FROM `test`;

-- now, expose the problem/bug:
SELECT * FROM test WHERE `value` = 'value';

--
-- Note that we get BOTH keys here! MySQLs UTF8 collates that are 
-- case insensitive (ending with _ci) do not distinguish between 
-- both values!
--
-- collate 'utf8_bin' doesn't have this problem, as I'll show next:
--

-- first, reset the client connection charset/collate type
charset utf8 collate utf8_bin

-- next, convert the values that we've previously inserted in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET utf8 COLLATE utf8_bin;

-- now, re-check for the bug
SELECT * FROM test WHERE `value` = 'value';

--
-- Note that we get just one key now, as you'd expect.
--
-- This problem appears to be specific to utf8. Next, I'll try to 
-- do the same with the 'latin1' charset:
--

-- first, reset the client connection charset/collate type
charset latin1 collate latin1_general_ci

-- next, convert the values that we've previously inserted
-- in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET latin1 COLLATE latin1_general_ci;

-- now, re-check for the bug
SELECT * FROM test WHERE `value` = 'value';

--
-- Again, only one key is returned (expected). This shows 
-- that the problem with utf8/utf8_generic_ci isn't present 
-- in latin1/latin1_general_ci
--
-- To complete the example, I'll check with the binary collate
-- of latin1 as well:

-- first, reset the client connection charset/collate type
charset latin1 collate latin1_bin

-- next, convert the values that we've previously inserted in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET latin1 COLLATE latin1_bin;

-- now, re-check for the bug
SELECT * FROM test WHERE `value` = 'value';

--
-- Again, only one key is returned (expected).
--
-- Finally, I'll re-introduce the problem in the exact same 
-- way (for any sceptics out there):

-- first, reset the client connection charset/collate type
charset utf8 collate utf8_generic_ci

-- next, convert the values that we've previously inserted in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;

-- now, re-check for the problem/bug
SELECT * FROM test WHERE `value` = 'value';

--
-- Two keys.
--

DROP DATABASE sandbox;