12.3.6 字符串文字的字符集和排序规则
Every character string literal has a character set and a collation.
For the simple statement SELECT '
, the string has the connection default character set and collation defined by the string
'character_set_connection
and collation_connection
system variables.
A character string literal may have an optional character set introducer and COLLATE
clause, to designate it as a string that uses a particular character set and collation:
[_charset_name]'string' [COLLATE collation_name]
The _
expression is formally called an introducer. It tells the parser, “the string that follows uses character set charset_name
charset_name
.” An introducer does not change the string to the introducer character set like CONVERT()
would do. It does not change the string value, although padding may occur. The introducer is just a signal. See Section 12.3.8, “Character Set Introducers”.
Examples:
SELECT 'abc';
SELECT _latin1'abc';
SELECT _binary'abc';
SELECT _utf8mb4'abc' COLLATE utf8mb4_danish_ci;
Character set introducers and the COLLATE
clause are implemented according to standard SQL specifications.
MySQL determines the character set and collation of a character string literal in the following manner:
-
If both
_charset_name
andCOLLATE
are specified, character setcollation_name
charset_name
and collationcollation_name
are used.collation_name
must be a permitted collation forcharset_name
. -
If
_charset_name
is specified butCOLLATE
is not specified, character setcharset_name
and its default collation are used. To see the default collation for each character set, use theSHOW CHARACTER SET
statement or query theINFORMATION_SCHEMA
CHARACTER_SETS
table. -
If
_charset_name
is not specified butCOLLATE
is specified, the connection default character set given by thecollation_name
character_set_connection
system variable and collationcollation_name
are used.collation_name
must be a permitted collation for the connection default character set. -
Otherwise (neither
_charset_name
norCOLLATE
is specified), the connection default character set and collation given by thecollation_name
character_set_connection
andcollation_connection
system variables are used.
Examples:
-
A nonbinary string with
latin1
character set andlatin1_german1_ci
collation:SELECT _latin1'Müller' COLLATE latin1_german1_ci;
-
A nonbinary string with
utf8mb4
character set and its default collation (that is,utf8mb4_0900_ai_ci
):SELECT _utf8mb4'Müller';
-
A binary string with
binary
character set and its default collation (that is,binary
):SELECT _binary'Müller';
-
A nonbinary string with the connection default character set and
utf8mb4_0900_ai_ci
collation (fails if the connection character set is notutf8mb4
):SELECT 'Müller' COLLATE utf8mb4_0900_ai_ci;
-
A string with the connection default character set and collation:
SELECT 'Müller';
An introducer indicates the character set for the following string, but does not change how the parser performs escape processing within the string. Escapes are always interpreted by the parser according to the character set given by character_set_connection
.
The following examples show that escape processing occurs using character_set_connection
even in the presence of an introducer. The examples use SET NAMES
(which changes character_set_connection
, as discussed in Section 12.4, “Connection Character Sets and Collations”), and display the resulting strings using the HEX()
function so that the exact string contents can be seen.
Example 1:
mysql> SET NAMES latin1;
mysql> SELECT HEX('à\n'), HEX(_sjis'à\n');
+------------+-----------------+
| HEX('à\n') | HEX(_sjis'à\n') |
+------------+-----------------+
| E00A | E00A |
+------------+-----------------+
Here, à
(hexadecimal value E0
) is followed by \n
, the escape sequence for newline. The escape sequence is interpreted using the character_set_connection
value of latin1
to produce a literal newline (hexadecimal value 0A
). This happens even for the second string. That is, the _sjis
introducer does not affect the parser's escape processing.
Example 2:
mysql> SET NAMES sjis;
mysql> SELECT HEX('à\n'), HEX(_latin1'à\n');
+------------+-------------------+
| HEX('à\n') | HEX(_latin1'à\n') |
+------------+-------------------+
| E05C6E | E05C6E |
+------------+-------------------+
Here, character_set_connection
is sjis
, a character set in which the sequence of à
followed by \
(hexadecimal values 05
and 5C
) is a valid multibyte character. Hence, the first two bytes of the string are interpreted as a single sjis
character, and the \
is not interpreted as an escape character. The following n
(hexadecimal value 6E
) is not interpreted as part of an escape sequence. This is true even for the second string; the _latin1
introducer does not affect escape processing.