Team LiB
Previous Section Next Section

Categories of Data Values

MySQL knows about several general categories in which data values can be represented. These include numbers, string values, temporal values such as dates and times, spatial values, and the NULL value.

Numeric Values

Numbers are values such as 48 or 193.62. MySQL understands numbers specified as integers (which have no fractional part) and floating-point or fixed-point values (which may have a fractional part). Integers can be specified in decimal or hexadecimal format.

An integer consists of a sequence of digits with no decimal point. In numeric contexts, an integer can be specified as a hexadecimal constant and is treated as a 64-bit integer. For example, 0x10 is 16 decimal. Hexadecimal values are treated as strings by default, so their syntax is given in the next section, "String Values."

A floating-point or fixed-point number consists of a sequence of digits, a decimal point, and another sequence of digits. The sequence of digits before or after the decimal point may be empty, but not both.

MySQL understands scientific notation. This is indicated by immediately following an integer or floating-point number with 'e' or 'E', a sign character ('+' or '-'), and an integer exponent. 1.34E+12 and 43.27e-1 are legal numbers in scientific notation. The number 1.34E12 is also legal even though it is missing an optional sign character before the exponent.

Hexadecimal numbers cannot be used in scientific notation; the 'e' that begins the exponent part is also a legal hex digit and thus would be ambiguous.

Any number can be preceded by a plus or minus sign character ('+' or '-'), to indicate a positive or negative value.

As of MySQL 5.0.3, bit-field values can be written as b'val', where val consists of one or more binary digits (0 or 1). For example, b'1001' is 9 decimal. This notation coincides with the introduction of the BIT data type, but bit-field values can be used more generally in other contexts.

mysql> SELECT b'101010' + 0;
+---------------+
| b'101010' + 0 |
+---------------+
|            42 |
+---------------+
mysql> SELECT CHAR(b'1100001');
+------------------+
| CHAR(b'1100001') |
+------------------+
| a                |
+------------------+

String Values

Strings are values such as 'Madison, Wisconsin', 'patient shows improvement', or even '12345' (which looks like a number, but isn't). Usually, you can use either single or double quotes to surround a string value, but there are two reasons to stick with single quotes:

  • The SQL standard specifies single quotes, so statements that use single-quoted strings are more portable to other database engines.

  • If the ANSI_QUOTES SQL mode is enabled, it treats the double quote as an identifier quoting character, not as a string quoting character. This means that a double-quoted value must refer to something like a database or table.

For the examples that use the double quote as a string quoting character in the discussion that follows, assume that ANSI_QUOTES mode is not enabled.

MySQL recognizes several escape sequences within strings that indicate special characters, as shown in Table 3.1. Each sequence begins with a backslash character ('\') to signify a temporary escape from the usual rules for character interpretation. Note that a NUL byte is not the same as the SQL NULL value; NUL is a zero-valued byte, whereas NULL in SQL signifies the absence of a value.

Table 3.1. String Escape Sequences

Sequence

Meaning

\0

NUL (zero-valued byte)

\'

Single quote

\"

Double quote

\b

Backspace

\n

Newline (linefeed)

\r

Carriage return

\t

Tab

\\

Single backslash

\Z

Ctrl-Z (Windows EOF character)


The escape sequences shown in the table are case sensitive, and any character not listed in the table is interpreted as itself if preceded by a backslash. For example, \t is a tab, but \T is an ordinary 'T' character.

The table shows how to escape single or double quotes using backslash sequences, but you actually have several options for including quote characters within string values:

  • Double the quote character if the string itself is quoted using the same character:

    'I can''t'
    "He said, ""I told you so."""
    

  • Quote the string with the other quote character. In this case, you do not double the quote characters within the string:

    "I can't"
    'He said, "I told you so."'
    

  • Escape the quote character with a backslash; this works regardless of the quote characters used to quote the string:

    'I can\'t'
    "I can\'t"
    "He said, \"I told you so.\""
    'He said, \"I told you so.\"'
    

To turn off the special meaning of backslash and treat it as an ordinary character, enable the NO_BACKSLASH_ESCAPES SQL mode, which is available as of MySQL 5.0.2.

As an alternative to using quotes for writing string values, you can use two forms of hexadecimal notation. The first consists of '0x' followed by one or more hexadecimal digits ('0' through '9' and 'a' through 'f'). For example, 0x0a is 10 decimal, and 0xffff is 65535 decimal. The non-decimal hex digits ('a' through 'f') can be specified in uppercase or lowercase, but the leading '0x' cannot be given as '0X'. That is, 0x0a and 0x0A are legal hexadecimal values, but 0X0a and 0X0A are not. In string contexts, pairs of hexadecimal digits are interpreted as 8-bit numeric byte values in the range from 0 to 255, and the result is used as a string. In numeric contexts, a hexadecimal constant is treated as a number. The following statement illustrates the interpretation of a hex constant in each type of context:

mysql> SELECT 0x61626364, 0x61626364+0;
+------------+--------------+
| 0x61626364 | 0x61626364+0 |
+------------+--------------+
| abcd       |   1633837924 |
+------------+--------------+

If a hexadecimal value written using 0x notation has an odd number of hex digits, MySQL treats it as though the value has a leading zero. For example, 0xa is treated as 0x0a.

String values may also be specified using the standard SQL notation X'val', where val consists of pairs of hexadecimal digits. As with 0x notation, such values are interpreted as strings, but may be used as numbers in a numeric context:

mysql> SELECT X'61626364', X'61626364'+0;
+-------------+---------------+
| X'61626364' | X'61626364'+0 |
+-------------+---------------+
| abcd        |    1633837924 |
+-------------+---------------+

Unlike 0x notation, the leading 'X' is not case sensitive:

mysql> SELECT X'61', x'61';
+-------+-------+
| X'61' | x'61' |
+-------+-------+
| a     | a     |
+-------+-------+

Properties of Binary and Non-Binary Strings

String values fall into two general categories, binary and non-binary:

  • A binary string is a sequence of bytes. These bytes are interpreted without respect to any concept of character set. A binary string has no special comparison or sorting properties. Comparisons are done byte by byte based on numeric byte values. Trailing spaces are significant in comparisons.

  • A non-binary string is a sequence of characters. It is associated with a character set, which determines the allowable characters that may be used and how MySQL interprets the string contents. Character sets have one or more collating (sorting) orders. The particular collation used for a string determines the ordering of characters in the character set, which affects comparison operations. Trailing spaces are not significant in comparisons. The default character set and collation are latin1 and latin1_swedish_ci.

Character units vary in their storage requirements. A single-byte character set such as latin1 uses one byte per character, but there also are multi-byte character sets in which some or all characters require more than one byte. For example, both of the Unicode character sets available in MySQL are multi-byte. ucs2 is a double-byte character set in which each character requires two bytes. utf8 is a variable-length multi-byte character set with characters that take from one to three bytes.

To find out which character sets and collations are available in your server as it currently is configured, use these two statements:

mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+--------+
| Charset  | Description                 | Default collation   | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5     | Big5 Traditional Chinese    | big5_chinese_ci     |      2 |
| dec8     | DEC West European           | dec8_swedish_ci     |      1 |
| cp850    | DOS West European           | cp850_general_ci    |      1 |
| hp8      | HP West European            | hp8_english_ci      |      1 |
| koi8r    | KOI8-R Relcom Russian       | koi8r_general_ci    |      1 |
| latin1   | ISO 8859-1 West European    | latin1_swedish_ci   |      1 |
...
| utf8     | UTF-8 Unicode               | utf8_general_ci     |      3 |
| ucs2     | UCS-2 Unicode               | ucs2_general_ci     |      2 |
...
+----------+-----------------------------+---------------------+--------+
mysql> SHOW COLLATION;
+----------------------+----------+-----+---------+----------+---------+
| Collation            | Charset  | Id  | Default | Compiled | Sortlen |
+----------------------+----------+-----+---------+----------+---------+
| big5_chinese_ci      | big5     |   1 | Yes     | Yes      |       1 |
| big5_bin             | big5     |  84 |         | Yes      |       1 |
...
| latin1_german1_ci    | latin1   |   5 |         |          |       0 |
| latin1_swedish_ci    | latin1   |   8 | Yes     | Yes      |       1 |
| latin1_danish_ci     | latin1   |  15 |         |          |       0 |
| latin1_german2_ci    | latin1   |  31 |         | Yes      |       2 |
| latin1_bin           | latin1   |  47 |         | Yes      |       1 |
| latin1_general_ci    | latin1   |  48 |         |          |       0 |
| latin1_general_cs    | latin1   |  49 |         |          |       0 |
| latin1_spanish_ci    | latin1   |  94 |         |          |       0 |
...
+----------------------+----------+-----+---------+----------+---------+

As shown by the output from SHOW COLLATION, each collation is specific to a given character set, but a given character set might have several collations. Collation names usually consist of a character set name, a language, and an additional suffix. For example, utf8_icelandic_ci is a collation for the utf8 Unicode character set in which comparisons follow Icelandic sorting rules and characters are compared in case-insensitive fashion. Collation suffixes have the following meanings:

  • _ci indicates a case-insensitive collation.

  • cs indicates a case-sensitive collation.

  • _bin indicates a binary collation. That is, comparisons are based on character code values without reference to any language. For this reason, _bin collation names do not include any language name. Examples: latin1_bin and utf8_bin.

The sorting properties for binary and non-binary strings differ as follows:

  • Binary strings are processed byte by byte in comparisons based solely on the numeric value of each byte. One implication of this property is that binary values appear to be case sensitive, but that actually is a side effect of the fact that uppercase and lowercase versions of a character have different numeric byte values. There isn't really any notion of lettercase for binary strings. Lettercase is a function of collation, which applies only to character (non-binary) strings.

  • Non-binary strings are processed character by character in comparisons, and the relative value of each character is determined by the collating sequence that is used for the character set. For most collations, uppercase and lowercase versions of a given letter have the same collating value, so non-binary string comparisons typically are not case sensitive. However, that is not true for case-sensitive or binary collations.

Because collations are used for comparison and sorting, they affect many operations:

  • Comparisons operators: <, <=, =, <>, >=, >, and LIKE.

  • Sorting: ORDER BY, MIN(), and MAX().

  • Grouping: GROUP BY and DISTINCT.

To determine the character set or collation of a string, you can use the CHARSET() and COLLATION() functions.

Quoted string literals are interpreted according to the current server settings. The default character set and collation are latin1 and latin1_swedish_ci:

mysql> SELECT CHARSET('abcd'), COLLATION('abcd');
+-----------------+-------------------+
| CHARSET('abcd') | COLLATION('abcd') |
+-----------------+-------------------+
| latin1          | latin1_swedish_ci |
+-----------------+-------------------+

MySQL treats hexadecimal constants as binary strings by default:

mysql> SELECT CHARSET(0x0123), COLLATION(0x123);
+-----------------+------------------+
| CHARSET(0x0123) | COLLATION(0x123) |
+-----------------+------------------+
| binary          | binary           |
+-----------------+------------------+

Two forms of notation can be used to force a string literal to be interpreted with a given character set:

  • A string constant can be designated for interpretation with a given character set using the following notation, where charset is the name of a supported character set:

    _charset str
    

    The _charset notation is called a "character set introducer." The string can be written as a quoted string or as a hexadecimal value. The following examples show how to cause strings to be interpreted in the latin2 and utf8 character sets:

    _latin2 'abc'
    _latin2 0x616263
    _latin2 X'616263'
    _utf8 'def'
    _utf8 0x646566
    _utf8 X'646566'
    

    For quoted strings, whitespace is optional between the introducer and the following string. For hexadecimal values, whitespace is required.

  • The notation N'str' is equivalent to _utf8'str'. N must be followed immediately by a quoted literal string with no intervening whitespace.

Introducer notation works for literal quoted strings or hexadecimal constants, but not for string expressions or column values. However, any string or string expression can be used to produce a string in a designated character set using the CONVERT() function:

CONVERT(str USING charset);

Introducers and CONVERT() are not the same. An introducer does not change the string value; it merely modifies how the string is interpreted. CONVERT() takes a string argument and produces a new string in the desired character set. To see the difference between introducers and CONVERT(), consider the following two statements that refer to the ucs2 double-byte character set:

mysql> SET @s1 = _ucs2 'ABCD';
mysql> SET @s2 = CONVERT('ABCD' USING ucs2);

Assume that the default character set is latin1 (a single-byte character set). The first statement interprets each pair of characters in the string 'ABCD' as a single double-byte ucs2 character, resulting in a two-character ucs2 string. The second statement converts each character of the string 'ABCD' to the corresponding ucs2 character, resulting in a four-character ucs2 string.

What is the "length" of each string? It depends. If you measure with CHAR_LENGTH(), you get the length in characters. If you measure with LENGTH(), you get the length in bytes:

mysql> SELECT CHAR_LENGTH(@s1), LENGTH(@s1), CHAR_LENGTH(@s2), LENGTH(@s2);
+------------------+-------------+------------------+-------------+
| CHAR_LENGTH(@s1) | LENGTH(@s1) | CHAR_LENGTH(@s2) | LENGTH(@s2) |
+------------------+-------------+------------------+-------------+
|                2 |           4 |                4 |           8 |
+------------------+-------------+------------------+-------------+

Here is a somewhat subtle point: A binary string is not the same thing as a non-binary string that has a binary collation. The binary string has no character set. It is interpreted with byte semantics and comparisons use single-byte numeric codes. A non-binary string with a binary collation has character semantics and comparisons use numeric character values that might be based on multiple bytes per character.

Here's one way to see the difference between binary and non-binary strings with regard to lettercase. Create a binary string and a non-binary string that has a binary collation, and then pass each string to the UPPER() function:

mysql> SET @s1 = BINARY 'abcd';
mysql> SET @s2 = _latin1 'abcd' COLLATE latin1_bin;
mysql> SELECT UPPER(@s1), UPPER(@s2);
+------------+------------+
| UPPER(@s1) | UPPER(@s2) |
+------------+------------+
| abcd       | ABCD       |
+------------+------------+

Why doesn't UPPER() convert the binary string to uppercase? This occurs because it has no character set, so there is no way to know which byte values correspond to uppercase or lowercase characters. To use a binary string with functions such as UPPER() and LOWER(), you must first convert it to a non-binary string:

mysql> SELECT @s1,  UPPER(CONVERT(@s1 USING latin1));
+------+----------------------------------+
| @s1  | UPPER(CONVERT(@s1 USING latin1)) |
+------+----------------------------------+
| abcd | ABCD                             |
+------+----------------------------------+

Character SetRelated System Variables

The server maintains several system variables that are involved in various aspects of character set support. Six of these variables refer to character sets and three refer to collations. Each of the collation variables is linked to a corresponding character set variable.

  • character_set_system indicates the character set used for storing identifiers. This is always utf8.

  • character_set_server and collation_server indicate the server's default character set and collation.

  • character_set_database and collation_database indicate the character set and collation of the default database. These are read-only and set automatically by the server whenever you select a default database. If there is no default database, they're set to the server's default character set and collation. These variables come into play when you create a table but specify no explicit character set or collation. In this case, the table defaults are taken from the database defaults.

  • The remaining variables influence how communication occurs between the client and the server:

    • character_set_client indicates the character set used for SQL statements that the client sends to the server.

    • character_set_results indicates the character set used for results that the server returns to the client. "Results" include data values and also metadata such as column names.

    • character_set_connection is used by the server. When it receives a statement string from the client, it converts the string from character_set_client to character_set_connection and works with the statement in the latter character set. (There is an exception: Any literal string in the statement that is preceded by a character set introducer is interpreted using the character set indicated by the introducer.) collation_connection is used for comparisons between literal strings within statement strings.

Very likely you'll find that most character set and collation variables are set to the same value by default. For example, the following output indicates that client/server communication takes place using the latin1 character set:

mysql> SHOW VARIABLES LIKE 'character\_set\_%';
+--------------------------+--------+
| Variable_name            | Value  |
+--------------------------+--------+
| character_set_client     | latin1 |
| character_set_connection | latin1 |
| character_set_database   | latin1 |
| character_set_results    | latin1 |
| character_set_server     | latin1 |
| character_set_system     | utf8   |
+--------------------------+--------+
mysql> SHOW VARIABLES LIKE 'collation\_%';
+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | latin1_swedish_ci |
| collation_database   | latin1_swedish_ci |
| collation_server     | latin1_swedish_ci |
+----------------------+-------------------+

A client that wants to talk to the server using another character set can change the communication-related variables. For example, if you want to use utf8, change three variables:

mysql> SET character_set_client = utf8;
mysql> SET character_set_results = utf8;
mysql> SET character_set_connection = utf8;

However, it's more convenient to use a SET NAMES statement for this purpose. The following statement is equivalent to the preceding three SET statements:

mysql> SET NAMES 'utf8';

One restriction on setting the communication character set is that you cannot use ucs2.

Many client programs support a --default-character-set option that produces the same effect as a SET NAMES statement by informing the server of the desired communication character set.

For variables that come in pairs (a character set variable and a collation variable), the members of the pair are linked in the following ways:

  • Setting the character set variable also sets the associated collation variable to the default collation for the character set.

  • Setting the collation variable also sets the associated character set variable to the character set implied by the first part of the collation name.

For example, setting character_set_connection to utf8 sets collation_connection to utf8_general_ci. Setting collation_connection to latin1_spanish_ci sets character_set_connection to latin1.

Date and Time (Temporal) Values

Dates and times are values such as '2005-06-17' or '12:30:43'. MySQL also understands combined date/time values, such as '2005-06-17 12:30:43'. Take special note of the fact that MySQL represents dates in year-month-day order. This often surprises newcomers to MySQL, although this is standard SQL format (also known as "ISO 8601" format). You can display date values any way you like using the DATE_FORMAT() function, but the default display format lists the year first, and input values must be specified with the year first.

Spatial Values

MySQL 4.1 and up supports spatial values, although currently only for MyISAM tables. This capability allows representation of values such as points, lines, and polygons. For example, the following statement uses the text representation of a point value with X and Y coordinates of (10, 20) to create a POINT and assigns the result to a user-defined variable:

SET @pt = POINTFROMTEXT('POINT(10 20)');

The NULL Value

NULL is something of a "typeless" value. Generally, it's used to mean "no value," "unknown value," "missing value," "out of range," "not applicable," "none of the above," and so forth. You can insert NULL values into tables, retrieve them from tables, and test whether a value is NULL. However, you cannot perform arithmetic on NULL values; if you try, the result is NULL. Also, many functions return NULL if you invoke them with a NULL argument.

    Team LiB
    Previous Section Next Section