When you get down to it, computers both store data and execute operations on that data using a simple language: 1’s and 0’s. If you’re doing simple arithmetic or logic, this can be simple, but a higher level of abstraction is needed to efficiently accomplish things like text manipulation. Character sets do this by assigning a value to each character so that users can see an “A” rather than a number or a series of 1’s and 0’s.
Modern computers and programs do a good job handling character sets behind the scene, but it is beneficial to understand their inner-workings so related issues are less mysterious and easier to address. This is a discussion of charsets: issues with content management and migration, path handling, file handling, and language support.
In the 1960’s, the American Standards Association established the ASCII character set which contained encodings for the English alphabet in upper and lower-case, numbers, punctuation, and some other commonly used characters. ASCII used 7 bits of data to describe each character, meaning there were 127 total characters that could be represented. In the 70’s, new systems were produced which could handle 8 bits. This meant that bits 128 through 255 were available to represent additional characters. There was no standard created so there were conflicts between different countries over how to encode various non-English characters.
Here is a table showing both the 7-bit and 8-big ASCII characters:
DEC | OCT | HEX | BIN | Symbol | HTML | Name | Description |
7-bit Characters (0-127): | |||||||
0 | 000 | 00 | 00000000 | NUL | � | Null char | |
1 | 001 | 01 | 00000001 | SOH |  | Start of Heading | |
2 | 002 | 02 | 00000010 | STX |  | Start of Text | |
3 | 003 | 03 | 00000011 | ETX |  | End of Text | |
4 | 004 | 04 | 00000100 | EOT |  | End of Transmission | |
5 | 005 | 05 | 00000101 | ENQ |  | Enquiry | |
6 | 006 | 06 | 00000110 | ACK |  | Acknowledgment | |
7 | 007 | 07 | 00000111 | BEL |  | Bell | |
8 | 010 | 08 | 00001000 | BS |  | Back Space | |
9 | 011 | 09 | 00001001 | HT | 	 | Horizontal Tab | |
10 | 012 | 0A | 00001010 | LF | 
 | Line Feed | |
11 | 013 | 0B | 00001011 | VT |  | Vertical Tab | |
12 | 014 | 0C | 00001100 | FF |  | Form Feed | |
13 | 015 | 0D | 00001101 | CR | 
 | Carriage Return | |
14 | 016 | 0E | 00001110 | SO |  | Shift Out / X-On | |
15 | 017 | 0F | 00001111 | SI |  | Shift In / X-Off | |
16 | 020 | 10 | 00010000 | DLE |  | Data Line Escape | |
17 | 021 | 11 | 00010001 | DC1 |  | Device Control 1 (oft. XON) | |
18 | 022 | 12 | 00010010 | DC2 |  | Device Control 2 | |
19 | 023 | 13 | 00010011 | DC3 |  | Device Control 3 (oft. XOFF) | |
20 | 024 | 14 | 00010100 | DC4 |  | Device Control 4 | |
21 | 025 | 15 | 00010101 | NAK |  | Negative Acknowledgement | |
22 | 026 | 16 | 00010110 | SYN |  | Synchronous Idle | |
23 | 027 | 17 | 00010111 | ETB |  | End of Transmit Block | |
24 | 030 | 18 | 00011000 | CAN |  | Cancel | |
25 | 031 | 19 | 00011001 | EM |  | End of Medium | |
26 | 032 | 1A | 00011010 | SUB |  | Substitute | |
27 | 033 | 1B | 00011011 | ESC |  | Escape | |
28 | 034 | 1C | 00011100 | FS |  | File Separator | |
29 | 035 | 1D | 00011101 | GS |  | Group Separator | |
30 | 036 | 1E | 00011110 | RS |  | Record Separator | |
31 | 037 | 1F | 00011111 | US |  | Unit Separator | |
32 | 040 | 20 | 00100000 |   | Space | ||
33 | 041 | 21 | 00100001 | ! | ! | Exclamation mark | |
34 | 042 | 22 | 00100010 | " | " | " | Double quotes (or speech marks) |
35 | 043 | 23 | 00100011 | # | # | Number | |
36 | 044 | 24 | 00100100 | $ | $ | Dollar | |
37 | 045 | 25 | 00100101 | % | % | Procenttecken | |
38 | 046 | 26 | 00100110 | & | & | & | Ampersand |
39 | 047 | 27 | 00100111 | ' | ' | Single quote | |
40 | 050 | 28 | 00101000 | ( | ( | Open parenthesis (or open bracket) | |
41 | 051 | 29 | 00101001 | ) | ) | Close parenthesis (or close bracket) | |
42 | 052 | 2A | 00101010 | * | * | Asterisk | |
43 | 053 | 2B | 00101011 | + | + | Plus | |
44 | 054 | 2C | 00101100 | , | , | Comma | |
45 | 055 | 2D | 00101101 | - | - | Hyphen | |
46 | 056 | 2E | 00101110 | . | . | Period, dot or full stop | |
47 | 057 | 2F | 00101111 | / | / | Slash or divide | |
48 | 060 | 30 | 00110000 | 0 | 0 | Zero | |
49 | 061 | 31 | 00110001 | 1 | 1 | One | |
50 | 062 | 32 | 00110010 | 2 | 2 | Two | |
51 | 063 | 33 | 00110011 | 3 | 3 | Three | |
52 | 064 | 34 | 00110100 | 4 | 4 | Four | |
53 | 065 | 35 | 00110101 | 5 | 5 | Five | |
54 | 066 | 36 | 00110110 | 6 | 6 | Six | |
55 | 067 | 37 | 00110111 | 7 | 7 | Seven | |
56 | 070 | 38 | 00111000 | 8 | 8 | Eight | |
57 | 071 | 39 | 00111001 | 9 | 9 | Nine | |
58 | 072 | 3A | 00111010 | : | : | Colon | |
59 | 073 | 3B | 00111011 | ; | ; | Semicolon | |
60 | 074 | 3C | 00111100 | < | < | < | Less than (or open angled bracket) |
61 | 075 | 3D | 00111101 | = | = | Equals | |
62 | 076 | 3E | 00111110 | > | > | > | Greater than (or close angled bracket) |
63 | 077 | 3F | 00111111 | ? | ? | Question mark | |
64 | 100 | 40 | 01000000 | @ | @ | At symbol | |
65 | 101 | 41 | 01000001 | A | A | Uppercase A | |
66 | 102 | 42 | 01000010 | B | B | Uppercase B | |
67 | 103 | 43 | 01000011 | C | C | Uppercase C | |
68 | 104 | 44 | 01000100 | D | D | Uppercase D | |
69 | 105 | 45 | 01000101 | E | E | Uppercase E | |
70 | 106 | 46 | 01000110 | F | F | Uppercase F | |
71 | 107 | 47 | 01000111 | G | G | Uppercase G | |
72 | 110 | 48 | 01001000 | H | H | Uppercase H | |
73 | 111 | 49 | 01001001 | I | I | Uppercase I | |
74 | 112 | 4A | 01001010 | J | J | Uppercase J | |
75 | 113 | 4B | 01001011 | K | K | Uppercase K | |
76 | 114 | 4C | 01001100 | L | L | Uppercase L | |
77 | 115 | 4D | 01001101 | M | M | Uppercase M | |
78 | 116 | 4E | 01001110 | N | N | Uppercase N | |
79 | 117 | 4F | 01001111 | O | O | Uppercase O | |
80 | 120 | 50 | 01010000 | P | P | Uppercase P | |
81 | 121 | 51 | 01010001 | Q | Q | Uppercase Q | |
82 | 122 | 52 | 01010010 | R | R | Uppercase R | |
83 | 123 | 53 | 01010011 | S | S | Uppercase S | |
84 | 124 | 54 | 01010100 | T | T | Uppercase T | |
85 | 125 | 55 | 01010101 | U | U | Uppercase U | |
86 | 126 | 56 | 01010110 | V | V | Uppercase V | |
87 | 127 | 57 | 01010111 | W | W | Uppercase W | |
88 | 130 | 58 | 01011000 | X | X | Uppercase X | |
89 | 131 | 59 | 01011001 | Y | Y | Uppercase Y | |
90 | 132 | 5A | 01011010 | Z | Z | Uppercase Z | |
91 | 133 | 5B | 01011011 | [ | [ | Opening bracket | |
92 | 134 | 5C | 01011100 | \ | \ | Backslash | |
93 | 135 | 5D | 01011101 | ] | ] | Closing bracket | |
94 | 136 | 5E | 01011110 | ^ | ^ | Caret - circumflex | |
95 | 137 | 5F | 01011111 | _ | _ | Underscore | |
96 | 140 | 60 | 01100000 | ` | ` | Grave accent | |
97 | 141 | 61 | 01100001 | a | a | Lowercase a | |
98 | 142 | 62 | 01100010 | b | b | Lowercase b | |
99 | 143 | 63 | 01100011 | c | c | Lowercase c | |
100 | 144 | 64 | 01100100 | d | d | Lowercase d | |
101 | 145 | 65 | 01100101 | e | e | Lowercase e | |
102 | 146 | 66 | 01100110 | f | f | Lowercase f | |
103 | 147 | 67 | 01100111 | g | g | Lowercase g | |
104 | 150 | 68 | 01101000 | h | h | Lowercase h | |
105 | 151 | 69 | 01101001 | i | i | Lowercase i | |
106 | 152 | 6A | 01101010 | j | j | Lowercase j | |
107 | 153 | 6B | 01101011 | k | k | Lowercase k | |
108 | 154 | 6C | 01101100 | l | l | Lowercase l | |
109 | 155 | 6D | 01101101 | m | m | Lowercase m | |
110 | 156 | 6E | 01101110 | n | n | Lowercase n | |
111 | 157 | 6F | 01101111 | o | o | Lowercase o | |
112 | 160 | 70 | 01110000 | p | p | Lowercase p | |
113 | 161 | 71 | 01110001 | q | q | Lowercase q | |
114 | 162 | 72 | 01110010 | r | r | Lowercase r | |
115 | 163 | 73 | 01110011 | s | s | Lowercase s | |
116 | 164 | 74 | 01110100 | t | t | Lowercase t | |
117 | 165 | 75 | 01110101 | u | u | Lowercase u | |
118 | 166 | 76 | 01110110 | v | v | Lowercase v | |
119 | 167 | 77 | 01110111 | w | w | Lowercase w | |
120 | 170 | 78 | 01111000 | x | x | Lowercase x | |
121 | 171 | 79 | 01111001 | y | y | Lowercase y | |
122 | 172 | 7A | 01111010 | z | z | Lowercase z | |
123 | 173 | 7B | 01111011 | { | { | Opening brace | |
124 | 174 | 7C | 01111100 | | | | | Vertical bar | |
125 | 175 | 7D | 01111101 | } | } | Closing brace | |
126 | 176 | 7E | 01111110 | ~ | ~ | Equivalency sign - tilde | |
127 | 177 | 7F | 01111111 |  | Delete | ||
8-bit Characters (128-255): | |||||||
128 | 200 | 80 | 10000000 | € | € | € | Euro sign |
129 | 201 | 81 | 10000001 | ||||
130 | 202 | 82 | 10000010 | ‚ | ‚ | ‚ | Single low-9 quotation mark |
131 | 203 | 83 | 10000011 | ƒ | ƒ | ƒ | Latin small letter f with hook |
132 | 204 | 84 | 10000100 | „ | „ | „ | Double low-9 quotation mark |
133 | 205 | 85 | 10000101 | … | … | … | Horizontal ellipsis |
134 | 206 | 86 | 10000110 | † | † | † | Dagger |
135 | 207 | 87 | 10000111 | ‡ | ‡ | ‡ | Double dagger |
136 | 210 | 88 | 10001000 | ˆ | ˆ | ˆ | Modifier letter circumflex accent |
137 | 211 | 89 | 10001001 | ‰ | ‰ | ‰ | Per mille sign |
138 | 212 | 8A | 10001010 | Š | Š | Š | Latin capital letter S with caron |
139 | 213 | 8B | 10001011 | ‹ | ‹ | ‹ | Single left-pointing angle quotation |
140 | 214 | 8C | 10001100 | Œ | Œ | Œ | Latin capital ligature OE |
141 | 215 | 8D | 10001101 | ||||
142 | 216 | 8E | 10001110 | Ž | Ž | Latin captial letter Z with caron | |
143 | 217 | 8F | 10001111 | ||||
144 | 220 | 90 | 10010000 | ||||
145 | 221 | 91 | 10010001 | ‘ | ‘ | ‘ | Left single quotation mark |
146 | 222 | 92 | 10010010 | ’ | ’ | ’ | Right single quotation mark |
147 | 223 | 93 | 10010011 | “ | “ | “ | Left double quotation mark |
148 | 224 | 94 | 10010100 | ” | ” | ” | Right double quotation mark |
149 | 225 | 95 | 10010101 | • | • | • | Bullet |
150 | 226 | 96 | 10010110 | – | – | – | En dash |
151 | 227 | 97 | 10010111 | — | — | — | Em dash |
152 | 230 | 98 | 10011000 | ˜ | ˜ | ˜ | Small tilde |
153 | 231 | 99 | 10011001 | ™ | ™ | ™ | Trade mark sign |
154 | 232 | 9A | 10011010 | š | š | š | Latin small letter S with caron |
155 | 233 | 9B | 10011011 | › | › | › | Single right-pointing angle quotation mark |
156 | 234 | 9C | 10011100 | œ | œ | œ | Latin small ligature oe |
157 | 235 | 9D | 10011101 | ||||
158 | 236 | 9E | 10011110 | ž | ž | Latin small letter z with caron | |
159 | 237 | 9F | 10011111 | Ÿ | Ÿ | ÿ | Latin capital letter Y with diaeresis |
160 | 240 | A0 | 10100000 |   | | Non-breaking space | |
161 | 241 | A1 | 10100001 | ¡ | ¡ | ¡ | Inverted exclamation mark |
162 | 242 | A2 | 10100010 | ¢ | ¢ | ¢ | Cent sign |
163 | 243 | A3 | 10100011 | £ | £ | £ | Pound sign |
164 | 244 | A4 | 10100100 | ¤ | ¤ | ¤ | Currency sign |
165 | 245 | A5 | 10100101 | ¥ | ¥ | ¥ | Yen sign |
166 | 246 | A6 | 10100110 | ¦ | ¦ | ¦ | Pipe, Broken vertical bar |
167 | 247 | A7 | 10100111 | § | § | § | Section sign |
168 | 250 | A8 | 10101000 | ¨ | ¨ | ¨ | Spacing diaeresis - umlaut |
169 | 251 | A9 | 10101001 | © | © | © | Copyright sign |
170 | 252 | AA | 10101010 | ª | ª | ª | Feminine ordinal indicator |
171 | 253 | AB | 10101011 | « | « | « | Left double angle quotes |
172 | 254 | AC | 10101100 | ¬ | ¬ | ¬ | Not sign |
173 | 255 | AD | 10101101 | | ­ | ­ | Soft hyphen |
174 | 256 | AE | 10101110 | ® | ® | ® | Registered trade mark sign |
175 | 257 | AF | 10101111 | ¯ | ¯ | ¯ | Spacing macron - overline |
176 | 260 | B0 | 10110000 | ° | ° | ° | Degree sign |
177 | 261 | B1 | 10110001 | ± | ± | ± | Plus-or-minus sign |
178 | 262 | B2 | 10110010 | ² | ² | ² | Superscript two - squared |
179 | 263 | B3 | 10110011 | ³ | ³ | ³ | Superscript three - cubed |
180 | 264 | B4 | 10110100 | ´ | ´ | ´ | Acute accent - spacing acute |
181 | 265 | B5 | 10110101 | µ | µ | µ | Micro sign |
182 | 266 | B6 | 10110110 | ¶ | ¶ | ¶ | Pilcrow sign - paragraph sign |
183 | 267 | B7 | 10110111 | · | · | · | Middle dot - Georgian comma |
184 | 270 | B8 | 10111000 | ¸ | ¸ | ¸ | Spacing cedilla |
185 | 271 | B9 | 10111001 | ¹ | ¹ | ¹ | Superscript one |
186 | 272 | BA | 10111010 | º | º | º | Masculine ordinal indicator |
187 | 273 | BB | 10111011 | » | » | » | Right double angle quotes |
188 | 274 | BC | 10111100 | ¼ | ¼ | ¼ | Fraction one quarter |
189 | 275 | BD | 10111101 | ½ | ½ | ½ | Fraction one half |
190 | 276 | BE | 10111110 | ¾ | ¾ | ¾ | Fraction three quarters |
191 | 277 | BF | 10111111 | ¿ | ¿ | ¿ | Inverted question mark |
192 | 300 | C0 | 11000000 | À | À | À | Latin capital letter A with grave |
193 | 301 | C1 | 11000001 | Á | Á | Á | Latin capital letter A with acute |
194 | 302 | C2 | 11000010 | Â | Â | Â | Latin capital letter A with circumflex |
195 | 303 | C3 | 11000011 | Ã | Ã | Ã | Latin capital letter A with tilde |
196 | 304 | C4 | 11000100 | Ä | Ä | Ä | Latin capital letter A with diaeresis |
197 | 305 | C5 | 11000101 | Å | Å | Å | Latin capital letter A with ring above |
198 | 306 | C6 | 11000110 | Æ | Æ | Æ | Latin capital letter AE |
199 | 307 | C7 | 11000111 | Ç | Ç | Ç | Latin capital letter C with cedilla |
200 | 310 | C8 | 11001000 | È | È | È | Latin capital letter E with grave |
201 | 311 | C9 | 11001001 | É | É | É | Latin capital letter E with acute |
202 | 312 | CA | 11001010 | Ê | Ê | Ê | Latin capital letter E with circumflex |
203 | 313 | CB | 11001011 | Ë | Ë | Ë | Latin capital letter E with diaeresis |
204 | 314 | CC | 11001100 | Ì | Ì | Ì | Latin capital letter I with grave |
205 | 315 | CD | 11001101 | Í | Í | Í | Latin capital letter I with acute |
206 | 316 | CE | 11001110 | Î | Î | Î | Latin capital letter I with circumflex |
207 | 317 | CF | 11001111 | Ï | Ï | Ï | Latin capital letter I with diaeresis |
208 | 320 | D0 | 11010000 | Ð | Ð | Ð | Latin capital letter ETH |
209 | 321 | D1 | 11010001 | Ñ | Ñ | Ñ | Latin capital letter N with tilde |
210 | 322 | D2 | 11010010 | Ò | Ò | Ò | Latin capital letter O with grave |
211 | 323 | D3 | 11010011 | Ó | Ó | Ó | Latin capital letter O with acute |
212 | 324 | D4 | 11010100 | Ô | Ô | Ô | Latin capital letter O with circumflex |
213 | 325 | D5 | 11010101 | Õ | Õ | Õ | Latin capital letter O with tilde |
214 | 326 | D6 | 11010110 | Ö | Ö | Ö | Latin capital letter O with diaeresis |
215 | 327 | D7 | 11010111 | × | × | × | Multiplication sign |
216 | 330 | D8 | 11011000 | Ø | Ø | Ø | Latin capital letter O with slash |
217 | 331 | D9 | 11011001 | Ù | Ù | Ù | Latin capital letter U with grave |
218 | 332 | DA | 11011010 | Ú | Ú | Ú | Latin capital letter U with acute |
219 | 333 | DB | 11011011 | Û | Û | Û | Latin capital letter U with circumflex |
220 | 334 | DC | 11011100 | Ü | Ü | Ü | Latin capital letter U with diaeresis |
221 | 335 | DD | 11011101 | Ý | Ý | Ý | Latin capital letter Y with acute |
222 | 336 | DE | 11011110 | Þ | Þ | Þ | Latin capital letter THORN |
223 | 337 | DF | 11011111 | ß | ß | ß | Latin small letter sharp s - ess-zed |
224 | 340 | E0 | 11100000 | à | à | à | Latin small letter a with grave |
225 | 341 | E1 | 11100001 | á | á | á | Latin small letter a with acute |
226 | 342 | E2 | 11100010 | â | â | â | Latin small letter a with circumflex |
227 | 343 | E3 | 11100011 | ã | ã | ã | Latin small letter a with tilde |
228 | 344 | E4 | 11100100 | ä | ä | ä | Latin small letter a with diaeresis |
229 | 345 | E5 | 11100101 | å | å | å | Latin small letter a with ring above |
230 | 346 | E6 | 11100110 | æ | æ | æ | Latin small letter ae |
231 | 347 | E7 | 11100111 | ç | ç | ç | Latin small letter c with cedilla |
232 | 350 | E8 | 11101000 | è | è | è | Latin small letter e with grave |
233 | 351 | E9 | 11101001 | é | é | é | Latin small letter e with acute |
234 | 352 | EA | 11101010 | ê | ê | ê | Latin small letter e with circumflex |
235 | 353 | EB | 11101011 | ë | ë | ë | Latin small letter e with diaeresis |
236 | 354 | EC | 11101100 | ì | ì | ì | Latin small letter i with grave |
237 | 355 | ED | 11101101 | í | í | í | Latin small letter i with acute |
238 | 356 | EE | 11101110 | î | î | î | Latin small letter i with circumflex |
239 | 357 | EF | 11101111 | ï | ï | ï | Latin small letter i with diaeresis |
240 | 360 | F0 | 11110000 | ð | ð | ð | Latin small letter eth |
241 | 361 | F1 | 11110001 | ñ | ñ | ñ | Latin small letter n with tilde |
242 | 362 | F2 | 11110010 | ò | ò | ò | Latin small letter o with grave |
243 | 363 | F3 | 11110011 | ó | ó | ó | Latin small letter o with acute |
244 | 364 | F4 | 11110100 | ô | ô | ô | Latin small letter o with circumflex |
245 | 365 | F5 | 11110101 | õ | õ | õ | Latin small letter o with tilde |
246 | 366 | F6 | 11110110 | ö | ö | ö | Latin small letter o with diaeresis |
247 | 367 | F7 | 11110111 | ÷ | ÷ | ÷ | Division sign |
248 | 370 | F8 | 11111000 | ø | ø | ø | Latin small letter o with slash |
249 | 371 | F9 | 11111001 | ù | ù | ù | Latin small letter u with grave |
250 | 372 | FA | 11111010 | ú | ú | ú | Latin small letter u with acute |
251 | 373 | FB | 11111011 | û | û | û | Latin small letter u with circumflex |
252 | 374 | FC | 11111100 | ü | ü | ü | Latin small letter u with diaeresis |
253 | 375 | FD | 11111101 | ý | ý | ý | Latin small letter y with acute |
254 | 376 | FE | 11111110 | þ | þ | þ | Latin small letter thorn |
255 | 377 | FF | 11111111 | ÿ | ÿ | ÿ | Latin small letter y with diaeresis |
In the 90’s, a group of 8-bit character sets called “ISO/IEC 8859” were created to standardize non-English characters. ISO-8859-1, also known as Latin-1, is a slight extension of ASCII and can seen in modern databases. However, with these, you had to know which of these sets to use in order to view a particular document and it was not possible to include content spanning multiple charsets in the same documents.
Eventually, a 256-bit charset call Unicode was established, which is capable of encoding characters across all languages with plenty of space for additional characters to be added as needed. Unicode provides some backwards-compatibility since its first 127 character encodings match ASCII. Although Unicode solved the problem of not being able to account for characters across all languages, web standards and practices had already been established that limited data transfers to 8-bits. So, although browsers could handle Unicode, UTF-8 was established as a workaround. It is backwards compatible with ASCII and works by using three bytes for character encoding:
- The first byte consist of alphabet characters you’re using
- The 2nd is upper or lowercase
- The 3rd is each alphabet to use
Due to its wide adoption, support across browsers, and MySQL support, UTF-8 is usually a good choice for Drupal.
Database handling of charsets
When importing data into a DB, the backwards compatibility of having the first 127-characters match across multiple sets can be a curse rather than a blessing. Let’s say you’re doing an import and view a particular page that doesn’t contain any of the ASCII-compatible characters like "幸せな魚". If you visit a page that should contain "幸せな魚" but shows something along the lines of "|.^&Q!" then most likely your UTF-8 content was imported into an ISO-8859-1 DB. This problem is usually pretty obvious since the characters look nothing alike. If you do run into this problem the easiest solution is to change the DB’s charset and re-run the import. If a complete re-import isn't an option you dump the corrupted data, re-import into a UTF-8 db, and then do manual corrections of bad characters as they’re found or use a script to idenity and replace bad entries.
Charsets in URLS
Characters are everywhere we go including browser URLs. While you might not normally think of it, URLs like http://mediacurrent.com are in fact English letters. The specification for URLs is very limited in that they must consist only of English letters, numbers, and these: $-_.+!*’() (&%+,/:;=?@ are reserved and have special meaning.)
So what about people that want to use URLs that aren't in the English character set? For page/asset names reserved characters should be encoded in order to be included within a URL. This is done with a “%” followed by a number denoting a character within the ISO-8859-1 charset. For example "幸せな魚" would be "%E5%B9%B8%E3%81%9B%E3%81%AA%E9%AD%9A". Be careful that slashes used to define the path are not also encoded although Apache can be configured to handle this with the “AllowEncodedSlashes” directive if you have access to modify your sever configuration.
Internationalized Domain Names use a somewhat different set of rules.
Charsets issues in browser content
Its common these days for sites to just mark all of their content at UTF-8 and not really have to worry about it. What happens if you are maintaining a site that specifies a different character set? In this case you'll want to both in the HTTP headers being sent as well as the meta tags in your HTML since these can differ. Here is more information on HTML character encoding.
Drupal Language support
We've gone over databases and HTML but we usually use a content management system or a web framework to pull these two items together. Since we are a Drupal shop we'll use Drupal 7 for our example. Drupal has support out of the box using the Locale module as well as a host of supporting third party modules. It allows you to have multiple translations of a single piece of content available at the same time. For content type-specific control, the Entity Translation module can be used which introduce a "Translate" tab.
Some key points to remember when using Drupal and multilingual:
- Presenting translated content can be context-based or users can be provided with an option for viewing content in a different language.
- The Location module can be used to determine a user’s geographic location to auto-select or suggest a language.
- Content can be translated manually or by a computer. Manually translated content tends to have better accuracy but cost more.
- The TranslateThis Button module uses JavaScript to do automated translations and supports many languages.
- Drupal allows translated content to either share the same path or to have completely different URLs.
- Use the Transliteration module to replace any non-standard characters for all file uploads.
Ultimately, character sets, although somewhat complex, make life easier in a shrinking global community. We’ve gotten to the point where you can have a system that automatically handles these issues so users can focus on content. Still, having a basic understanding can help save troubleshooting time when something goes wrong.
Sources:
http://www.ascii-code.com
http://coding.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-ch…
http://www.bluebox.net/news/2009/07/mysql_encoding
http://stackoverflow.com/questions/1344692/i-need-help-fixing-broken-ut…
http://www.phpwact.org/php/i18n/charsets
http://www.sthlmconnection.se/sv/blog/languages-and-drupal-7-what-you-n…
http://evolvingweb.ca/story/content-translation-drupal-7