Unicode
What is unicode?
The goal of Unicode is to be able to represent every character in every language. Use of Unicode promotes easy internationalization of files and applications. Unicode encoding schemes include UTF-8, UTF-16, UTF-32. These use 8, 16, and 32 bits to represent a character, respectively. Unicode is defined by the Unicode Consortium. An example how Unicode works looks like this:
tweedleburg:~ # echo hello wörld>file tweedleburg:~ # cat file hello wörld tweedleburg:~ # hexdump -c file 0000000 h e l l o w 303 266 r l d \n 000000d tweedleburg:~ # hexdump -C file 00000000 68 65 6c 6c 6f 20 77 c3 b6 72 6c 64 0a |hello w..rld.| 0000000d
As you can see, the ö-character has been stored as two bytes, every other character as one byte. This is UTF-8 encoding.
You can experiment around how unicode texts look with the unicode-text editor yudit. With this editor, you can chose if you want to save a text as utf-8, utf-16 or the like. If you open the above file and store it as UTF-16, every character is stored as two byte (16 bits). For little endian, it looks like this:
tweedleburg:~ # yudit file tweedleburg:~ # hexdump -C file 00000000 68 00 65 00 6c 00 6c 00 6f 00 20 00 77 00 f6 00 |h.e.l.l.o. .w...| 00000010 72 00 6c 00 64 00 0a 00 |r.l.d...| 00000018
See also
- Internationalization
- utf-8
- utf-16
- ISO 10646
- Unicode (en.wikipedia.org)
- Unicode.org (www.unicode.org)
- An introduction to Unicode