Unicode

From LQWiki
Jump to navigation Jump to search

What is unicode?

The goal of Unicode is to be able to represent every character in every language. Use of Unicode promotes easy internationalization of files and applications. Unicode encoding schemes include UTF-8, UTF-16, UTF-32. These use 8, 16, and 32 bits to represent a character, respectively. Unicode is defined by the Unicode Consortium. An example how Unicode works looks like this:

tweedleburg:~ # echo hello wörld>file
tweedleburg:~ # cat file
hello wörld
tweedleburg:~ # hexdump -c file
0000000   h   e   l   l   o       w 303 266   r   l   d  \n
000000d
tweedleburg:~ # hexdump -C file
00000000  68 65 6c 6c 6f 20 77 c3  b6 72 6c 64 0a           |hello w..rld.|
0000000d

As you can see, the ö-character has been stored as two bytes, every other character as one byte. This is UTF-8 encoding.

You can experiment around how unicode texts look with the unicode-text editor yudit. With this editor, you can chose if you want to save a text as utf-8, utf-16 or the like. If you open the above file and store it as UTF-16, every character is stored as two byte (16 bits). For little endian, it looks like this:

tweedleburg:~ # yudit file
tweedleburg:~ # hexdump -C file
00000000  68 00 65 00 6c 00 6c 00  6f 00 20 00 77 00 f6 00  |h.e.l.l.o. .w...|
00000010  72 00 6c 00 64 00 0a 00                           |r.l.d...|
00000018

See also