Utf-8

From LQWiki
Jump to navigation Jump to search

UTF-8 is an elegant unicode encoding. Every ascii sign that uses 7 bit is written as this ascii sign, using 1 byte. In that case, the first bit is 0. As soon as the first bit is 1, utf-8 takes a second byte to find out what sign is represented. Let's take the editor yudit to create an example. Save a text u.txt, unicode format utf-8. Only write an "A" into the text.

cat u
ll u
hexdump u

Shows you, that this text consists of one byte, an "A", character 41h, 65d, 01000001b. Now, save a single Ä. With ll, you see that your text (consisting of one sign) needs two bytes. With cat, you see, that your Ä has been saved correctly. With hexdump, you see, that your text needs two bytes, 84 c3, and both have their first bit set.

As a test, use the following lines to output all 256 ascii signs:

for l in $(seq 0 1 7); do for i in $(seq 0 1 7); do for n in $(seq 0 1 7); do \
echo -en "\0${l}${i}${n}  ";done;  done;done 

See also