View the Most Wanted LQ Wiki articles.
LinuxQuestions.org > Linux Wiki > Unicode

From LQWiki

Jump to: navigation, search

Contents

What is unicode?

Unicode is a system for representing characters and symbols in binary[1]. To that extent, it is similar to other encoding schemes like ASCII. However, the goal of Unicode is to be able to uniquely represent every character in every language. Use of Unicode promotes easy internationalization of files and applications.

Unicode encoding schemes include UTF-8, UTF-16, UTF-32. These use 8, 16, and 32 bits to represent a character, respectively.

What is the Unicode Consortium?

From the unicode website: The Unicode Consortium is a non-profit organization founded to develop, extend and promote use of the Unicode Standard, which specifies the representation of text in modern software products and standards. The membership of the consortium represents a broad spectrum of corporations and organizations in the computer and information processing industry. The consortium is supported financially solely through membership dues. Membership in the Unicode Consortium is open to organizations and individuals anywhere in the world who support the Unicode Standard and wish to assist in its extension and implementation.

How can this influence my work?

You need to know that any text you see can be ASCII or Unicode. In the text

scorpio:~ # cat unicode.txt
This is UTF-16

"This" can turn out to consist of 4 bytes, or of eight. Find out like this:

scorpio:~ # hexdump unicode.txt
0000000 5400 6800 6900 7300 2000 6900 7300 2000
0000010 5500 5400 4600 2d00 3100 3600 0a00
000001e

As you see here, between the "T" (hex code 54) and the "h" (hex code 68), there is a Null-byte (00). So, this text is UTF-16 as it states. This is important to know when reading a text byte-wise. You can experiment around how unicode texts look with the unicode-editor yudit. With this editor, you can chose if you want to save a text as utf-8, utf-16 or the like.

See also

Foot notes

What is unicode?

[1] Just to be absolutely clear, Unicode is not a set of fonts, nor is it a software package. It's simply a way of representing a given character as a string of 1's and 0's.

Or to clarify that even further, Unicode defines a set of characters, and assigns a number to each one. The actual 1's and 0's on the wire can vary depending on the encoding; specifically depending on whether the wire format is bigendian or littleendian and on whether a transform such as UTF-8 is in use.

External links


Personal tools