Cloud Tags

    Character Encoding

    Unicode defines the standards for representation and processing of written text to support the worldwide interchange, processing, and display of the written texts of the diverse languages and technical disciplines of the modern world. The current version of the Unicode standard, defines more than 107,000 characters covering 90 scripts.

    Character encoding is the numeric value assigned by Unicode for all the characters and symbols used in writing scripts using UTF-8 encoding. Character encoding standards not only identity each character and its numeric value, or code point, but also how this value is represented in bits.

    The four levels of the Unicode Character Encoding Model can be summarized as:

    1.ACR Abstract Character Repertoire the set of characters to be encoded, for example, some alphabet or symbol set
    2.CCS Coded Character Set: a mapping from an abstract character repertoire to a set of nonnegative integers
    3.CEF Character Encoding Form: a mapping from a set of nonnegative integers that are elements of a CCS to a set of sequences of particular code units of some specified width, such as 32-bit integers
    4.CES Character Encoding Scheme: a reversible transformation from a set of sequences of code units (from one or more CEFs to a serialized sequence of bytes)

    Unicode Standard follows a set of fundamental principles for Character Encoding:

    •Universal repertoire 
    •Logical order 
    •Characters, not glyphs 
    •Dynamic composition 
    •Plain Text 

    The Unicode Standard encodes characters in the range U+0000.U+10FFFF, which is roughly a 21-bit code space. Depending on the encoding form you choose (UTF-8, UTF-16, or UTF-32), each character will then be represented either as a sequence of one to four 8-bit bytes, one or two 16-bit code units, or a single 32-bit code unit.

    There are several possible representations of Unicode data, including UTF-8, UTF-16 and UTF-32. In addition, there are compression transformations such as the one described in the Unicode Technical Standard #6: A Standard Compression Scheme for Unicode.

    Unicode transformation format (UTF) is an algorithmic mapping from every Unicode code point (except surrogate code points) to a unique byte sequence. The ISO/IEC 10646 standard uses the term “UCS transformation format” for UTF; the two terms are merely synonyms for the same concept.