Notes on character encodings

  • Binary Coded Decimal (BCD)
    • encodes Latin script
    • 6 bits
    • A-Z, 0-9 (not a-z)
    • never standardized
  • Extended Binary Coded Decimal Interchange Code (EBCDIC)
    • 8 bits
    • mainly used by IBM
    • a-z, A-Z, 0-9 (discontinuous)
  • American Standard Code for Information Interchange (ASCII)
    • encodes Latin script
    • an ANSI standard
    • 7 bits -> 128 symbols (0x00 - 0x7F)
      • 95 printable: 0-9, A-Z, a-z, punctuation, specials
        • 0x20 space
        • 0x30 0
        • 0x41 A
        • 0x61 a
        • corresponds mostly to keys on modern keyboard
      • 33 non-printable
        • 0x00 NUL (\0)
          • used at the end of null-terminated strings in C
        • 0x07 BEL (\a)
        • 0x08 BS backspace (\b)
        • 0x09 HT horizontal tab (\t)
        • 0x0A LF line feed (\n)
        • 0x0D CR carriage return (\r)
        • mostly obsolete nowadays
      • sorting: ASCIIbetical order
    • remaining bit sometimes used as parity check
  • intermezzo: code pages (CP)
    • sets of characters plus encodings
    • defined by a unique number
    • used within big companies (IBM, MS, SAP, Oracle, ...)
    • within microsoft
      • OEM code pages: old, from DOS era
      • ANSI code pages: used in Windows until shift to Unicode
        • misnomer: not an ANSI standard
        • new name: Windows code pages
  • OEM code page 437
    • used on original IBM PC running DOS
    • 8 bit ASCII extension (0x00 - 0xFF)
    • replaces almost all control characters with new symbols
    • adds 128 new characters (mostly symbols)
    • still supported in modern Windows with alt-codes (e.g. alt+numpad1 = ☺)
      • uses decimal codes, not hexadecimal
  • latin1 = ISO-8859-1
    • 8 bit ASCII extension (0x00 - 0xFF)
    • adds 128 new characters (mostly letters with diacritics)
  • windows-1252
    • ANSI code page 1252
    • 8 bit latin1 extension
    • replaces some control characters from latin1 with new printables
    • can be accessed with alt codes by typing numpad0 as prefix
  • unicode ~= ISO 10646
    • covers almost all scripts, both historic and contemporary
      • Arabic, Cyrillic, Greek, Hebrew, Latin, Chinese, Japanese, Braille, Emoji, ...
    • supports bidirectional text (left-to-right and right-to-left)
    • codespace: a range of integers from 0x000000 to 0x10FFFF available for encoding characters
      • 1M values (21 bits)
      • divided into 17 planes of 16 bits each: 0x__0000 - 0x__FFFF
      • first plane (0x0000-0xFFFF) = Basic Multilingual Plane (BMP)
        • contains 65k most relevant characters
    • code point: any value in the unicode codespace
      • notation: e.g. U+0041 for A (0x41)
    • 1 code point != 1 character
      • combinations possible (e.g. é as ' + e, although é as single code point also exists)
      • calculating string length is not trivial
    • Unicode Transformation Format (UTF): map code points to bytes
      • each can represent the full unicode range
      • UTF-32: fixed length, 32 bit representation of a code point
        • 1:1 mapping between code point and byte value
        • simple but space inefficient
          • 11 wasted bits if all code points are relevant
          • rarely used for transmission
          • usually only the first plane is relevant
      • UTF-16: variable length, 16-32 bits
        • almost all relevant code points are encoded with 16 bits
        • used by Windows, .NET, Java, Javascript
        • successor to Universal Coded Character Set 2 (UCS-2)
          • fixed with, 16 bits
          • limited to 65k code points
      • UTF-8: variable length, 8-32 bits
        • compatible with ASCII, but not latin-1
        • single byte is used to represent ASCII
          • leading 0 to pad to 8 bits
        • multi byte encodings
          • indicate number of bytes in first byte
            • e.g. 1110xxxx indicates 3 bytes
          • other bytes start with 10xxxxxx
            • no confusion possible with single byte encodings
        • used on internet, unix
        • based on bytes, not code units like UTF-16 and UTF-32
    • Byte Order Mark (BOM)
      • U+FEFF code point
      • purpose: distinguish between endianness
        • big-endian: most significant byte first (default, networking protocols)
        • little-endian: least significant byte first (e.g. Windows, x86/ARM architectures)
        • byte order, not bit order
        • only relevant when writing structures of more than 1 byte at once
      • reader may reverse bytes to align with own endianness
      • optionally inserted at start of document
      • if not parsed as BOM: corresponds to zero with non-breaking space
      • endianness of code point signals endianness of rest of text
        • 00 00 FE FF -> UTF-32, BE
        • FF FE 00 00 -> UTF-32, LE
        • FE FF -> UTF-16, BE
        • FF FE -> UTF-16, LE
        • EF BB BF -> UTF-8
          • not used for endianness since UTF-8 is written byte by byte and there can be no confusion
          • to distinguish from other similar encodings
      • typical for plaintext that has no other way to specify encoding
      • may break ASCII compatibility
      • alternative: heuristics (e.g. examine patterns of null bytes)