Notes on character encodings

vhsven

2020-03-15

Binary Coded Decimal (BCD)
- encodes Latin script
- 6 bits
- A-Z, 0-9 (not a-z)
- never standardized
Extended Binary Coded Decimal Interchange Code (EBCDIC)
- 8 bits
- mainly used by IBM
- a-z, A-Z, 0-9 (discontinuous)
American Standard Code for Information Interchange (ASCII)
- encodes Latin script
- an ANSI standard
- 7 bits -> 128 symbols (0x00 - 0x7F)
  - 95 printable: 0-9, A-Z, a-z, punctuation, specials
    - 0x20 space
    - 0x30 0
    - 0x41 A
    - 0x61 a
    - corresponds mostly to keys on modern keyboard
  - 33 non-printable
    - 0x00 NUL (\0)
      - used at the end of null-terminated strings in C
    - 0x07 BEL (\a)
    - 0x08 BS backspace (\b)
    - 0x09 HT horizontal tab (\t)
    - 0x0A LF line feed (\n)
    - 0x0D CR carriage return (\r)
    - mostly obsolete nowadays
  - sorting: ASCIIbetical order
- remaining bit sometimes used as parity check
intermezzo: code pages (CP)
- sets of characters plus encodings
- defined by a unique number
- used within big companies (IBM, MS, SAP, Oracle, ...)
- within microsoft
  - OEM code pages: old, from DOS era
  - ANSI code pages: used in Windows until shift to Unicode
    - misnomer: not an ANSI standard
    - new name: Windows code pages
OEM code page 437
- used on original IBM PC running DOS
- 8 bit ASCII extension (0x00 - 0xFF)
- replaces almost all control characters with new symbols
- adds 128 new characters (mostly symbols)
- still supported in modern Windows with alt-codes (e.g. alt+numpad1 = ☺)
  - uses decimal codes, not hexadecimal
latin1 = ISO-8859-1
- 8 bit ASCII extension (0x00 - 0xFF)
- adds 128 new characters (mostly letters with diacritics)
windows-1252
- ANSI code page 1252
- 8 bit latin1 extension
- replaces some control characters from latin1 with new printables
- can be accessed with alt codes by typing numpad0 as prefix
unicode ~= ISO 10646
- covers almost all scripts, both historic and contemporary
  - Arabic, Cyrillic, Greek, Hebrew, Latin, Chinese, Japanese, Braille, Emoji, ...
- supports bidirectional text (left-to-right and right-to-left)
- codespace: a range of integers from 0x000000 to 0x10FFFF available for encoding characters
  - 1M values (21 bits)
  - divided into 17 planes of 16 bits each: 0x__0000 - 0x__FFFF
  - first plane (0x0000-0xFFFF) = Basic Multilingual Plane (BMP)
    - contains 65k most relevant characters
- code point: any value in the unicode codespace
  - notation: e.g. U+0041 for A (0x41)
- 1 code point != 1 character
  - combinations possible (e.g. é as ' + e, although é as single code point also exists)
  - calculating string length is not trivial
- Unicode Transformation Format (UTF): map code points to bytes
  - each can represent the full unicode range
  - UTF-32: fixed length, 32 bit representation of a code point
    - 1:1 mapping between code point and byte value
    - simple but space inefficient
      - 11 wasted bits if all code points are relevant
      - rarely used for transmission
      - usually only the first plane is relevant
  - UTF-16: variable length, 16-32 bits
    - almost all relevant code points are encoded with 16 bits
    - used by Windows, .NET, Java, Javascript
    - successor to Universal Coded Character Set 2 (UCS-2)
      - fixed with, 16 bits
      - limited to 65k code points
  - UTF-8: variable length, 8-32 bits
    - compatible with ASCII, but not latin-1
    - single byte is used to represent ASCII
      - leading 0 to pad to 8 bits
    - multi byte encodings
      - indicate number of bytes in first byte
        
        e.g. 1110xxxx indicates 3 bytes
      - other bytes start with 10xxxxxx
        
        no confusion possible with single byte encodings
    - used on internet, unix
    - based on bytes, not code units like UTF-16 and UTF-32
- Byte Order Mark (BOM)
  - U+FEFF code point
  - purpose: distinguish between endianness
    - big-endian: most significant byte first (default, networking protocols)
    - little-endian: least significant byte first (e.g. Windows, x86/ARM architectures)
    - byte order, not bit order
    - only relevant when writing structures of more than 1 byte at once
  - reader may reverse bytes to align with own endianness
  - optionally inserted at start of document
  - if not parsed as BOM: corresponds to zero with non-breaking space
  - endianness of code point signals endianness of rest of text
    - 00 00 FE FF -> UTF-32, BE
    - FF FE 00 00 -> UTF-32, LE
    - FE FF -> UTF-16, BE
    - FF FE -> UTF-16, LE
    - EF BB BF -> UTF-8
      - not used for endianness since UTF-8 is written byte by byte and there can be no confusion
      - to distinguish from other similar encodings
  - typical for plaintext that has no other way to specify encoding
  - may break ASCII compatibility
  - alternative: heuristics (e.g. examine patterns of null bytes)