Skip to main content
- Binary Coded Decimal (BCD)
- encodes Latin script
- 6 bits
- A-Z, 0-9 (not a-z)
- never standardized
- Extended Binary Coded Decimal Interchange Code (EBCDIC)
- 8 bits
- mainly used by IBM
- a-z, A-Z, 0-9 (discontinuous)
- American Standard Code for Information Interchange (ASCII)
- encodes Latin script
- an ANSI standard
- 7 bits -> 128 symbols (0x00 - 0x7F)
- 95 printable: 0-9, A-Z, a-z, punctuation, specials
- 0x20 space
- 0x30 0
- 0x41 A
- 0x61 a
- corresponds mostly to keys on modern keyboard
- 33 non-printable
- 0x00 NUL (\0)
- used at the end of null-terminated strings in C
- 0x07 BEL (\a)
- 0x08 BS backspace (\b)
- 0x09 HT horizontal tab (\t)
- 0x0A LF line feed (\n)
- 0x0D CR carriage return (\r)
- mostly obsolete nowadays
- sorting: ASCIIbetical order
- remaining bit sometimes used as parity check
- intermezzo: code pages (CP)
- sets of characters plus encodings
- defined by a unique number
- used within big companies (IBM, MS, SAP, Oracle, ...)
- within microsoft
- OEM code pages: old, from DOS era
- ANSI code pages: used in Windows until shift to Unicode
- misnomer: not an ANSI standard
- new name: Windows code pages
- OEM code page 437
- used on original IBM PC running DOS
- 8 bit ASCII extension (0x00 - 0xFF)
- replaces almost all control characters with new symbols
- adds 128 new characters (mostly symbols)
- still supported in modern Windows with alt-codes (e.g. alt+numpad1 = ☺)
- uses decimal codes, not hexadecimal
- latin1 = ISO-8859-1
- 8 bit ASCII extension (0x00 - 0xFF)
- adds 128 new characters (mostly letters with diacritics)
- windows-1252
- ANSI code page 1252
- 8 bit latin1 extension
- replaces some control characters from latin1 with new printables
- can be accessed with alt codes by typing
numpad0
as prefix
- unicode ~= ISO 10646
- covers almost all scripts, both historic and contemporary
- Arabic, Cyrillic, Greek, Hebrew, Latin, Chinese, Japanese, Braille, Emoji, ...
- supports bidirectional text (left-to-right and right-to-left)
- codespace: a range of integers from 0x000000 to 0x10FFFF available for encoding characters
- 1M values (21 bits)
- divided into 17 planes of 16 bits each: 0x__0000 - 0x__FFFF
- first plane (0x0000-0xFFFF) = Basic Multilingual Plane (BMP)
- contains 65k most relevant characters
- code point: any value in the unicode codespace
- notation: e.g.
U+0041
for A (0x41)
- 1 code point != 1 character
- combinations possible (e.g. é as ' + e, although é as single code point also exists)
- calculating string length is not trivial
- Unicode Transformation Format (UTF): map code points to bytes
- each can represent the full unicode range
- UTF-32: fixed length, 32 bit representation of a code point
- 1:1 mapping between code point and byte value
- simple but space inefficient
- 11 wasted bits if all code points are relevant
- rarely used for transmission
- usually only the first plane is relevant
- UTF-16: variable length, 16-32 bits
- almost all relevant code points are encoded with 16 bits
- used by Windows, .NET, Java, Javascript
- successor to Universal Coded Character Set 2 (UCS-2)
- fixed with, 16 bits
- limited to 65k code points
- UTF-8: variable length, 8-32 bits
- compatible with ASCII, but not latin-1
- single byte is used to represent ASCII
- leading 0 to pad to 8 bits
- multi byte encodings
- indicate number of bytes in first byte
- e.g.
1110xxxx
indicates 3 bytes
- other bytes start with
10xxxxxx
- no confusion possible with single byte encodings
- used on internet, unix
- based on bytes, not code units like UTF-16 and UTF-32
- Byte Order Mark (BOM)
- U+FEFF code point
- purpose: distinguish between endianness
- big-endian: most significant byte first (default, networking protocols)
- little-endian: least significant byte first (e.g. Windows, x86/ARM architectures)
- byte order, not bit order
- only relevant when writing structures of more than 1 byte at once
- reader may reverse bytes to align with own endianness
- optionally inserted at start of document
- if not parsed as BOM: corresponds to zero with non-breaking space
- endianness of code point signals endianness of rest of text
-
00 00 FE FF
-> UTF-32, BE
-
FF FE 00 00
-> UTF-32, LE
-
FE FF
-> UTF-16, BE
-
FF FE
-> UTF-16, LE
-
EF BB BF
-> UTF-8
- not used for endianness since UTF-8 is written byte by byte and there can be no confusion
- to distinguish from other similar encodings
- typical for plaintext that has no other way to specify encoding
- may break ASCII compatibility
- alternative: heuristics (e.g. examine patterns of null bytes)