encodingutf-8unicodeasciicharacter sets

Understanding Character Encoding: UTF-8, ASCII, Unicode

A clear explanation of character encoding - ASCII, Unicode, UTF-8, UTF-16, and why it matters for developers.

July 1, 2024Β·7 min read

Why Character Encoding Matters

Every text character stored or transmitted by a computer must be mapped to a number. Character encoding defines this mapping. Getting it wrong causes the dreaded mojibake (garbled text like "Ò€ℒ" instead of "'").

ASCII (1963)

ASCII (American Standard Code for Information Interchange) maps 128 characters (0-127) to 7-bit codes:

  • A = 65, B = 66, ... Z = 90
  • a = 97, b = 98, ... z = 122
  • 0 = 48, 1 = 49, ... 9 = 57

ASCII only covers English. Not sufficient for global text.

Unicode

Unicode is a universal character set that assigns a unique "code point" to every character in every writing system:

  • 'A' = U+0041
  • '€' = U+20AC
  • 'δΈ­' = U+4E2D
  • 'πŸ˜€' = U+1F600

Unicode defines what characters exist; encodings (UTF-8, UTF-16, etc.) define how to store them.

UTF-8 (The Web Standard)

UTF-8 encodes Unicode using 1-4 bytes per character:

  • ASCII characters (0-127): 1 byte - backward compatible!
  • Latin extended, common symbols: 2 bytes
  • Chinese, Japanese, Korean: 3 bytes
  • Emoji, rare scripts: 4 bytes

UTF-8 is used by 98%+ of websites. Always use UTF-8 for web content.

<meta charset="UTF-8">

HTML Entities

When you can't use a character directly in HTML, use entities:

&amp;  β†’ &
&lt;   β†’ <
&gt;   β†’ >
&copy; β†’ Β©
&#128512; β†’ πŸ˜€

Use our HTML Entity Encoder/Decoder and Unicode Escape tool.