Understanding Character Encoding: UTF-8, ASCII, Unicode

Why Character Encoding Matters

Every text character stored or transmitted by a computer must be mapped to a number. Character encoding defines this mapping. Getting it wrong causes the dreaded mojibake (garbled text like "â€™" instead of "'").

ASCII (1963)

ASCII (American Standard Code for Information Interchange) maps 128 characters (0-127) to 7-bit codes:

A = 65, B = 66, ... Z = 90
a = 97, b = 98, ... z = 122
0 = 48, 1 = 49, ... 9 = 57

ASCII only covers English. Not sufficient for global text.

Unicode

Unicode is a universal character set that assigns a unique "code point" to every character in every writing system:

'A' = U+0041
'€' = U+20AC
'中' = U+4E2D
'😀' = U+1F600

Unicode defines what characters exist; encodings (UTF-8, UTF-16, etc.) define how to store them.

UTF-8 (The Web Standard)

UTF-8 encodes Unicode using 1-4 bytes per character:

ASCII characters (0-127): 1 byte - backward compatible!
Latin extended, common symbols: 2 bytes
Chinese, Japanese, Korean: 3 bytes
Emoji, rare scripts: 4 bytes

UTF-8 is used by 98%+ of websites. Always use UTF-8 for web content.

<meta charset="UTF-8">

HTML Entities

When you can't use a character directly in HTML, use entities:

&amp;  → &
&lt;   → <
&gt;   → >
&copy; → ©
&#128512; → 😀

Use our HTML Entity Encoder/Decoder and Unicode Escape tool.