Unicode to UTF-8
Convert any text, emoji, or symbol into its raw UTF-8 bytes. See the hex sequence, binary representation, and byte count per character β all in your browser.
Unicode is a list. UTF-8 is the format.
These two terms are often used interchangeably, but they describe completely different things. Unicode is a specification β a giant, continuously updated table that assigns a unique number (a code point) to every character in every writing system on earth, plus emoji, symbols, and control characters. There are currently over 149,000 assigned code points.
UTF-8 is an encoding: the algorithm that takes those abstract numbers and turns them into actual bytes that processors, storage drives, and network packets can work with. A file doesn't store "the letter A" β it stores the byte 0x41. UTF-8 is the rulebook for that translation.
The key design decision that made UTF-8 dominant is variable-width encoding. Instead of giving every character the same number of bytes, UTF-8 uses 1 byte for ASCII characters, 2 for common extended characters, 3 for most symbols and CJK scripts, and 4 for emoji and rarer code points. English text stays compact; the full range of Unicode stays reachable.
How specific characters encode
The table below shows five representative characters spanning the full 1β4 byte range. Enter any of these into the tool above to see the full binary breakdown.
| Char | Code Point | Bytes | Hex | Note |
|---|---|---|---|---|
| A | U+0041 | 1 | 41 | ASCII β unchanged from 1963 |
| Γ© | U+00E9 | 2 | C3 A9 | Latin Extended β common in French, Spanish |
| β¬ | U+20AC | 3 | E2 82 AC | Euro sign β 3 bytes, not in Latin-1 |
| δΈ | U+4E2D | 3 | E4 B8 AD | CJK β all Chinese characters are 3 bytes |
| π | U+1F30D | 4 | F0 9F 8C 8D | Emoji β always 4 bytes |
The variable-width scheme, explained
UTF-8's variable width works because the leading bits of the first byte encode how many bytes follow. Parsers use this to walk through a byte stream without a separator or length header β each character announces its own size.
0xxxxxxx
A β 01000001
The leading 0 signals a single-byte character. Identical to ASCII.
110xxxxx 10xxxxxx
Γ© β 11000011 10101001
Leading 110 means two bytes total. The second byte always starts with 10.
1110xxxx 10xxxxxx 10xxxxxx
β¬ β 11100010 10000010 10101100
Leading 1110 means three bytes. Covers most CJK characters and symbols.
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
π β F0 9F 8C 8D
Leading 11110 means four bytes. All emoji live in this range.
Real situations where byte-level encoding matters
Debugging MySQL emoji corruption
utf8 charset only handles up to 3 bytes per character β it predates most emoji. Storing a four-byte emoji in a utf8 column silently truncates or replaces it with ????. Use this tool to confirm a character is four bytes, then fix the column with: ALTER TABLE t CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;Constructing raw network payloads
Cryptographic input preparation
HTTP header and URL encoding issues
%XX percent-encoded UTF-8 bytes). URLs follow the same rule. If a multi-byte character is being mangled in a header or URL, checking the raw UTF-8 hex sequence here tells you exactly what the percent-encoded form should be.Validating text field byte limits
The BOM: what it is and why to avoid it
The Byte Order Mark is the three-byte sequence EF BB BF that some editors (notably Notepad on Windows) write at the very beginning of a UTF-8 file. It was borrowed from UTF-16, where byte order genuinely matters because the standard has two variants (big-endian and little-endian). In UTF-16, the BOM tells a reader which variant they're looking at.
UTF-8 doesn't have a byte order problem. Every UTF-8 byte stream is identical regardless of the host machine's endianness. So the BOM carries no useful information in UTF-8 β it's just three extra bytes at the start of the file.
The Unicode Consortium's own recommendation is to not use a BOM in UTF-8 files. In practice, BOMs cause real problems: PHP outputs a blank line before the HTTP response because the BOM comes before the <?php tag, Unix shell scripts fail because the shebang line is no longer the first bytes, and CSV files develop a phantom character at the top-left cell in spreadsheet apps.
Quick rule of thumb
If you control the toolchain: save UTF-8 without BOM. If you're receiving files from Windows users or legacy systems, strip the BOM before processing. Most modern editors (VS Code, JetBrains IDEs) default to UTF-8 without BOM and let you check or change it in the file encoding settings.
UTF-8 vs UTF-16 vs UTF-32
All three are valid Unicode encodings β they just make different trade-offs between space efficiency, complexity, and compatibility.
| UTF-8 | UTF-16 | UTF-32 | |
|---|---|---|---|
| Bytes per ASCII char | 1 | 2 | 4 |
| Bytes per emoji | 4 | 4 | 4 |
| ASCII compatible | Yes | No | No |
| Byte order issues | None | Yes (BOM needed) | Yes (BOM needed) |
| Used by | Web, Linux, macOS, JSON | Windows API, Java, JS internals | Rare β databases, some compilers |
| Web share | ~98% | ~1% | <0.1% |
Practical notes for working with UTF-8
β¦ Byte length β character length
String.length in JavaScript returns UTF-16 code units, not characters. A single emoji can have .length of 2. For byte-accurate sizing, use TextEncoder and check the resulting Uint8Array's byteLength.
β¦ MySQL utf8 is not real UTF-8
MySQL named its 3-byte charset 'utf8', which is technically incorrect. The actual full-Unicode charset is 'utf8mb4'. Always use utf8mb4 for any column that might receive user input.
β¦ JSON is always UTF-8
The JSON specification (RFC 8259) mandates UTF-8 encoding. If you're producing JSON, you're producing UTF-8. Exotic encodings in JSON files will break parsers.
β¦ HTTP Content-Type charset
Always declare charset=utf-8 in your Content-Type header for HTML and API responses. Without it, browsers may guess the encoding β and guess wrong, especially for pages with non-ASCII content.
β¦ len() in Python counts characters, not bytes
len('π') returns 1 in Python 3, because Python strings are sequences of Unicode code points. To get the byte count: len('π'.encode('utf-8')) returns 4.
β¦ Normalisation matters for comparisons
Some characters have multiple Unicode representations. 'Γ©' can be a single code point (U+00E9) or two code points ('e' + combining accent U+0301). These look identical but have different byte sequences. Use unicode.normalize() before comparing strings.
Common questions
What's the difference between Unicode and UTF-8?
Unicode is a standard that assigns a unique number β called a code point β to every character and symbol in existence. U+0041 is the code point for 'A'. U+1F30D is the code point for π. UTF-8 is the encoding: the algorithm that converts those abstract numbers into actual bytes that a CPU, hard drive, or network packet can handle. You can think of Unicode as a global character dictionary and UTF-8 as the printing format.
Why do some characters use more bytes than others?
UTF-8 is a variable-width encoding. ASCII characters (basic Latin letters, digits, punctuation) encode to exactly one byte, which is why UTF-8 is fully backward-compatible with 1960s ASCII. Characters in most European and Middle Eastern scripts use two bytes. CJK characters (Chinese, Japanese, Korean) and symbols like Β© use three bytes. Modern emoji use four bytes. The encoding scheme embeds the byte count in the leading bits of the first byte so parsers know how many bytes belong to each character.
Why does MySQL corrupt emojis into question marks?
MySQL's legacy 'utf8' charset only supports up to three bytes per character, which predates most emoji. When you try to store a four-byte emoji in a utf8 column, MySQL either silently truncates it or replaces it with '????'. The fix is to set the column (and ideally the database and connection) charset to 'utf8mb4', which is the correct implementation of full Unicode support. If you're migrating an existing table, ALTER TABLE tablename CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci handles it.
What is a Byte Order Mark (BOM) and should I use one?
A BOM is the byte sequence EF BB BF placed at the very start of a UTF-8 file to signal the encoding format to applications that might otherwise have to guess. The Unicode Consortium explicitly recommends against using a BOM in UTF-8 files. Unlike UTF-16 (where byte order genuinely matters), UTF-8 doesn't have an endianness problem, so the BOM is unnecessary. In practice, it breaks PHP scripts that output headers, confuses Unix shell scripts, and causes unexpected characters to appear in CSV files opened in Excel.
Is UTF-8 the only way to encode Unicode?
No. UTF-16 encodes every character in two or four bytes and is used internally by JavaScript strings and the Windows API. UTF-32 uses a fixed four bytes per character, making random access fast but storage expensive. UTF-8 has won the web: over 98% of websites use it. Its combination of ASCII compatibility, space efficiency for Latin-script text, and full Unicode coverage is hard to beat.
How does this tool perform the conversion?
It uses the browser's built-in TextEncoder API with the 'utf-8' encoding label, which serializes the input string into a Uint8Array of raw bytes. The tool then formats those bytes as hex (base-16) and binary (base-2) for inspection. No server is involved β the conversion runs entirely in your browser's JavaScript engine.
Runs entirely in your browser. All encoding is performed using the browser's native TextEncoder API. No text you enter is sent to any server or logged anywhere.