Text to Code Points
Paste any text, emoji, or symbol to see the exact Unicode code points behind it โ including surrogate pairs, ZWJ sequences, and invisible characters that don't show up in your editor.
What are Unicode code points?
Every character you can type โ every letter, punctuation mark, emoji, and symbol โ has a unique number in the Unicode standard. That number is its code point, written as U+XXXX in hexadecimal.
Unicode is maintained by the Unicode Consortium and currently maps over 149,000 characters across 161 scripts. The range goes from U+0000 to U+10FFFF, organized into 17 planes of 65,536 code points each.
The first plane โ the Basic Multilingual Plane (U+0000โU+FFFF) โ covers Latin, Greek, Cyrillic, Arabic, Chinese, Japanese, Korean, and most punctuation. Emoji and historic scripts largely live in the Supplementary Planes (U+10000 and above), which is why they behave differently in programming languages that predate them.
Reading the output: a practical example
Paste ๐จโ๐ฉโ๐ง into the tool above. Instead of one entry, you'll see this sequence unpacked:
| Code Point | Character | Name | Notes |
|---|---|---|---|
| U+1F468 | ๐จ | Man | Supplementary Plane โ emoji |
| U+200D | โ | Zero-Width Joiner | Invisible combiner |
| U+1F469 | ๐ฉ | Woman | Supplementary Plane โ emoji |
| U+200D | โ | Zero-Width Joiner | Invisible combiner |
| U+1F467 | ๐ง | Girl | Supplementary Plane โ emoji |
The browser renders that entire five-part sequence as a single glyph. If you call .length on it in JavaScript, you'll get 8 โ not 1, and not 5. Understanding this is essential for correct emoji-aware string handling.
When developers actually need this
Knowing a character's code point isn't trivia โ it surfaces as a real need in a handful of recurring debugging and security scenarios.
Regex character class ranges
When filtering input by script โ blocking Cyrillic spam in a comment field, allowing only CJK characters in a Japanese form โ you need the exact hex range for the regex.
Font glyph debugging
If a character renders as a box โก or question mark in your app, the font file likely doesn't have a glyph for that code point. Extract the U+ value and check it against the font's character map in a tool like FontForge or the OS character viewer.
Homoglyph / spoofing detection
Two strings can look identical but be completely different characters. The Latin a (U+0061) and Cyrillic ะฐ (U+0430) are visually indistinguishable in most fonts. Attackers exploit this in phishing domains. Paste suspicious text here to expose what's actually inside it.
Invisible character bugs
Zero-width spaces (U+200B), soft hyphens (U+00AD), and directional marks (U+200E / U+200F) are copied from web pages, PDFs, and rich text editors without anyone noticing. They break string equality checks and cause subtle parsing failures. This tool makes them visible.
Emoji-aware string length
Enforcing a "140 character" limit on user-generated content? A naive .length call counts emoji as 2 characters each (or more for ZWJ sequences). Use the code point count from this tool's output to understand what a correct character counter needs to handle.
HTML entities & CSS content
HTML entities use decimal (☃) or hex (☃) code points for special characters. CSS content properties use the hex value directly. Knowing the U+ value is the starting point for both.
Why .length lies to you
JavaScript's .length counts UTF-16 code units, not characters. For anything outside the Basic Multilingual Plane, those are not the same number.
For reliable character counting, use [...str].length or Intl.Segmenter (for grapheme clusters that include ZWJ sequences). The code point count shown by this tool corresponds to [...str].length.
A quick map of Unicode planes
If your extracted code point starts with U+1 through U+10, it's in a Supplementary Plane โ this is why it behaves differently in legacy string handling.
| Plane | Range | What lives here |
|---|---|---|
| BMP (Plane 0) | U+0000โU+FFFF | Latin, Greek, Cyrillic, Arabic, CJK, most punctuation |
| SMP (Plane 1) | U+10000โU+1FFFF | Emoji, historic scripts (Linear B, Gothic, Cuneiform) |
| SIP (Plane 2) | U+20000โU+2FFFF | CJK extension BโF, rare Chinese/Japanese/Korean characters |
| TIP (Plane 3) | U+30000โU+3FFFF | CJK extension GโH (extremely rare) |
| Planes 4โ13 | U+40000โU+DFFFF | Unassigned (reserved for future use) |
| SSP (Plane 14) | U+E0000โU+EFFFF | Language tags and variation selectors |
| SPUA-A/B (15โ16) | U+F0000โU+10FFFF | Private Use Areas โ app-specific glyphs |
Tips for getting reliable results
Frequently asked questions
What is a Unicode code point?+
A code point is the unique number assigned to every character in the Unicode standard โ letters, punctuation, currency symbols, mathematical operators, emoji, ancient scripts, you name it. They're written in hexadecimal with a U+ prefix: the letter A is U+0041, the copyright symbol ยฉ is U+00A9, and the rocket emoji ๐ is U+1F680. The number itself doesn't change across languages, operating systems, or programming environments โ that universality is the whole point of Unicode.
Why do emojis have longer code points than regular letters?+
The original Unicode design allocated code points from U+0000 to U+FFFF โ the Basic Multilingual Plane. That covers most scripts humans use today. When emojis and rare historic characters were added later, they needed to go into Supplementary Planes, which start at U+10000 and go up to U+10FFFF. So a rocket emoji at U+1F680 simply lives higher in the numbering system than the letter A at U+0041. It's purely about when each character was added and which plane it was assigned to.
What is a surrogate pair, and why does it matter in JavaScript?+
JavaScript strings are stored in memory as UTF-16, which uses 16-bit code units. A 16-bit number can hold 65,536 values โ enough for the Basic Multilingual Plane, but not for Supplementary Plane characters like most emoji. To handle those, UTF-16 uses surrogate pairs: two 16-bit code units that together encode one character. This is why '๐'.length returns 2 in JavaScript, not 1. The string is one character but two code units. This tool uses the modern ES6 for...of iterator, which understands surrogate pairs and extracts the true single code point (U+1F680) rather than the two halves.
What is a ZWJ sequence?+
ZWJ stands for Zero-Width Joiner (U+200D). It's an invisible character used to combine multiple separate emoji into one visual rendering. The family emoji ๐จโ๐ฉโ๐ง is actually four separate characters: the man emoji, a ZWJ, the woman emoji, another ZWJ, and the girl emoji. The browser reads that sequence and renders a single combined graphic. Paste it into this tool and you'll see it unpack into each component, which is useful for understanding why string length checks on emoji sequences are so often wrong.
When would a developer actually need these code points?+
A few common situations: writing a regex to block or allow a specific Unicode range (like filtering Cyrillic characters with /[\u0400-\u04FF]/); debugging why a character isn't rendering in a custom font (you check whether the font file has a glyph for that specific U+ value); investigating why a string comparison is failing (invisible characters like soft hyphens, zero-width spaces, or directional marks cause subtle bugs); and encoding special characters in HTML entities or CSS content properties.
Are there characters that look identical but have different code points?+
Yes โ this is one of the sneakiest bugs in internationalized applications. Homoglyphs are characters from different scripts that look visually identical or nearly identical to the human eye. The Latin letter 'a' (U+0061) and the Cyrillic 'ะฐ' (U+0430) are visually indistinguishable in most fonts. Attackers use this to register deceptive domain names or bypass content filters. Pasting suspicious text into this tool will reveal if any characters are not what they appear to be.