Unicode & ASCII ยท Dev Tools

Text to Code Points

Paste any text, emoji, or symbol to see the exact Unicode code points behind it โ€” including surrogate pairs, ZWJ sequences, and invisible characters that don't show up in your editor.

What are Unicode code points?

Every character you can type โ€” every letter, punctuation mark, emoji, and symbol โ€” has a unique number in the Unicode standard. That number is its code point, written as U+XXXX in hexadecimal.

Unicode is maintained by the Unicode Consortium and currently maps over 149,000 characters across 161 scripts. The range goes from U+0000 to U+10FFFF, organized into 17 planes of 65,536 code points each.

The first plane โ€” the Basic Multilingual Plane (U+0000โ€“U+FFFF) โ€” covers Latin, Greek, Cyrillic, Arabic, Chinese, Japanese, Korean, and most punctuation. Emoji and historic scripts largely live in the Supplementary Planes (U+10000 and above), which is why they behave differently in programming languages that predate them.

Reading the output: a practical example

Paste ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ง into the tool above. Instead of one entry, you'll see this sequence unpacked:

Code PointCharacterNameNotes
U+1F468๐Ÿ‘จManSupplementary Plane โ€” emoji
U+200Dโ€‹Zero-Width JoinerInvisible combiner
U+1F469๐Ÿ‘ฉWomanSupplementary Plane โ€” emoji
U+200Dโ€‹Zero-Width JoinerInvisible combiner
U+1F467๐Ÿ‘งGirlSupplementary Plane โ€” emoji

The browser renders that entire five-part sequence as a single glyph. If you call .length on it in JavaScript, you'll get 8 โ€” not 1, and not 5. Understanding this is essential for correct emoji-aware string handling.

When developers actually need this

Knowing a character's code point isn't trivia โ€” it surfaces as a real need in a handful of recurring debugging and security scenarios.

๐Ÿ”

Regex character class ranges

When filtering input by script โ€” blocking Cyrillic spam in a comment field, allowing only CJK characters in a Japanese form โ€” you need the exact hex range for the regex.

/[\u0400-\u04FF]/
๐Ÿ” 

Font glyph debugging

If a character renders as a box โ–ก or question mark in your app, the font file likely doesn't have a glyph for that code point. Extract the U+ value and check it against the font's character map in a tool like FontForge or the OS character viewer.

๐Ÿ•ต๏ธ

Homoglyph / spoofing detection

Two strings can look identical but be completely different characters. The Latin a (U+0061) and Cyrillic ะฐ (U+0430) are visually indistinguishable in most fonts. Attackers exploit this in phishing domains. Paste suspicious text here to expose what's actually inside it.

๐Ÿ›

Invisible character bugs

Zero-width spaces (U+200B), soft hyphens (U+00AD), and directional marks (U+200E / U+200F) are copied from web pages, PDFs, and rich text editors without anyone noticing. They break string equality checks and cause subtle parsing failures. This tool makes them visible.

๐Ÿ’ฌ

Emoji-aware string length

Enforcing a "140 character" limit on user-generated content? A naive .length call counts emoji as 2 characters each (or more for ZWJ sequences). Use the code point count from this tool's output to understand what a correct character counter needs to handle.

๐ŸŒ

HTML entities & CSS content

HTML entities use decimal (☃) or hex (☃) code points for special characters. CSS content properties use the hex value directly. Knowing the U+ value is the starting point for both.

Why .length lies to you

JavaScript's .length counts UTF-16 code units, not characters. For anything outside the Basic Multilingual Plane, those are not the same number.

>"A".lengthโ†’1
Latin letter โ€” 1 code unit, 1 code point โœ“
>"ยฉ".lengthโ†’1
BMP symbol โ€” 1 code unit, 1 code point โœ“
>"๐Ÿš€".lengthโ†’2
Supplementary Plane emoji โ€” 2 code units, 1 code point โœ—
>"๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ง".lengthโ†’8
ZWJ family โ€” 8 code units, 5 code points, 1 visible โœ—
>[..."๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ง"].lengthโ†’5
ES6 spread โ€” counts code points correctly โœ“

For reliable character counting, use [...str].length or Intl.Segmenter (for grapheme clusters that include ZWJ sequences). The code point count shown by this tool corresponds to [...str].length.

A quick map of Unicode planes

If your extracted code point starts with U+1 through U+10, it's in a Supplementary Plane โ€” this is why it behaves differently in legacy string handling.

PlaneRangeWhat lives here
BMP (Plane 0)U+0000โ€“U+FFFFLatin, Greek, Cyrillic, Arabic, CJK, most punctuation
SMP (Plane 1)U+10000โ€“U+1FFFFEmoji, historic scripts (Linear B, Gothic, Cuneiform)
SIP (Plane 2)U+20000โ€“U+2FFFFCJK extension Bโ€“F, rare Chinese/Japanese/Korean characters
TIP (Plane 3)U+30000โ€“U+3FFFFCJK extension Gโ€“H (extremely rare)
Planes 4โ€“13U+40000โ€“U+DFFFFUnassigned (reserved for future use)
SSP (Plane 14)U+E0000โ€“U+EFFFFLanguage tags and variation selectors
SPUA-A/B (15โ€“16)U+F0000โ€“U+10FFFFPrivate Use Areas โ€” app-specific glyphs

Tips for getting reliable results

1
Watch for normalization differences. The letter รฉ can be represented two ways: as a single precomposed character (U+00E9) or as the letter e (U+0065) followed by a combining accent (U+0301). Both look identical. If two strings aren't comparing equal despite looking the same, Unicode normalization (NFC vs NFD) is likely the cause. Paste both into this tool to see which form you have.
2
Check for invisible characters from copy-paste. Text copied from PDFs, Word documents, or certain websites often includes invisible Unicode characters. A zero-width space (U+200B) or soft hyphen (U+00AD) will not show up in your editor or browser, but will break string comparisons and regex matches. Run suspected text through this tool before trusting it.
3
Skin tone modifiers are separate code points. The thumbs up emoji ๐Ÿ‘ followed by a skin tone modifier (U+1F3FB through U+1F3FF) renders as a skin-toned variant. That's two code points: the base emoji and the modifier. This tool will show both, which is helpful when implementing emoji pickers or filtering.
4
Variation selectors change rendering, not identity. Some characters have text and emoji variants. The number sign # followed by U+FE0F (variation selector-16) renders as the hashtag emoji #๏ธโƒฃ. The underlying character is the same, but the variation selector tells the renderer to use the emoji style. Pasting emoji-style punctuation here will reveal whether a variation selector is present.

Frequently asked questions

What is a Unicode code point?+

A code point is the unique number assigned to every character in the Unicode standard โ€” letters, punctuation, currency symbols, mathematical operators, emoji, ancient scripts, you name it. They're written in hexadecimal with a U+ prefix: the letter A is U+0041, the copyright symbol ยฉ is U+00A9, and the rocket emoji ๐Ÿš€ is U+1F680. The number itself doesn't change across languages, operating systems, or programming environments โ€” that universality is the whole point of Unicode.

Why do emojis have longer code points than regular letters?+

The original Unicode design allocated code points from U+0000 to U+FFFF โ€” the Basic Multilingual Plane. That covers most scripts humans use today. When emojis and rare historic characters were added later, they needed to go into Supplementary Planes, which start at U+10000 and go up to U+10FFFF. So a rocket emoji at U+1F680 simply lives higher in the numbering system than the letter A at U+0041. It's purely about when each character was added and which plane it was assigned to.

What is a surrogate pair, and why does it matter in JavaScript?+

JavaScript strings are stored in memory as UTF-16, which uses 16-bit code units. A 16-bit number can hold 65,536 values โ€” enough for the Basic Multilingual Plane, but not for Supplementary Plane characters like most emoji. To handle those, UTF-16 uses surrogate pairs: two 16-bit code units that together encode one character. This is why '๐Ÿš€'.length returns 2 in JavaScript, not 1. The string is one character but two code units. This tool uses the modern ES6 for...of iterator, which understands surrogate pairs and extracts the true single code point (U+1F680) rather than the two halves.

What is a ZWJ sequence?+

ZWJ stands for Zero-Width Joiner (U+200D). It's an invisible character used to combine multiple separate emoji into one visual rendering. The family emoji ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ง is actually four separate characters: the man emoji, a ZWJ, the woman emoji, another ZWJ, and the girl emoji. The browser reads that sequence and renders a single combined graphic. Paste it into this tool and you'll see it unpack into each component, which is useful for understanding why string length checks on emoji sequences are so often wrong.

When would a developer actually need these code points?+

A few common situations: writing a regex to block or allow a specific Unicode range (like filtering Cyrillic characters with /[\u0400-\u04FF]/); debugging why a character isn't rendering in a custom font (you check whether the font file has a glyph for that specific U+ value); investigating why a string comparison is failing (invisible characters like soft hyphens, zero-width spaces, or directional marks cause subtle bugs); and encoding special characters in HTML entities or CSS content properties.

Are there characters that look identical but have different code points?+

Yes โ€” this is one of the sneakiest bugs in internationalized applications. Homoglyphs are characters from different scripts that look visually identical or nearly identical to the human eye. The Latin letter 'a' (U+0061) and the Cyrillic 'ะฐ' (U+0430) are visually indistinguishable in most fonts. Attackers use this to register deceptive domain names or bypass content filters. Pasting suspicious text into this tool will reveal if any characters are not what they appear to be.

Feedback

Live