·
3 分钟阅读时长
·
1191
字
·
-阅读
-评论
title: “Understanding emoji” tags: [] slug: 4d28b2a7 date: 2020-05-09 16:23:24 summary: “What emojis are (characters, not images), encoding basics, why JS string length varies, and practical notes.”

I saw a quiz in a chat group: what is
'✈️'.length? That stumped me, so I dug in and took notes.
What is emoji?
emoji(英语:emoji,日语:絵文字/えもじ emoji),是使用在网页和聊天中的形意符号
- Emojis are characters, not images.
- The same emoji looks different on Twitter/iOS/Windows because code points are standard, but rendering depends on platform fonts.
Character encoding
在搞明白'✈️'.length之前,先补下编码知识
我们常说的编码有这些。ASCII(国际编码),Unicode(国际符号集),UTF8(国际编码),GB2312(国标编码),ISO-8859-1(国际编码)
We used to run into encoding issues (archives, web pages, uploads/downloads) more often; less so today as standards converged.
- ASCII只考虑了英文字符
- Unicode考虑了世界上各种字符,符号,规定了符号的二进制代码
- Unicode只是一个符号集,它只规定了符号的二进制代码,却没有规定这个二进制代码 应该如何存储
UTF-8 is one implementation of Unicode, a variable‑length encoding using 1–4 bytes per code point.- Unicode 编码范围是从 U+0000 到 U+10FFFF,每一个编码(也称之为码位 code point)表示一个 Unicode 字符;而这么多码位有划分成 17 个平面: 第一个平面(U+0000 ~ U+FFFF): 基本平面(Basic Multilingual Plane - BMP) 其他 16 个平面(U+100000 ~ U+10FFFF): 补充平面(Supplementary Planes)
- GB2312是
国标,但是今天几乎所有站点都是UTF8,GB2312已经被淘汰。 - ISO-8859-1 不是Unicode字符集,原因是并不对所有符号编码,只是一部分,今天也很少见该编码
- JavaScript engines internally use UCS‑2/UTF‑16.
- Java默认编码方式是UTF-16
下图可以简单的看出几种编码的区别,以后应该主要是UTF了

Emoji “length”
编码知道后,再一开始的长度疑问
- In JS,
'123'.length===3counts code units, not bytes. - Code points, bytes, and JS string length are not equivalent.
- Different emojis correspond to different code unit lengths.
Unicode and emoji
- BMP emojis may be 1 code unit.
- BMP multi‑code‑point sequences may be 2 code units.
- Supplementary plane single code points are typically 2 code units in UTF‑16.
- Multi‑code‑point sequences vary.
举个例子
// 各个表情字符个数
console.log('⛷'. length); // 1
console.log('😂'.length); // 2
console.log('1️⃣'.length); // 3
console.log('👨👨👦'.length); // 8
console.log('👨👩👧👦'.length); // 11
Encodings you’ll see
System locale: run
localein a terminal.搞开发的同学会看到IDE,各个程序文件都有文件编码,DB也有database编码
浏览器浏览网页HTML,CSS,JS也都有对应的编码
Terminal text/emoji relies on encoding and fonts. If you see tofu/garbage, your font may lack glyphs — switch fonts.

Final Thoughts
Even understanding encodings and emoji, it’s still nontrivial to predict code unit length at a glance. 😭
参考文档
- Emoji 简介
- 字符编码笔记:ASCII,Unicode 和 UTF-8
- GB2312
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
- Emoji Unicode Tables
- Javascript 有个 Unicode 的天坑
- 探究 emoji 字符长度
- emoji WIKI
- utf-8编码已经成为主流
- 探索iOS中Emoji表情的编码与解析
- JavaScript’s internal character encoding: UCS-2 or UTF-16?

