· 3 分钟阅读时长 · 1191 字 · -阅读 -评论

title: “Understanding emoji” tags: [] slug: 4d28b2a7 date: 2020-05-09 16:23:24 summary: “What emojis are (characters, not images), encoding basics, why JS string length varies, and practical notes.”

I saw a quiz in a chat group: what is '✈️'.length? That stumped me, so I dug in and took notes.

What is emoji?

emoji(英语:emoji,日语:絵文字/えもじ emoji),是使用在网页和聊天中的形意符号

  • Emojis are characters, not images.
  • The same emoji looks different on Twitter/iOS/Windows because code points are standard, but rendering depends on platform fonts.

Character encoding

在搞明白'✈️'.length之前,先补下编码知识

我们常说的编码有这些。ASCII(国际编码),Unicode(国际符号集),UTF8(国际编码),GB2312(国标编码),ISO-8859-1(国际编码)

We used to run into encoding issues (archives, web pages, uploads/downloads) more often; less so today as standards converged.

  • ASCII只考虑了英文字符
  • Unicode考虑了世界上各种字符,符号,规定了符号的二进制代码
  • Unicode只是一个符号集,它只规定了符号的二进制代码,却没有规定这个二进制代码 应该如何存储
  • UTF-8 is one implementation of Unicode, a variable‑length encoding using 1–4 bytes per code point.
  • Unicode 编码范围是从 U+0000 到 U+10FFFF,每一个编码(也称之为码位 code point)表示一个 Unicode 字符;而这么多码位有划分成 17 个平面: 第一个平面(U+0000 ~ U+FFFF): 基本平面(Basic Multilingual Plane - BMP) 其他 16 个平面(U+100000 ~ U+10FFFF): 补充平面(Supplementary Planes)
  • GB2312是国标,但是今天几乎所有站点都是UTF8,GB2312已经被淘汰。
  • ISO-8859-1 不是Unicode字符集,原因是并不对所有符号编码,只是一部分,今天也很少见该编码
  • JavaScript engines internally use UCS‑2/UTF‑16.
  • Java默认编码方式是UTF-16

下图可以简单的看出几种编码的区别,以后应该主要是UTF了

Emoji “length”

编码知道后,再一开始的长度疑问

  • In JS, '123'.length===3 counts code units, not bytes.
  • Code points, bytes, and JS string length are not equivalent.
  • Different emojis correspond to different code unit lengths.

Unicode and emoji

  • BMP emojis may be 1 code unit.
  • BMP multi‑code‑point sequences may be 2 code units.
  • Supplementary plane single code points are typically 2 code units in UTF‑16.
  • Multi‑code‑point sequences vary.

举个例子


// 各个表情字符个数

console.log('⛷'. length);  // 1
console.log('😂'.length); // 2
console.log('1️⃣'.length); // 3
console.log('👨‍👨‍👦'.length); // 8
console.log('👨‍👩‍👧‍👦'.length); // 11

Encodings you’ll see

  • System locale: run locale in a terminal.

  • 搞开发的同学会看到IDE,各个程序文件都有文件编码,DB也有database编码

  • 浏览器浏览网页HTML,CSS,JS也都有对应的编码

  • Terminal text/emoji relies on encoding and fonts. If you see tofu/garbage, your font may lack glyphs — switch fonts.

Final Thoughts

Even understanding encodings and emoji, it’s still nontrivial to predict code unit length at a glance. 😭

参考文档

Alan H
Authors
开发者,数码产品爱好者,喜欢折腾,喜欢分享,喜欢开源