Typing emoji with a Pico keypad

Sunday, 22 January 2023

Typing emoji with a Pico keypad

Typing emoji with a Pico keypad

I got myself a Pimoroni RGB keypad, a keypad with 16 coloured buttons controlled by a Raspberry Pico. So the first thing I wanted to do was code it to output emoji, because I am very professional person. However, this was not a simple task as I had hoped.

Raspberry Pico

A Pico is a board with the Raspberry Pi Foundation's own RPi2040 microcontroller (the big chip) and a 2MB flash memory (square chip to the right of it) which is partitioned into controller firmware and mass storage, the former can run MicroPython (or its derivative CircuitPython), a bare-metal python compiler, with the code run stored in the latter. It is less powerful than a Raspberry Pi, which runs a full OS, but it is smaller and in most applications you don't need an OS. The ESP32 works similarly to the Pico, while an an Arduino you don't generally touch the bootloader and upload Arduino-flavoured C, while on a STM32 Nucleo you go full-on bare-metal coding. A Pico is simple but not hard-core to control, which is nice.

A funny thing about the Pico is that it has some Rubber Ducky traits to it: one flashes the firmware albeit by pressing a boot selection button as opposed cracking open a USB memory stick and shorting two pins, the main memory appears as a regular USB memory stick, and with the hid module it also is seen as a keyboard...

One detail to note is that the Python is a slimmed down version: several missing standard library modules, copying of packages in lib folder in lieu of installation and so forth.

Pimoroni RGB Keypad

The Pimoroni RGB Keypad is a 4x4 matrix of buttons with RGB LEDs. The buttons are connected to the Pico via I2C, and the LEDs are controlled via PWM. The Python library controlling it is adafruit_hid (HID is Human Interface Device).

The problems

There are two issues to outputting emoji:

  • emoji are not part of the Basic Multilingual Plane (BMP) of Unicode,
  • Mac does not support Unicode input on a standard GB/US keyboard layout.

When you type a character, the keyboard sends a code to the computer, which then looks up the character in a table. This code page is different than ASCII or Unicode. For example, the code for the letter a is 0x61 in ASCII, but in the US keyboard layout it is 0x04.

The notation 0xdd is hexadecimal, where 0x is the prefix and nn is a hex number between 00 and FF. Where I to want to write a binary number I would use 0bnn, where nn is a binary number between 00 and 11. Hexadecimal, base 16, is easier to read than binary, base 2, and encodes 4-bits (a nibble): as a byte is 8 bits, so 2 hex digits. For Unicode specifically, there is also the notation U+nnnn, where nnnn is a number between 0000 and FFFF. In Python the latter is expressed as \unnnn in a string.

The BMP is the first 65,536 characters of Unicode, that means that 2 bytes are enough to represent a character. The rest of Unicode is called Supplementary Planes, and they are represented by 4 bytes (UTF-16).

Unicode keyboard on a Mac

It supports only the keys on a standard US keyboard. On a Windows machine AltGr+number will output unicode keys, but this is not the case with the default Mac keyboard layout. Namely, one needs to add an alternative keyboard layout to the Mac in System Preferences > Keyboard > Input Sources > Other > Unicode Hex Input. One can switch between keyboard layouts with control+option+space. Once this is done one can type ⌥+Unicode hex digits. Say alt + 00E1 is for the letter á, an acute-accented a.

In the regular layout alt + e is the dead key ´ which waits for the next letter to be pressed to modify that. In Unicode there are two ways actually to write á: the other is with the combining acute accent ´ at U+0301, combined with a.

On a Windows the AltGr key is combined with a decimal number (not hex) so this step is not needed.

Unicode digression

The Basic Multilingual Plane covers a fair amount of characters, but not all of them. The CJK (Chinese, Japanese, Korean) characters are not all in the BMP, only 27,000 of them. Actually this covers a large amount of characters: Japanese school children learn 2,136 kanji. There are differences between the Japanese, Chinese and Korean characters, for example the traditional Kangxi character for an East-Asian dragon is  (U+9F8D), while its descendants are the Japanese character (U+7AC1) and the simplified Chinese character (U+9F99). All of these fit in 2 bytes. The most complicated character in Japanese (a joke like hippopotomonstrosesquippedaliophobia) is 𱁬 (U+3106C), which is 3 bytes long as it is not in the BMP, however the mad character for the snaking flight-movement of an East-Asian dragon (wingless) is  (U+9F98). Korean script, Hangul, is syllabic and dragon happens to be a single 2-byte character  (U+C6A9), but in one of Japanese syllabic scripts (hiragana) it'sりゅう, which is 3 characters. A western dragon in Japanese is written in the other syllabic script (katakana) is ドラゴン, so 2*4=8 bytes are needed and the word 'dragon' is 6 ASCII-characters/bytes. So stroke count does not correlate with memory footprint!

In summary, 65,536 is a crazy amount of characters. In Unicode 15.0 there are 149,186 characters.

Funny factoid: neither Tolkien's Elvish script (Tengwar) or the Klingon script (pIqaD in thlIngal Hol) are not in unicode: they were in the former ConScript Unicode Registry area, but now emoji have taken the place of Tengwar, while the warriors of Qro'noS are holding out.

Why is this important? When typing or copypasting or saving a non-BMP character you often get a weird gibberish. Say 😊 (U+001F60A) will become ὠ (curiously 'uh?' is both my reaction and the sound of an aspirated omega). This is because only the first 2 bytes are being read. This happens with the keypad.

Surrogate pairs

To circumvent this one can encode them with high surrogates. Wiktionary describes these as: 'A code point in the range U+D800 through U+DBFF (the High Surrogates and High Private Use Surrogates blocks), used in UTF-16 to encode the high 10 bits of the 20-bit offset above U+FFFF of the code point belonging to a supplementary character.'

That verbiage basically tells us, it's an uninteresting technical hack, wherein the surrogate character � acts as kind of modifier for the next 2 bytes. So 😀 (U+1F600) becomes � (U+D83D) Þ00 (U+DE00)

Herein I am using U+FFFD for a placeholder, as this glyph may appear when displaying an actual encoding error and is different than □ (U+25A1), which is a missing representation in the font used.

To convert a non-BMP character to a high surrogate pair one can use the utf-16 encoding and str.encode. Let's look at the character á (U+00E1):

>>> ord('á')
>>> hex(ord('á'))
>>> 'à'  #: string
>>> 'à'.encode('utf-8')  #: bytes
>>> 'á'.encode('utf-16')  #: bytes
>>> 'á'.encode('utf-16-le') #: bytes

'LE' stands for little-endian, which is a way of reading bits derived from a massive pointless dispute theLilliputians have in Gulliver's travels over which way up should an egg go, herein the egg is a byte, and for more on the pettiness over the discussion of how to read a byte see this. x86 and arm machines are little-endians, but can do both. And there you thought this Unicode discussion could not get any more pedantic.

Where we have to more character in the decoded string, in the utf-16 and utf-16-le encodings the bytes encoding each character will still be left to right. Okay, there's the right-to-left mark, but let's not get into that.

So now back to our non-BMP character:

>>> ord('😊')
>>> hex(ord('😊'))
>>> '😊'.encode('utf-16')
>>> '😊'.encode('utf-16-le')
b'\x0a\xd8\x3d\xde'  # or  b'=\xd8\n\xde'

So in the interest of sanity let's assume magic and skip ahead into convert the last one to a surrogate pair:

>>> le16_emoji: bytes = '😊'.encode('utf-16-le')
>>> len(le16_emoji)
>>> f"U+{int.from_bytes(le16_emoji[:2], 'little'):0>4x} U+{int.from_bytes(le16_emoji[2:], 'little'):0>4x}"
'U+d83d U+de0a'

Typing +d83dde0a will give you 😊.

Colours are also stored as RGB hex values, so #ff0000 is red, #00ff00 is green, #0000ff is blue. One byte per channel. Colour theory is a whole other fun numerical mad-hatter tea party and is touched upon a past post, ggplot colours in Python.

So on a real machine one can pre-make the surrogate pairs and store them in a lookup table. Doing it on the fly on the pico does not work (I am unsure why), but here is a simple table:

# define what is wanted
emoji_settings = [(0, '😀', 'green'),
                  (1, '🤩', 'yellow'),
                  (2, '🤣', 'blue'), 
                  (3, '😭', 'red'), 
                  # new row
                  (4, '👾', 'purple'), 
                  (5, '🪲', 'lime'), 
                  (6, '🤖', 'cerulean'),
                  # 7
                  (8, '🤦', 'coral'),
                  (9, '🤷', 'sage'), 

colors = dict(green='#00FF00', 

rows = []

emoji2hexseq = lambda emoji: 

def emoji2hexseq(emoji: str) -> str:
    return f"{int.from_bytes(emoji.encode('utf-16-le')[:2], 'little'):0>4x}"+\
           f"{int.from_bytes(emoji.encode('utf-16-le')[2:], 'little'):0>4x}"

for i, emoji, color_name in emoji_settings:
                     color_hex= colors[color_name],
                     hexsequence = emoji2hexseq(emoji)

Once that is done, I load the following in my code.py script on the pico:

# define settings
emoji_settings = [{'index': 0,
                  'emoji': '😀',
                  'color_name': 'green',
                  'color_hex': '#00FF00',
                  'hexsequence': 'd83dde00'},
                 {'index': 1,
                  'emoji': '🤩',
                  'color_name': 'yellow',
                  'color_hex': '#FFFF00',
                  'hexsequence': 'd83edd29'},
                 {'index': 2,
                  'emoji': '🤣',
                  'color_name': 'blue',
                  'color_hex': '#0000FF',
                  'hexsequence': 'd83edd23'},
                 {'index': 3,
                  'emoji': '😭',
                  'color_name': 'red',
                  'color_hex': '#FF0000',
                  'hexsequence': 'd83dde2d'},
                 {'index': 4,
                  'emoji': '👾',
                  'color_name': 'purple',
                  'color_hex': '#A020F0',
                  'hexsequence': 'd83ddc7e'},
                 {'index': 5,
                  'emoji': '🪲',
                  'color_name': 'lime',
                  'color_hex': '#32CD32',
                  'hexsequence': 'd83edeb2'},
                 {'index': 6,
                  'emoji': '🤖',
                  'color_name': 'cerulean',
                  'color_hex': '#2a52be',
                  'hexsequence': 'd83edd16'},
                 {'index': 8,
                  'emoji': '🤦',
                  'color_name': 'coral',
                  'color_hex': '#FF7F50',
                  'hexsequence': 'd83edd26'},
                 {'index': 9,
                  'emoji': '🤷',
                  'color_name': 'sage',
                  'color_hex': '#B2AC88',
                  'hexsequence': 'd83edd37'}]

import usb_hid

# from circuitpython_typing import List
from adafruit_hid.keyboard import Keyboard
from adafruit_hid.keycode import Keycode
from pmk import PMK, Key, hsv_to_rgb
from pmk.platform.rgbkeypadbase import (
    RGBKeypadBase as Hardware,
)  # for Pico RGB Keypad Base

# this is not needed, as I am circumventing this
from adafruit_hid.keyboard_layout_base import KeyboardLayoutBase
import time

# import logging
debug = print
# debug = lambda *args, **kwargs: None

pmk = PMK(Hardware())
keys: list = pmk.keys  #: List[Key]
keyboard = Keyboard(usb_hid.devices)

def switch_keyboard():
    """Switch the keyboard layout
    This assumes only one real layout and one unicode layout.
    keyboard.press(Keycode.CONTROL, Keycode.OPTION, Keycode.SPACEBAR)
    # time.sleep(0.1)
    debug("Switched keyboard")

def type_letter(letter: str) -> Keycode:
    The numbers of the Keycode enum are words"""
    if letter.isdigit():
        letter = [
    code = getattr(Keycode, letter.upper())
    return code

def type_unicode(char: str):
    The high surrogate conversion does not work in the pico
    if ord(char) < 0xFFFF:
        hexed = f"{ord(char):0>4x}"
        debug("Non-multilingual basic plate character... High surrogate")
        hexed = (
            f"{int.from_bytes(char.encode('utf-16-le')[:2], 'little'):0>4x}"
            + f"{int.from_bytes(char.encode('utf-16-le')[2:], 'little'):0>4x}"

    debug(f"{char} -> 0x{hexed}")

def type_unicode_sequence(hexed: str):
    for letter in hexed:
        # time.sleep(0.05)

def test():

# ---------------------------------------------------------------------------
# ## Set colours

k2c = {row['index']: row['color_hex'] for row in emoji_settings}

def set_color(key: Key):
    Set the colours of the keys as Tuple[int, int, int],
    whereas for sanity they are rgb hexes.
    Uses ``k2c``, which is a dict of key index to color hex derived from ``emoji_settings`` 
    color = k2c.get(_ROTATED[key.number], None)
    rgb = (
        [int(color[1 + i : 3 + i], 16) for i in range(0, 6, 2)] if color else (0, 0, 0)

# ---------------------------------------------------------------------------
# ## Set actions

k2e = {row['index']: row['hexsequence'] for row in emoji_settings}

key: Key
for key in keys:

    def press_handler(key: Key):
        debug(f"pressed #{key.number}")
        key.set_led(0, 0, 0)
        hexed = k2e.get(_ROTATED[key.number], None)
        if hexed:

    def release_handler(key: Key):
        debug(f"released #{key.number}")

debug("Loaded successfully")

# ---------------------------------------------------------------------------
while True:

Three things are painful in that snippet.

  1. There is no typehinting via the typing module. CircuitPython does not have the typing module in its standard library —there is a library for it but it does not work as expected.

  2. There is no logging module, hence the debug function, which can be print or lambda *args, **kwargs: None.

  3. My use of British spelling for comments and docstrings, while American spelling in code. There is no PEP order from Guido banning British spelling, but due to dependencies etc. I find it easier to use American spelling in code.

About the print business, when plugged in the stdout is sent via the serial connection (USB). On a Unix machine this will be /dev/tty.usbmodem* or /dev/ttyACM* or /dev/ttyUSB*. The tty stands for teletype, which is a device that can be used to send and receive text, also called a controlling terminal, which is a different thing than a terminal emulator. The mu editor can be used to view the output (serial button), But the /dev/tty.usbmodem14101 can be used in a Jupyter notebook via:

!screen /dev/tty.usbmodem14101

I will admit that it is a bit crude, but it works... A better solution would be using cell magic, ipython_widgets and threading. Another time: I need to figure out what emoji to add to my keypad...

No comments:

Post a Comment