July | 2010 | not Nuke's blog

While in the employment of PSU, I was charged with implementing an online license system. Part of that implementation involved the user entering in data that was printed, either in a textbook or from a webpage, and at some point we were using an SHA-1 digest, which meant that this involved 160 bits, or 20 bytes, of data.

From http://en.wikipedia.org/wiki/SHA-1, we have the SHA1 hash of “The quick brown fox jumps over the lazy dog” represented as 2fd4e1c6 7a2d28fc ed849ee1 bb76e739 1b93eb12, which, as it’s using hex which uses 4 bits per digit, is 40 keystrokes.

While I didn’t really consider this option at the time, it turns out that using A-Z, a-z, 0-9, and two more characters, we could get 6-bits per keystroke for 27 characters, but that would involve a lot of confusion and hitting the Shift key a lot, which if you count as a keystroke, would bring up the keystroke count to (27*(1+(26/64))=almost 38 actual keystrokes, if you only needed the Shift key for the uppercase alphabet. That number could be decreased a little by removing some of the uppercase alphabet and using more non-alphanumeric symbols, but it also would be slightly less efficient to code and is a lot less elegant.

The solution was to use a 5-bit alphabet that was not case-sensitive. Using A-Z and 0-9 gives us 36 characters, and for a 5-bit alphabet we would only need 32 different characters. Using 5 bits per keystroke, we would also have 32 characters, a 20% improvement over the original 40. My solution was to drop the numbers 0 and 1, and the letters o and i, and to always display these codes that the user needed to enter in uppercase letters. None of these 32 characters look so similar that they would be mistyped, and they are all available on American and European keyboard layouts: 23456789ABCDEFGHJKLMNPQRSTUVWXYZ. Furthermore, two of the blocks of characters appear in the ASCII table in 3-bit blocks: 2-9 and A-H, making it easy and fast to decode using one bit test, one integer comparison, and one integer subtraction per 5-bit digit, plus whatever bit shifting would be necessary to line it up as part of a larger value.

Representing the two most significant bits by column, and the 3 least significant bits by row, we get the following, color-coded for each grouping of characters that are adjacent in both ASCII and my 5-bit encoding:

5-bit Code
	msb=0		msb=1
	00	01	10	11
000	2	A	J	S
001	3	B	K	T
010	4	C	L	U
011	5	D	M	V
100	6	E	N	W
101	7	F	P	X
110	8	G	Q	Y
111	9	H	R	Z

The fact that my initials are the first characters each of the letter-based columns is purely coincidental.

Example, using the previous sample data:

2fd4e1c67a 2d28fced84 9ee1bb76e7 391b93eb12 (grouped as 4 40-bit groupings)

2fd4e1c67a = 0010 1111 1101 0100 1110 0001 1100 0110 0111 1010 (grouped in 4-bits)

00101 11111 01010 01110 00011 10001 10011 11010 = 7ZCG5KMU(grouped in 5-bits)

2d28fced84 = 00101 10100 10100 01111 11001 11011 01100 00100 = 7NNHTVE6

9ee1bb76e7 = 10011 11011 10000 11011 10110 11101 10111 00111 = MVJVQXR9

391b93eb12 = 00111 00100 01101 11001 00111 11010 11000 10010 = 96FT9USL

So, instead of

2fd4e 1c67a 2d28f ced84 9ee1b b76e7 391b9 3eb12

we have

7ZCG 5KMU7NNH TVE6 MVJV QXR9 96FT 9USL

with no confusion over the “is it an 0 or an O, a 1 or an l or an I?” issue, fewer keystrokes than some other systems (hex),nearly maximized data per keystroke.

not Nuke's blog

Where all sorts of things happen!

Monthly Archives: July 2010

5-bit code