This is a small proof-of-concept implementation of a phonetic representation for the SHA1 strings used in Monotone (though it may be useful in other contexts where small-to-medium-sized chunks of binary data need to be communicated, remembered, and so on). It is called "BibbleBabble", because I was reading about "BubbleBabble" and, well, I'm not sure why it's called BibbleBabble, actually. The basic technique is to encode each 16-bit hex quad as a 5-letter nonsense word in Latin orthography. The universe of words is chosen to maximize phonetic simplicity with an eye towards making it maximally useful across language populations, maximally unambiguous when pronounced, simple to implement, and reasonably easy to read, type, and remember. Oh, and it should be possible to do completion; a proper prefix of a BibbleBabble representation corresponds to proper prefix of a binary/hex representation, modulo partial bits at the end. Examples ======== 3724d40d1200ac2103662df4cc205672adf800b2 kafun-imama-dogef-akuma-bemos-heban-ideme-mumel-amege-bahil fd4bfebcc4ee56b6cab877dba637aec7261604d3 upeho-useni-esafu-munus-ibeke-suvam-adiwe-amufo-gikos-bifum f17240a7ed842e9244e6ca5982e4fb9b13e6f640 uboba-kumot-ovonu-hehal-lelus-ibabe-tuvon-unafe-dudes-uhana 4aa922f44c39d1ff62e3ecb09f2ede36eec3c606 luduk-gazon-luvuk-ikizu-pafum-ovani-zisos-obipe-owofo-esufu Contents of this directory ========================== README - this file. bibblebabble.py - a Python module implementing the BibbleBabble encoding. bibblefilter.py - a Python program (using bibblebabble.py) that copies stdin to stdout, converting any hex strings it sees to bibblebabble. Seems to have some bugs; I haven't investigated fully. Try piping monotone through it, though, to get an idea of what it looks like. bibble-words - all words from my /usr/share/dict/words that are valid BibbleBabble words. (Actually, "uvula" isn't, because it would correspond to a number > 65535.) Makes for a amusing reading. The fact that "semen", "penis", "nudes" all appear is a bit worrisome, but their frequency of occurrence is rather low (once in 65535 quads, in each case...), and probably there are words in other languages that would be just as objectionable; trying to elimate them all is hopeless. Design and Implementation ========================= Each 16 bit hex quad corresponds to a unique 5-letter quasi-word; longer strings of hex (like full 160-bit SHA1 hashes) are split into quads, each quad is converted to BibbleBabble, and the resulting words are joined by hyphens. The 5-letter quasi-words each follow one of two templates; CVCVC or VCVCV (where C=consonant and V=vowel). The vowels are always one of "aeiou". The consonants fall into three categories: word-initial: one of "bdfghklmnpstvwz" intervocalic: one of "bdfghklmnpsvwz" word-final: one of "fklmnpst" Taking the letters in the order given, we can interpret any CVCVC or VCVCV word as a number in mixed-base; this is exactly what we do. CVCVC words represent the corresponding number when interpreted in this way; VCVCV words represent the corresponding number + 42000. (There are 42000 CVCVC words available -- the effect is first map the CVCVC words to the first 42000 numbers, and then append the VCVCV words after that.) There are a number of competing design constraints that inform these choices. The word structures are chosen to be as simple as possible to maximize cross-linguistic usability -- in particular, there are no dipthongs, no consonant clusters, no geminates, the syllable structures are simple, and the consonants are chosen in a principled way to maximize articulatory distinctiveness. (For instance, there is only one glide, only one liquid, etc.) In addition, the "t" is eliminated from the intervocalic words because in English intervocalic "t" and "d" are both reduced to a flap, and indistinguishable; "d" is retained because flaps sound more like d's than t's. The highly-reduced set of word-final consonants is to avoid ambiguities arising from word-final devoicing. This part of the design is admittedly quite ad hoc; it's basically the best I could come up with after a bit of brainstorming with my girlfriend. Both of us are linguists, so it's at least plausible, but inventing substitution codings is hardly our specialty, so I'd love to hear suggestions for improvement. Unfortunately, it's not clear what sort of principled approach one could take to the problem of choosing the best alphabet... The alphabets are reduced as much possible; each BibbleBabble word contains ~16.02 bits of information. The use of mixed base instead of something more complicated serves to firstly, simplify implementation and description, and secondly, preserve the correspondence between initial substrings of BibbleBabble strings and binary/hex strings, so globbed matching, completion, etc. are possible. Those extra 0.02 bits could be used to avoid words (e.g., "penis") that some might find objectionable; on the other hand, this would require much more finicky code, would only matter in the rare cases where such words do appear, and would only help with English words. I'm sure there are words meeting the above criteria that are naughty in other languages; avoiding them all would be more or less impossible. Alternatives ============ BubbleBabble is a similar system used to represent ssh2 key fingerprints. It doesn't seem to be documented anywhere; CPAN has a Digest::BubbleBabble module, but the code isn't terribly readable. BubbleBabble seems to have a theoretical entropy of about 17 bits per 5-letter word, but somehow needs 11 such words to code an 160 bit hash (?); I don't understand this. The words themselves are extremely odd; many output strings are entirely unpronounceable. ("puxix"? "sakyv"? Those both occur in my ssh public key fingerprint...) Perl source code: http://search.cpan.org/src/BTROTT/Digest-BubbleBabble-0.01/BubbleBabble.pm http://tothink.com/mnemonic/ takes an interesting approach, representing 32 bit numbers as triples of English words. The dictionary is chosen to maximize memorability and phonetic distinctness; the apparent goal is to be able to memorize and read out things like crypto keys, and have software correct mishearings and misspellings when possible. Problematic is the English-specificity ("inch-calpyso-ibiza" is unlikely to be terribly memorable or even pronounceable to a non-English speaker!), the requirement of carrying around a dictionary, and the complicated mapping between the numbers and words. (3 words <-> 32 bits means each word has ~11.7 bits of entropy; in BibbleBabble each word has exactly 16 bits of entropy, which is convenient for giving 1-2 word SHA1 prefixes and the like.) The above page also cites and criticizes some attempts at similar systems. RFC 1751 is similar to the above system, but probably even worse for our purposes (the basic unit is 6 words <-> 64 bits).