This is a small proof-of-concept implementation of a phonetic
representation for the SHA1 strings used in Monotone (though
it may be useful in other contexts where small-to-medium-sized
chunks of binary data need to be communicated, remembered, and
so on).  It is called "BibbleBabble", because I was reading
about "BubbleBabble" and, well, I'm not sure why it's called
BibbleBabble, actually.

The basic technique is to encode each 16-bit hex quad as a
5-letter nonsense word in Latin orthography.  The universe of
words is chosen to maximize phonetic simplicity with an eye
towards making it maximally useful across language populations,
maximally unambiguous when pronounced, simple to implement,
and reasonably easy to read, type, and remember.  Oh, and it
should be possible to do completion; a proper prefix of a
BibbleBabble representation corresponds to proper prefix of
a binary/hex representation, modulo partial bits at the end.

Examples
========
  
  3724d40d1200ac2103662df4cc205672adf800b2
  kafun-imama-dogef-akuma-bemos-heban-ideme-mumel-amege-bahil


  fd4bfebcc4ee56b6cab877dba637aec7261604d3
  upeho-useni-esafu-munus-ibeke-suvam-adiwe-amufo-gikos-bifum

  f17240a7ed842e9244e6ca5982e4fb9b13e6f640
  uboba-kumot-ovonu-hehal-lelus-ibabe-tuvon-unafe-dudes-uhana

  
  4aa922f44c39d1ff62e3ecb09f2ede36eec3c606
  luduk-gazon-luvuk-ikizu-pafum-ovani-zisos-obipe-owofo-esufu


Contents of this directory
==========================

  README - this file.
  bibblebabble.py - a Python module implementing the
                    BibbleBabble encoding.
  bibblefilter.py - a Python program (using bibblebabble.py)
                    that copies stdin to stdout, converting
                    any hex strings it sees to bibblebabble.
                    Seems to have some bugs; I haven't
                    investigated fully.  Try piping monotone
                    through it, though, to get an idea of
                    what it looks like.
  bibble-words - all words from my /usr/share/dict/words that
                 are valid BibbleBabble words.  (Actually,
                 "uvula" isn't, because it would correspond
                 to a number > 65535.)  Makes for a amusing
                 reading.  The fact that "semen", "penis",
                 "nudes" all appear is a bit worrisome, but
                 their frequency of occurrence is rather low
                 (once in 65535 quads, in each case...), and
                 probably there are words in other languages
                 that would be just as objectionable; trying
                 to elimate them all is hopeless.

Design and Implementation
=========================

Each 16 bit hex quad corresponds to a unique 5-letter quasi-word;
longer strings of hex (like full 160-bit SHA1 hashes) are split
into quads, each quad is converted to BibbleBabble, and the
resulting words are joined by hyphens.

The 5-letter quasi-words each follow one of two templates; CVCVC
or VCVCV (where C=consonant and V=vowel).

The vowels are always one of "aeiou".  The consonants fall into
three categories:
  word-initial: one of "bdfghklmnpstvwz"
  intervocalic: one of "bdfghklmnpsvwz"
  word-final: one of "fklmnpst"
Taking the letters in the order given, we can interpret any
CVCVC or VCVCV word as a number in mixed-base; this is exactly
what we do.  CVCVC words represent the corresponding number when
interpreted in this way; VCVCV words represent the corresponding
number + 42000.  (There are 42000 CVCVC words available -- the
effect is first map the CVCVC words to the first 42000 numbers,
and then append the VCVCV words after that.)

There are a number of competing design constraints that inform
these choices.  The word structures are chosen to be as simple
as possible to maximize cross-linguistic usability -- in
particular, there are no dipthongs, no consonant clusters,
no geminates, the syllable structures are simple, and the
consonants are chosen in a principled way to maximize
articulatory distinctiveness.  (For instance, there is
only one glide, only one liquid, etc.)  In addition, the "t"
is eliminated from the intervocalic words because in English
intervocalic "t" and "d" are both reduced to a flap, and
indistinguishable; "d" is retained because flaps sound more
like d's than t's.  The highly-reduced set of word-final
consonants is to avoid ambiguities arising from word-final
devoicing.

This part of the design is admittedly quite ad hoc; it's
basically the best I could come up with after a bit of
brainstorming with my girlfriend.  Both of us are linguists, so
it's at least plausible, but inventing substitution codings
is hardly our specialty, so I'd love to hear suggestions for
improvement.  Unfortunately, it's not clear what sort of
principled approach one could take to the problem of choosing
the best alphabet...

The alphabets are reduced as much possible; each BibbleBabble
word contains ~16.02 bits of information.  The use of mixed
base instead of something more complicated serves to firstly,
simplify implementation and description, and secondly, preserve
the correspondence between initial substrings of BibbleBabble
strings and binary/hex strings, so globbed matching, completion,
etc. are possible.

Those extra 0.02 bits could be used to avoid words (e.g., "penis")
that some might find objectionable; on the other hand, this
would require much more finicky code, would only matter in the
rare cases where such words do appear, and would only help with
English words.  I'm sure there are words meeting the above
criteria that are naughty in other languages; avoiding them all
would be more or less impossible.

Alternatives
============
  BubbleBabble is a similar system used to represent ssh2 key
    fingerprints.  It doesn't seem to be documented anywhere; CPAN
    has a Digest::BubbleBabble module, but the code isn't terribly
    readable.  BubbleBabble seems to have a theoretical entropy
    of about 17 bits per 5-letter word, but somehow needs 11 such
    words to code an 160 bit hash (?); I don't understand this.
    The words themselves are extremely odd; many output strings
    are entirely unpronounceable.  ("puxix"?  "sakyv"?  Those both
    occur in my ssh public key fingerprint...)
  Perl source code:
     http://search.cpan.org/src/BTROTT/Digest-BubbleBabble-0.01/BubbleBabble.pm

  http://tothink.com/mnemonic/ takes an interesting approach,
  representing 32 bit numbers as triples of English words.  The
  dictionary is chosen to maximize memorability and phonetic
  distinctness; the apparent goal is to be able to memorize and
  read out things like crypto keys, and have software correct
  mishearings and misspellings when possible.
  Problematic is the English-specificity ("inch-calpyso-ibiza" is
  unlikely to be terribly memorable or even pronounceable to a
  non-English speaker!), the requirement of carrying around a
  dictionary, and the complicated mapping between the numbers and
  words.  (3 words <-> 32 bits means each word has ~11.7 bits of
  entropy; in BibbleBabble each word has exactly 16 bits of
  entropy, which is convenient for giving 1-2 word SHA1 prefixes
  and the like.)

  The above page also cites and criticizes some attempts at
  similar systems.

  RFC 1751 is similar to the above system, but probably even worse
  for our purposes (the basic unit is 6 words <-> 64 bits).