I'd prefer if we could use native interfaces for that. The main
application I would see for sound within Emacs would be
WYHIWYG-editing of lilypond files and other sound describing programs.
Starting one process for each auditory feedback seems like overkill.
One would rather want to keep a device/socket/pipe open, and ALSA
appears like the most basic access method with a free future: you can
pretty much rely on its presence on current GNU/Linux systems.
I don't think it offers network transparency, but there is no network
transparent variant one could rely on as the main access method,
anyway.
So in order of urgency, one would probably implement:
a) play through a pipe and a command line app started once
b) play through native ALSA
Case a) should cover most systems after being properly configured, as
long as there is some "play" utility that does not get into a tizzy
when pipe underrun occurs. Changes in sample rate or format would
have to kill and restart the process.
Case b) would be more efficient where ALSA is available.