I'm not sure I follow. With the exception of the references to the back-end TTS, I don't see anything in your proposed package that the other packages don't already do. I can see that Emacspeak may be too sophisticated with too much additional functionality that is not needed, such as session information, but the primary purpose is to turn buffers into text. However, Emacspeak is somewhat 'invasive', so I can appreciate that it may be overkill for what you want. This is why I suggest you look at speechd.el, which does not modify Emacs behaviour to the extent of Emacspeak and essentially just takes buffer contents and sends them to speech-dispatcher.
I would not discount speech-dispatcher as a possible backend. I find current versions stable and reliable. Having written a number of speech services in the past, I know first hand how complex implementing a reliable and functional interface is. There are a lot of hidden details which only become obvious when you start to use the system. For example, when reading a large buffer, being able to pause speech and restart from the position you paused at while at the same time, ensuring that text is sent in sufficiently large chunks to enable the TTS engine to turn it into good quality speech (chunk size is often critical to this), handling various UTF characters, filtering out text which you don't want spoken (consider a horizontal line of -----, do you want it to speak "horizontal line" or do you want it to say 'dash' 80 times or say nothing, what is the 'cuitoff' point ) etc. Implementing a proof of concept is fairly trivial, but implementing something robust and usable is much much harder.
Tim