I wanted to point out that no SSE intrinsics were included in the source code in order to vectorize the encoding process. I've found that a small but decent speed gain can be achieved by including the immintrin.h header and then compiling with auto-vectorization enabled in GCC and LTO for linktime. I also profiled the program after running with the --best option in order to further optimize the program. The resulting gains were 57 sec with optimization to 69 sec without on a 134 MB file (contents of the MS Reserved Partition passed through dd). I would recommend looking into adding the intrinsic header so as to allow GCC to automatically optimize the compilation based upon what CPU is in use. Including a header for a later CPU will not add intrinsics which the CPU cannot handle to the program.
While I have seen a speed increase, it did increase the size of the final binary by about 4 KB as well.
I know that you like to keep code simple, but just adding in the #include immintrin.h to the headers will allow for auto-vectorization without requiring further changes to any of the existing code.