A small update on this. I have a working implementation of the "halted state" mechanism for waiting all the pending flushes to be completed. However, the way I'm going back to the cpus.c loop (the while(1) in qemu_tcg_cpu_thread_fn) is a bit convoluted. In the case of the TLB ops that always end the TB, a simple cpu_exit() allows me to go back to the main loop. I think in this case we can also use the cpu_loop_exit(), though making the code a bit more complicated since the PC would require some adjustments.
I wanted then to apply the same "halted state" to the LoadLink helper, since also this one might cause some flush requests. In this case, we can not just call cpu_loop_exit() in that the guest code would miss the returned value. Forcing the LDREX instruction to also end the TB through an empty 'is_jmp' condition did the trick allowing once again to use cpu_exit(). Is there another better solution?