I see two ways to fix this: hold the cache locked while doing apply_reg_state
(fix#1), or make a local copy of 'rs' (fix#2).
In profiling my executables, apply_reg_state consumes the most cycles,
so I personally prefer fix#2, though it is more expensive for single-threaded
programs.
Executing apply_reg_state with the lock held is a problem only for UNW_CACHE_GLOBAL. How does the performance of UNW_CACHE_PER_THREAD compare in your tests?
I'm inclined to apply the more conservative fix #1 until we have more data on the cost of the memcpy vs using UNW_CACHE_PER_THREAD.
-Arun