[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v2 00/16] record/replay fixes: attempting to get avocado gree
From: |
Alex Bennée |
Subject: |
Re: [PATCH v2 00/16] record/replay fixes: attempting to get avocado green |
Date: |
Wed, 13 Dec 2023 10:57:39 +0000 |
User-agent: |
mu4e 1.11.26; emacs 29.1 |
Alex Bennée <alex.bennee@linaro.org> writes:
> As I'm a glutton for punishment I thought I'd have a go at fixing the
> slowly growing number of record/replay bugs. The two fixes are:
>
> replay: stop us hanging in rr_wait_io_event
> chardev: force write all when recording replay logs
>
> I think we are beyond 8.2 material but it would be nice to get this
> functionality stable again. We have a growing number of bugs under the
> icount label on gitlab:
>
> https://gitlab.com/qemu-project/qemu/-/issues/?label_name%5B%5D=icount
>
> Changes
> -------
>
> v2
>
> Apart from addressing tidy ups and tags I've been investigating the
> failures in replay_linux.py which are the more exhaustive tests which
> boot the kernel and user-space. The "fix":
>
> replay: report sync error when no exception in log (!DEBUG INVESTIGATION)
>
> triggers around the time of the hang in the logs and despite the
> rather hairy EXCP->INT transitions around cpu_exec_loop() I think
> points to a genuine problem. I added the tracing to cputlb to verify
> the page tables are the same and started detecting divergence between
> record and replay a lot earlier on that when the replay_sync_error()
> catches things. I see patterns like this:
>
> 1878 tlb_fill 0x4770c000/1 1 2
> tlb_fill 0x4770c000/1 1 2
> 1879 tlb_fill 0x4770d000/1 1 2
> tlb_fill 0x4770d000/1 1 2
> 1880 tlb_fill 0x59000/1 0 2
> tlb_fill 0x59000/1 0 2
> 1881 >
> tlb_fill 0x476dd116/1 0 2
> 1882 tlb_fill 0x4770e000/1 1 2
> tlb_fill 0x4770e000/1 1 2
> 1883 tlb_fill 0x476dd527/1 0 2 |
> tlb_fill 0x476dfb17/1 0 2
> 1884 >
> tlb_fill 0x476de0fd/1 0 2
> 1885 >
> tlb_fill 0x476dce2e/1 0 2
> 1886 tlb_fill 0x4770f000/1 1 2
> tlb_fill 0x4770f000/1 1 2
> 1887 tlb_fill 0x476df939/1 0 2 <
> 1888 tlb_fill 0x47710000/1 1 2
> tlb_fill 0x47710000/1 1 2
> 1889 tlb_fill 0x47711000/1 1 2
> tlb_fill 0x47711000/1 1 2
>
> These don't seem to affect the overall program flow but are concerning
> because the memory access patterns should be the same. My
> investigations with rr seem to indicate the difference is due to
> behaviour of the victim_tlb_cache which again AFAICT should be
> deterministic.
>
> Anyway I can't spend any time debugging it this week so I thought I'd
> post the current state in case anyone is curious enough to want to go
> diving into record/replay.
Just for the actual failure report, on replay:
[ 144.918551] audit: type=1130 audit(1702415469.960:45): pid=1 uid=0
auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0
msg='unit=systemd-tmpfiles-setup comm="
systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=?
res=success'
[ 145.579135] audit: type=1305 audit(1702415470.620:46): op=set
audit_enabled=1 old=1 auid=4294967295 ses=4294967295
subj=system_u:system_r:auditd_t:s0 res=1
qemu-system-aarch64: Exception not in log (insn total 30592471098/1463 left,
event 2848747 is EVENT_INSTRUCTION)
fish: Job 1, './qemu-system-aarch64 -display …' terminated by signal SIGABRT
(Abort)
I've made some more fixes to the replay_dump script and the next event
expected is a CLOCK_WARP followed by some more instructions before an
exception. However there are a lot of exceptions/interrupts firing off
in the stream:
2848736:EVENT_EXCEPTION(2) no data
2848737:EVENT_INSTRUCTION(0) + 835 -> 30592463170
2848738:EVENT_EXCEPTION(2) no data
2848739:EVENT_INSTRUCTION(0) + 1133 -> 30592464303
2848740:EVENT_EXCEPTION(2) no data
2848741:EVENT_INSTRUCTION(0) + 4739 -> 30592469042
2848742:EVENT_EXCEPTION(2) no data
2848743:EVENT_INSTRUCTION(0) + 865 -> 30592469907
2848744:EVENT_CP_CLOCK_WARP_ACCOUNT(31) no additional data
2848745:EVENT_CP_CLOCK_WARP_ACCOUNT(31) no additional data
2848746:EVENT_INTERRUPT(1)
2848747:EVENT_INSTRUCTION(0) + 2654 -> 30592472561
we are here...
2848748:EVENT_CP_CLOCK_WARP_ACCOUNT(31) no additional data
2848749:EVENT_INSTRUCTION(0) + 184 -> 30592472745
2848750:EVENT_EXCEPTION(2) no data
2848751:EVENT_INSTRUCTION(0) + 832 -> 30592473577
2848752:EVENT_CP_CLOCK_WARP_ACCOUNT(31) no additional data
2848753:EVENT_INSTRUCTION(0) + 6 -> 30592473583
2848754:EVENT_CP_CLOCK_WARP_ACCOUNT(31) no additional data
2848755:EVENT_INSTRUCTION(0) + 2960 -> 30592476543
2848756:EVENT_EXCEPTION(2) no data
2848757:EVENT_INSTRUCTION(0) + 1565 -> 30592478108
2848758:EVENT_EXCEPTION(2) no data
2848759:EVENT_INSTRUCTION(0) + 2996 -> 30592481104
2848760:EVENT_EXCEPTION(2) no data
2848761:EVENT_INSTRUCTION(0) + 832 -> 30592481936
2848762:EVENT_ASYNC(9)
REPLAY_ASYNC_EVENT_NET(6) id 0 flags 0 305 bytes
That said given the complexity of the exception/interrupt processing
back and forth I can see there is potential for things getting confused.
Of course I still haven't explained why there is a divergence in the tlb
processing so early on despite seemingly having no effect on the replay
until this very late stage.
--
Alex Bennée
Virtualisation Tech Lead @ Linaro
- [PATCH v2 14/16] accel/tcg: define tlb_fill as a trace point, (continued)
- [PATCH v2 14/16] accel/tcg: define tlb_fill as a trace point, Alex Bennée, 2023/12/11
- [PATCH v2 15/16] accel/tcg: add trace_tlb_resize trace point, Alex Bennée, 2023/12/11
- [PATCH v2 16/16] replay: report sync error when no exception in log (!DEBUG INVESTIGATION), Alex Bennée, 2023/12/11
- [PATCH v2 13/16] tests/avocado: remove skips from replay_kernel, Alex Bennée, 2023/12/11
- [PATCH v2 12/16] chardev: force write all when recording replay logs, Alex Bennée, 2023/12/11
- Re: [PATCH v2 00/16] record/replay fixes: attempting to get avocado green,
Alex Bennée <=