[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#28211: Grafting code triggers GC/thread-safety issue on Guile 2.2.2
From: |
Ludovic Courtès |
Subject: |
bug#28211: Grafting code triggers GC/thread-safety issue on Guile 2.2.2 |
Date: |
Wed, 09 May 2018 09:17:35 +0200 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/25.3 (gnu/linux) |
Hi Mark,
Mark H Weaver <address@hidden> skribis:
> address@hidden (Ludovic Courtès) writes:
> [...]
>> Thread 1 (Thread 0x7f6fe6f5d700 (LWP 2856)):
>> #0 0x00007f7019db0d79 in scm_is_pair (x=<error reading variable: ERROR:
>> Cannot access memory at address 0x0>0x0) at ../libguile/pairs.h:159
>> #1 scm_ilength (sx=<optimized out>) at list.c:190
> [...]
>> What this means is that Thread 1 gets NULL instead of a list as its
>> on-stack argument (vm-engine.c:909 is ‘tail-apply’).
>>
>> How can arguments on the VM stack be zeroed?
>
> I doubt that's what happened, because I expect that each VM stack is
> dedicated to a single hardware thread. In theory, if a single VM stack
> is used by one thread, and then later used by another thread,
> thread-safety issues on the VM stack could arise in the absense of
> proper thread synchronization.
>
> However, I think it's _far_ more likely that the NULL argument on the
> stack was copied from memory shared by multiple threads without proper
> thread synchronization.
It could be this, but this particular case is an “embarrassingly
parallel” program where threads work on independent data sets without
any inter-thread communication whatsoever.
What you describe could nevertheless be happening at a lower level,
within libguile, though it’s not clear to me where that could be.
>> I commented out the MADV_DONTNEED call to be sure, but I can still
>> reproduce the bug.
>
> I strongly doubt that the MADV_DONTNEED is relevant to this issue.
I thought about it because that’s one way part of the VM stack could be
zeroed out.
>> Then I thought vp->sp might be out-of-sync compared to the local
>> variable ‘sp’, which in turn could cause ‘scm_i_vm_mark_stack’ to not
>> mark a few items on the tip of the stack. So I did this:
>>
>> diff --git a/libguile/vm-engine.c b/libguile/vm-engine.c
>> index 9509cd643..1136b2271 100644
>> --- a/libguile/vm-engine.c
>> +++ b/libguile/vm-engine.c
>> @@ -151,7 +151,8 @@
>> code, or otherwise push anything on the stack, you will need to
>> CACHE_SP afterwards to restore the possibly-changed stack pointer. */
>>
>> -#define SYNC_IP() vp->ip = (ip)
>> +#define SYNC_IP() \
>> + do { vp->ip = (ip); vp->sp = (sp); } while (0)
>
> I don't see how a change like this could be useful for any thread safety
> problem.
I witnessed situations where the local ‘sp’ seemed to be different from
‘vp->sp’, though it’s hard to tell because I’m unsure where gcc stores
‘sp’. Here’s an example:
--8<---------------cut here---------------start------------->8---
(gdb) frame
#16 0x00007fabf30af2ca in vm_regular_engine (thread=0x24e6000, vp=0x22de6c0,
registers=0x0, resume=40) at vm-engine.c:785
785 ret = scm_apply_subr (sp, FRAME_LOCALS_COUNT ());
(gdb) p vp->sp
$5 = (union scm_vm_stack_element *) 0x7fabec158718
(gdb) p (union scm_vm_stack_element *) $r13
$6 = (union scm_vm_stack_element *) 0x7fabec158e30
(gdb) p $6 - $5
$7 = 227
(gdb) p vp->fp
$8 = (union scm_vm_stack_element *) 0x7fabec158730
(gdb) p vp->stack_top
$9 = (union scm_vm_stack_element *) 0x7fabec159000
(gdb) p vp->stack_bottom
$10 = (union scm_vm_stack_element *) 0x7fabec158000
(gdb) p vp->sp_min_since_gc
$11 = (union scm_vm_stack_element *) 0x7fabec158620
(gdb) info registers
rax 0x1 1
rbx 0xa 10
rcx 0x28 40
rdx 0x0 0
rsi 0x23f1920 37689632
rdi 0x24e6000 38690816
rbp 0x22de6c0 0x22de6c0
rsp 0x7fabcce18660 0x7fabcce18660
r8 0x1 1
r9 0x1 1
r10 0x100 256
r11 0x23f1920 37689632
r12 0x7fabf330b8c0 140376496191680
r13 0x7fabec158e30 140376376970800
r14 0x7fabf30c6d7c 140376493813116
r15 0x7fabf0fa7f28 140376459083560
rip 0x7fabf30af2ca 0x7fabf30af2ca <vm_regular_engine+14058>
eflags 0x10246 [ PF ZF IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
--8<---------------cut here---------------end--------------->8---
My hypothesis was that such a bug could lead heap elements to be
reclaimed too early. This is more likely to happen in a multi-threaded
context because one thread could be allocating memory and triggering a
GC while another thread is invoking a subr with an out-of-sync ‘vp->sp’.
Does that make sense?
> For now, I would suggest avoiding multi-threaded code in Guix, or at
> least to avoid loading any Scheme code from multiple threads.
>
> How about using multiple processes instead?
We could do that, but with my Guile maintainer hat on (a hat I don’t
wear that often as you might have noticed ;-)) I think it would be nice
to fix the issue.
Thanks,
Ludo’.