poke-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2] Add io space MMAP for memory mapping devices and files


From: Jose E. Marchesi
Subject: Re: [PATCH v2] Add io space MMAP for memory mapping devices and files
Date: Tue, 09 Jan 2024 16:47:32 +0100
User-agent: Gnus/5.13 (Gnus v5.13)

> Hi Jose,
>
> "Jose E. Marchesi" <jemarch@gnu.org> schrieb am Di, 09. Jan 15:01:
>> 
>> > Hi Jose,
>> >
>> > "Jose E. Marchesi" <jemarch@gnu.org> schrieb am Mo, 08. Jan 19:55:
>> >> 
>> >> > Hi Jose,
>> >> >
>> >> > "Jose E. Marchesi" <jemarch@gnu.org> schrieb am Mo, 08. Jan 18:06:
>> >> >> 
>> >> >> > Hi Jose,
>> >> >> >
>> >> >> > "Jose E. Marchesi" <jemarch@gnu.org> schrieb am Mo, 08. Jan 15:08:
>> >> >> >> 
>> >> >> >> > Hi Jose,
>> >> >> >> >
>> >> >> >> > "Jose E. Marchesi" <jemarch@gnu.org> schrieb am Mo, 08. Jan 00:03:
>> >> >> >> >> 
>> >> >> >> >> > Hi again,
>> >> >> >> >> >
>> >> >> >> >> > Andreas Klinger <ak@it-klinger.de> schrieb am Sa, 06. Jan 
>> >> >> >> >> > 20:42:
>> >> >> >> >> >> Hi Jose,
>> >> >> >> >> >> 
>> >> >> >> >> >> "Jose E. Marchesi" <jemarch@gnu.org> schrieb am Sa, 06. Jan 
>> >> >> >> >> >> 20:14:
>> >> >> >> >> >> > > I think that the approach of patch v2 could be optimized 
>> >> >> >> >> >> > > by a .set alignment
>> >> >> >> >> >> > > variable for those guys who are aware of such hardware 
>> >> >> >> >> >> > > constraints on their
>> >> >> >> >> >> > > architecture and want to optimize this. But I would set 
>> >> >> >> >> >> > > the default to a value
>> >> >> >> >> >> > > which works for all, which is the size of the address bus.
>> >> >> >> >> >> > 
>> >> >> >> >> >> > I think it is perfectly ok to have IOD specific 
>> >> >> >> >> >> > configuration
>> >> >> >> >> >> > parameters, sure.
>> >> >> >> >> >> > 
>> >> >> >> >> >> > Let's make it a Poke variable that the user can customize, 
>> >> >> >> >> >> > plus a
>> >> >> >> >> >> > setting in the registry so the .set dot-command works on 
>> >> >> >> >> >> > it.  I suppose
>> >> >> >> >> >> > it shall have the name mmap in it, like pk_mmap_alignment.  
>> >> >> >> >> >> > WDYT?
>> >> >> >> >> >> 
>> >> >> >> >> >> great. It makes sense to name it mmap as it only applies to 
>> >> >> >> >> >> mmap io spaces.
>> >> >> >> >> > [...]
>> >> >> >> >> >
>> >> >> >> >> > After thinking about the problem again I came to a different 
>> >> >> >> >> > solution which is
>> >> >> >> >> > much simpler, doesn't need an additional .set variable and 
>> >> >> >> >> > avoids the signal
>> >> >> >> >> > handler as well as it is copying with the address bus size if 
>> >> >> >> >> > possible, thus
>> >> >> >> >> > avoiding performance bottlenecks. The read function is as 
>> >> >> >> >> > follows:
>> >> >> >> >> >
>> >> >> >> >> > static int
>> >> >> >> >> > ios_dev_mmap_pread (void *iod, void *buf, size_t count, 
>> >> >> >> >> > ios_dev_off offset)
>> >> >> >> >> > {
>> >> >> >> >> >   struct ios_dev_mmap *dev_map = iod;
>> >> >> >> >> >   int align = sizeof(void*);
>> >> >> >> >> >   char *m = buf;
>> >> >> >> >> >
>> >> >> >> >> >   if (offset > dev_map->size || count > dev_map->size - offset)
>> >> >> >> >> >     return IOD_EOF;
>> >> >> >> >> >
>> >> >> >> >> >   /* copy unaligned bytes */
>> >> >> >> >> >   while (count && offset % align)
>> >> >> >> >> >     {
>> >> >> >> >> >       memcpy (m, dev_map->addr + offset, 1);
>> >> >> >> >> >       count--;
>> >> >> >> >> >       offset++;
>> >> >> >> >> >       m++;
>> >> >> >> >> >     }
>> >> >> >> >> >
>> >> >> >> >> >   /* copy with the address bus size */
>> >> >> >> >> >   while (count >= align)
>> >> >> >> >> >     {
>> >> >> >> >> >       memcpy (m, dev_map->addr + offset, align);
>> >> >> >> >> 
>> >> >> >> >> This is still copying a byte at a time, isn't it?
>> >> >> >> >
>> >> >> >> > There is no general answer to this question. It depends on what 
>> >> >> >> > the C library
>> >> >> >> > and the compiler are making out of it. Thus it's also dependent 
>> >> >> >> > on the
>> >> >> >> > architecture and the value of align.
>> >> >> >> 
>> >> >> >> Actually, memcpy is guaranteed to work regardless of any alignment
>> >> >> >> requirement restrictions imposed by the architecture targetted by 
>> >> >> >> the
>> >> >> >> compiler, so the general answer is that you can always assume that
>> >> >> >> memcpy is equivalent to copying bytes by bytes.
>> >> >> >> 
>> >> >> >> The compiler will only optimize a memcpy (foo, bar, 8) into a long 
>> >> >> >> word
>> >> >> >> move if it knows that foo and bar are properly aligned according to 
>> >> >> >> the
>> >> >> >> rules of the target.
>> >> >> >
>> >> >> > You're right.
>> >> >> >
>> >> >> > But I'm still wondering why I get the SIGBUS (on zynqmp) when doing 
>> >> >> > a memcpy of
>> >> >> > 2 bytes from an odd address which is mmaped, like:
>> >> >> >
>> >> >> > memcpy(some-variable, odd-address, 2);
>> >> >> 
>> >> >> I have no idea.  I guess that's up to the driver providing the mmapped
>> >> >> contents?  Is the driver documenting the rules on how to access the
>> >> >> contents?
>> >> >
>> >> > I wrote the driver and I'm doing a usual memory mapping.
>> >> 
>> >> Ok then it would be very strange to get SIGBUS when accessing a single
>> >> byte.
>> >> 
>> >> >
>> >> > I wrote a function which demonstrates the difference if memcpy is 
>> >> > called with a
>> >> > variable a (for alignment) or with a fixed value:
>> >> >
>> >> > #include <string.h>
>> >> >
>> >> > int mycopy(void)
>> >> > {
>> >> >         int a = 4;
>> >> >         unsigned int x;
>> >> >         unsigned char y[4];
>> >> >
>> >> >         memcpy(&x, &y[0], 4);
>> >> >
>> >> >         memcpy(&x, &y[0], a);
>> >> >
>> >> >         return 0;
>> >> > }
>> >> >
>> >> > In the objdump we can see that when called with the fixed value of 4 
>> >> > bytes to be
>> >> > copied there are just 2 instructions and the function memcpy isn't 
>> >> > called at
>> >> > all. So the compiler resolves the memcpy without the function call.
>> >> 
>> >> That is to be expected, since the compiler knows the addresses of both x
>> >> and y automatics, and therefore their alignment.
>> >
>> > see disassemble below.
>> >
>> >> > I think this could really be optimized in the code snipped I provided 
>> >> > earlier to
>> >> > avoid byte by byte copying.
>> >> 
>> >> In the code snippet earlier the address of the buffer where the stuff is
>> >> copied is passed as an argument to th function, so I don't see how the
>> >> compiler could optimize it that way.
>> >> 
>> >> Have you tried to disassemble the pread function in the mmap IOD to see
>> >> what the calls to memcpy are compiled to?
>> >
>> > Here is the objdump of libpoke created for bbb:
>> >
>> > 0009c170 <ios_dev_mmap_pread>:
>> > {
>> >    9c170:  e92d4030        push    {r4, r5, lr}
>> >    9c174:  e59d300c        ldr     r3, [sp, #12]
>> >    9c178:  e59de010        ldr     lr, [sp, #16]
>> >   if (offset > dev_map->size || count > dev_map->size - offset)
>> >    9c17c:  e5904028        ldr     r4, [r0, #40]   ; 0x28
>> >    9c180:  e590c02c        ldr     ip, [r0, #44]   ; 0x2c
>> >    9c184:  e1540003        cmp     r4, r3
>> >    9c188:  e0dc500e        sbcs    r5, ip, lr
>> >    9c18c:  3a000024        bcc     9c224 <ios_dev_mmap_pread+0xb4>
>> >    9c190:  e0544003        subs    r4, r4, r3
>> >    9c194:  e0ccc00e        sbc     ip, ip, lr
>> >    9c198:  e1540002        cmp     r4, r2
>> >    9c19c:  e2dcc000        sbcs    ip, ip, #0
>> >    9c1a0:  2a000007        bcs     9c1c4 <ios_dev_mmap_pread+0x54>
>> >     return IOD_EOF;
>> >    9c1a4:  e3e00004        mvn     r0, #4
>> > }
>> >    9c1a8:  e8bd8030        pop     {r4, r5, pc}
>> >       memcpy (m, dev_map->addr + offset, 1);
>> >    9c1ac:  e590c030        ldr     ip, [r0, #48]   ; 0x30
>> >    9c1b0:  e7dcc003        ldrb    ip, [ip, r3]
>> >    9c1b4:  e4c1c001        strb    ip, [r1], #1
>> >       count--;
>> >    9c1b8:  e2422001        sub     r2, r2, #1
>> >       offset++;
>> >    9c1bc:  e2933001        adds    r3, r3, #1
>> >    9c1c0:  e2aee000        adc     lr, lr, #0
>> >   while (count && offset % align)
>> >    9c1c4:  e3520000        cmp     r2, #0
>> >    9c1c8:  0a000008        beq     9c1f0 <ios_dev_mmap_pread+0x80>
>> >    9c1cc:  e3130003        tst     r3, #3
>> >    9c1d0:  1afffff5        bne     9c1ac <ios_dev_mmap_pread+0x3c>
>> >    9c1d4:  ea000005        b       9c1f0 <ios_dev_mmap_pread+0x80>
>> >       memcpy (m, dev_map->addr + offset, align);
>> >    9c1d8:  e590c030        ldr     ip, [r0, #48]   ; 0x30
>> >    9c1dc:  e79cc003        ldr     ip, [ip, r3]
>> >    9c1e0:  e481c004        str     ip, [r1], #4
>> >       count -= align;
>> >    9c1e4:  e2422004        sub     r2, r2, #4
>> >       offset += align;
>> >    9c1e8:  e2933004        adds    r3, r3, #4
>> >    9c1ec:  e2aee000        adc     lr, lr, #0
>> >   while (count >= align)
>> >    9c1f0:  e3520003        cmp     r2, #3
>> >    9c1f4:  8afffff7        bhi     9c1d8 <ios_dev_mmap_pread+0x68>
>> >    9c1f8:  ea000005        b       9c214 <ios_dev_mmap_pread+0xa4>
>> >       memcpy (m, dev_map->addr + offset, 1);
>> >    9c1fc:  e590c030        ldr     ip, [r0, #48]   ; 0x30
>> >    9c200:  e7dcc003        ldrb    ip, [ip, r3]
>> >    9c204:  e4c1c001        strb    ip, [r1], #1
>> >       count--;
>> >    9c208:  e2422001        sub     r2, r2, #1
>> >       offset++;
>> >    9c20c:  e2933001        adds    r3, r3, #1
>> >    9c210:  e2aee000        adc     lr, lr, #0
>> >   while (count)
>> >    9c214:  e3520000        cmp     r2, #0
>> >    9c218:  1afffff7        bne     9c1fc <ios_dev_mmap_pread+0x8c>
>> >   return IOD_OK;
>> >    9c21c:  e3a00000        mov     r0, #0
>> >    9c220:  e8bd8030        pop     {r4, r5, pc}
>> >     return IOD_EOF;
>> >    9c224:  e3e00004        mvn     r0, #4
>> >    9c228:  e8bd8030        pop     {r4, r5, pc}
>> >
>> >
>> > The glibc function memcpy is never called but the compiler builtin is 
>> > used. This
>> > makes it clear where the SIGBUS comes from.
>> 
>> So the builtin expands to ldr an dstr aarch64 instructions, which AFAIK
>> both support unaligned loads and stores.  
>
> The disassemble is from beaglebone black which is arm. Because I'm travelling
> these days I don't have access to the other aarch64 system.
>
>> So these should not result in a SIGBUS when accessing normal memory and the
>> compiler is free to use them regardless of the alignment of the pointers.  So
>> the question is: why do they trigger a SIGBUS while accessing an address 
>> space
>> backed by your driver?
>
> My driver is not mapping normal memory but io memory. Whenever we access it
> there's an access to the bus which needs to be properly aligned and otherwise
> throws a SIGBUS.

Ok, so this clarifies it.  Thanks for explaining.

If this is a typical usage of the mmap IOD (and it is, thats why you are
writing it :)) and it is expected that these mmaped memory areas have
more stringent alignment requirements, then you cannot use memcpy in the
implementation of pread/pwrite.  You will have to use aligned
read/writes under the hood so the IOD will never do unaligned accessed,
no matter what base address and byte count you get from the IOS layer.

Are these IO memory alignment requirements homogeneous among different
targets in the Linux kernel?

If yes, then a generic IOD pread/write implementation may be enough,
using these requirements.

If not, then we will indeed have to 1) allow the user to configure the
alignment schema to be used by the mmap IOD and 2) be ready to catch
SIGBUS and raise the proper Poke exception (E_ios_alignment for example)
if the chosen schema doesn't match what the mmapped area requires.

Hope this makes sense..

>
>
>> >
>> >  
>> >> >
>> >> > Here is the objdump:
>> >> >
>> >> > mycopy.o:     file format elf32-littlearm
>> >> >
>> >> > Disassembly of section .text:
>> >> >
>> >> > 00000000 <mycopy>:
>> >> > #include <string.h>
>> >> >
>> >> > int mycopy(void)
>> >> > {
>> >> >    0:   e92d4800        push    {fp, lr}
>> >> >    4:   e28db004        add     fp, sp, #4
>> >> >    8:   e24dd010        sub     sp, sp, #16
>> >> >    c:   e59f2070        ldr     r2, [pc, #112]  ; 84 <mycopy+0x84>
>> >> >   10:   e08f2002        add     r2, pc, r2
>> >> >   14:   e59f306c        ldr     r3, [pc, #108]  ; 88 <mycopy+0x88>
>> >> >   18:   e7923003        ldr     r3, [r2, r3]
>> >> >   1c:   e5933000        ldr     r3, [r3]
>> >> >   20:   e50b3008        str     r3, [fp, #-8]
>> >> >   24:   e3a03000        mov     r3, #0
>> >> >         int a = 4;
>> >> >   28:   e3a03004        mov     r3, #4
>> >> >   2c:   e50b3010        str     r3, [fp, #-16]
>> >> >   30:   e51b300c        ldr     r3, [fp, #-12]
>> >> >         unsigned int x;
>> >> >         unsigned char y[4];
>> >> >
>> >> >         memcpy(&x, &y[0], 4);
>> >> >   34:   e50b3014        str     r3, [fp, #-20]  ; 0xffffffec
>> >> >
>> >> >         memcpy(&x, &y[0], a);
>> >> >   38:   e51b2010        ldr     r2, [fp, #-16]
>> >> >   3c:   e24b100c        sub     r1, fp, #12
>> >> >   40:   e24b3014        sub     r3, fp, #20
>> >> >   44:   e1a00003        mov     r0, r3
>> >> >   48:   ebfffffe        bl      0 <memcpy>
>> >> >
>> >> >         return 0;
>> >> >   4c:   e3a03000        mov     r3, #0
>> >> >   [...]
>> >> >
>> >> >> >> > But yes I need to confess that when looking into the glibc for 
>> >> >> >> > example this code
>> >> >> >> > snippet could be optimized to allow larger amounts of data to be 
>> >> >> >> > passed to the
>> >> >> >> > memcpy function so that it's up to the C library and compiler to 
>> >> >> >> > optimize it the
>> >> >> >> > best.
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >> 
>> >> >> >> >> >       count -= align;
>> >> >> >> >> >       offset += align;
>> >> >> >> >> >       m += align;
>> >> >> >> >> >     }
>> >> >> >> >> >
>> >> >> >> >> >   /* copy remaining unaligned bytes */
>> >> >> >> >> >   while (count)
>> >> >> >> >> >     {
>> >> >> >> >> >       memcpy (m, dev_map->addr + offset, 1);
>> >> >> >> >> >       count--;
>> >> >> >> >> >       offset++;
>> >> >> >> >> >       m++;
>> >> >> >> >> >     }
>> >> >> >> >> >
>> >> >> >> >> >   return IOD_OK;
>> >> >> >> >> > }
>> >> >> >> >> >
>> >> >> >> >> > The write function is quite similar implemented in the same 
>> >> >> >> >> > manner.
>> >> >> >> >> >
>> >> >> >> >> > WDYT?
>> >> >> >> >> >
>> >> >> >> >> > Andreas
>> >> >> >> >
>> >> >> >> > Best regards,
>> >> >> >> >
>> >> >> >> > Andreas



reply via email to

[Prev in Thread] Current Thread [Next in Thread]