poke-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2] Add io space MMAP for memory mapping devices and files


From: Andreas Klinger
Subject: Re: [PATCH v2] Add io space MMAP for memory mapping devices and files
Date: Tue, 9 Jan 2024 14:37:54 +0100

Hi Jose,

"Jose E. Marchesi" <jemarch@gnu.org> schrieb am Mo, 08. Jan 19:55:
> 
> > Hi Jose,
> >
> > "Jose E. Marchesi" <jemarch@gnu.org> schrieb am Mo, 08. Jan 18:06:
> >> 
> >> > Hi Jose,
> >> >
> >> > "Jose E. Marchesi" <jemarch@gnu.org> schrieb am Mo, 08. Jan 15:08:
> >> >> 
> >> >> > Hi Jose,
> >> >> >
> >> >> > "Jose E. Marchesi" <jemarch@gnu.org> schrieb am Mo, 08. Jan 00:03:
> >> >> >> 
> >> >> >> > Hi again,
> >> >> >> >
> >> >> >> > Andreas Klinger <ak@it-klinger.de> schrieb am Sa, 06. Jan 20:42:
> >> >> >> >> Hi Jose,
> >> >> >> >> 
> >> >> >> >> "Jose E. Marchesi" <jemarch@gnu.org> schrieb am Sa, 06. Jan 20:14:
> >> >> >> >> > > I think that the approach of patch v2 could be optimized by a 
> >> >> >> >> > > .set alignment
> >> >> >> >> > > variable for those guys who are aware of such hardware 
> >> >> >> >> > > constraints on their
> >> >> >> >> > > architecture and want to optimize this. But I would set the 
> >> >> >> >> > > default to a value
> >> >> >> >> > > which works for all, which is the size of the address bus.
> >> >> >> >> > 
> >> >> >> >> > I think it is perfectly ok to have IOD specific configuration
> >> >> >> >> > parameters, sure.
> >> >> >> >> > 
> >> >> >> >> > Let's make it a Poke variable that the user can customize, plus 
> >> >> >> >> > a
> >> >> >> >> > setting in the registry so the .set dot-command works on it.  I 
> >> >> >> >> > suppose
> >> >> >> >> > it shall have the name mmap in it, like pk_mmap_alignment.  
> >> >> >> >> > WDYT?
> >> >> >> >> 
> >> >> >> >> great. It makes sense to name it mmap as it only applies to mmap 
> >> >> >> >> io spaces.
> >> >> >> > [...]
> >> >> >> >
> >> >> >> > After thinking about the problem again I came to a different 
> >> >> >> > solution which is
> >> >> >> > much simpler, doesn't need an additional .set variable and avoids 
> >> >> >> > the signal
> >> >> >> > handler as well as it is copying with the address bus size if 
> >> >> >> > possible, thus
> >> >> >> > avoiding performance bottlenecks. The read function is as follows:
> >> >> >> >
> >> >> >> > static int
> >> >> >> > ios_dev_mmap_pread (void *iod, void *buf, size_t count, 
> >> >> >> > ios_dev_off offset)
> >> >> >> > {
> >> >> >> >   struct ios_dev_mmap *dev_map = iod;
> >> >> >> >   int align = sizeof(void*);
> >> >> >> >   char *m = buf;
> >> >> >> >
> >> >> >> >   if (offset > dev_map->size || count > dev_map->size - offset)
> >> >> >> >     return IOD_EOF;
> >> >> >> >
> >> >> >> >   /* copy unaligned bytes */
> >> >> >> >   while (count && offset % align)
> >> >> >> >     {
> >> >> >> >       memcpy (m, dev_map->addr + offset, 1);
> >> >> >> >       count--;
> >> >> >> >       offset++;
> >> >> >> >       m++;
> >> >> >> >     }
> >> >> >> >
> >> >> >> >   /* copy with the address bus size */
> >> >> >> >   while (count >= align)
> >> >> >> >     {
> >> >> >> >       memcpy (m, dev_map->addr + offset, align);
> >> >> >> 
> >> >> >> This is still copying a byte at a time, isn't it?
> >> >> >
> >> >> > There is no general answer to this question. It depends on what the C 
> >> >> > library
> >> >> > and the compiler are making out of it. Thus it's also dependent on the
> >> >> > architecture and the value of align.
> >> >> 
> >> >> Actually, memcpy is guaranteed to work regardless of any alignment
> >> >> requirement restrictions imposed by the architecture targetted by the
> >> >> compiler, so the general answer is that you can always assume that
> >> >> memcpy is equivalent to copying bytes by bytes.
> >> >> 
> >> >> The compiler will only optimize a memcpy (foo, bar, 8) into a long word
> >> >> move if it knows that foo and bar are properly aligned according to the
> >> >> rules of the target.
> >> >
> >> > You're right.
> >> >
> >> > But I'm still wondering why I get the SIGBUS (on zynqmp) when doing a 
> >> > memcpy of
> >> > 2 bytes from an odd address which is mmaped, like:
> >> >
> >> > memcpy(some-variable, odd-address, 2);
> >> 
> >> I have no idea.  I guess that's up to the driver providing the mmapped
> >> contents?  Is the driver documenting the rules on how to access the
> >> contents?
> >
> > I wrote the driver and I'm doing a usual memory mapping.
> 
> Ok then it would be very strange to get SIGBUS when accessing a single
> byte.
> 
> >
> > I wrote a function which demonstrates the difference if memcpy is called 
> > with a
> > variable a (for alignment) or with a fixed value:
> >
> > #include <string.h>
> >
> > int mycopy(void)
> > {
> >     int a = 4;
> >     unsigned int x;
> >     unsigned char y[4];
> >
> >     memcpy(&x, &y[0], 4);
> >
> >         memcpy(&x, &y[0], a);
> >
> >     return 0;
> > }
> >
> > In the objdump we can see that when called with the fixed value of 4 bytes 
> > to be
> > copied there are just 2 instructions and the function memcpy isn't called at
> > all. So the compiler resolves the memcpy without the function call.
> 
> That is to be expected, since the compiler knows the addresses of both x
> and y automatics, and therefore their alignment.

see disassemble below.

> > I think this could really be optimized in the code snipped I provided 
> > earlier to
> > avoid byte by byte copying.
> 
> In the code snippet earlier the address of the buffer where the stuff is
> copied is passed as an argument to th function, so I don't see how the
> compiler could optimize it that way.
> 
> Have you tried to disassemble the pread function in the mmap IOD to see
> what the calls to memcpy are compiled to?

Here is the objdump of libpoke created for bbb:

0009c170 <ios_dev_mmap_pread>:
{
   9c170:       e92d4030        push    {r4, r5, lr}
   9c174:       e59d300c        ldr     r3, [sp, #12]
   9c178:       e59de010        ldr     lr, [sp, #16]
  if (offset > dev_map->size || count > dev_map->size - offset)
   9c17c:       e5904028        ldr     r4, [r0, #40]   ; 0x28
   9c180:       e590c02c        ldr     ip, [r0, #44]   ; 0x2c
   9c184:       e1540003        cmp     r4, r3
   9c188:       e0dc500e        sbcs    r5, ip, lr
   9c18c:       3a000024        bcc     9c224 <ios_dev_mmap_pread+0xb4>
   9c190:       e0544003        subs    r4, r4, r3
   9c194:       e0ccc00e        sbc     ip, ip, lr
   9c198:       e1540002        cmp     r4, r2
   9c19c:       e2dcc000        sbcs    ip, ip, #0
   9c1a0:       2a000007        bcs     9c1c4 <ios_dev_mmap_pread+0x54>
    return IOD_EOF;
   9c1a4:       e3e00004        mvn     r0, #4
}
   9c1a8:       e8bd8030        pop     {r4, r5, pc}
      memcpy (m, dev_map->addr + offset, 1);
   9c1ac:       e590c030        ldr     ip, [r0, #48]   ; 0x30
   9c1b0:       e7dcc003        ldrb    ip, [ip, r3]
   9c1b4:       e4c1c001        strb    ip, [r1], #1
      count--;
   9c1b8:       e2422001        sub     r2, r2, #1
      offset++;
   9c1bc:       e2933001        adds    r3, r3, #1
   9c1c0:       e2aee000        adc     lr, lr, #0
  while (count && offset % align)
   9c1c4:       e3520000        cmp     r2, #0
   9c1c8:       0a000008        beq     9c1f0 <ios_dev_mmap_pread+0x80>
   9c1cc:       e3130003        tst     r3, #3
   9c1d0:       1afffff5        bne     9c1ac <ios_dev_mmap_pread+0x3c>
   9c1d4:       ea000005        b       9c1f0 <ios_dev_mmap_pread+0x80>
      memcpy (m, dev_map->addr + offset, align);
   9c1d8:       e590c030        ldr     ip, [r0, #48]   ; 0x30
   9c1dc:       e79cc003        ldr     ip, [ip, r3]
   9c1e0:       e481c004        str     ip, [r1], #4
      count -= align;
   9c1e4:       e2422004        sub     r2, r2, #4
      offset += align;
   9c1e8:       e2933004        adds    r3, r3, #4
   9c1ec:       e2aee000        adc     lr, lr, #0
  while (count >= align)
   9c1f0:       e3520003        cmp     r2, #3
   9c1f4:       8afffff7        bhi     9c1d8 <ios_dev_mmap_pread+0x68>
   9c1f8:       ea000005        b       9c214 <ios_dev_mmap_pread+0xa4>
      memcpy (m, dev_map->addr + offset, 1);
   9c1fc:       e590c030        ldr     ip, [r0, #48]   ; 0x30
   9c200:       e7dcc003        ldrb    ip, [ip, r3]
   9c204:       e4c1c001        strb    ip, [r1], #1
      count--;
   9c208:       e2422001        sub     r2, r2, #1
      offset++;
   9c20c:       e2933001        adds    r3, r3, #1
   9c210:       e2aee000        adc     lr, lr, #0
  while (count)
   9c214:       e3520000        cmp     r2, #0
   9c218:       1afffff7        bne     9c1fc <ios_dev_mmap_pread+0x8c>
  return IOD_OK;
   9c21c:       e3a00000        mov     r0, #0
   9c220:       e8bd8030        pop     {r4, r5, pc}
    return IOD_EOF;
   9c224:       e3e00004        mvn     r0, #4
   9c228:       e8bd8030        pop     {r4, r5, pc}


The glibc function memcpy is never called but the compiler builtin is used. This
makes it clear where the SIGBUS comes from.

 
> >
> > Here is the objdump:
> >
> > mycopy.o:     file format elf32-littlearm
> >
> > Disassembly of section .text:
> >
> > 00000000 <mycopy>:
> > #include <string.h>
> >
> > int mycopy(void)
> > {
> >    0:       e92d4800        push    {fp, lr}
> >    4:       e28db004        add     fp, sp, #4
> >    8:       e24dd010        sub     sp, sp, #16
> >    c:       e59f2070        ldr     r2, [pc, #112]  ; 84 <mycopy+0x84>
> >   10:       e08f2002        add     r2, pc, r2
> >   14:       e59f306c        ldr     r3, [pc, #108]  ; 88 <mycopy+0x88>
> >   18:       e7923003        ldr     r3, [r2, r3]
> >   1c:       e5933000        ldr     r3, [r3]
> >   20:       e50b3008        str     r3, [fp, #-8]
> >   24:       e3a03000        mov     r3, #0
> >     int a = 4;
> >   28:       e3a03004        mov     r3, #4
> >   2c:       e50b3010        str     r3, [fp, #-16]
> >   30:       e51b300c        ldr     r3, [fp, #-12]
> >     unsigned int x;
> >     unsigned char y[4];
> >
> >     memcpy(&x, &y[0], 4);
> >   34:       e50b3014        str     r3, [fp, #-20]  ; 0xffffffec
> >
> >     memcpy(&x, &y[0], a);
> >   38:       e51b2010        ldr     r2, [fp, #-16]
> >   3c:       e24b100c        sub     r1, fp, #12
> >   40:       e24b3014        sub     r3, fp, #20
> >   44:       e1a00003        mov     r0, r3
> >   48:       ebfffffe        bl      0 <memcpy>
> >
> >     return 0;
> >   4c:       e3a03000        mov     r3, #0
> >   [...]
> >
> >> >> > But yes I need to confess that when looking into the glibc for 
> >> >> > example this code
> >> >> > snippet could be optimized to allow larger amounts of data to be 
> >> >> > passed to the
> >> >> > memcpy function so that it's up to the C library and compiler to 
> >> >> > optimize it the
> >> >> > best.
> >> >> >
> >> >> >
> >> >> >> 
> >> >> >> >       count -= align;
> >> >> >> >       offset += align;
> >> >> >> >       m += align;
> >> >> >> >     }
> >> >> >> >
> >> >> >> >   /* copy remaining unaligned bytes */
> >> >> >> >   while (count)
> >> >> >> >     {
> >> >> >> >       memcpy (m, dev_map->addr + offset, 1);
> >> >> >> >       count--;
> >> >> >> >       offset++;
> >> >> >> >       m++;
> >> >> >> >     }
> >> >> >> >
> >> >> >> >   return IOD_OK;
> >> >> >> > }
> >> >> >> >
> >> >> >> > The write function is quite similar implemented in the same manner.
> >> >> >> >
> >> >> >> > WDYT?
> >> >> >> >
> >> >> >> > Andreas
> >> >> >
> >> >> > Best regards,
> >> >> >
> >> >> > Andreas

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]