[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-arm] [Qemu-devel] [RFC PATCH] hw/arm/virt: use variable size o

From: Markus Armbruster
Subject: Re: [Qemu-arm] [Qemu-devel] [RFC PATCH] hw/arm/virt: use variable size of flash device to save memory
Date: Thu, 11 Apr 2019 09:15:57 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux)

Xiang Zheng <address@hidden> writes:

> Hi Kevin,
> On 2019/4/9 16:28, Kevin Wolf wrote:
>> Am 09.04.2019 um 08:01 hat Markus Armbruster geschrieben:
>>> László's last sentence below is "This really needs the attention of the
>>> block people."  Cc'ing some.
>>> Laszlo Ersek <address@hidden> writes:
>>>> On 04/08/19 15:43, Xiang Zheng wrote:
>>>>> On 2019/4/3 23:35, Laszlo Ersek wrote:
>>>>>>> I thought about your comments and wrote the following patch (just for 
>>>>>>> test)
>>>>>>> which uses a file mapping to replace the anonymous mapping. UEFI seems 
>>>>>>> to work
>>>>>>> fine. So why not use a file mapping to read or write a pflash device?
>>>>>> Honestly I can't answer "why not do this". Maybe we should.
>>>>>> Regarding "why not do this *exactly as shown below*" -- probably because
>>>>>> then we'd have updates to the same underlying regular file via both
>>>>>> mmap() and write(), and the interactions between those are really tricky
>>>>>> (= best avoided).
>>>>>> One of my favorite questions over the years used to be posting a matrix
>>>>>> of possible mmap() and file descriptor r/w/sync interactions, with the
>>>>>> outcomes as they were "implied" by POSIX, and then asking at the bottom,
>>>>>> "is my understanding correct?" I've never ever received an answer to
>>>>>> that. :)
>>>>> In my opinion, it's feasible to r/w/sync the memory devices which use a 
>>>>> block
>>>>> backend via mmap() and write().
>>>> Maybe. I think that would be a first in QEMU, and you'd likely have to
>>>> describe and follow a semi-formal model between fd actions and mmap()
>>>> actions.
>>>> Here's the (unconfirmed) table I referred to earlier.
>>>> +-------------+-----------------------------------------------------+
>>>> | change made | change visible via                                  |
>>>> | through     +-----------------+-------------+---------------------+
>>>> |             | MAP_SHARED      | MAP_PRIVATE | read()              |
>>>> +-------------+-----------------+-------------+---------------------+
>>>> | MAP_SHARED  | yes             | unspecified | depends on MS_SYNC, |
>>>> |             |                 |             | MS_ASYNC, or normal |
>>>> |             |                 |             | system activity     |
>>>> +-------------+-----------------+-------------+---------------------+
>>>> | MAP_PRIVATE | no              | no          | no                  |
>>>> +-------------+-----------------+-------------+---------------------+
>>>> | write()     | depends on      | unspecified | yes                 |
>>>> |             | MS_INVALIDATE,  |             |                     |
>>>> |             | or the system's |             |                     |
>>>> |             | read/write      |             |                     |
>>>> |             | consistency     |             |                     |
>>>> +-------------+-----------------+-------------+---------------------+
>>>> Presumably:
>>>> - a write() to some offset range of a regular file should be visible in
>>>> a MAP_SHARED mapping of that range after a matching msync(...,
>>>> MS_INVALIDATE) call;
>>>> - changes through a MAP_SHARED mapping to a regular file should be
>>>> visible via read() after a matching msync(..., MS_SYNC) call returns.
>>>> I sent this table first in 2010 to the Austin Group mailing list, and
>>>> received no comments. Then another person (on the same list) asked
>>>> basically the same questions in 2013, to which I responded with the
>>>> above assumptions / interpretations -- and received no comments from
>>>> third parties again.
>>>> But, I'm really out of my depth here -- we're not even discussing
>>>> generic read()/write()/mmap()/munmap()/msync() interactions, but how
>>>> they would fit into the qemu block layer. And I have no idea.
>> There is no infrastructure in the block layer for mmapping image files.
>>>>>> Also... although we don't really use them in practice, some QCOW2
>>>>>> features for pflash block backends are -- or would be -- nice. (Not the
>>>>>> internal snapshots or the live RAM dumps, of course.)
>>>>> Regarding sparse files, can we read actual backend size data for the 
>>>>> read-only
>>>>> pflash memory as the following steps shown?
>>>>> 1) Create a sparse file -- QEMU_EFI-test.raw:
>>>>>    dd if=/dev/zero of=QEMU_EFI-test.raw bs=1M seek=64 count=0
>>>>> 2) Read from QEMU_EFI.fd and write to QEMU_EFI-test.raw:
>>>>>    dd of="QEMU_EFI-test.raw" if="QEMU_EFI.fd" conv=notrunc
>>>>> 3) Use QEMU_EFI-test.raw as the read-only pflash and start qemu with the 
>>>>> below
>>>>> patch applied:
>>>>> ---8>---
>>>>> diff --git a/hw/block/block.c b/hw/block/block.c
>>>>> index bf56c76..ad18d5e 100644
>>>>> --- a/hw/block/block.c
>>>>> +++ b/hw/block/block.c
>>>>> @@ -30,7 +30,7 @@
>>>>>  bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr 
>>>>> size,
>>>>>                                   Error **errp)
>>>>>  {
>>>>> -    int64_t blk_len;
>>>>> +    int64_t blk_len, actual_len;
>>>>>      int ret;
>>>>>      blk_len = blk_getlength(blk);
>>>>> @@ -53,7 +53,9 @@ bool blk_check_size_and_read_all(BlockBackend *blk, 
>>>>> void *buf, hwaddr size,
>>>>>       * block device and read only on demand.
>>>>>       */
>>>>>      assert(size <= BDRV_REQUEST_MAX_BYTES);
>>>>> -    ret = blk_pread(blk, 0, buf, size);
>>>>> +    actual_len = bdrv_get_allocated_file_size(blk_bs(blk));
>>>>> +    ret = blk_pread(blk, 0, buf,
>>>>> +            (!blk_is_read_only(blk) || actual_len < 0) ? size : 
>>>>> actual_len);
>>>>>      if (ret < 0) {
>>>>>          error_setg_errno(errp, -ret, "can't read block backend");
>>>>>          return false;
>>>> This hunk looks dubious for a general helper function. It seems to
>>>> assume that the holes are punched at the end of the file.
>>>> Sorry, this discussion is making me uncomfortable. I don't want to
>>>> ignore your questions, but I have no idea about the right answers. This
>>>> really needs the attention of the block people.
>> Yes, this code is wrong. bdrv_get_allocated_file_size() isn't even
>> implemented in all block drivers, and when it is implemented it isn't
>> necessarily reliable (e.g. return 0 for block devices). It's fine for
>> 'qemu-img info', but that's it.
>> But even if you actually get the correct allocated file size, that's the
>> disk space used on the host filesystem, so it doesn't include any holes,
>> but it does include image format metadata, the data could possibly be
>> compressed etc. In other words, for a guest device it's completely
>> meaningless.
>> I'm not even sure why you would want to do this even in your special
>> case of a raw image that is sparse only at the end. Is it an
>> optimisation to avoid reading zeros? This ends up basically being
>> memset() for sparse blocks, so very quick. And I think you do want to
>> zero out the remaining part of the buffer anyway. So it looks to me as
>> if it were complicating the code for no use.
> The primary target of this discussion is to save the memory allocated
> for UEFI firmware device. On virtual machine, we split it into two flash
> devices[1] -- one is read-only in 64MB size and another is read-write in
> 64MB size. The qemu commandline is:
>    -drive file=QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on \
>    -drive file=empty_VARS.fd,if=pflash,format=raw,unit=1 \
> Regarding the read-only one which are created from QEMU_EFI.fd, the original
> QEMU_EFI.fd is only 2MB in size and we need to stuff it to 64MB with 62MB
> 'zeros':
>    dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M count=64
>    dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc
> For some historical reasons such as compatibility and extensibility[2], we
> restrict both their size to 64MB -- both UEFI and qemu have cold hard
> constants.

These reasons aren't historical.  But they're valid, and that's all that
matters :)

> This will consume a large amount of memory when running multiple VM
> simultaneously (each guest needs 128MB).


> There are accepted ideas proposed by Markus and Laszlo from prior 
> discussion[3]:
> 1) Getting flash memory size from a machine type or property.
> 2) Map the read-only part so it is shared among guests; Map the read-write
> part normally.
> The first idea called explicit configuration which may break migration.

For any configuration that can break migration, the solution is to have
identical configuration on source and destination.

> For the second idea, the only way I can imagine is using a file mapping to
> read or write pflash devices so that we can allocate memory on demand. So I
> tried to fit mmap() actions into the block layer[4]. However I am not sure
> that maping image file whether can work fine for block layer.

It adds an odd special case to the block layer.  But then flash devices
are an odd user of the block layer.

Normal users use the block layer as a *block* layer: to read and write

Our flash devices don't do that.  They are *memory* devices, not *block*
devices.  Pretty much the only thing the two have in common is
persistence of data.  The block layer provides persistence, and I figure
that's why it got pressed into service here.

What the flash devices do is keep the complete image contents in memory,
with changes written through to disk.

This is pretty much exactly what mmap() does, so using it suggests

It's very much *not* what the block layer does.  mmap() is not even
possible for anything but a file backend.

With mmap(), memory gets faulted in on demand, and sharing read-only
memory with other VMs should be feasible.  The question is how to use it
here.  Is there a sane way to use it without rewriting the flash devices
not to use the block layer?

> On the other hand, Laszlo had mentioned the features of QCOW2 and sparse
> files which may be nice for pflash block backends. But I am not familiar with
> it.

Offhand, two QCOW2 features come to mind:

1. Sparse storage even when the storage substrate can't do sparse.  I
consider this one largely irrelevant.  Any decent file system can do
sparse.  Archival and network transfer can compress instead.

2. Snapshots to go with disk snapshots.  *Live* snapshots (the ones that
include memory) already have the flash contents in the memory part, so
this sems only relevant for "dead" snapshots.

Anything else?

> Till now I still have no idea to get rid of this problem (more properly it's
> a optimization).
> [1]https://lists.nongnu.org/archive/html/qemu-devel/2019-01/msg06606.html
> [2]https://lists.nongnu.org/archive/html/qemu-devel/2019-03/msg07009.html
> [3]https://lists.nongnu.org/archive/html/qemu-devel/2019-03/msg07107.html
> [4]https://lists.nongnu.org/archive/html/qemu-devel/2019-04/msg00719.html

reply via email to

[Prev in Thread] Current Thread [Next in Thread]