[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v3 0/4] Live Migration Acceleration with IAA Compression
From: |
Peter Xu |
Subject: |
Re: [PATCH v3 0/4] Live Migration Acceleration with IAA Compression |
Date: |
Tue, 30 Jan 2024 18:32:16 +0800 |
On Tue, Jan 30, 2024 at 03:56:05AM +0000, Liu, Yuan1 wrote:
> > -----Original Message-----
> > From: Peter Xu <peterx@redhat.com>
> > Sent: Monday, January 29, 2024 6:43 PM
> > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > Cc: farosas@suse.de; leobras@redhat.com; qemu-devel@nongnu.org; Zou,
> > Nanhai <nanhai.zou@intel.com>
> > Subject: Re: [PATCH v3 0/4] Live Migration Acceleration with IAA
> > Compression
> >
> > On Wed, Jan 03, 2024 at 07:28:47PM +0800, Yuan Liu wrote:
> > > Hi,
> >
> > Hi, Yuan,
> >
> > I have a few comments and questions. Many of them can be pure questions
> > as I don't know enough on these new technologies.
> >
> > >
> > > I am writing to submit a code change aimed at enhancing live migration
> > > acceleration by leveraging the compression capability of the Intel
> > > In-Memory Analytics Accelerator (IAA).
> > >
> > > The implementation of the IAA (de)compression code is based on Intel
> > > Query Processing Library (QPL), an open-source software project
> > > designed for IAA high-level software programming.
> > > https://github.com/intel/qpl
> > >
> > > In the last version, there was some discussion about whether to
> > > introduce a new compression algorithm for IAA. Because the compression
> > > algorithm of IAA hardware is based on deflate, and QPL already
> > > supports Zlib, so in this version, I implemented IAA as an accelerator
> > > for the Zlib compression method. However, due to some reasons, QPL is
> > > currently not compatible with the existing Zlib method that Zlib
> > > compressed data can be decompressed by QPl and vice versa.
> > >
> > > I have some concerns about the existing Zlib compression
> > > 1. Will you consider supporting one channel to support multi-stream
> > > compression? Of course, this may lead to a reduction in compression
> > > ratio, but it will allow the hardware to process each stream
> > > concurrently. We can have each stream process multiple pages,
> > > reducing the loss of compression ratio. For example, 128 pages are
> > > divided into 16 streams for independent compression. I will provide
> > > the a early performance data in the next version(v4).
> >
> > I think Juan used to ask similar question: how much this can help if
> > multifd can already achieve some form of concurrency over the pages?
>
>
> > Couldn't the user specify more multifd channels if they want to grant more
> > cpu resource for comp/decomp purpose?
> >
> > IOW, how many concurrent channels QPL can provide? What is the suggested
> > concurrency channels there?
>
> From the QPL software, there is no limit on the number of concurrent
> compression and decompression tasks.
> From the IAA hardware, one IAA physical device can process two compressions
> concurrently or eight decompression tasks concurrently. There are up to 8 IAA
> devices on an Intel SPR Server and it will vary according to the customer’s
> product selection and deployment.
>
> Regarding the requirement for the number of concurrent channels, I think this
> may not be a bottleneck problem.
> Please allow me to introduce a little more here
>
> 1. If the compression design is based on Zlib/Deflate/Gzip streaming mode,
> then we indeed need more channels to maintain concurrent processing. Because
> each time a multifd packet is compressed (including 128 independent pages),
> it needs to be compressed page by page. These 128 pages are not concurrent.
> The concurrency is reflected in the logic of multiple channels for the
> multifd packet.
Right. However since you said there're only a max of 8 IAA devices, would
it also mean n_multifd_threads=8 can be a good enough scenario to achieve
proper concurrency, no matter the size of data chunk for one compression
request?
Maybe you meant each device can still process concurrent compression
requests, so the real capability of concurrency can be much larger than 8?
>
> 2. Through testing, we prefer concurrent processing on 4K pages, not multifd
> packet, which means that 128 pages belonging to a packet can be
> compressed/decompressed concurrently. Even one channel can also utilize all
> the resources of IAA. But this is not compatible with existing zlib.
> The code is similar to the following
> for(int i = 0; i < num_pages; i++) {
> job[i]->input_data = pages[i]
> submit_job(job[i] //Non-block submit for compression/decompression tasks
> }
> for(int i = 0; i < num_pages; i++) {
> wait_job(job[i]) //busy polling. In the future, we will make this part
> and data sending into pipeline mode.
> }
Right, if more concurrency is wanted, you can use this async model; I think
Juan used to suggest such and I agree it will also work. It can be done on
top of the basic functionality merged.
>
> 3. Currently, the patches we provide to the community are based on streaming
> compression. This is to be compatible with the current zlib method. However,
> we found that there are still many problems with this, so we plan to provide
> a new change in the next version that the independent QPL/IAA acceleration
> function as said above.
> Compatibility issues include the following
> 1. QPL currently does not support the z_sync_flush operation
> 2. IAA comp/decomp window is fixed 4K. By default, the zlib window size
> is 32K. And window size should be the same for Both comp/decomp sides.
> 3. At the same time, I researched the QAT compression scheme. QATzip
> currently does not support zlib, nor does it support z_sync_flush. The window
> size is 32K
>
> In general, I think it is a good suggestion to make the accelerator
> compatible with standard compression algorithms, but also let the accelerator
> run independently, thus avoiding some compatibility and performance problems
> of the accelerator. For example, we can add the "accel" option to the
> compression method, and then the user must specify the same accelerator by
> compression accelerator parameter on the source and remote ends (just like
> specifying the same compression algorithm)
>
> > >
> > > 2. Will you consider using QPL/IAA as an independent compression
> > > algorithm instead of an accelerator? In this way, we can better
> > > utilize hardware performance and some features, such as IAA's
> > > canned mode, which can be dynamically generated by some statistics
> > > of data. A huffman table to improve the compression ratio.
> >
> > Maybe one more knob will work? If it's not compatible with the deflate
> > algo maybe it should never be the default. IOW, the accelerators may be
> > extended into this (based on what you already proposed):
> >
> > - auto ("qpl" first, "none" second; never "qpl-optimized")
> > - none (old zlib)
> > - qpl (qpl compatible)
> > - qpl-optimized (qpl uncompatible)
> >
> > Then "auto"/"none"/"qpl" will always be compatible, only the last doesn't,
> > user can select it explicit, but only on both sides of QEMU.
> Yes, this is what I want, I need a way that QPL is not compatible with zlib.
> From my current point of view, if zlib chooses raw defalte mode, then QAT
> will be compatible with the current community's zlib solution.
> So my suggestion is as follows
>
> Compression method parameter
> - none
> - zlib
> - zstd
> - accel (Both Qemu sides need to select the same accelerator from
> "Compression accelerator parameter" explicitly).
Can we avoid naming it as "accel"? It's too generic, IMHO.
If it's a special algorithm that only applies to QPL, can we just call it
"qpl" here? Then...
>
> Compression accelerator parameter
> - auto
> - none
> - qpl (qpl will not support zlib/zstd, it will inform an error when
> zlib/zstd is selected)
> - qat (it can provide acceleration of zlib/zstd)
Here IMHO we don't need qpl then, because the "qpl" compression method can
enforce an hardware accelerator. In summary, not sure whether this works;
Compression methods: none, zlib, zstd, qpl (describes all the algorithms
that might be used; again, qpl enforces HW support).
Compression accelerators: auto, none, qat (only applies when zlib/zstd
chosen above)
>
> > > Test condition:
> > > 1. Host CPUs are based on Sapphire Rapids, and frequency locked to
> > 3.4G
> > > 2. VM type, 16 vCPU and 64G memory
> > > 3. The Idle workload means no workload is running in the VM
> > > 4. The Redis workload means YCSB workloadb + Redis Server are running
> > > in the VM, about 20G or more memory will be used.
> > > 5. Source side migartion configuration commands
> > > a. migrate_set_capability multifd on
> > > b. migrate_set_parameter multifd-channels 2/4/8
> > > c. migrate_set_parameter downtime-limit 300
> > > d. migrate_set_parameter multifd-compression zlib
> > > e. migrate_set_parameter multifd-compression-accel none/qpl
> > > f. migrate_set_parameter max-bandwidth 100G
> > > 6. Desitination side migration configuration commands
> > > a. migrate_set_capability multifd on
> > > b. migrate_set_parameter multifd-channels 2/4/8
> > > c. migrate_set_parameter multifd-compression zlib
> > > d. migrate_set_parameter multifd-compression-accel none/qpl
> > > e. migrate_set_parameter max-bandwidth 100G
> >
> > How is zlib-level setup? Default (1)?
> Yes, use level 1 the default level.
>
> > Btw, it seems both zlib/zstd levels are not even working right now to be
> > configured.. probably overlooked in migrate_params_apply().
> Ok, I will check this.
Thanks. If you plan to post patch, please attach:
Reported-by: Xiaohui Li <xiaohli@redhat.com>
As that's reported by our QE team.
Maybe you can already add an unit test (migration-test.c, under tests/)
which should expose this issue already, by setting z*-level to non-1 then
query it back, asserting that the value did change.
>
> > > Early migration result, each result is the average of three tests
> > > +--------+-------------+--------+--------+---------+----+-----+
> > > | | The number |total |downtime|network |pages per |
> > > | | of channels |time(ms)|(ms) |bandwidth|second |
> > > | | and mode | | |(mbps) | |
> > > | +-------------+-----------------+---------+----------+
> > > | | 2 chl, Zlib | 20647 | 22 | 195 | 137767 |
> > > | +-------------+--------+--------+---------+----------+
> > > | Idle | 2 chl, IAA | 17022 | 36 | 286 | 460289 |
> > > |workload+-------------+--------+--------+---------+----------+
> > > | | 4 chl, Zlib | 18835 | 29 | 241 | 299028 |
> > > | +-------------+--------+--------+---------+----------+
> > > | | 4 chl, IAA | 16280 | 32 | 298 | 652456 |
> > > | +-------------+--------+--------+---------+----------+
> > > | | 8 chl, Zlib | 17379 | 32 | 275 | 470591 |
> > > | +-------------+--------+--------+---------+----------+
> > > | | 8 chl, IAA | 15551 | 46 | 313 | 1315784 |
> >
> > The number is slightly confusing to me. If IAA can send 3x times more
> > pages per-second, shouldn't the total migration time 1/3 of the other if
> > the guest is idle? But the total times seem to be pretty close no matter
> > N of channels. Maybe I missed something?
>
> This data is the information read from "info migrate" after the live
> migration status changes to "complete".
> I think it is the max throughout when expected downtime and network available
> bandwidth are met.
> In vCPUs are idle, live migration does not run at maximum throughput for too
> long.
>
> > > +--------+-------------+--------+--------+---------+----------+
> > >
> > > +--------+-------------+--------+--------+---------+----+-----+
> > > | | The number |total |downtime|network |pages per |
> > > | | of channels |time(ms)|(ms) |bandwidth|second |
> > > | | and mode | | |(mbps) | |
> > > | +-------------+-----------------+---------+----------+
> > > | | 2 chl, Zlib | 100% failure, timeout is 120s |
> > > | +-------------+--------+--------+---------+----------+
> > > | Redis | 2 chl, IAA | 62737 | 115 | 4547 | 387911 |
> > > |workload+-------------+--------+--------+---------+----------+
> > > | | 4 chl, Zlib | 30% failure, timeout is 120s |
> > > | +-------------+--------+--------+---------+----------+
> > > | | 4 chl, IAA | 54645 | 177 | 5382 | 656865 |
> > > | +-------------+--------+--------+---------+----------+
> > > | | 8 chl, Zlib | 93488 | 74 | 1264 | 129486 |
> > > | +-------------+--------+--------+---------+----------+
> > > | | 8 chl, IAA | 24367 | 303 | 6901 | 964380 |
> > > +--------+-------------+--------+--------+---------+----------+
> >
> > The redis results look much more preferred on using IAA comparing to the
> > idle tests. Does it mean that IAA works less good with zero pages in
> > general (assuming that'll be the majority in idle test)?
> Both Idle and Redis data are not the best performance for IAA since it is
> based on multifd packet streaming compression.
> In the idle case, most pages are indeed zero page, zero page compression is
> not as good as only detecting zero pages, so the compression advantage is not
> reflected.
>
> > From the manual, I see that IAA also supports encryption/decryption.
> > Would it be able to accelerate TLS?
> From Sapphire Rapids(SPR)/Emerald Rapids (EMR) Xeon servers, IAA can't
> support encryption/decryption. This feature may be available in future
> generations
> For TLS acceleration, QAT supports this function on SPR/EMR and has
> successful cases in some scenarios.
> https://www.intel.cn/content/www/cn/zh/developer/articles/guide/nginx-https-with-qat-tuning-guide.html
>
> > How should one consider IAA over QAT? What is the major difference? I
> > see that IAA requires IOMMU scalable mode, why? Is it because the IAA HW
> > is something attached to the pcie bus (assume QAT the same)?
>
> Regarding the difference between using IAA or QAT for compression
> 1. IAA is more suitable for 4K compression, and QAT is suitable for large
> block data compression. This is determined by the deflate windows size, and
> QAT can support more compression levels. IAA hardware supports 1 compression
> level.
> 2. From the perspective of throughput, one IAA device supports compression
> throughput is 4GBps and decompression is 30GBps. One QAT support compression
> or decompression throughput is 20GBps.
> 3. Depending on the product type selected by the customer and the deployment,
> the resources used for live migration will also be different.
>
> Regarding the IOMMU scalable mode
> 1. The current IAA software stack requires Shared Virtual Memory (SVM)
> technology, and SVM depends on IOMMU scalable mode.
> 2. Both IAA and QAT support PCIe PASID capability, then IAA can support
> shared work queue.
> https://docs.kernel.org/next/x86/sva.html
Thanks for all these information. I'm personally still curious why Intel
would like to provide two new technology to service similar purposes merely
at the same time window.
Could you put many of these information into a doc file? It can be
docs/devel/migration/QPL.rst.
Also, we may want an unit test to cover the new stuff when the whole design
settles. It may cover all mode supported, but for sure we can skip hw
accelerated use case.
--
Peter Xu
- [PATCH v3 0/4] Live Migration Acceleration with IAA Compression, Yuan Liu, 2024/01/03
- [PATCH v3 3/4] configure: add qpl option, Yuan Liu, 2024/01/03
- [PATCH v3 2/4] multifd: Implement multifd compression accelerator, Yuan Liu, 2024/01/03
- [PATCH v3 4/4] multifd: Introduce QPL compression accelerator, Yuan Liu, 2024/01/03
- [PATCH v3 1/4] migration: Introduce multifd-compression-accel parameter, Yuan Liu, 2024/01/03
- Re: [PATCH v3 0/4] Live Migration Acceleration with IAA Compression, Peter Xu, 2024/01/29