[PATCH 0/5] *** Implement using Intel QAT to offload ZLIB

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[PATCH 0/5] *** Implement using Intel QAT to offload ZLIB

From:	Bryan Zhang
Subject:	[PATCH 0/5] *** Implement using Intel QAT to offload ZLIB
Date:	Sun, 31 Dec 2023 20:57:59 +0000

* Overview:

This patchset implements using Intel's QAT accelerator to offload ZLIB 
compression and decompression in the multifd live migration path.

* Background:

Intel's 4th generation Xeon processors support Intel's QuickAssist Technology 
(QAT), a hardware accelerator for cryptography and compression operations.

Intel has also released a software library, QATzip, that interacts with QAT and 
exposes an API for QAT-accelerated ZLIB compression and decompression.

This patchset introduces a new multifd compression method, `qatzip`, which uses 
QATzip to perform ZLIB compression and decompression.

* Implementation:

The bulk of this patchset is in `migration/multifd-qatzip.c`, which mirrors the 
other compression implementation files, `migration/multifd-zlib.c` and 
`migration/multifd-zstd.c`, by providing an implementation of the multifd 
send/recv methods using the API exposed by QATzip. This is fairly 
straightforward, as the multifd setup/prepare/teardown methods align closely 
with QATzip's methods for initialization/(de)compression/teardown.

The only major divergence from the other compression methods is that we use a 
non-streaming compression/decompression API, as opposed to streaming each page 
to the compression layer one at a time. This does not require any major code 
changes - by the time we want to call to the compression layer, we already have 
a batch of pages, so it is easy to copy them into a contiguous buffer. This 
decision is purely performance-based, as our initial QAT benchmark testing 
showed that QATzip's non-streaming API outperformed the streaming API.

* Performance:

** Setup:

We use two Intel 4th generation Xeon servers for testing.

Architecture:        x86_64
CPU(s):              192
Thread(s) per core:  2
Core(s) per socket:  48
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               143
Model name:          Intel(R) Xeon(R) Platinum 8457C
Stepping:            8
CPU MHz:             2538.624
CPU max MHz:         3800.0000
CPU min MHz:         800.0000

Each server has two QAT devices, and the network bandwidth between the two 
servers is 1Gbps.

We perform multifd live migration over TCP using a VM with 64GB memory. We 
prepared the machine's memory by powering it on, allocating a large amount of 
memory (63GB) as a single buffer, and filling the buffer with the repeated 
contents of the Silesia corpus[0]. This is in lieu of a more realistic memory 
snapshot, which proved troublesome to acquire.

We analyzed CPU usage by averaging the output of `top` every second during live 
migration. This is admittedly imprecise, but we feel that it accurately 
portrays the different degrees of CPU usage of varying compression methods.

We present the latency, throughput, and CPU usage results for all of the 
compression methods, with varying numbers of multifd threads (4, 8, and 16).

[0] The Silesia corpus can be accessed here: 
https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia

** Results:

4 multifd threads:

    |---------------|---------------|----------------|---------|---------|
    |method         |time(sec)      |throughput(mbps)|send cpu%|recv cpu%|
    |---------------|---------------|----------------|---------|---------|
    |qatzip         |111.256        |916.03          | 29.08%  | 51.90   |
    |---------------|---------------|----------------|---------|---------|
    |zlib           |193.033        |562.16          |297.36%  |237.84   |
    |---------------|---------------|----------------|---------|---------|
    |zstd           |112.449        |920.67          |234.39%  |157.57   |
    |---------------|---------------|----------------|---------|---------|
    |none           |327.014        |933.41          |  9.50%  | 25.28   |
    |---------------|---------------|----------------|---------|---------|

8 multifd threads:

    |---------------|---------------|----------------|---------|---------|
    |method         |time(sec)      |throughput(mbps)|send cpu%|recv cpu%|
    |---------------|---------------|----------------|---------|---------|
    |qatzip         |111.349        |915.20          | 29.13%  | 59.63   |
    |---------------|---------------|----------------|---------|---------|
    |zlib           |149.378        |726.64          |516.24%  |400.46   |
    |---------------|---------------|----------------|---------|---------|
    |zstd           |111.942        |925.85          |345.75%  |170.74   |
    |---------------|---------------|----------------|---------|---------|
    |none           |327.417        |933.34          |  8.38%  | 27.72   |
    |---------------|---------------|----------------|---------|---------|

16 multifd threads:

    |---------------|---------------|----------------|---------|---------|
    |method         |time(sec)      |throughput(mbps)|send cpu%|recv cpu%|
    |---------------|---------------|----------------|---------|---------|
    |qatzip         |112.035        |908.96          | 29.93%  | 63.83%  |
    |---------------|---------------|----------------|---------|---------|
    |zlib           |118.730        |912.94          |914.14%  |621.59%  |
    |---------------|---------------|----------------|---------|---------|
    |zstd           |112.167        |924.78          |384.81%  |171.54%  |
    |---------------|---------------|----------------|---------|---------|
    |none           |327.728        |932.08          |  9.31%  | 29.89%  |
    |---------------|---------------|----------------|---------|---------|

** Observations:

Latency: In our test setting, live migration is mostly network-constrained, so 
compression performs relatively well in general. `qatzip` particularly shows a 
significant improvement over `zlib` with limited threads. With 4 multifd 
threads, `qatzip` shows a ~42% decrease in latency over `zlib`. In all 
scenarios, `qatzip` shows comparable performance with `zstd`.

Throughput: In all scenarios, nearly all compression methods reach nearly the 
entire network throughput of 1Gbps except for `zlib`, which appears to be 
CPU-bound with 4 and 8 threads, but reaches comparable throughput performance 
with the other methods at 16 threads.

CPU usage: In all scenarios, `qatzip` consumes a fraction of the CPU usage that 
`zlib` and `zstd` use. In the most limited case, with 4 multifd threads, 
`qatzip`'s sender CPU usage is ~10% that of `zlib`, and ~12% that of `zstd`, 
and its receiver CPU usage is ~22% that of `zlib`, and ~33% that of `zstd`. The 
magnitude of these savings increases as we increase to 8 and 16 threads.

* Future work:

- Comparing QAT offloading against other compression methods in environments 
that are not as network-constrained.
- Combining compression offloading with offloading using other Intel 
accelerators (e.g. using Intel's Data Streaming Accelerator to offload zero 
page checking, which is part of another related patchset currently under 
discussion, and to offload `memcpy()` operations on the receiver side).
- Reworking multifd logic to pipeline live migration work to improve device 
saturation.

* Testing:

This patchset adds an integration test for the new `qatzip` multifd compression 
method.

* Patchset:

This patchset was generated on top of commit 7425b627.

Bryan Zhang (5):
  meson: Introduce 'qatzip' feature to the build system.
  migration: Add compression level parameter for QATzip
  migration: Introduce unimplemented 'qatzip' compression method
  migration: Implement 'qatzip' methods using QAT
  migration: Add integration test for 'qatzip' compression method

 hw/core/qdev-properties-system.c |   6 +-
 meson.build                      |  10 +
 meson_options.txt                |   2 +
 migration/meson.build            |   1 +
 migration/migration-hmp-cmds.c   |   4 +
 migration/multifd-qatzip.c       | 369 +++++++++++++++++++++++++++++++
 migration/multifd.h              |   1 +
 migration/options.c              |  27 +++
 migration/options.h              |   1 +
 qapi/migration.json              |  24 +-
 scripts/meson-buildoptions.sh    |   3 +
 tests/qtest/meson.build          |   4 +
 tests/qtest/migration-test.c     |  37 ++++
 13 files changed, 486 insertions(+), 3 deletions(-)
 create mode 100644 migration/multifd-qatzip.c

-- 
2.30.2

[Prev in Thread]

Current Thread

[Next in Thread]

[PATCH 0/5] *** Implement using Intel QAT to offload ZLIB, Bryan Zhang <=
- [PATCH 1/5] meson: Introduce 'qatzip' feature to the build system., Bryan Zhang, 2023/12/31
- [PATCH 2/5] migration: Add compression level parameter for QATzip, Bryan Zhang, 2023/12/31
- [PATCH 4/5] migration: Implement 'qatzip' methods using QAT, Bryan Zhang, 2023/12/31
- [PATCH 5/5] migration: Add integration test for 'qatzip' compression method, Bryan Zhang, 2023/12/31
- [PATCH 3/5] migration: Introduce unimplemented 'qatzip' compression method, Bryan Zhang, 2023/12/31

Prev by Date: [PATCH 1/5] meson: Introduce 'qatzip' feature to the build system.
Next by Date: [PATCH 2/5] migration: Add compression level parameter for QATzip
Previous by thread: Re: [PATCH v2 1/1] docs: pcie: describe PCIe option ROMs
Next by thread: [PATCH 1/5] meson: Introduce 'qatzip' feature to the build system.
Index(es):
- Date
- Thread