[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[PATCH 0/5] *** Implement using Intel QAT to offload ZLIB
From: |
Bryan Zhang |
Subject: |
[PATCH 0/5] *** Implement using Intel QAT to offload ZLIB |
Date: |
Sun, 31 Dec 2023 20:57:59 +0000 |
* Overview:
This patchset implements using Intel's QAT accelerator to offload ZLIB
compression and decompression in the multifd live migration path.
* Background:
Intel's 4th generation Xeon processors support Intel's QuickAssist Technology
(QAT), a hardware accelerator for cryptography and compression operations.
Intel has also released a software library, QATzip, that interacts with QAT and
exposes an API for QAT-accelerated ZLIB compression and decompression.
This patchset introduces a new multifd compression method, `qatzip`, which uses
QATzip to perform ZLIB compression and decompression.
* Implementation:
The bulk of this patchset is in `migration/multifd-qatzip.c`, which mirrors the
other compression implementation files, `migration/multifd-zlib.c` and
`migration/multifd-zstd.c`, by providing an implementation of the multifd
send/recv methods using the API exposed by QATzip. This is fairly
straightforward, as the multifd setup/prepare/teardown methods align closely
with QATzip's methods for initialization/(de)compression/teardown.
The only major divergence from the other compression methods is that we use a
non-streaming compression/decompression API, as opposed to streaming each page
to the compression layer one at a time. This does not require any major code
changes - by the time we want to call to the compression layer, we already have
a batch of pages, so it is easy to copy them into a contiguous buffer. This
decision is purely performance-based, as our initial QAT benchmark testing
showed that QATzip's non-streaming API outperformed the streaming API.
* Performance:
** Setup:
We use two Intel 4th generation Xeon servers for testing.
Architecture: x86_64
CPU(s): 192
Thread(s) per core: 2
Core(s) per socket: 48
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 143
Model name: Intel(R) Xeon(R) Platinum 8457C
Stepping: 8
CPU MHz: 2538.624
CPU max MHz: 3800.0000
CPU min MHz: 800.0000
Each server has two QAT devices, and the network bandwidth between the two
servers is 1Gbps.
We perform multifd live migration over TCP using a VM with 64GB memory. We
prepared the machine's memory by powering it on, allocating a large amount of
memory (63GB) as a single buffer, and filling the buffer with the repeated
contents of the Silesia corpus[0]. This is in lieu of a more realistic memory
snapshot, which proved troublesome to acquire.
We analyzed CPU usage by averaging the output of `top` every second during live
migration. This is admittedly imprecise, but we feel that it accurately
portrays the different degrees of CPU usage of varying compression methods.
We present the latency, throughput, and CPU usage results for all of the
compression methods, with varying numbers of multifd threads (4, 8, and 16).
[0] The Silesia corpus can be accessed here:
https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia
** Results:
4 multifd threads:
|---------------|---------------|----------------|---------|---------|
|method |time(sec) |throughput(mbps)|send cpu%|recv cpu%|
|---------------|---------------|----------------|---------|---------|
|qatzip |111.256 |916.03 | 29.08% | 51.90 |
|---------------|---------------|----------------|---------|---------|
|zlib |193.033 |562.16 |297.36% |237.84 |
|---------------|---------------|----------------|---------|---------|
|zstd |112.449 |920.67 |234.39% |157.57 |
|---------------|---------------|----------------|---------|---------|
|none |327.014 |933.41 | 9.50% | 25.28 |
|---------------|---------------|----------------|---------|---------|
8 multifd threads:
|---------------|---------------|----------------|---------|---------|
|method |time(sec) |throughput(mbps)|send cpu%|recv cpu%|
|---------------|---------------|----------------|---------|---------|
|qatzip |111.349 |915.20 | 29.13% | 59.63 |
|---------------|---------------|----------------|---------|---------|
|zlib |149.378 |726.64 |516.24% |400.46 |
|---------------|---------------|----------------|---------|---------|
|zstd |111.942 |925.85 |345.75% |170.74 |
|---------------|---------------|----------------|---------|---------|
|none |327.417 |933.34 | 8.38% | 27.72 |
|---------------|---------------|----------------|---------|---------|
16 multifd threads:
|---------------|---------------|----------------|---------|---------|
|method |time(sec) |throughput(mbps)|send cpu%|recv cpu%|
|---------------|---------------|----------------|---------|---------|
|qatzip |112.035 |908.96 | 29.93% | 63.83% |
|---------------|---------------|----------------|---------|---------|
|zlib |118.730 |912.94 |914.14% |621.59% |
|---------------|---------------|----------------|---------|---------|
|zstd |112.167 |924.78 |384.81% |171.54% |
|---------------|---------------|----------------|---------|---------|
|none |327.728 |932.08 | 9.31% | 29.89% |
|---------------|---------------|----------------|---------|---------|
** Observations:
Latency: In our test setting, live migration is mostly network-constrained, so
compression performs relatively well in general. `qatzip` particularly shows a
significant improvement over `zlib` with limited threads. With 4 multifd
threads, `qatzip` shows a ~42% decrease in latency over `zlib`. In all
scenarios, `qatzip` shows comparable performance with `zstd`.
Throughput: In all scenarios, nearly all compression methods reach nearly the
entire network throughput of 1Gbps except for `zlib`, which appears to be
CPU-bound with 4 and 8 threads, but reaches comparable throughput performance
with the other methods at 16 threads.
CPU usage: In all scenarios, `qatzip` consumes a fraction of the CPU usage that
`zlib` and `zstd` use. In the most limited case, with 4 multifd threads,
`qatzip`'s sender CPU usage is ~10% that of `zlib`, and ~12% that of `zstd`,
and its receiver CPU usage is ~22% that of `zlib`, and ~33% that of `zstd`. The
magnitude of these savings increases as we increase to 8 and 16 threads.
* Future work:
- Comparing QAT offloading against other compression methods in environments
that are not as network-constrained.
- Combining compression offloading with offloading using other Intel
accelerators (e.g. using Intel's Data Streaming Accelerator to offload zero
page checking, which is part of another related patchset currently under
discussion, and to offload `memcpy()` operations on the receiver side).
- Reworking multifd logic to pipeline live migration work to improve device
saturation.
* Testing:
This patchset adds an integration test for the new `qatzip` multifd compression
method.
* Patchset:
This patchset was generated on top of commit 7425b627.
Bryan Zhang (5):
meson: Introduce 'qatzip' feature to the build system.
migration: Add compression level parameter for QATzip
migration: Introduce unimplemented 'qatzip' compression method
migration: Implement 'qatzip' methods using QAT
migration: Add integration test for 'qatzip' compression method
hw/core/qdev-properties-system.c | 6 +-
meson.build | 10 +
meson_options.txt | 2 +
migration/meson.build | 1 +
migration/migration-hmp-cmds.c | 4 +
migration/multifd-qatzip.c | 369 +++++++++++++++++++++++++++++++
migration/multifd.h | 1 +
migration/options.c | 27 +++
migration/options.h | 1 +
qapi/migration.json | 24 +-
scripts/meson-buildoptions.sh | 3 +
tests/qtest/meson.build | 4 +
tests/qtest/migration-test.c | 37 ++++
13 files changed, 486 insertions(+), 3 deletions(-)
create mode 100644 migration/multifd-qatzip.c
--
2.30.2
- [PATCH 0/5] *** Implement using Intel QAT to offload ZLIB,
Bryan Zhang <=
- [PATCH 1/5] meson: Introduce 'qatzip' feature to the build system., Bryan Zhang, 2023/12/31
- [PATCH 2/5] migration: Add compression level parameter for QATzip, Bryan Zhang, 2023/12/31
- [PATCH 4/5] migration: Implement 'qatzip' methods using QAT, Bryan Zhang, 2023/12/31
- [PATCH 5/5] migration: Add integration test for 'qatzip' compression method, Bryan Zhang, 2023/12/31
- [PATCH 3/5] migration: Introduce unimplemented 'qatzip' compression method, Bryan Zhang, 2023/12/31