qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v2] XBRLE page delta compression for live migrat


From: Shribman, Aidan
Subject: Re: [Qemu-devel] [PATCH v2] XBRLE page delta compression for live migration of large memory apps
Date: Tue, 2 Aug 2011 15:45:51 +0200

> From: Stefan Hajnoczi [mailto:address@hidden 
> Sent: Thursday, July 07, 2011 11:23 AM
> To: Shribman, Aidan
> Cc: address@hidden; Anthony Liguori
> Subject: Re: [PATCH v2] XBRLE page delta compression for live 
> migration of large memory apps
>  
> Any thoughts on reducing the overhead and making xbrle on by default?

XBRLE was replaced by XBZRLE which now runs word-wise and only attempts RLE on 
zero sequences. In comparison to word-wise XBRLE it gives a more compact 
encoding roughly 30% smaller in size. When compared to XOR+LZO or XOR+Snappy 
encoding size is roughly 30% larger but XBZRLE is 2x-5x faster making it ideal 
for fast in-line encoding such as required for live-migration. XBZRLE 
demonstrated sustained speeds of about 1.6-2.2 GB/s per single core on a 64bit 
Linux 2.6.35 kernel.

For now I would not switch XBZRLE on by default as it effects the network 
serialization format and would make the patched Qemu in-operable by default 
with older Qemu versions.

Full benchmark results are brought for several scenarios, defined according to 
step in bytes between each two changed memoy areas  (e.g. 1111 for the sparse 
scenario) and the length in bytes of each changed area (e.g. 12 for the SPARSE 
scenario) 

==========================================================
Scenario SPARSE with diff segment of step 1111 len 12
==========================================================
xblzo: ENC{2.06s  997 MB/s 2.82%} DEC{1.40s 1462 MB/s 100.00%} .. ok
xbsnappy: ENC{1.82s 1122 MB/s 6.14%} DEC{1.67s 1225 MB/s 100.00%} .. ok
xbrle: ENC{9.28s  221 MB/s 3.08%} DEC{3.25s  630 MB/s 100.00%} .. ok
xbzrle: ENC{0.96s 2142 MB/s 3.55%} DEC{0.73s 2817 MB/s 100.00%} .. ok

==========================================================
Scenario MEDIUM with diff segment of step 701 len 33
==========================================================
xblzo: ENC{2.50s  820 MB/s 6.34%} DEC{1.37s 1492 MB/s 100.00%} .. ok
xbsnappy: ENC{2.25s  912 MB/s 9.27%} DEC{1.72s 1189 MB/s 100.00%} .. ok
xbrle: ENC{9.35s  219 MB/s 10.31%} DEC{3.36s  610 MB/s 100.00%} .. ok
xbzrle: ENC{1.03s 1994 MB/s 8.37%} DEC{0.73s 2809 MB/s 100.00%} .. ok

==========================================================
Scenario DENSE with diff segment of step 203 len 41
==========================================================
xblzo: ENC{4.08s  502 MB/s 21.37%} DEC{1.83s 1116 MB/s 100.00%} .. ok
xbsnappy: ENC{4.80s  426 MB/s 22.80%} DEC{2.15s  953 MB/s 100.00%} .. ok
xbrle: ENC{9.65s  212 MB/s 41.44%} DEC{3.70s  553 MB/s 100.00%} .. ok
xbzrle: ENC{1.23s 1666 MB/s 31.92%} DEC{0.84s 2441 MB/s 100.00%} .. ok

==========================================================
Scenario VERY-DENSE with diff segment of step 121 len 43
==========================================================
xblzo: ENC{5.59s  366 MB/s 32.29%} DEC{2.36s  866 MB/s 100.00%} .. ok
xbsnappy: ENC{6.74s  304 MB/s 33.46%} DEC{2.69s  762 MB/s 100.00%} .. ok
xbrle: ENC{9.84s  208 MB/s 72.78%} DEC{4.22s  486 MB/s 100.00%} .. ok
xbzrle: ENC{1.18s 1730 MB/s 54.92%} DEC{0.94s 2167 MB/s 100.00%} .. ok
> 
> > Work is based on research results published VEE 2011: 
> Evaluation of Delta
> > Compression Techniques for Efficient Live Migration of 
> Large Virtual Machines
> > by Benoit, Svard, Tordsson and Elmroth.
> 
> I will read your paper.  Did you try unconditionally applying a cheap
> compression algorithm like the one Google recently published? 
>  That way you
> just compress everything and don't need to keep the cache around:
> 
> http://code.google.com/p/snappy/
> http://www.hypertable.org/doxygen/bmz_8h.html
> 

As Google Snappy's peformance is 0.3 - 1.1 GB/s on 64bit machine per core it is 
much less suitable than (XB) ZRLE delta encoding. In cases of limitted 
bandwidth it would be beneficial to use Snappy to compress full page content, 
and could be considered in future.

> > +static int save_xbrle_page(QEMUFile *f, uint8_t *current_data,
> > +        ram_addr_t current_addr, RAMBlock *block, 
> ram_addr_t offset, int cont)
> > +{
> > +    int cache_location = -1, slot = -1, encoded_len = 0, 
> bytes_sent = 0;
> > +    XBRLEHeader hdr = {0};
> > +    CacheItem *it;
> > +    uint8_t *xor_buf = NULL, *xbrle_buf = NULL;
> > +
> > +    /* get location */
> > +    slot = cache_is_cached(current_addr);
> > +    if (slot == -1) {
> > +        acct_info.xbrle_cache_miss++;
> > +        goto done;
> > +    }
> > +    cache_location = cache_get_cache_pos(current_addr);
> > +
> > +    /* abort if page changed too much */
> > +    it = cache_item_get(cache_location, slot);
> > +
> > +    /* XOR encoding */
> > +    xor_buf = (uint8_t *) qemu_mallocz(TARGET_PAGE_SIZE);
> 
> Zeroing unnecessary here.

replaced qemu_mallocz() with qemu_malloc()

> 
> > +    xor_encode(xor_buf, it->it_data, current_data);
> > +
> > +    /* XBRLE (XOR+RLE) encoding (if we can ensure a 1/3 ratio) */
> > +    xbrle_buf = (uint8_t *) qemu_mallocz(TARGET_PAGE_SIZE);
> 
> Why TARGET_PAGE_SIZE when the actual size is TARGET_PAGE_SIZE/3?
> 
> Zeroing unnecessary here.

replaced qemu_malloc() with qemu_mallocz()

> 
> > +    encoded_len = rle_encode(xor_buf, TARGET_PAGE_SIZE, xbrle_buf,
> > +            TARGET_PAGE_SIZE/3);
> > +
> > +    if (encoded_len < 0) {
> > +        DPRINTF("XBRLE encoding oeverflow - sending 
> uncompressed\n");
> 
> s/oeverflow/overflow/

corrected

> 
> > +        acct_info.xbrle_overflow++;
> > +        goto done;
> > +    }
> > +
> > +    hdr.xh_len = encoded_len;
> > +    hdr.xh_flags |= ENCODING_FLAG_XBRLE;
> > +
> > +    /* Send XBRLE compressed page */
> > +    save_block_hdr(f, block, offset, cont, RAM_SAVE_FLAG_XBRLE);
> > +    qemu_put_buffer(f, (uint8_t *) &hdr, sizeof(hdr));
> > +    qemu_put_buffer(f, xbrle_buf, encoded_len);
> > +    acct_info.xbrle_pages++;
> > +    bytes_sent = encoded_len + sizeof(hdr);
> > +    acct_info.xbrle_bytes += bytes_sent;
> > +
> > +done:
> > +    qemu_free(xor_buf);
> > +    qemu_free(xbrle_buf);
> > +    return bytes_sent;
> > +}
> > 
> >  static int is_dup_page(uint8_t *page, uint8_t ch)
> >  {
> > @@ -107,7 +486,7 @@ static int is_dup_page(uint8_t *page, 
> uint8_t ch)
> >  static RAMBlock *last_block;
> >  static ram_addr_t last_offset;
> > 
> > -static int ram_save_block(QEMUFile *f)
> > +static int ram_save_block(QEMUFile *f, int stage)
> >  {
> >      RAMBlock *block = last_block;
> >      ram_addr_t offset = last_offset;
> > @@ -128,28 +507,27 @@ static int ram_save_block(QEMUFile *f)
> >                                              current_addr + 
> TARGET_PAGE_SIZE,
> >                                              MIGRATION_DIRTY_FLAG);
> > 
> > -            p = block->host + offset;
> > +            p = qemu_mallocz(TARGET_PAGE_SIZE);
> 
> Where is p freed when use_xbrle is off?

corrected in PATCH v3

> 
> You should not introduce overhead in the case where use_xbrle 
> is off.  Please
> make sure the malloc/memcpy only happens if the page is added 
> to the cache.
> 
> > +static int load_xbrle(QEMUFile *f, ram_addr_t addr, void *host)
> > +{
> > +    int ret, rc = -1;
> > +    uint8_t *prev_page, *xor_buf, *xbrle_buf;
> > +    XBRLEHeader hdr = {0};
> > +
> > +    /* extract RLE header */
> > +    qemu_get_buffer(f, (uint8_t *) &hdr, sizeof(hdr));
> > +    if (!(hdr.xh_flags & ENCODING_FLAG_XBRLE)) {
> > +        fprintf(stderr, "Failed to load XBRLE page - wrong 
> compression!\n");
> > +        goto done;
> > +    }
> > +
> > +    if (hdr.xh_len > TARGET_PAGE_SIZE) {
> > +        fprintf(stderr, "Failed to load XBRLE page - len 
> overflow!\n");
> > +        goto done;
> > +    }
> > +
> > +    /* load data and decode */
> > +    xbrle_buf = (uint8_t *) qemu_mallocz(TARGET_PAGE_SIZE);
> > +    qemu_get_buffer(f, xbrle_buf, hdr.xh_len);
> 
> Why allocate TARGET_PAGE_SIZE instead of hdr.xh_len and why 
> zero it when
> qemu_get_buffer() will overwrite it?

corrected in PATCH v3

> 
> > +
> > +    /* decode RLE */
> > +    xor_buf = (uint8_t *) qemu_mallocz(TARGET_PAGE_SIZE);
> 
> Again there is no need to zero the buffer.
> 

corrected in PATCH v3

Aidan


reply via email to

[Prev in Thread] Current Thread [Next in Thread]