qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH v4 00/28] COarse-grain LOck-stepping(COLO) V


From: zhanghailiang
Subject: Re: [Qemu-devel] [RFC PATCH v4 00/28] COarse-grain LOck-stepping(COLO) Virtual Machines for Non-stop Service
Date: Fri, 24 Apr 2015 16:52:10 +0800
User-agent: Mozilla/5.0 (Windows NT 6.1; rv:31.0) Gecko/20100101 Thunderbird/31.1.1

On 2015/4/22 19:18, Dr. David Alan Gilbert wrote:
* zhanghailiang (address@hidden) wrote:
Hi,

ping ...

I will get to look at this again; but not until after next week.


OK, thanks for your reply. :)

The main blocked bugs for COLO have been solved,

I've got the v3 set running, but the biggest problem I hit are problems
with the packet comparison module; I've seen a panic which I think is

What's the panic log?

in colo_send_checkpoint_req that I think is due to the use of
GFP_KERNEL to allocate the netlink message and I think it can schedule
there.  I tried making that a GFP_ATOMIC  but I'm hitting other
problems with :

kcolo_thread, no conn, schedule out


Er, it is OK to get this messages if you enable the debug,
if there is no net connect to VM, or there is a checkpoint request happening,
it is no need to compare any network packets. So we just schedule out the 
kcolo_thread.
Is it just this messages been printed ? Or maybe some other problems ?

that I've not had time to look into yet.

So I only get about a 50% success rate of starting COLO.

This is really strange, yes, sometimes we can come across problems like kernel 
panic in our tests,
but not so often. Can you describe the problems in detail ?

I see there are stuff in the TODO of the colo-proxy that
seem to say the netlink stuff should change, maybe you're already fixing
that?


Yes, we are trying to replace the  current netlink in COLO with nfnetlink 
interface.
Hope to merge the code in next version.

we also have finished some new features and optimization on COLO. (If you are 
interested in this,
we can send them to you in private ;))

For easy of review, it is better to keep it simple now, so we will not add too 
much new codes into this frame
patch set before it been totally reviewed.

I'd like to see those; but I don't want to take code privately.
It's OK to post extra stuff as a separate set.


Hmm, there is really a good idea, maybe we should also add a branch
with all the optimization and new features in github.

COLO is a totally new feature which is still in early stage, we hope to speed 
up the development,
so your comments and feedback are warmly welcomed. :)

Yes, it's getting there though; I don't think anyone else has
got this close to getting a full FT set working with disk and networking.


Thanks,
zhanghailiang


On 2015/3/26 13:29, zhanghailiang wrote:
This is the 4th version of COLO, here is only COLO frame part, include: VM 
checkpoint,
failover, proxy API, block replication API, not include block replication.
The block part has been sent by wencongyang:
[RFC PATCH COLO v2 00/13] Block replication for continuous checkpoints

Compared with last version, there aren't too much optimize and new functions.
The main reason is that there is an known issue that still unsolved, we found
some dirty pages which have been missed setting bit in corresponding bitmap.
And it will trigger strange problem in VM.
We hope to resolve it before add more codes.

You can get the newest integrated qemu colo patches from github:
https://github.com/coloft/qemu/commits/colo-v1.1

About how to test COLO, Please reference to the follow link.
http://wiki.qemu.org/Features/COLO.

Please review and test.

Known issue still unsolved:
(1) Some pages dirtied without setting its corresponding dirty-bitmap.

Previous posted RFC patch series:
http://lists.nongnu.org/archive/html/qemu-devel/2014-06/msg05567.html
http://lists.nongnu.org/archive/html/qemu-devel/2014-09/msg04459.html
https://lists.gnu.org/archive/html/qemu-devel/2015-02/msg04771.html

TODO list:
1 Optimize the process of checkpoint, shorten the time-consuming:
   (Partly done, patch is not include into this series)
    1) separate ram and device save/load process to reduce size of extra memory
       used during checkpoint
    2) live migrate part of dirty pages to slave during sleep time.
2 Add more debug/stat info
   (Partly done, patch is not include into this series)
   include checkpoint count, proxy discompare count, downtime,
    number of live migrated pages, total sent pages, etc.
3 Strengthen failover
4 optimize proxy part, include proxy script.
5 The capability of continuous FT

v4:
- New block replication scheme (use image-fleecing for sencondary side)
- Adress some comments from Eric Blake and Dave
- Add commmand colo-set-checkpoint-period to set the time of periodic checkpoint
- Add a delay (100ms) between continuous checkpoint requests to ensure VM
   run 100ms at least since last pause.

v3:
- use proxy instead of colo agent to compare network packets
- add block replication
- Optimize failover disposal
- handle shutdown

v2:
- use QEMUSizedBuffer/QEMUFile as COLO buffer
- colo support is enabled by default
- add nic replication support
- addressed comments from Eric Blake and Dr. David Alan Gilbert

v1:
- implement the frame of colo

Wen Congyang (1):
   COLO: Add block replication into colo process

zhanghailiang (27):
   configure: Add parameter for configure to enable/disable COLO support
   migration: Introduce capability 'colo' to migration
   COLO: migrate colo related info to slave
   migration: Integrate COLO checkpoint process into migration
   migration: Integrate COLO checkpoint process into loadvm
   COLO: Implement colo checkpoint protocol
   COLO: Add a new RunState RUN_STATE_COLO
   QEMUSizedBuffer: Introduce two help functions for qsb
   COLO: Save VM state to slave when do checkpoint
   COLO RAM: Load PVM's dirty page into SVM's RAM cache temporarily
   COLO VMstate: Load VM state into qsb before restore it
   arch_init: Start to trace dirty pages of SVM
   COLO RAM: Flush cached RAM into SVM's memory
   COLO failover: Introduce a new command to trigger a failover
   COLO failover: Implement COLO master/slave failover work
   COLO failover: Don't do failover during loading VM's state
   COLO: Add new command parameter 'colo_nicname' 'colo_script' for net
   COLO NIC: Init/remove colo nic devices when add/cleanup tap devices
   COLO NIC: Implement colo nic device interface configure()
   COLO NIC : Implement colo nic init/destroy function
   COLO NIC: Some init work related with proxy module
   COLO: Do checkpoint according to the result of net packets comparing
   COLO: Improve checkpoint efficiency by do additional periodic
     checkpoint
   COLO: Add colo-set-checkpoint-period command
   COLO NIC: Implement NIC checkpoint and failover
   COLO: Disable qdev hotplug when VM is in COLO mode
   COLO: Implement shutdown checkpoint

  arch_init.c                            | 199 +++++++-
  configure                              |  14 +
  hmp-commands.hx                        |  30 ++
  hmp.c                                  |  14 +
  hmp.h                                  |   2 +
  include/exec/cpu-all.h                 |   1 +
  include/migration/migration-colo.h     |  58 +++
  include/migration/migration-failover.h |  22 +
  include/migration/migration.h          |   3 +
  include/migration/qemu-file.h          |   3 +-
  include/net/colo-nic.h                 |  25 +
  include/net/net.h                      |   4 +
  include/sysemu/sysemu.h                |   3 +
  migration/Makefile.objs                |   2 +
  migration/colo-comm.c                  |  80 ++++
  migration/colo-failover.c              |  48 ++
  migration/colo.c                       | 809 +++++++++++++++++++++++++++++++++
  migration/migration.c                  |  60 ++-
  migration/qemu-file-buf.c              |  58 +++
  net/Makefile.objs                      |   1 +
  net/colo-nic.c                         | 438 ++++++++++++++++++
  net/tap.c                              |  45 +-
  qapi-schema.json                       |  42 +-
  qemu-options.hx                        |  10 +-
  qmp-commands.hx                        |  41 ++
  savevm.c                               |   2 +-
  scripts/colo-proxy-script.sh           |  97 ++++
  stubs/Makefile.objs                    |   1 +
  stubs/migration-colo.c                 |  58 +++
  vl.c                                   |  36 +-
  30 files changed, 2178 insertions(+), 28 deletions(-)
  create mode 100644 include/migration/migration-colo.h
  create mode 100644 include/migration/migration-failover.h
  create mode 100644 include/net/colo-nic.h
  create mode 100644 migration/colo-comm.c
  create mode 100644 migration/colo-failover.c
  create mode 100644 migration/colo.c
  create mode 100644 migration/colo.c.
  create mode 100644 net/colo-nic.c
  create mode 100755 scripts/colo-proxy-script.sh
  create mode 100644 stubs/migration-colo.c



--
Dr. David Alan Gilbert / address@hidden / Manchester, UK

.






reply via email to

[Prev in Thread] Current Thread [Next in Thread]