gluster-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] [RFC] A new caching/synchronization mechanism to spe


From: Xavier Hernandez
Subject: Re: [Gluster-devel] [RFC] A new caching/synchronization mechanism to speed up gluster
Date: Mon, 10 Feb 2014 11:17:12 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0

Hi Niels,

El 10/02/14 11:05, Niels de Vos ha escrit:
On Tue, Feb 04, 2014 at 10:07:22AM +0100, Xavier Hernandez wrote:
Hi,

currently, inodelk() and entrylk() are being used to make sure that
changes happen synchronously on all bricks, avoiding data/metadata
corruption when multiple clients modify the same inode concurrently.
So far so good, however I think this introduces a significant
overhead to avoid a situation that will happen very rarely. It also
limits the advantage of client-side caches.

I propose to implement a new translator that uses a MESI-like
protocol (protocol used to maintain memory coherency between local
caches of CPU cores). This translator would add virtually 0 overhead
when there isn't more than one client accessing the same inode, and
an overhead comparable to current implementation if there is
contention.

Another advantage of this protocol would be that it will be possible
to implement much more aggressive caching mechanisms on the client
side that will improve overall performance without losing any
current features.

At a high level this is how it could work:

Each client tracks the state of each inode it uses (M - Modified, E
- Exclusive, S - Shared, I - Invalid). All inodes will be created in
the invalid state. When the client needs to write the inode, it asks
all bricks exclusive access. Once granted, the inode will be in
exclusive state and any read/write operation could be made locally
on the client side, because it knows that nobody else will be
modifying the inode. If the inode is successfully written (on the
local cache), the state will change to modified. Eventually the
changes will be sent to the bricks in background and the state will
go back to exclusive, or invalid if the inode is not needed anymore.

Now, if another client needs to read or write the same inode, it
will send a request to all bricks. If the inode is in the exclusive
or modified state in one of the clients, the bricks will notify the
current owner of the inode to flush all pending changes. Once
completed, the new client will be granted exclusive (if it's a write
request) or shared (if it's a read request) access to the inode. The
former owner will leave the inode in the invalid state (if it's a
write request) or shared (if it's a read request).

Multiple clients can read a shared inode simultaneously, however if
one client needs exclusive access to the inode, all other clients
will need to set inode's state to invalid before granting exclusive
access.

The only synchronization point needed is to make sure that all
bricks agree on the inode state and which client owns it. This can
be achieved without locking using a method similar to what I
implemented in the DFC translator.

Besides the lock-less architecture, the main advantage is that much
more aggressive caching strategies can be implemented very near to
the final user, increasing considerably the throughput of the file
system. Special care has to be taken with things than can fail on
background writes (basically brick space and user access rights).
Those should be handled appropiately on the client side to guarantee
future success of writes.

Of course this is only a high level overview. A deeper analysis
should be done to see what to do on each special case.

What do you think ?
This sounds very much like "delegations and callbacks" in NFSv4. It is
an optional feature that servers do not need to support, and some
clients can not support easily (think of firewalls blocking callbacks).
The RFC for NFSv4 documents the feature pretty well:
- http://tools.ietf.org/html/rfc3530#section-9.2
I didn't know it, but it really seems very similar to the idea I had. I'll read it in more detail.

Thanks !!!

Xavi

I'd surely be interested in seeing something similar for the GlusterFS
protocol, it definitely improved performance for certain workloads on
NFS.

Niels




reply via email to

[Prev in Thread] Current Thread [Next in Thread]