[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Gnu-arch-users] A Truly Transactional Filesystem (API description)
From: |
Thomas Lord |
Subject: |
[Gnu-arch-users] A Truly Transactional Filesystem (API description) |
Date: |
Thu, 12 Jan 2006 17:22:30 -0800 |
Building a file system on top `vumnd' (the page-oriented, persistent
transactional memory system I presented in my last message) is a
well-understood problem. An inode layer can be done in a few
thousand line of code and a `vu_' system call layer with a few
thousand more. I hope to get such a thing working soon, if I can.
This note is about *extra* functionality that can be built into
a file system at the same time.
* Constant Files, Symbolic Links, and Devices.
The new file system call `int conx (int fd)' turns inodes which are
not directories into *constant inodes*.
The link count of a constant inode can change but no other aspect of
the inode can ever again change, regardless of the node's
permissions. There is no inverse to the `conx' operation.
* Constant Directories
Constant directories work a little bit differently. If `conx'
is applied to a directory, all of the contents of the directory
must themselves be constants (except `..'). The directory
is made constant and its device number (normally 0) is changed.
Cross device links are permitted into and and out of constant
directories. If a link is made to a node in a constant tree
the resulting link may or may not be to the same device and
inode as found in the constant tree.
* Cheap Device Cloning
The new file system call `clone (...)' makes a space and time
efficient copy of a constant directory. The copy has a new device
number (and possibly a new inode) for all of its subdirectories.
All corresponding non-directory nodes in the two trees have the same
device and inode number. The effect cloning has on link counts of
non-directories is not specified.
Note that a clone is always itself a constant tree and is always
a clone of a constant tree.
* Transaction Devices
The new file system call `dbase (...)' makes a space and
time efficient copy of a constant directory, again with a new
device number.
Unlike a clone, a transaction copy can be modified in ordinary
ways by file system calls -- none of the files or directories is
initially constant.
Also unlike a clone, in a copy created by `dbase', even
non-directory nodes have a new device number.
`conx' works on files and directories on transaction nodes in the
usual way.
The new call `relink (dir_fd, name)' replaces the link for a given
name and directory, where the named node is constant and resides on
the transaction tree, with a link to an equivalent constant node
that does not reside on the transaction device. For a given
constant node on the transaction device, `relink' always returns the
same result.
The new call `revert (dir_fd, name)' replaces the named subtree of a
transaction tree with either a link (to an original file) or to a
new clone (to the original directory).
* Expanded `rename' Capabilities
`rename' can overwrite a link to a constant directory even
if that directory is not empty.
`soft_rename' acts like `rename' except that it never overwrites
an existing link.
* Example Application: Revision Control Whole-file Storage
The above features have many uses in revision control. An easy
example to see is storage management for tree snapshots:
Suppose that we have a constant tree representing an immediate
ancestor revision and want to `commit', creating a new constant
tree representing a new revision. We can:
1. `dbase' (make a transactional copy) of the ancestor
revision. This is O(1) in time and space.
2. In the transactional copy, modify changed files
and directories. This performs about the same
as modifying regular files.
3. In the transactional copy, `revert' all of the largest
unmodified subtrees. This is O(1) for each such subtree,
O(N) for N largest unmodified subtrees.
4. From leafs towards the root, `relink' all entries
of each *modified* directory and `conx' that directory.
When the top-level of the transactional copy is made constant
by `conx', the result is a constant tree for the new revision,
sharing unchanged state with the tree for the previous revision.
`commit' time and space in this system is proportionate to the size
of the file and directory modifications being committed.
* Example Application: A Persistent MRU Delta Cache
Storing whole trees optimizes for some important access patterns but
pessimizes others. For example, clients sometimes need optimized
access to arbitrary whole-tree deltas -- diffs between two
revisions.
Reasonable performance characteristics can be obtained from a system
that computes diffs between revisions "on-demand" but also caches
recently used deltas (perhaps weighted by the expense of recomputing
those deltas).
Using our new features we can represent the cache as a pair: a
constant directory (for cache readers) and a transactional copy
of that constant directory (for cache writers).
Cache writers update the transaction copy, make it constant, rename
it to replace the reader's copy of the cache, and create a new
transaction copy for the next write transaction.
Cache readers open the root directory of the constant copy of the
cache and access cache contents relative to that directory. Thus,
readers are isolated from concurrent writes.
* Other Applications
Revision control is hardly the only application.
* Implementation
I've already hinted at large parts of it (`vudev' and `vumnd' in
the previous two messages). With luck and support I'll show how
to implement these features by Sept. (support: address@hidden via
paypal, thanks).
* Project Plans
I plan to leverage my strengths and avoid my weaknesses. So:
With some encouragement, I'll start publishing code snapshots.
With greater encouragement, I'll publish more code snapshots
and more detailed descriptions and documentation.
While I welcome helpful kibbitzing including bug reports I make
no promises to respond in any way. When it gets to the point that
people want to start using the result in projects of their own,
someone *else* will have to start maintaining fork for that purpose.
-t
- [Gnu-arch-users] A Truly Transactional Filesystem (API description),
Thomas Lord <=