[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Gnu-arch-users] user space file systems
From: |
Thomas Lord |
Subject: |
[Gnu-arch-users] user space file systems |
Date: |
Tue, 10 Jan 2006 10:18:52 -0800 |
Building a user space file system is a good idea for lots of
reasons. But how?
Don't use an RDBMS, persistent hash table, or other database
middle-ware as back-end. Sure, their support for transactions helps
a little and they do offer portable APIs to native storage.
Unfortunately they also come with a lot of code and baggage for
functionality not really needed and the APIs they provide aren't a
particularly natural target language for implementing a POSIX-style
file system.
Why not make a portable library whose API resembles an idealized,
simplified raw disk? That should be easy to simply rewrite for
every platform one wants to port to; it should give good
performance; it is a very natural target for writing a file system
implementation. (And if you *must* use a database -- build this
API on top of that.)
Here's such an API that can be implemented in about 610 lines of
code on Posix. I can' imagine it would be any harder on Windows
using native calls there.
* The API
/* Pages are 1Kb
*/
#define c_vudev_page_size_bits ((unsigned int)10)
#define c_vudev_page_size ((size_t)1 << c_vudev_page_size_bits)
/* There are 2^32 pages -- theoretical 4Tb capacity.
*/
#define c_vudev_page_addr_bits ((unsigned int)32)
#define c_vudev_max_page_addr ((t_vudev_page_addr)0xffffffff)
typedef t_uint32 t_vudev_page_addr;
/* Client programs "connect" to virtual disks.
*/
typedef <unspecified-pointer-type> t_vudev_connection;
/* A "chunk" is a (virtual) DMA area for transfers between
* the client and the raw virtual disk.
*/
typedef <unspecified-pointer-type> t_vudev_chunk;
The location of a virtual disk image (e.g., a file on a native
system where that file contains the complete virtual file system)
is specified by a URI:
* int vudev_create_device
(const t_uchar ** const err,
const t_uchar * const uri,
t_vudev_page_addr const n_control_pages);
Create an initialize a new virtual disk.
`n_control_pages' is a performance hint. The implementation
should try to make access to pages `0..(n_control_pages - 1)'
as fast as possible.
* t_vudev_connection vudev_connect (const t_uchar ** const err,
const t_uchar * const uri);
Connect to an existing virtual disk.
* t_vudev_connection vudev_dup (const t_uchar ** const err,
t_vudev_connection conn);
Copy a connection. (May return the argument connection in
which case connections are reference counted.)
* int vudev_disconnect (const t_uchar ** const err,
t_vudev_connection conn);
Terminate a connection.
* int vudev_write_lock (const t_uchar ** const err,
t_vudev_connection const conn);
* int vudev_write_unlock (const t_uchar ** const err,
t_vudev_connection const conn,
t_uchar * const control_pages);
* t_uchar * vudev_read_lock (const t_uchar ** const err,
t_vudev_connection const conn);
* int vudev_read_unlock (const t_uchar ** const err,
t_vudev_connection const conn,
t_uchar * const control_pages);
Begin/end a write/read transaction.
Transactions may not be nested, promoted, or demoted.
* t_vudev_chunk vudev_chunk (const t_uchar ** const err,
t_vudev_connection const conn,
t_vudev_page_addr addr,
t_vudev_page_addr n_pages);
* t_uchar * vudev_chunk_data (const t_uchar ** const err,
t_vudev_connection const conn,
t_vudev_chunk const chunk);
* t_vudev_page_addr vudev_chunk_addr (const t_uchar ** const err,
t_vudev_connection const conn,
t_vudev_chunk const chunk);
* t_vudev_page_addr vudev_chunk_n_pages (const t_uchar ** const err,
t_vudev_connection const
conn,
t_vudev_chunk const chunk);
* int vudev_chunk_dirty (const t_uchar ** const err,
t_vudev_connection const conn,
t_vudev_chunk const chunk);
* int vudev_chunk_stale (const t_uchar ** const err,
t_vudev_connection const conn,
t_vudev_chunk const chunk);
* int vudev_free_chunk (const t_uchar ** const err,
t_vudev_connection const conn,
t_vudev_chunk const chunk);
Actual I/O is performed by modifying buffers which may or may
not be active DMA areas. A chunk is a handle for a buffer
for an arbitrary choice of contiguous pages. Multiple
overlapping chunks may concurrently exist.
If a chunk is allocated during a read transaction it's
initial data is consistent with the state of the disk
during that transaction. If a chunk is left over from a
previous transaction, it's data may be invalid unless
the chunk is passed to `vudev_chunk_stale'.
Writing is accomplished by modifying chunk data and, after
the modifications, calling `vudev_chunk_dirty'.
When no longer needed, `vudev_free_chunk' releases a chunk.
Clients should assume that there is a performance penalty
for concurrently overlapping chunks and for very large
chunks.
* int vudev_sync (const t_uchar ** const err,
t_vudev_connection const conn);
Wait until all chunks marked dirty have reached stable
storage.
The transactional semantics of this API are relatively weak:
The contents of chunk data is undefined except during a read or
write transaction.
If a chunk was not allocated during the current transaction it's
contents are invalid for reading until the chunk is passed to
`vudev_chunk_stale'.
If a chunk is modified during a write transaction then the
chunk *must* be passed to `vudev_chunk_dirty' after the
modifications but before the end of the transaction.
Concurrent writes to a page produce undefined results.
Data written into a chunk may reach stable storage at any time after
it is first written but before the next call to `vudev_sync'
completes. Data may reach stable storage in any order consistent
with the `vudev_sync' constraint.
A crash *during* `vudev_sync' leaves the contents of all
pages written since the previous `vudev_sync' undefined.
- [Gnu-arch-users] user space file systems,
Thomas Lord <=