[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Gzz-commits] manuscripts/storm short-paper.rst
From: |
Benja Fallenstein |
Subject: |
[Gzz-commits] manuscripts/storm short-paper.rst |
Date: |
Tue, 20 May 2003 14:48:12 -0400 |
CVSROOT: /cvsroot/gzz
Module name: manuscripts
Changes by: Benja Fallenstein <address@hidden> 03/05/20 14:48:12
Modified files:
storm : short-paper.rst
Log message:
shorten short paper
CVSWeb URLs:
http://savannah.gnu.org/cgi-bin/viewcvs/gzz/manuscripts/storm/short-paper.rst.diff?tr1=1.1&tr2=1.2&r1=text&r2=text
Patches:
Index: manuscripts/storm/short-paper.rst
diff -u manuscripts/storm/short-paper.rst:1.1
manuscripts/storm/short-paper.rst:1.2
--- manuscripts/storm/short-paper.rst:1.1 Tue May 20 10:18:35 2003
+++ manuscripts/storm/short-paper.rst Tue May 20 14:48:12 2003
@@ -2,22 +2,32 @@
Storm: Supporting data mobility through location-independent identifiers
========================================================================
+.. Main point of this paper:
+ Location-independent identifiers support data mobility;
+ DHT allows location-independent identifiers
+
Abstract
========
-In this paper, we define data mobility as a collective term for the
-movement of documents between computers, different locations
-on one computer and movement of content between documents.
-We identify dangling links and alternative versions as major
-obstacles for the free movement of data. This paper presents the Storm
-(STORage Module) design as one possible solution to these problems.
-Storm uses location-independent globally unique
-identifiers, append-and-delete-only storage and peer-to-peer networking to
-resolve problems raised by data mobility. Moreover, we discuss some
-specific use scenarios related to ad hoc networks, unreliable network
-connections and mobile computing, in which the need for data mobility
-is obvious. Our current prototype implementation works on a single system;
-peer-to-peer networking is in an early prototype stage.
+- data mobility
+- problems
+- location-independent identifiers such as hashes
+- resolvable through DHT
+- our implementation (Storm) is beginning to be deployed
+
+.. In this paper, we define data mobility as a collective term for the
+ movement of documents between computers, different locations
+ on one computer and movement of content between documents.
+ We identify dangling links and alternative versions as major
+ obstacles for the free movement of data. This paper presents the Storm
+ (STORage Module) design as one possible solution to these problems.
+ Storm uses location-independent globally unique
+ identifiers, append-and-delete-only storage and peer-to-peer networking to
+ resolve problems raised by data mobility. Moreover, we discuss some
+ specific use scenarios related to ad hoc networks, unreliable network
+ connections and mobile computing, in which the need for data mobility
+ is obvious. Our current prototype implementation works on a single system;
+ peer-to-peer networking is in an early prototype stage.
.. raw:: latex
@@ -51,326 +61,13 @@
This, we believe, may be the most important result of peer-to-peer
research with regard to hypermedia.
-In this paper, we examine how location-independent identifiers can
-support *data mobility*. Documents often move quite freely
-between computers: they are sent as
-e-mail attachments, carried around on disks, published on the web, moved
-between desktop and laptop systems, downloaded for off-line reading or
-copied between computers in a LAN. We use 'data mobility' as a collective
-term for the movement of documents between computers (or folders!),
-and movement of content between documents (through copy&paste) [#]_.
-
-.. [#] While the physical mobility of e.g. notebooks may effect
- data mobility (for example due to caching for off-line access),
- data mobility is neither the same as, nor limited to the physical
- movement of devices.
-
-We address two issues raised by data mobility:
-Dangling links and keeping track of alternative versions.
-Resolvable location-independent identifiers
-make these issues much easier to deal with, since data
-can be identified wherever it is moved [#]_.
-Current systems dealing with these issues
-often do not deal well with many forms of data mobility.
-
-.. [#] It might be more appropriate to speak about *resources*
- and *references* instead of *documents* and *links*, but
- in the spirit of [kappe95scalable]_, we stick with
- the simpler terms for explanation purposes.
-
-*Dangling links* are an issue when documents are moved
-between servers; when no network connection is available,
-but there is a local copy (e.g. on a laptop or dialup system);
-or when the publisher removes a document permanently,
-but there are still copies (e.g. in a public archive such as
-[waybackmachine]_). Dangling links are also an issue
-when a document and a link to it are received independently,
-for example as attachments to independent emails,
-or when a link is sent by mail and the document is available
-from the local intranet. When two people meet e.g. on the train,
-they should be able to form an ad-hoc network and follow links
-to documents stored on either one's computer [thompson01coincidence]_.
-Furthermore, when a document is split to parts, links to
-the elements in the parts that are then in new documents should not break.
-
-Advanced hypermedia systems such as Microcosm and Hyper-G
-address dangling links through a notification system
[hill94extending-andalso-kappe95scalable]_:
-When a document is moved, a message is sent to servers storing links to it.
-Hyper-G uses an efficient protocol for delivering such notifications
-on the public Internet.
-
-Location-independent identifiers for documents
-make such a system unnecessary; a structured peer-to-peer lookup system
-can find documents wherever they are moved.
-This kind of system also works for data not publicized on the Internet.
-For example, if one email has a document attached to it, and another email
-links to this document, an index of locally stored documents
-by permanent identifier allows the system to follow the link.
-This would be difficult to realize through a
-notification mechanism.
-
-*Tracking alternative versions*, on the other hand,
-is an issue when documents are modified
-on several independent, unconnected systems, for example
-when a user takes a document home from work on a floppy disk;
-when they keep the same set of documents on their desktop and laptop,
-modifying them on each; when two people collaborate on a document,
-sending each other versions of the document by email;
-when someone downloads a document, modifies it, and publishes
-the modified version,
-or when a group of people collaborate on a set of documents,
-synchronizing irregularly with a central server (as in CVS [cvs]_),
-a network of servers (as in Lotus Notes) or directly with each other
-(as in Groove [groovesurl]_). In each of these cases, a user should be able
-to work on the version at hand and then either merge it with others
-or fork to a different branch, as well as rollback the current changes
-or look at differences between versions *without network connectivity*.
-
-The main contribution of this paper is the Storm (for *STORage Module*)
design,
-a hypermedia system built to use the emerging
-peer-to-peer data lookup technologies to enhance data mobility
-by dealing with versioning and dangling links.
-
-Storm is a library
-for storing and retrieving data as *blocks*, immutable
-byte sequences identified by cryptographic content hashes
-[lukka02guids]_.
-We address the mobility of documents by block storage
-and versioning.,
-Fig. [ref-storm_layers]_ provides an overview of Storm's components.
-
-.. uml:: storm_layers
- :caption: Components of the Storm model
-
- package Blocks
-
- package Indexing
- use Blocks
-
- package XuStorage
- use Indexing
- use Blocks
-
- package Pointers
- use Indexing
- use Blocks
-
- package Diffs
- use Indexing
- use Blocks
-
- ---
- Blocks.c = (0,0);
- vertically(55, foo, Blocks, Indexing);
-
- dx = 80; dy = 60;
- XuStorage.c = Indexing.c + (-dx, -dy);
- z1 = Indexing.c + (dx, -dy);
- Pointers.c = z1 + (15, 25);
- Diffs.c = z1 - (15, 15);
-
-Additionally, we hope to
-provide an input to the ongoing discussion about peer-to-peer
-hypermedia systems
-[thompson01coincidence-andalso-bouvin02open-andalso-p2p-hypertext-panel-andalso-lukka02guids]_.
-
-This paper is structured as follows. In the next section, we describe
-related work. In section 3, we give an overview of the xanalogical storage
model.
-In section 4, we introduce the basic storage unit of our
-system, i.e. file-like blocks identified by cryptographic hashes. In section
5,
-we discuss application-specific reverse indexing of blocks by their
-content, essential for many applications. In section 6, we present
-techiques for efficient versioned storage of mutable data on top of blocks.
-In section 7, we report on implementation experience and future directions.
-Section 8 concludes the paper.
+- location-dependent identifiers cause broken links
+- alternative versions on independent systems hard to synchronize
-Related Work
-============
+- creating a location-independent namespace, resolve through DHT
-Dangling links and alternative versions
----------------------------------------
-The dangling link problem has received a lot of attention
-in hypermedia research (e.g. [davis98referential]_). As examples, we examine
the ways
-in which HTTP, Microcosm [fountain90microcosm]_ and Hyper-G [andrews95hyperg]_
-deal with the problem.
-
-In HTTP, servers are able to notify a client that a document
-has been moved, and redirect it accordingly [rfc2068]_. However,
-this is not required, and there are no facilities for
-updating a link automatically when its target is moved.
-The HTTP protocol includes a "LINK" request
-for creating a relationship between a set of URIs,
-but this feature has never been commonly implemented [reich-davis99-ohp]_.
-
-In Microcosm, hypermedia functionality is implemented
-using *filters*, which react to arbitrary messages
-(such as 'find links to this anchor') generated by
-a client application. Filters are processes on the local system
-or on a remote host [hill94extending]_. When
-a document is moved or deleted, a message is sent
-to the filters. Linkbases implemented as filters can
-update their links accordingly. A client selects a set
-of remote filters to use. Only links stored by one
-of these filters can be found by the client.
-
-.. [HymEbook?]
-
-.. Microcosm systems can independently choose
- whether to import filters from other systems, and whether
- to host and export own filters; thus, a system can act
- as both a client and server at the same time,
- for example in a workgroup.
-
-In Hyper-G, documents are bound to servers, and a link
-between documents on different servers is stored by both servers
-[kappe95scalable]_. This ensures that all links from and to a document
-can always be found, but requires the cooperation
-of both parties. Hyper-G employs a scalable protocol
-for notifying servers when a document has been moved or removed.
-A server hosting links to this document can then ask
-the link's author to change the link, or at least the link
-can be removed automatically. The *p-flood* algorithm
-employed by Hyper-G guarantees that a message
-is delivered to all interested servers, but requires that each
-interested server keeps a list of all the others.
-
-These approaches share the assumption that it is not possible
-to resolve a location-independent identifier. Otherwise,
-it would not be necessary to update links when a document
-is moved, nor would either of the servers storing two given documents
-need to know the links between them;
-knowing only a document's location-independent identifier,
-it would be possible to find both the document and links to it,
-no matter which peer in the network they are stored on.
-
-In a similar vein,
-version control systems like CVS or RCS [tichy85rcs]_ generally assume
-a central server hosting a repository. The WebDAV/DeltaV protocols,
-designed for interoperability between version control systems, inherit
-this assumption [rfc2518-andalso-rfc3253]_.
-On the other hand, Arch [arch]_ places all repositories
-into a global namespace and allows independent developers
-to branch and merge overlapping repositories without any central control.
-
-Lotus Notes, a popular database sharing and collaboration tool,
-uses both location-dependent and location-independent
-identifiers [lotus-notes-c-api]_. However, partly due to the age of the
system, Lotus Notes
-is limited to client-server architecture.
-Groove [groovesurl]_ is an improved design based on Lotus Notes,
-employing strong security mechanisms and usesing peer-to-peer functionality
-as the basis of communication channels among a limited amount of participants.
-
-.. [ref HTML version format proposal] Alternate versions important for
- authoring process [search refs]. (Note: Keeping track of versions
- structure is also \*hyper*media. Refs?) (WebDAV!)
-
-.. review: http://citeseer.nj.nec.com/griffiths99contentspec.html ?
- couldn't find a relevant angle, as it's a storage /protocol/. hm
- same prob. with http://citeseer.nj.nec.com/millard98reworking.html
- http://citeseer.nj.nec.com/227358.html that are about OHProtocol.
-
-
-Peer-to-peer systems
---------------------
-
-.. [ref: iris: http://iris.lcs.mit.edu/].
-
-During the last few years, there has been a lot of research
-related to peer-to-peer resource discovery, both in the academia
-and in the industry [p2pworkinggroup]_.
-There are two main approaches: broadcasting
[gnutellaurl-andalso-ripeanu02mappinggnutella-andalso-kazaaurl]_,
-and distributed hashtables (DHTs)
[stoica01chord-andalso-ratnasamy01can-andalso-zhao01tapestry-andalso-rowston01pastry-andalso-maymounkov02kademlia-andalso-malkhi02viceroy]_.
Broadcasting systems
-forward queries to all systems reachable in a given number of hops
-(time-to-live). DHTs store (key,value) pairs which can be found given
-the key; a DHT assigns each peer a subset of all possible keys, and
-routes queries for a given key to the peer responsible for it.
-Before a pair can be found, it must be *inserted* in the DHT
-by sending it to the peer responsible for the key. Both approaches
-use an application-level overlay network for routing.
-
-While broadcasting systems' performance can be worse than linear, DHTs'
performance
-usually has log-like bounds in the number of peers
-for *all* internal operations [#]_. This scalability is
-what makes global searches feasible in DHTs. In broadcasting approaches,
-on the other hand, scalability is achieved by forwarding queries
-only to a limited subset of the peers (bounded by the time-to-live),
-which means that searches in these systems are not truly global.
-
-.. [broadcasting's message population can grow as fast as O(n^2) -Hermanni]
-
-.. [#] It's not clear whether *all* proposed DHT designs can preserve
- log-like properties when participants are heterogeneous and they
- join and leave the system in a dynamic manner.
-
-A DHT has a *key space*, for example the points on a circle.
-The keys in (key,value) pairs are mapped to points in the key space
-through a hash function. Independently, each peer is assigned
-a point in the space. The DHT defines a distance metric
-between points in the key space (e.g. numeric, XOR); the peer
-responsible for a hashtable key, then, is the one that is *closest*
-to it in the key space, according to the distance metric.
-A DHT peer is roughly analogous to a hashtable bucket.
-Queries are routed in the overlay network, each hop bringing
-them closer to their destination in key space, until they reach
-the responsible peer. A common API that can be supported by current and future
DHTs
-is proposed in [zhao03api]_.
-
-.. Recently, a few DHT-like systems have been developed which employ
- a key space similarly to a DHT, but in which queries are routed
- to (key,value) pairs [bonsma02swan-andalso-AspnesS2003]_: A peer
- occupies several positions in the key space, one for each
- (key,value) pair. In such a system, the indirection of placing
- close keys in the custody of a 'hashtable bucket' peer is removed
- at the cost of each peer maintaining one node in the overlay network
- for each (key,value) pair it publishes.
-
-The basic definition of a distributed hashtable does not indicate
-how large the keys and values used may be. Intuitively, we expect keys
-to be small, maybe a few hundred bytes at most; however, there are different
-approaches to the size of values. Consider a file-sharing application:
-If the keys are keywords from the titles of shared files, are the values
-the files-- or the addresses of peers from which the files may be
-downloaded? Iyer et al [iyer02squirrel]_ call the former approach
-a *home-store* and the latter a *directory* scheme (they call the peer
-responsible for a hashtable item its 'home node,' thus 'home-store').
-The choice between the schemes affects the scalability and reliability
-of the network.
-
-CFS [dabek01widearea]_ and PAST [rowstron01storage]_
-are scalable storage systems using the home node approach,
-based on the Chord [stoica01chord]_ and
-Pastry [rowston01pastry]_ DHTs, respectively.
-Freenet [freenet-ieee]_ is a system for anonymous reading
-and publication.
-
-Recently there has been some interest in peer-to-peer hypermedia.
-Thompson and de Roure [thompson01coincidence]_ examine the discovery
-of documents and links available at and relating to
-a user's physical location. An example would be
-a linkbase constructed from links made available by different
-participants of a meeting [thompson00weaving]_.
-Bouvin [bouvin02open]_ focuses on the scalability and ease of publishing
-in peer-to-peer systems, examining ways in which p2p can serve
-as a basis for Open Hypermedia. Our own work has been
-in implementing Xanalogical storage [lukka02guids]_.
-
-At the Hypertext'02 panel on peer-to-peer hypertext [p2p-hypertext-panel]_,
-there was a lively discussion on whether the probabilistic access
-to documents offered by peers joining and leaving the network
-would be tolerable for hypermedia publishing. For many documents,
-the answer is probably no; however, for personal links,
-comments, and notes about documents, probabilistic access may be acceptable,
-especially when seen as a trade-off against
-having to set up a webspace account before publication.
-
-In the end, some peers will necessarily be more equal than others:
-Published data will be hosted on servers
-which are permanently on-line, but are otherwise ordinary peers
-in the indexing overlay network.
-
-
Storm block storage
===================
@@ -522,434 +219,10 @@
for the experimental Gzz system, a platform explicitly developed
to overcome the limitations of traditional file-based applications.
+- versioning, pointers
-Implementation
---------------
+- web integration
-Storm blocks are MIME messages [borenstein92mime]_, i.e., objects with
-a header and body as used in Internet mail or HTTP.
-This allows them to carry any metadata that can be carried
-in a MIME header, most importantly a content type.
-
-Collections of existing Storm blocks are called *pools*. Pools provide
-the following interface for injecting and obtaining data::
-
- add(bytes) -> id
- getIds() -> list
- get(id) -> block
-
-and the following methods for moving blocks between pools::
-
- add(block)
- delete(id)
-
-Implementations may store blocks in RAM, in individual files,
-in a Zip archive, in a database, in a p2p network,
-or through other means.
-We have implemented the first three (using hexadecimal
-representations of the block ids for file names).
-
-Many existing peer-to-peer systems could be used to
-find blocks on the network.
-For example, Freenet [freenet-ieee]_, recent Gnutella-based clients
-(e.g. Shareaza [shareazaurl]_), and Overnet [overneturl]_
-also use SHA-1-based identifiers.
-Implementations on top of a DHT could use both the
-directory and the home store approach as defined by [iyer02squirrel]_.
-
-Unfortunately, we have not put a p2p-based implementation
-into use yet and can therefore only report on our design.
-Currently, we are working on a prototype implementation
-based on UDP, the GISP distributed hashtable [kato02gisp]_,
-and the directory approach (using the DHT to find a peer
-with a copy of the block, then using HTTP to download the block).
-Many practical problems have to be overcome before this
-implementation will be usable (for example seeding the
-table of known peers, and issues with UDP and network
-address translation [rfc3253]_).
-
-.. talk about efficiency, storing big media files only once--
-
- It is unclear whether this approach is efficient for text
- in the Storm framework; in the future, we may try storing
- the characters in the documents themselves, along with their
- permanent identifiers; however, this makes spoofing
- possible. For images or video, on the other hand,
- it is clearly beneficial if content appearing in different
- documents-- or different versions of a document-- is only
- stored once, in a block only referred to wherever
- the data is transcluded. This is similar to different Web pages
- including the same image.
-
-An important open issue with block storage are
-UI conventions for listing, moving and deleting blocks.
-
-.. Currently, the only interface is a file system directory
- containing a set of blocks as files with hexadecimal,
- random-looking names. In Gzz, we currently trick our way around
- the problem; at startup time, we simply load the most current
- version of a document whose identifier is hard-wired into
- the software (mutable documents are described in section 6.1).
-
-
-Application-specific reverse indexing
-=====================================
-
-Finding links and transclusions in
-Xanalogical storage is an example of *reverse indexing*
-of Storm blocks: finding a block based on its contents.
-(For other examples, see section 6, below.)
-Storm provides a general API for indexing blocks in
-application-specific ways. We have implemented indexing
-on a local machine, but the interface is designed so that
-implementation on top of a distributed hashtable
-will be straight-forward. (Again, our GISP-based implementation
-is in a very early stage.)
-
-In Storm, applications are not allowed to put arbitrary
-items into the index. Instead, applications that want
-to index blocks provide the following callback
-to a Storm pool::
-
- getItems(block) ->
- set of (key, value) pairs
-
-This callback analyzes a block and returns a set of
-hashtable items (key/value pairs) to be placed into the index.
-The Storm pool, in turn, provides
-the following interface to the application::
-
- get(key) -> set of (block, value) pairs
-
-This function finds all items created by this application
-with a given key, indicating both the application-provided
-value and the block for which the item was created.
-
-We use the ``getItems()`` approach instead of
-allowing applications to put arbitrary items into the database
-because binding items to blocks makes it easy for pools
-to e.g. remove associated items when deleting a block.
-
-In a networked implementation, each peer is responsible
-for indexing the blocks it stores. Since no peer can
-feasibly know all applications that may need indexing,
-there may be blocks available on the network that have
-not been indexed by a particular application.
-We do not see this as a problem --- it's just like a block
-being available only if there's a peer which wants it to be ---
-but applications must be prepared to deal with it.
-
-Locally, on the other hand, it is guaranteed that
-all blocks in a pool are indexed by all applications
-known by the pool. To ensure this, we check that all blocks
-are indexed when a pool is loaded, and add missing items to the index.
-
-One indexing application that may seem obvious is keyword-based
-full-text search. However, no research has been done
-in this direction; it is not clear whether the current
-interface is well suited to this, or whether current implementations
-are able to scale to the load to store an item for each word
-occuring in a document.
-
-.. [There are two refs about keywords in DHTs-- should we ref these ?
-Hermanni]
-
- Not sure how applicable they are: our system is *not*
- as general or performant as a DHT (as explained above).
- Should read & find out whether they could be implemented
- through our index system at all... -b
-
-
-Versioning
-==========
-
-Mutable documents can be implemented on top of block storage
-using a combination of two mechanisms, *pointers* and *diffs*.
-A *pointer* is an updatable reference to a block,
-and a diff is a set of differences between versions,
-similar to what is stored e.g. by version control systems such as CVS.
-
-
-Pointers: implementing mutable resources
-----------------------------------------
-
-A Storm pointer is a globally unique identifier (usually created randomly)
-that can refer to different blocks over time. A block a pointer
-points to is called the pointer's *target* (Fig. [ref-storm_pointers]_).
-
-To assign a target to a pointer, we create a special kind of block,
-a *pointer block*, representing an assertion like *pointer P targets
-block B*. To find the target of pointer P, Storm searches for
-blocks of this form. This is one application of Storm
-indexing (Section 5), using P as the index key.
-
-.. uml:: storm_pointers
- :caption: The Storm pointer system. A pointer is implemented
- by a collection of pointer blocks which can obsolete
- other pointer blocks and each pointer block gives a single
- target for the pointer.
-
- class Pointer
-
- class PointerBlock
- assoc multi(*) - multi(1) Pointer
- assoc multi(*) - multi(1) role(target) Target
-
- ring = assoc PointerBlock multi(1) - multi(*) role(obsoleted) PointerBlock
-
- class Target
-
- ---
-
- Pointer.c = (0, 0);
- horizontally(100, foo, Pointer, PointerBlock);
- vertically(50, bar, PointerBlock, Target);
- ring.p = PointerBlock.e{right} .. PointerBlock.n{down};
-
-
-In addition to the pointer and the target, pointer blocks contain
-a list of zero or more *obsoleted* pointer blocks. When a new version
-is created, it usually supersedes one older version;
-the corresponding pointer block then 'obsoletes'
-the pointer block targeting the superseded version.
-Only the new, non-obsoleted block will be considered when
-loading the document (although the pointer blocks pointing to
-past versions remain accessible for tracing the document's history) [#]_.
-
-.. [#] All known pointer blocks for a pointer are still loaded
- when the pointer is resolved. Storm then discards
- the obsoleted ones.
-
-If, on the other hand, two people collaborate on a document
-and produce two independent versions, neither will obsolete
-the other. When they synchronize their pools by copying
-all new blocks in either to the other, both versions will be
-considered 'current' by the system. The users can then take
-appropriate action, by consolidating the changes in both versions
-(manually or through an automatic merge algorithm),
-or by treating the two versions as alternative. After the
-alternative versions have been consolidated, a pointer block
-obsoleting both consolidated previous versions is created.
-
-Currently, the pointer mechanism
-works only between trusted Storm pools, e.g.
-in a workgroup collaborating on a set of documents.
-In a multi-user environment, we usually want only one user
-or group to be able to publish official versions a document.
-It is not yet clear how to do this,
-but digital signatures of pointer blocks seem promising.
-For long-term publishing, one-time signatures have been
-found useful [anderson98erl]_.
-
-.. digital signatures require a public key infrastructure
- and a trusted timestamping mechanism, which
- are hardly feasible for a system intended to be used
- for off-line as well as on-line work.
-
-The ability to retain multiple 'current' versions of a document
-can be useful, for example when there is no time to consolidate
-changes at the time of synchronization. However, we need
-to choose one such version when loading the document.
-For example, we could open an official or original version automatically
-if one exists.
-
-While we think that alternative current versions are useful for
-asynchronous collaboration, they aren't well suited to Web-like publishing.
-For this, a different system may be practical, where digitally signed pointer
blocks
-store a target and a timestamp; when resolving a pointer, the newest
-pointer block for that pointer would then be selected.
-
-In summary, the current pointer system seems promising, but
-there are a number of unresolved issues with it:
-authenticating pointer blocks; the user interface for choosing
-between alternative current versions; and the suitability
-for Web-like publishing. More research is needed in this area.
-
-.. authenticating -> [possible refs: ConChord, SDSI/SPKI ? -Hermanni]
-
-
-Diffs: storing alternative versions efficiently
------------------------------------------------
-
-.. [Hm, should we move/remove 'Additionally, many versioning'
- paragraph into related work ? -Hermanni]
-
-The pointer system suggests that for each version of a document,
-we store an independent block containing this version. This
-obviously doesn't scale well when we keep a lot of versions
-each with only small changes. Instead, we use the well-known
-technique of storing only the differences between versions.
-
-We still refer to a version by the id of a block containing it.
-However, we do not necessarily *store* this block,
-even though we refer to it. Instead, we may create a *diff block*,
-containing the ids of two versions and the differences between them.
-When we want to load a version
-and do not have the block, we use Storm indexing to find
-all diff blocks from or to that version, trying to find
-a chain of differences starting at a known version. Then,
-we can apply the differences in order, and arrive at the version
-we seek.
-
-When we have reconstructed this version, we create a Storm block
-from it and check that it matches the id of the version
-we are seeking. This way, we do not need to place any trust
-in the diff blocks we are using. While anybody can create
-a diff block pretending to give us version X even though it really
-gives us version Y, we can still retrieve diff blocks from
-an untrusted network source because we can check whether a block
-has given us version X or Y by checking the cryptographic hash.
-
-.. In this scheme, we can easily drop a previous version
- by merging differences: If we have stored the differences
- from version ``A`` to ``B``, and ``B`` to ``C``,
- to drop version ``B``, we compute
- the difference from ``A`` to ``C``, and replace the two
- previous differences by it. If we also store a difference
- between version ``C`` and ``D``, it does not need
- to be altered, because it refers to *version* ``C`` and not
- the difference to ``C`` from ``B`` (as in the simplistic scheme).
-
- We can also store the block containing version ``D``
- in addition to storing the versions above. Then, we can reconstruct
- version ``C`` in two ways: By using the diffs from ``A`` to ``B``
- and ``B`` to ``C``, or, more efficiently, by applying the inverse
- of the diff from ``C`` to ``D`` to version ``D`` [#]_.
-
- .. [#] Of course, in reality the number of differences
- that can be 'skipped' will have to be much higher
- for this mechanism to be useful.
-
-Our current implementation is a layer above Storm block storage
-and indexing. This layer implements a ``load(id) -> version``
-interface through the following simplified algorithm:
-
-1. If the block for ``version-id`` is in the pool, return it.
-2. Else, search for diff blocks storing the difference
- between ``version-id`` and any other version.
-3. For each of these blocks, attempt to ``load()`` the *other* version.
-4. If successful, apply the difference.
-5. Check the hash of the resulting version. If correct, return it;
- if incorrect, go back to step 3.
-
-As computing differences is file-format dependent, so is our system
-for storing versions. In our implementation, applications need to
-provide a callback interface for reading and writing versions
-and computing and applying differences.
-
-.. uml:: version_interfaces
- :caption: Diff interfaces
-
- class Version "interface"
- methods
- getDiffFrom(:Version): Diff
-
- class Diff "interface"
- methods
- applyTo(:Version): Version
- inverse(): Diff
-
- class VersionFormat "interface"
- methods
- readVersion(:InputStream): Version
- readDiff(:InputStream): Diff
-
- writeVersion(:OutputStream, :Version)
- writeDiff(:OutputStream, :Diff)
-
- ---
- Version.c = (100,0);
- Diff.c = (250, 0);
- VersionFormat.c = (175, -80);
- %horizontally(100, foo, Version, Diff);
- %vertically(120, bar, foo, VersionFormat);
-
-
-The diff system is more complicated than simple block storage,
-and therefore more liable to bugs. However, saving is still
-purely additive: New diffs
-are added, but old diffs aren't changed. Therefore, when a save
-goes wrong, again only the changes after the previous save are lost.
-
-.. With backward diffing, we remove the cached full version,
- but we can reconstruct it using the diffs. We believe that
- diff-based Storm storage is still more reliable than file storage,
- where a simple application bug can lose all previous work
- on a document.
-
-To protect against buggy ``Diff`` or ``VersionFormat``
-implementations, before storing a diff, we always check
-that we can reconstruct the appropriate version block from it;
-if this fails for some reason, we store the full version block
-instead. At the cost of some storage space, this protects
-the user's data.
-
-.. [this would be relevant, but is cut because of space constraints -b]
- Currently, Storm pool implementations do not know anything about diffs;
- all the functionality described here is implemented on top of them.
- For a networked system, however, it would be useful if a server
- could recreate version blocks before sending them to a client.
- Then, instead of transferring all the diffs, only the full version
- would have to be sent through the network.
-
-
-Discussion
-==========
-
-To evaluate the design, we revisit the issues raised by data mobility.
-For the two issues addressed, *dangling links* and *tracking
-alternative versions*, each individual (use) case that was identified in the
-introduction is dealt with here to illustrate Storm.
-
-*Dangling links*. When documents are moved between servers, when using Storm
-the links to them are not affected as the identifiers are
-location-independent. In a peer-to-peer implementation, the lookup with the
-id returns a location where the data is currently available. If the
-publisher removes the document permanently, but it is archived elsewhere,
-the archives act as peers similarly. When there is no network connection
-available, but a local copy instead, Storm can find it. Also if a
-document and a link to it are received independently, e.g. as attachments in
-separate e-mails, or a link to a document in the local intranet is e-mailed,
-the link works.
-When people meet live, e.g. on a train, and form an ad-hoc network, they are
-able to see each other's public documents and follow links to them if a
-peer-to-peer implementation of Storm is used.
-
-*Tracking alternative versions*. Because Storm utilises immutable blocks,
-each modification to a document creates a new block. When a document is
-modified on several independent, unconnected systems, if there are
-simultaneous changes (i.e. no synching in between), there will be several
-versions of it. Using diffs, each version is actually (a collection of)
-changes to the original. What happens then, is outside the scope of Storm:
-the authors may decide to merge the changes forming a new joint version, but
-how that is done is file format and hence application specific. If (some of)
-the new versions of the document are not merged but forked to separate
-branches, they simply continue to exist (they may be assigned different
-names, which is again outside the scope of Storm).
-
-.. some things from the earlier treatment left here as notes:
-
- B sets the document public (how that is done depends on UI
- implementation), i.e. putting in a published pool, which may reside
- locally or externally e.g. on a (dedicated) server that is "always-on".
- These public/private pools are an area where future research is needed,
- possibly related to rights and permissions etc. too.
-
- Comments may be new entities(?) linking to it
-
- [At the end of this section ? -Hermanni]
- When Xanalogical storage is not applied, using Storm as a
- replacement/equivalent of a conventional file and versioning system is
- trivial?
-
-.. Besides the selected issues discussed above, a few remarks about further
- evaluation of Storm follow. From a security point of view, the fact that all
- data is stored in immutable blocks has obvious benefits for reliablity (data
- is never overwritten). By way of using of SHA-1 cryptocraphic content hashes
- as identifiers, verifyability is another benefit. As for usability, the ease
- of replication and caching combined with location-independent identifiers
- enable several improvements, including the possibility to keep on working on
- shared documents even when there is no or an unreliable network connection.
-
Conclusions
===========
@@ -966,16 +239,7 @@
Currently, we are working on a GISP-based peer-to-peer
implementation.
-No work on integrating Storm with current programs (in the spirit of Open
-Hypermedia) has been done yet.
-This makes Storm a rather monolithic approach at present.
-
-One possibility is to take an existing system
-(with features outside the focus of Gzz)
-which implements strict versioning, and to modify it to use Storm for storage.
-A candidate is the object-oriented Web publishing environment Zope [zope]_,
-which is Free Software. The
-open hypermedia protocol (OHP) may be another possibility [reich-davis99-ohp]_.
+We have written an HTTP gateway and plan integration with KDE.
Work is also needed on user interfaces for Storm.
- [Gzz-commits] manuscripts/storm short-paper.rst, Benja Fallenstein, 2003/05/20
- [Gzz-commits] manuscripts/storm short-paper.rst,
Benja Fallenstein <=
- [Gzz-commits] manuscripts/storm short-paper.rst, Benja Fallenstein, 2003/05/22
- [Gzz-commits] manuscripts/storm short-paper.rst, Benja Fallenstein, 2003/05/25
- [Gzz-commits] manuscripts/storm short-paper.rst, Benja Fallenstein, 2003/05/26
- [Gzz-commits] manuscripts/storm short-paper.rst, Benja Fallenstein, 2003/05/26
- [Gzz-commits] manuscripts/storm short-paper.rst, Benja Fallenstein, 2003/05/29
- [Gzz-commits] manuscripts/storm short-paper.rst, Benja Fallenstein, 2003/05/29
- [Gzz-commits] manuscripts/storm short-paper.rst, Benja Fallenstein, 2003/05/29
- [Gzz-commits] manuscripts/storm short-paper.rst, Benja Fallenstein, 2003/05/29
- [Gzz-commits] manuscripts/storm short-paper.rst, Benja Fallenstein, 2003/05/29
- [Gzz-commits] manuscripts/storm short-paper.rst, Benja Fallenstein, 2003/05/29