chicken-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Chicken-users] Blobs: a (modest?) proposal


From: Alaric Snell-Pym
Subject: [Chicken-users] Blobs: a (modest?) proposal
Date: Sat, 24 Jan 2009 12:10:14 +0000


In my work on Ugarit (and, hence, lzma) I've been shifting a lot of
arbitrary binary data about. I plumped for SRFI-4 u8vectors for
Ugarit, purely because at one point I want to prefix them with a
compression-algorithm byte and then strip that byte off again
elsewhere, but I feel in my heart of hearts I should have used blobs;
for I have no idea what the structure of the data I find in random
files being backed up is.

There's a long tradition of considering arbitrary binary data as a
byte array, but that's slightly overspecifying it, in my opinion. Just
because any bit of memory *can* be interpreted as a byte array doesn't
mean it *is* a byte array, any more than the fact that any file in the
filesystem *can* be edited in vi means that it makes sense to do so.
For a start, as srfi-4 makes clear, that region of memory can be seen
as a u8vector or an s8vector...

I think there's a valid semantic distinction between a blob - which is
purely a region of storage, which happens to be a multiple of 8 bits
in length - and a byte vector.

Which is why I made the lzma egg operate on blobs; lzma:compress and
lzma:decompress are just functions of type blob -> blob. I saw that
the z3 egg, which does a similar job, chose to use strings; there's
been some history of using strings for arbitrary data in Chicken,
which I think is wrong - strings imply character-sequence semantics.

So, I propose that people be mindful of this distinction and try and
make more use of the blob type. I don't propose breaking existing
code; things that operate on arbitrary data can happily accept blob/
string/u8vector/s8vector, but I think blob should be the default in
people's minds!

Further to this, I am considering throwing together some useful blob
tools, to allow more to be done with blobs without needing to copy
them so much, and to deal with bigger blobs. This would comprise:

1) Replacements for the core blob functions, which operate on blobs
composed of a c-pointer and a size. (make-blob size) would malloc size
bytes and construct a blob with a finalizer that called free. Perhaps
for blobs below a certain size it'd just allocate them from the
nursery, using the normal approach to blobs and thus avoiding
registering a finalizer, and all the other blob functions would have a
conditional to detect which blob representation was in use. However,
ffi code that returns blobs can then easily wrap a malloced pointer
returned by a C function, and have the finalizer call free on it; or
use a different finalizer if the memory comes from some other kind of
memory pool. The flexibility would be there.

2) A wrapper for the mmap stuff in the posix unit, adding a function
that returns a blob wrapping the mmapped region, with a reference
count; when the last blob goes away, it's un-mmapped.

3) Blob I/O on file descriptors - file-read in the POSIX unit should
return a blob by default, not a string! It's too late to change that,
so I'd add a file-read-blob, and make file-write accept new-style blobs.

4) Similarly sidling the new blobs into the lolevel unit functions, so
they can be move-memory!ed to/from, the pointer extracted, and all that.

5) A new srfi-4, which uses a blob as the underlying storage for every
vector. blob->*vector/shared reuses an existing blob, and the non-/
shared versions just duplicate the blob then use that. An actual
srfi-4 vector would become a record referencing the underlying blob, a
starting offset (subject to alignment, of course), and the length of
the vector in elements, so any subregion of a blob can be viewed as an
SRFI-4 vector; this would mean that sub*vector/shared functions could
be created that just made a new vector-record referencing the same
blob, but with reduced offset/length fields.

This would mean that there'd be a lot less copying involved in dealing
with blobby data. A foreign function that returns a malloced block
could have it returned in Chicken as a blob with zero copying, and
Chicken code could then happily interpret different parts of it as any
of the srfi-4 types by just dropping lightweight shared vector
wrappers onto it.

What do people think of this? Would it be welcome in the chicken core
once it's proven itself?

ABS

--
Alaric Snell-Pym
Work: http://www.snell-systems.co.uk/
Play: http://www.snell-pym.org.uk/alaric/
Blog: http://www.snell-pym.org.uk/?author=4






reply via email to

[Prev in Thread] Current Thread [Next in Thread]