gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] How does arch/tla handle encodings?


From: Marcus Sundman
Subject: Re: [Gnu-arch-users] How does arch/tla handle encodings?
Date: Sat, 28 Aug 2004 23:28:41 +0300
User-agent: KMail/1.7

On Saturday 28 August 2004 18:22, Jan Hudec wrote:
> On Sat, Aug 28, 2004 at 13:46:40 +0300, Marcus Sundman wrote:
> > On Saturday 28 August 2004 12:53, Jan Hudec wrote:
> > > On Fri, Aug 27, 2004 at 21:38:06 +0300, Marcus Sundman wrote:
> > > > On Friday 27 August 2004 21:23, Andrew Suffield wrote:
> > > > > On Fri, Aug 27, 2004 at 08:20:00PM +0300, Marcus Sundman wrote:
> > > > > > On Friday 27 August 2004 19:52, Andrew Suffield wrote:
> > > > > > > On Fri, Aug 27, 2004 at 06:50:23PM +0200, V. Haisman wrote:
> > > > > > > > File's encoding is imho metadata as much as permisions are.
> > > > > > >
> > > > > > > It's not. Encoding is data.
> > > > > >
> > > > > > Oh, get a clue. And a dictionary. The encoding info is data
> > > > > > about the data that is the content of the file. "Data about
> > > > > > data" is called "metadata". "Encoding" is an attribute of the
> > > > > > file, just as "filename" and "permissions" are.
> > > > >
> > > > > And I repeat: encoding is data.
> > > >
> > > > Yes, but it's also metadata. You said it isn't, but it is. Don't
> > > > pretend to be more stupid than you are.
> > >
> > > It is **NOT** metadata in the sense of filename, permissions,
> > > timestamp, ie. file attributes. It is metadata in the general sense
> > > "data about data".
> > >
> > > So while *calling* it metadata is ok, *treating* it as file
> > > attributes is not. The encoding is needed to understand the file, so
> > > it better be deduced from it's contents. The attributes do not bind
> > > that tighlty and they can be lost at any moment. Especially since
> > > applications don't know how to handle them.
> >
> > Are you seriously suggesting that metadata is not actually metadata if
> > it is
>
> No. I am actualy suggesting, that it *DIFFERENT KIND* of metadata than
> file name, permissions and timestamp. And thus should be handled
> differently, if at all possible by the file format itself.

I think I agree that formats should be as self-contained as possible. 
However, many aren't. And even if the format supports all needed data you 
still have to know what format a given file is in. Hence we don't only need 
to store the encoding info, but also the file type info. So, why not simply 
store the mime-type of each file (with sensible and configurable defaults, 
of course)?

> > mandatory? Only optional metadata is actually metadata? Both a file's
> > name and its encoding are properties of the file. The former can be
> > changed without modifying the contents of the file, the latter can't
> > necessarily. This is irrelevant. Both are equally metadata.
>
> Yes. They are equaly metadata. Which by does not mean they are best
> treated the same. There may be many ways of treating metadata and
> different ways are appropriate for different metadata.

You seem to be confusing two different things. On the one hand you have 
files of which the encoding is part of the semantics, and on the other hand 
you have files that may be represented in whatever encoding one wishes. 
(There is actually also a hybrid alternative, but we can safely put that in 
the former category at this point.)

If the encoding is part of a file's semantics, as is the case for files with 
embedded encoding info, then you obviously don't _have_to_ store the 
encoding info somewhere else, too. (You only need to store the file type.) 
For other files you *do* have to store the encoding info somewhere else. If 
there is also other metadata to store, then it would probably be best to 
use the same system for all metadata.

> > You just don't make sense. Is the "description" attribute metadata?
> > Let's say you have a picture that is displaying a particular shade of
> > red, and has the attribute "description: the color of my car". You use
> > this picture
>
> Then the comment (most graphical formats provide room for one) is the
> most appropriate place for that -- and that is within the actual file.

You are correct. I should have come up with a better example. Still, I think 
you can imagine the same situation if graphics files wouldn't support 
comments.

> > to find the correct shade when shopping for car paint. If you lose the
> > description attribute the picture is meaningless. The description is an
> > essential part of the picture and can't be deduced from it. Does this
> > make the attribute not metadata? Or how is this different from the
> > encoding of a text file? (And please don't say something stupid like
> > "it's different because the color of characters are irrelevant".)
>
> It makes the attribute a metadata. But a metadata of the contents, as
> opposed to metadata of the filesystem object.
>
> While the metadata of the filesystem object are best stored in the
> inode, perhaps as extended attributes, the content metadata are much
> better stored in the file itself, of course if the file format has room
> for them. If it does not, extended attributes are surely better than
> nothing. But they are not good.

I'm sorry, I don't understand what you mean by "the filesystem object". I 
don't see why e.g. the name of the photographer of a picture should be any 
more related to the inode than the encoding info of the picture. Sure, for 
maximum compatibility with badly behaving systems it can sometimes be a 
good idea to embed metadata in the same file as the data, but on the other 
hand, sometimes it's not. E.g., I bet 90% of all photographers with digital 
cameras have lost important information just because some stupid program 
decided to throw away the exif tag.

> > Also, the encoding can *not* be deduced from the file's contents. I
> > have already told why this is. E.g. if a file is in ISO-8859-2 there is
> > no way that the editor could know that it's not ISO-8859-1 or
> > ISO-8859-4 or ISO-8859-5 or ISO-8859-8 or ISO-8859-9 or ISO-8859-10 or
> > ISO-8859-13 or ISO-8859-14 or ISO-8859-15 or some other of the 30+
> > encodings for which the given byte sequence is valid.
>
> It definitely CAN -- if it's format supports it. If you say, that a file
> staring with comment containing:
> # encoding: iso-8859-15
> is in said encoding (eg. python sources have this rule), than the
> encoding is deduced from the file contents. Just there is no standard
> for this. There is no standard for extended attributes there either.

So there is a handful of python editors that support this, and all other 
programs treat such files as if it was in the local system's default 
encoding. But sure, if such a thing could be standardized it would be 
great, since that would work in quite a few cases. Not in all, though.

> > > After all, that's what the byte-order-mark is for.  In most editors,
> > > the sequence 0xfe 0xff indicates utf-16be, 0xff 0xfe indicates
> > > utf-16le and 0xef 0xbb 0xbf indicates utf-8 encoding.
> >
> > No, the BOM is for specifying endianess of the encoding. (All unicode
> > formats support a BOM, it's just that it's not needed for single byte
> > based ones, such as UTF-8. That said, I fully support using BOMs also
> > in UTF-8 files to more often detect badly behaving programs.) If you
> > don't know which encoding (or group of encodings) a file is in then you
> > can't possibly know how to interpret the first bytes of the file. There
> > is no way of knowing if a file beginning with the bytes 0xFE and 0xFF
> > is a big-endian UTF-16 file or an ISO-8859-1 file starting with "thorn
> > yuml" or something completely different in some other encoding.
>
> No, you really don't know that. You don't even know that is a text file.

Exactly.

> Actualy arch could support external attributes on contents, because the
> contents is tagged with the id tag. Perhaps when the file-as-directory
> interface is finalized.

Sounds good.

> (There is currently a flamewar on linux-filesystem about reiser4
> concerning, among other things, this extended attributes topic).

I bet there is. We all knew there would be.


- Marcus Sundman




reply via email to

[Prev in Thread] Current Thread [Next in Thread]