libextractor
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [libextractor] extractor metadata and XML/RDF


From: Christian Grothoff
Subject: Re: [libextractor] extractor metadata and XML/RDF
Date: Wed, 11 Jul 2007 06:25:16 -0600
User-agent: KMail/1.9.5

On Wednesday 11 July 2007 06:00, you wrote:
> Hi,
>
> Christian Grothoff wrote:
> >> Extractor looks like exactly the right tool for extracting metadata
> >> from legacy formats.  However, the resulting metadata are name-value
> >> pairs, which makes post-processing difficult.
> >
> > I don't see how it makes post-processing difficult.  It is pretty much
> > the simplest format possible.  Now, certainly having data in highly
> > standardized format (such as dates, numbers, etc.) would help certain
> > forms of post-processing.  However, given that some of the file-formats
> > are a bit vague in how they encode the data in the first place, I don't
> > see how it would be possible to always achieve this.
>
> Using a standardised format (even XML) would facilitate parsing,
> e.g. newline handling, Unicode, etc.  You're right about dates and
> numbers, haven't thought about that one.

I think your problems arise from using the "extract" command-line tool, not 
from using the library.  When you use the library directly, you get the 
metadata as a linked list with type + char*-content. So there are no problems 
with newlines in that case.  As for Unicode, internally LE defines that 
everything should be UTF-8.  When given to the console (by extract), extracts 
converts UTF-8 to whatever the locale is (which may result in some losses if 
the locale lacks ways to represent certain characters).  But again, this 
problem goes away once you use the library and not the command-line tool -- 
at that point, you do not need to do any parsing.

> Also, using defined tags (maybe with URIs) helps to understand
> what the data is about (what's template, split, format)?

Template is the template used to create the document (see office products).  
Split are keyword generated by splitting metadata (see splitextractor).  
Format is the document format.  

> At the moment I'm ok with the tool.  We only use the extracted
> title, and don't care too much about Unicode and newlines.  IMHO
> having a more formal metadata format would be extremely helpful
> in the long term though.

Having better documentation as to what the different LE keyword types mean, I 
agree, having good (verbose) documentation would be nice.  However, just 
saying what each type means is not enough -- we would also have to check all 
of the plugins to make sure that the types are used consistently according to 
those updated definitions.  If someone wants to do this, I'd be very happy 
about it. 

Christian




reply via email to

[Prev in Thread] Current Thread [Next in Thread]