libextractor
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [libextractor] extractor metadata and XML/RDF


From: Christian Grothoff
Subject: Re: [libextractor] extractor metadata and XML/RDF
Date: Mon, 9 Jul 2007 23:43:16 -0600
User-agent: KMail/1.9.5

On Monday 09 July 2007 10:07, Andreas Harth wrote:
> Hello,
>
> I'm working on SWSE [1], a Semantic Web Search Engine.  The aim
> is to collect arbitrary content from the Web and make the metadata
> available for search and query.
>
> Extractor looks like exactly the right tool for extracting metadata
> from legacy formats.  However, the resulting metadata are name-value
> pairs, which makes post-processing difficult.

I don't see how it makes post-processing difficult.  It is pretty much the 
simplest format possible.  Now, certainly having data in highly standardized 
format (such as dates, numbers, etc.) would help certain forms of 
post-processing.  However, given that some of the file-formats are a bit 
vague in how they encode the data in the first place, I don't see how it 
would be possible to always achieve this.

> Do you have (or are there efforts in that direction) a more formal
> way of returning metadata? I can see XML or better RDF fitting there.
> I'd like to add some terms from standard ontologies (such as Dublin
> Core and Friend of a Friend) to the output, probably using sed
> scripts in the beginning if there is currently nothing else available.

The metadata types used by LE were motivated by Dublin Core.  Additional terms 
are added as needed by particular formats.  Improvements in the set of 
available metadata types are welcome but should be driven by adding or 
modifying existing plugins to produce better terms, not by just adding terms 
that will never be extracted.  I am not aware of any effort to add support 
for RDF or XML.

Best regards,

Christian




reply via email to

[Prev in Thread] Current Thread [Next in Thread]