gnunet-svn
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[GNUnet-SVN] r9950 - Extractor-docs/WWW


From: gnunet
Subject: [GNUnet-SVN] r9950 - Extractor-docs/WWW
Date: Fri, 1 Jan 2010 14:47:21 +0100

Author: grothoff
Date: 2010-01-01 14:47:21 +0100 (Fri, 01 Jan 2010)
New Revision: 9950

Modified:
   Extractor-docs/WWW/documentation.html
   Extractor-docs/WWW/index.html
Log:
docu

Modified: Extractor-docs/WWW/documentation.html
===================================================================
--- Extractor-docs/WWW/documentation.html       2010-01-01 13:24:36 UTC (rev 
9949)
+++ Extractor-docs/WWW/documentation.html       2010-01-01 13:47:21 UTC (rev 
9950)
@@ -17,11 +17,11 @@
 <link rel="SHORTCUT ICON" href="http://gnunet.org/libextractor/favicon.ico";>
 </head>
 <body>
-
-
 <table width="99%" border="0" cellpadding="0" cellspacing="0">
-<tbody><tr><td colspan="2" width="99%" bgcolor="#99bbff" 
align="center">libextractor - Documentation</td></tr>
-<tr><td valign="top"><table width="15%" border="0" cellpadding="2" 
cellspacing="3">
+<tbody>
+<tr><td colspan="2" width="99%" bgcolor="#99bbff" align="center">libextractor 
- Documentation</td></tr>
+<tr><td valign="top">
+<table width="15%" border="0" cellpadding="2" cellspacing="3">
 <tbody><tr><th nowrap="nowrap" bgcolor="99BBFF"><a 
href="libextractor.html">Home</a></th></tr>
 <tr><th nowrap="nowrap" bgcolor="99BBFF"><a 
href="download.html">Download</a></th></tr>
 <tr><th nowrap="nowrap" bgcolor="99BBFF"><a 
href="documentation.html">Documentation</a></th></tr>
@@ -37,11 +37,11 @@
 This documentation covers the major aspects of libextractor.
 The man pages for <a href="man/extract.html">extract</a> and <a 
href="man/libextractor.html">libextractor</a> are also on-line.
 <br>
-An article describing libextractor was published in the <a 
href="http://www.linuxjournal.com/";>LinuxJournal</a> and is available <a 
href="http://www.linuxjournal.com/article/7552";>here</a>.
+An article describing libextractor was published in the <a 
href="http://www.linuxjournal.com/";>LinuxJournal</a> and is available <a 
href="http://www.linuxjournal.com/article/7552";>here</a>.  That article 
describes the API for versions 0.0.0 to 0.5.23 and not the more recent 0.6.x 
API.
 
 <a name="copyright"></a>
 <h2>Copyright and Contributions</h2>
-libExtractor is released under the GNU General Public License.
+libextractor is released under the GNU General Public License.
 All contributions must thus be put under the <a 
href="http://www.gnu.org/copyleft/gpl.html";>GNU Public License (GPL)</a> or a 
compatible license.
 
 <h3>Mailing lists</h3>
@@ -64,8 +64,8 @@
 <p>
 Development of libextractor, and GNU in general, is a volunteer
 effort, and you can contribute.  For information, please
-read <a href="/help/">How to help GNU</a>.  If you'd like to get
-involved, it's a good idea to join the mailing list (see above).
+read <a href="/help/">How to help GNU</a>.  If you would like to get
+involved, it is a good idea to join the mailing list (see above).
 </p>
 
 <dl>
@@ -101,13 +101,10 @@
 <pre>
 # apt-get install libextractor-dev extract
 </pre>
-If you want to compile libextractor from source you will need an
-unusual amount of memory: 256 MB system memory is roughly the minimum,
-since gcc will take about 200 MB to compile one of the plugins.
-Otherwise, compiling by hand follows the usual sequence:
+Compiling by hand follows the usual sequence:
 <pre>
-$ tar xzvf libextractor.x.x.x.tar.gz
-$ cd libextractor.x.x.x
+$ tar xzvf libextractor.x.y.z.tar.gz
+$ cd libextractor.x.y.z
 $ ./configure
 $ make
 # make install
@@ -124,11 +121,11 @@
 
 <p>
 After installing libextractor, the extract tool can be used to obtain
-meta-data from documents.  By default, the extract tool uses the
-canonical set of plugins, which consists of all file-format-specific
+meta data from documents.  By default, the extract tool uses the
+canonical set of plugins, which consists of all format-specific
 plugins supported by the current version of libextractor together with
 the mime-type detection plugin.  If you are a user
-of <a 
href="http://www.ecst.csuchico.edu/%7Ejacobsd/bib/formats/bibtex.html";>BibTeX</a>
+of <a 
href="http://www.ecst.csuchico.edu/~jacobsd/bib/formats/bibtex.html";>BibTeX</a>
 the option <tt>-b</tt> is likely to come in handy to automatically
 create bibtex entries from documents that have been properly equipped
 with meta-data:
@@ -148,25 +145,7 @@
 }
 </pre>
 </p>
-
 <p>
-Another interesting option is <tt>-B LANG</tt>.  This option loads one
-of the language specific (but format-agnostic) plugins.  These plugins
-attempt to find plaintext in a document by matching strings in the
-document against a dictionary.  If the need for 200 MB of memory to
-compile libextractor seems mysterious, the answer lies in these
-plugins.  In order to be able to perform a fast dictionary search,
-a <a href="https://ng.gnunet.org/bloomfilter";>bloomfilter</a>
-is created that allows fast probabilistic matching; gcc finds the
-resulting datastructure a bit hard to swallow.  The option <tt>-B</tt>
-is useful for formats that are undocumented, currently unsupported or
-for full-text search.  Note that the printable plugins typically print
-the entire text of the document in order.
-</p>
-
-<p>
-The supported languages at the moment are Danish (da), German (de), English 
(en), Spanish (es), Italian (it) and Norvegian (no).
-Supporting other languages is merely a question of adding (free) dictionaries 
in an appropriate character set.
 Further options are described in the extract manpage 
(<tt>man&nbsp;1&nbsp;extract</tt>).
 </p>
 <p>
@@ -175,6 +154,7 @@
 <h3>Examples:</h3>
 <pre>
 $ extract libextractor-0.1.3-1.src.rpm
+Keywords for file libextractor-0.1.3-1.src.rpm:
 os - linux
 resource-identifier - http://ovmj.org/libextractor/
 group -System Environment/Libraries
@@ -191,46 +171,44 @@
 unknown - SOURCE RPM 3.0
 mimetype - application/x-rpm
 </pre>
-<pre>$ extract extractor_logo.png
-unknown - The libextractor logo
+<pre>
+$ extract extractor_logo.png
+Keywords for file extractor_logo.png:
+image dimensions - 272x188
+thumbnail - (binary, 5932 bytes)
+image dimensions - 272x188
+thumbnail - (binary, 6427 bytes)
+image dimensions - 272x188
+thumbnail - (binary, 6427 bytes)
 mimetype - image/png
+mimetype - image/png
+image dimensions - 272x188
+keywords - The libextractor logo
 </pre>
-<p>
-The following is the output of extract for a Winword document using the 
plaintext extractors:
-</p>
-<pre>
-$ wget -q http://www.bayern.de/HDBG/polges.doc
-$ extract -B de polges.doc | head -n 4 
-unknown - FEE Politische Geschichte Bayerns
-Herausgegeben vom Haus der Geschichte als Heft 
-der zur Geschichte und Kultur Redaktion Manfred
- Bearbeitung Otto Copyright Haus der Geschichte
-M�nchen Gestaltung f�rs Internet Rudolf Inhalt im.
-unknown - und das Deutsche Reich.
-unknown - und seine.
-unknown - Henker im Zeitalter von Reformation und Gegenreformation.
-</pre>
 
 <h2>Using the libextractor library</h2>
 <p>
-The following listing shows the code of a minimalistic program that uses 
libextractor.
-Compiling the fragment requires passing the option <tt>-lextractor</tt> to gcc.
-The <tt>EXTRACTOR_KeywordList</tt> is a simple linked list containing a 
keyword and a keyword type.
-For details and additional functions for loading plugins and manipulating the 
keyword list, see
+The following listing shows the code of a minimalistic program that
+uses libextractor.  Compiling the fragment requires passing the
+option <tt>-lextractor</tt> to gcc.  For details and additional
+functions for loading plugins and manipulating the keyword list, see
 the libextractor manpage (<tt>man&nbsp;3&nbsp;libextractor</tt>).
-Java programmers should note that a Java class that uses JNI to communicate 
with libextractor is also available.
-Python programmers will find that libextractor (since 0.5.0) can also be used 
from Python, just <tt>import Extractor</tt>.
+Java programmers should note that a Java class that uses JNI to
+communicate with libextractor is also available.  Python programmers
+will find that libextractor (since 0.5.0) can also be used from
+Python, just <tt>import Extractor</tt>.
 <br>
 <pre>
-int main(int argc, char * argv[]) {
-  EXTRACTOR_ExtractorList *extractors
-    = EXTRACTOR_loadDefaultLibraries();
-  EXTRACTOR_KeywordList *keywords
-    = EXTRACTOR_getKeywords(extractors, argv[1]);
-  EXTRACTOR_printKeywords(stdout,
-                          keywords);
-  EXTRACTOR_freeKeywords(keywords);
-  EXTRACTOR_removeAll(extractors);
+#include <extractor.h>
+
+int main(int argc, char * argv[]) 
+{
+  struct EXTRACTOR_PluginList *plugins
+    = EXTRACTOR_plugin_add_defaults (EXTRACTOR_OPTION_DEFAULT_POLICY);
+  EXTRACTOR_extract (plugins, argv[1],
+                     NULL, 0, 
+                     &EXTRACTOR_meta_data_print, stdout);
+  EXTRACTOR_plugin_remove_all (plugins);
 }
 </pre>
 </p>
@@ -277,51 +255,58 @@
 <p>
 The most complicated thing when writing a new plugin for libextractor is the 
writing of the actual parser for a specific format.
 Nevertheless, the basic pattern is always the same.
-The plugin library must be called <tt>libextractor_XXX.so</tt> where XXX 
denotes the file format supported by the plugin.
-The library must export a method <tt>libextractor_XXX_extract</tt> with the 
following signature:
+The plugin library must be called <tt>libextractor_XXX.so</tt> where XXX 
denotes the file format supported by the plugin and
+must be placed in the plugin directory (typically 
<tt>$PREFIX/lib/libextractor/</tt>).
+The library must export a method <tt>EXTRACTOR_XXX_extract</tt> with the 
following signature:
 <pre>
-struct EXTRACTOR_Keywords *
-libextractor_XXX_extract (char * filename,
-                          char * data,
-                          size_t size,
-                          struct EXTRACTOR_Keywords * prev,
-                          const char* options);
+int
+EXTRACTOR_XXX_extract (const char *data,
+                       size_t size,
+                       EXTRACTOR_MetaDataProcessor proc,
+                       void *proc_cls,
+                       const char* options);
 </pre>
 </p>
 <p>
-The argument filename specifies the name of the file that is being processed.
-<tt>data</tt> is a pointer to the (typically mmapped) contents of the
-file, and size is the filesize. Most plugins to not make use of the
-filename and just directly parse data directly, staring by verifying
-that the header of the data matches the specific format.
-<tt>prev</tt> is the list of keywords that have been extracted so far by other 
plugins for the file.
-The function is expected to return an updated list of keywords.
-The keywords are supposed to be converted into the UTF-8 character set by the 
plugin.
-If the format does not match the expectations of the plugin, <tt>prev</tt> is 
returned.
-Most plugins use a function like <tt>addKeyword</tt> to extend the list:
+<tt>data</tt> is a pointer to the contents of the
+file, and <tt>size</tt> is the number of bytes available in <tt>data</tt>. Most
+plugins starting by verifying that <tt>size</tt> is sufficiently large and
+that the header of data matches the specific format.
+The <tt>extract</tt> function is expected to call <tt>proc</tt> with each
+meta data item found.  <tt>proc_cls</tt> must be passed as the first
+argument to <tt>proc</tt>, the other arguments correspond to the meta data 
found.
+Finally, <tt>options</tt> is an arbitrary string of options that the plugin is
+free to interpret. Most plugins ignore <tt>options</tt>.
 </p>
+<p>
+If the meta data extracted is a string, it issupposed to be converted
+into the UTF-8 character set by the plugin.  However, in cases where
+the character encoding used in the document is unknown, no conversion
+should be done.  Binary meta data can also be extracted.  Plugins
+indicate the format of the meta data using the <tt>format</tt>
+argument to <tt>proc</tt>.  Supported formats are UTF-8 strings, C
+Strings (for strings of unknown encoding) and binary data.  In
+addition to this rough categorization, the plugin is also supposed to
+indicate the mime type of the meta data.  For strings, that mime type
+is most often <tt>text/plain</tt>.  Finally, the plugin must specify
+the meta data type.  Common meta data types are &quot;author&quot;,
+&quot;title&quot; and &quot;mime-type&quot;.  The full signature of
+the &quot;proc&quot; callback is:
+</p>
 <pre>
-static void addKeyword(struct EXTRACTOR_Keywords ** list,
-                       char * keyword,
-                       EXTRACTOR_KeywordType type)
-{
-  EXTRACTOR_KeywordList * next;
-  next = malloc(sizeof(EXTRACTOR_KeywordList));
-  next-&gt;next = *list;
-  next-&gt;keyword = keyword;
-  next-&gt;keywordType = type;
-  *list = next;
-}
+typedef int (*EXTRACTOR_MetaDataProcessor)(void *cls,
+                                           const char *plugin_name,
+                                           enum EXTRACTOR_MetaType type,
+                                           enum EXTRACTOR_MetaFormat format,
+                                           const char *data_mime_type,
+                                           const char *data,
+                                           size_t data_len);
 </pre>
 <p>
-A typical use of <tt>addKeyword</tt> is to add the mime-type once the
-file format has been established.  For example, the JPEG-extractor
-checks the first bytes of the JPEG header and then either aborts or
-claims the file to be a JPEG.  Note that the <tt>strdup</tt> in the
-code is important since the string will be deallocated later,
-typically in <tt>EXTRACTOR_freeKeywords()</tt>.  A list of supported
-keyword classifications (in the example <tt>EXTRACTOR_MIMETYPE</tt>)
-can be found in the <tt>extractor.h</tt> header file.
+If &quot;proc&quot; returns non-zero, the plugin should abort and
+return non-zero itself.  The &quot;extract&quot; function should
+always return zero unless a call to &quot;proc&quot; returned
+non-zero, in which case the plugin must return 1.
 </p>
 </td>
 </tr>

Modified: Extractor-docs/WWW/index.html
===================================================================
--- Extractor-docs/WWW/index.html       2010-01-01 13:24:36 UTC (rev 9949)
+++ Extractor-docs/WWW/index.html       2010-01-01 13:47:21 UTC (rev 9950)
@@ -2,8 +2,11 @@
 <html><head>
 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
 <title>GNU libextractor - GNU Project - Free Software Foundation</title>
-<meta name="content-language" content="en"><meta name="language" 
content="en"><meta name="description" content="a simple library for keyword 
extraction"><meta name="author" content="Vids Samanta and Christian Grothoff">
-<meta name="rights" content="(C) 2002,2003,2004,2005,2006,2007,2009 by Vids 
Samanta and Christian Grothoff">
+<meta name="content-language" content="en">
+<meta name="language" content="en">
+<meta name="description" content="a simple library for keyword extraction">
+<meta name="author" content="Vids Samanta and Christian Grothoff">
+<meta name="rights" content="(C) 2002,2003,2004,2005,2006,2007,2009,2010 by 
Vids Samanta and Christian Grothoff">
 <meta name="keywords" content="keyword, extraction, mp3, html, pdf, images, 
jpeg, gif, ps, mime, real, qt, asf, mpeg, avi, riff, tiff, summary, summaries, 
kbps, format, mime-type, zip, elf, doc, ppt, xls, sha-1, md5, open office, sxw, 
dvi, id3, id3v2, id3v2.3, id3v2.4, thumbnails, exiv2, nsf, sid, flv, flac">
 <meta name="robots" content="index,follow">
 <meta name="revisit-after" content="28 days">
@@ -17,7 +20,8 @@
 <table width="99%" border="0" cellpadding="0" cellspacing="0">
 <tbody>
 <tr><td colspan="2" width="99%" bgcolor="#99bbff" align="center">GNU 
libextractor - a simple library for keyword extraction</td></tr>
-<tr><td valign="top"><table width="15%" border="0" cellpadding="2" 
cellspacing="3">
+<tr><td valign="top">
+<table width="15%" border="0" cellpadding="2" cellspacing="3">
 <tbody>
 <tr><th nowrap="nowrap" bgcolor="99BBFF"><a 
href="http://www.gnu.org/software/libextractor/";>Home</a></th></tr>
 <tr><td bgcolor="efefef"><a href="#about">About</a></td></tr>
@@ -43,12 +47,13 @@
 libextractor can be downloaded from this site or the <a 
href="http://www.gnu.org/prep/ftp.html";>GNU mirrors</a>.
 </p>
 <p>
-The goal is to provide developers of file-sharing networks or
-WWW-indexing bots with a universal library to obtain simple keywords to
-match against queries.
-libextractor contains a shell-command <tt>extract</tt> that, similar to the
-well-known <tt>file</tt> command, can extract meta-data from a file an print
-the results to stdout.
+The goal is to provide developers of file-sharing networks, browsers
+or WWW-indexing bots with a universal library to obtain simple
+keywords and meta data to match against queries and to show to users
+instead of only relying on filenames.  libextractor contains a
+shell-command <tt>extract</tt> that, similar to the
+well-known <tt>file</tt> command, can extract meta data from a file an
+print the results to stdout.
 </p>
 <p>
 Currently, libextractor supports the following formats:





reply via email to

[Prev in Thread] Current Thread [Next in Thread]