[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev Lynx parser-anaylser and structure (fwd)

From: Klaus Weide
Subject: Re: lynx-dev Lynx parser-anaylser and structure (fwd)
Date: Tue, 2 May 2000 13:44:22 -0500 (CDT)

> From: Gog the Dwarf <address@hidden>
> I am sorry to write to you this way, without being in the mailing list
> (I've subscribed but the work is *quite* urgent and need an answer).
> I am currently a student in Grenoble and need to do a HTML page
> interpreter and convert the page in some C++ library (this is FLTK but
> that's not the big point).

I hope you are aware that lynx code is C and not C++.

> In order to make this work easy, I'd like to reuse Lynx sources. The
> problem is that I cannot find developement guides to explain me how it
> works.

Reusing Lynx source may not be the best way.  Very likely it's not the
fastest way.  You can probably find something easier to reuse, somewhere
"out there".

You may want to look at

   Linkname: Libwww - the W3C Sample Code Library

and especially

   Linkname: Libwww Architecture

and also the section "Should I use the Internal SGML/HTML Parser?" in

   Linkname: Libwww Quick Guide

You may end up deciding to use the W3C Library.  If not, *some* of
the stuff described in those pages still applies for the Lynx code -
because the WWW/Library part of the lynx code is derived for a predecessor
version of the Library.

> I'd like to reuse the parser and analyser that builds the tree
> representation of the HTML code (I presume there is such a tree built) and
> which methods can be used in order to *read* the tree (to convert the file
> into the lib format).

There is no tree built in memory.  Input character data is parsed on the
fly (see SGML.c) and passed to a "structured stream" (see SGML.h) in
terms of [let's call it] "events": put_character, put_string, _write,
start_element, end_element, put_entity.  The regular Lynx structured
stream (HTML.c) doesn't keep a tree structure in memory, it just "does
things" in response to events - mostly, appending characters to the
HText object (GridText.c) that represents the rendered document via
the HText interface.

You could replace Lynx's HTML.c functions with something that builds some
form of memory representation of the structure, if that's what you need,
but you'll have to invent and implement that representation yourself.

You'll need time to spare and familiarity with a debugger, to understand
how functions are called.  ('b HTML_start_element' may be a good place
to start).

Good luck.  Let us know what you come up with.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]