Re: lynx-dev Lynx parser-anaylser and structure (fwd)

lynx-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev Lynx parser-anaylser and structure (fwd)

From:	Klaus Weide
Subject:	Re: lynx-dev Lynx parser-anaylser and structure (fwd)
Date:	Tue, 2 May 2000 13:44:22 -0500 (CDT)

> From: Gog the Dwarf <address@hidden>
> 
> I am sorry to write to you this way, without being in the mailing list
> (I've subscribed but the work is *quite* urgent and need an answer).
> 
> I am currently a student in Grenoble and need to do a HTML page
> interpreter and convert the page in some C++ library (this is FLTK but
> that's not the big point).

I hope you are aware that lynx code is C and not C++.

> In order to make this work easy, I'd like to reuse Lynx sources. The
> problem is that I cannot find developement guides to explain me how it
> works.

Reusing Lynx source may not be the best way.  Very likely it's not the
fastest way.  You can probably find something easier to reuse, somewhere
"out there".

You may want to look at

   Linkname: Libwww - the W3C Sample Code Library
        URL: http://www.w3.org/Library/

and especially

   Linkname: Libwww Architecture
        URL: http://www.w3.org/Library/User/Architecture/

and also the section "Should I use the Internal SGML/HTML Parser?" in

   Linkname: Libwww Quick Guide
        URL: http://www.w3.org/Library/User/Start.html

You may end up deciding to use the W3C Library.  If not, *some* of
the stuff described in those pages still applies for the Lynx code -
because the WWW/Library part of the lynx code is derived for a predecessor
version of the Library.

> I'd like to reuse the parser and analyser that builds the tree
> representation of the HTML code (I presume there is such a tree built) and
> which methods can be used in order to *read* the tree (to convert the file
> into the lib format).

There is no tree built in memory.  Input character data is parsed on the
fly (see SGML.c) and passed to a "structured stream" (see SGML.h) in
terms of [let's call it] "events": put_character, put_string, _write,
start_element, end_element, put_entity.  The regular Lynx structured
stream (HTML.c) doesn't keep a tree structure in memory, it just "does
things" in response to events - mostly, appending characters to the
HText object (GridText.c) that represents the rendered document via
the HText interface.

You could replace Lynx's HTML.c functions with something that builds some
form of memory representation of the structure, if that's what you need,
but you'll have to invent and implement that representation yourself.

You'll need time to spare and familiarity with a debugger, to understand
how functions are called.  ('b HTML_start_element' may be a good place
to start).

Good luck.  Let us know what you come up with.

   Klaus

[Prev in Thread]

Current Thread

[Next in Thread]

lynx-dev Lynx parser-anaylser and structure (fwd), Klaus Weide, 2000/05/02
- Re: lynx-dev Lynx parser-anaylser and structure (fwd), Klaus Weide <=

Prev by Date: Re: lynx-dev overuse of system-specific conditionals (was: cygwin patch)
Next by Date: Re: lynx-dev -traversal -crawl gets into infinite loop (fwd)
Previous by thread: lynx-dev Lynx parser-anaylser and structure (fwd)
Next by thread: Re: lynx-dev quieting the page
Index(es):
- Date
- Thread