Re: [Lynx-dev] seporating main text from whole page

From: David Woolley
Subject: Re: [Lynx-dev] seporating main text from whole page
Date: Fri, 30 Mar 2007 08:00:33 +0100 (BST)

> i need to get only the report body, not the whole page as Lynx does....

As I noted, that's almost certainly illegal.  However, there is an
XML based format that is commonly used to give abstracts of news items.
I won't name it, in case part of your task was to discover it, but if
the commercial people are on the ball, you will only get enough of
the article to make you want to read the full page, with its 

(In most cases, if you are given the whole article, you are probably
viewing a propaganda site, rather than a news site; i.e. the editorial
is the advertising.)

One other point, in the unlikely event of actually dealing with something
that was designed with the semantic web in mind, you would need to 
process the document object model, which  means using a full SGML parser.
Normal web browsers are about taking syntactically badly broken HTML 
and making them visually usable, they, therefore have most of their code
to deal with SGML violations, whereas a semantic web document ought to 
be easy to parse.

