[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Lynx-dev] Extracting text from an HTML file (was: help)

From: David Woolley
Subject: [Lynx-dev] Extracting text from an HTML file (was: help)
Date: Sat, 01 Mar 2008 12:03:12 +0000
User-agent: Thunderbird (X11/20071031)

Tim Chase wrote:
is it possible to use a program to get all the text only of
the html? as if I open the html with a browser, then click
ctrl+a and then copy paste all the selected text

You need to define what you mean by this much more precisely. Do you want title elements to be included? Do you want initial values for input elements (which are attributes, not text nodes) to be included? Do you want text nodes in a pre element handled specially. Can you constrain the input to be valid HTML (most web sites aren't)? If not, what error recovery do you want? Etc.

Sounds like you're reaching for the "-dump" parameter that Lynx
supports, as described in the man-page:

   lynx -dump

This will insert extra characters to achieve a rendering of the text. If only the input characters are wanted, one might be better using a Perl script to strip out all the tags, directives, etc., and resolve entities.

You could also use the nsgmls tools, provided the input is valid, to get the infoset representation and then strip out the tag and attribute lines. You still need to decide how you will deal with the resulting newlines and any newlines in the original text nodes.

You could probably modify lynx to dump the text nodes as it identifies them, but Lynx is quite big and complex.

P.S. when posting to support lists, please use a subject that is a precis of the complete question.

This can then be automated via a script, or you may be able to
use the '-crawl' parameter in conjunction with -dump to walk a
site.  I didn't see anything in my man-page to limit link
recursion-depth as wget offers.

If you don't want the link-lists, you can use the -nolist
parameter as well.


Lynx-dev mailing list

David Woolley
Emails are not formal business letters, whatever businesses may want.
RFC1855 says there should be an address here, but, in a world of spam,
that is no longer good advice, as archive address hiding may not work.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]