[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Lynx-dev] Extracting text from an HTML file (was: help)
From: |
David Woolley |
Subject: |
[Lynx-dev] Extracting text from an HTML file (was: help) |
Date: |
Sat, 01 Mar 2008 12:03:12 +0000 |
User-agent: |
Thunderbird 2.0.0.9 (X11/20071031) |
Tim Chase wrote:
is it possible to use a program to get all the text only of
the html? as if I open the html with a browser, then click
ctrl+a and then copy paste all the selected text
You need to define what you mean by this much more precisely. Do you
want title elements to be included? Do you want initial values for
input elements (which are attributes, not text nodes) to be included?
Do you want text nodes in a pre element handled specially. Can you
constrain the input to be valid HTML (most web sites aren't)? If not,
what error recovery do you want? Etc.
Sounds like you're reaching for the "-dump" parameter that Lynx
supports, as described in the man-page:
lynx -dump http://www.example.com
This will insert extra characters to achieve a rendering of the text.
If only the input characters are wanted, one might be better using a
Perl script to strip out all the tags, directives, etc., and resolve
entities.
You could also use the nsgmls tools, provided the input is valid, to get
the infoset representation and then strip out the tag and attribute
lines. You still need to decide how you will deal with the resulting
newlines and any newlines in the original text nodes.
You could probably modify lynx to dump the text nodes as it identifies
them, but Lynx is quite big and complex.
P.S. when posting to support lists, please use a subject that is a
precis of the complete question.
This can then be automated via a script, or you may be able to
use the '-crawl' parameter in conjunction with -dump to walk a
site. I didn't see anything in my man-page to limit link
recursion-depth as wget offers.
If you don't want the link-lists, you can use the -nolist
parameter as well.
-tim
_______________________________________________
Lynx-dev mailing list
address@hidden
http://lists.nongnu.org/mailman/listinfo/lynx-dev
--
David Woolley
Emails are not formal business letters, whatever businesses may want.
RFC1855 says there should be an address here, but, in a world of spam,
that is no longer good advice, as archive address hiding may not work.
- [Lynx-dev] Extracting text from an HTML file (was: help),
David Woolley <=