Re: [Lynx-dev] Extracting text from an HTML file

lynx-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Lynx-dev] Extracting text from an HTML file

From:	David Woolley
Subject:	Re: [Lynx-dev] Extracting text from an HTML file
Date:	Sat, 01 Mar 2008 12:12:40 +0000
User-agent:	Thunderbird 2.0.0.9 (X11/20071031)

is it possible to use a program to get all the text only of
the html? as if I open the html with a browser, then click
ctrl+a and then copy paste all the selected text

Also, depending on what sites you are going to do this on, make sure youread their terms and conditions, and, unless you have permission byother means, use a tool that honours the robots exclusion rules. Idon't believe that Lynx has any provision for doing that, so you willprobably need to use wget, if you do not have explicit permission tomachine process the pages.

Many sites forbid any machine processing other that needed to displaythe page in real time, and in accordance with the HTML they provided(i.e. you cannot strip adverts), or to cache pages that are not markeduncacheable. They also, generally, accept indexing for search engines,but may use the robots mechanism to restrict this. robots exclusionapplies to any automated process, not just indexing and can selectivelybar robots.




--
David Woolley
Emails are not formal business letters, whatever businesses may want.
RFC1855 says there should be an address here, but, in a world of spam,
that is no longer good advice, as archive address hiding may not work.

[Prev in Thread]

Current Thread

[Next in Thread]

[Lynx-dev] Extracting text from an HTML file (was: help), David Woolley, 2008/03/01
- Re: [Lynx-dev] Extracting text from an HTML file, David Woolley <=

Prev by Date: [Lynx-dev] Extracting text from an HTML file (was: help)
Next by Date: [Lynx-dev] Re: Lynx-dev Digest, Vol 50, Issue 1
Previous by thread: [Lynx-dev] Extracting text from an HTML file (was: help)
Next by thread: [Lynx-dev] Re: Lynx-dev Digest, Vol 50, Issue 1
Index(es):
- Date
- Thread