[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Lynx-dev] Extracting text from an HTML file

From: David Woolley
Subject: Re: [Lynx-dev] Extracting text from an HTML file
Date: Sat, 01 Mar 2008 12:12:40 +0000
User-agent: Thunderbird (X11/20071031)

is it possible to use a program to get all the text only of
the html? as if I open the html with a browser, then click
ctrl+a and then copy paste all the selected text

Also, depending on what sites you are going to do this on, make sure you read their terms and conditions, and, unless you have permission by other means, use a tool that honours the robots exclusion rules. I don't believe that Lynx has any provision for doing that, so you will probably need to use wget, if you do not have explicit permission to machine process the pages.

Many sites forbid any machine processing other that needed to display the page in real time, and in accordance with the HTML they provided (i.e. you cannot strip adverts), or to cache pages that are not marked uncacheable. They also, generally, accept indexing for search engines, but may use the robots mechanism to restrict this. robots exclusion applies to any automated process, not just indexing and can selectively bar robots.

David Woolley
Emails are not formal business letters, whatever businesses may want.
RFC1855 says there should be an address here, but, in a world of spam,
that is no longer good advice, as archive address hiding may not work.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]