|
From: | David Woolley |
Subject: | Re: [Lynx-dev] Extracting text from an HTML file |
Date: | Sat, 01 Mar 2008 12:12:40 +0000 |
User-agent: | Thunderbird 2.0.0.9 (X11/20071031) |
is it possible to use a program to get all the text only of the html? as if I open the html with a browser, then click ctrl+a and then copy paste all the selected text
Also, depending on what sites you are going to do this on, make sure you read their terms and conditions, and, unless you have permission by other means, use a tool that honours the robots exclusion rules. I don't believe that Lynx has any provision for doing that, so you will probably need to use wget, if you do not have explicit permission to machine process the pages.
Many sites forbid any machine processing other that needed to display the page in real time, and in accordance with the HTML they provided (i.e. you cannot strip adverts), or to cache pages that are not marked uncacheable. They also, generally, accept indexing for search engines, but may use the robots mechanism to restrict this. robots exclusion applies to any automated process, not just indexing and can selectively bar robots.
-- David Woolley Emails are not formal business letters, whatever businesses may want. RFC1855 says there should be an address here, but, in a world of spam, that is no longer good advice, as archive address hiding may not work.
[Prev in Thread] | Current Thread | [Next in Thread] |