lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Lynx-dev] pse help.


From: David Woolley
Subject: Re: [Lynx-dev] pse help.
Date: Thu, 11 Jun 2009 08:15:47 +0100
User-agent: Thunderbird 2.0.0.21 (X11/20090302)

karsten harazim wrote:
wonder if it seems to be possible to extract information from existing websites into some exel document like extracting all names, adresses, phone numbers, email, url etc from pages like that: http://www.muenster.de/schulen-alle-1.html

Technically, you need something like XSLT to do this, although you are rather dependent on the author actually writing HTML according to true spirit of HTML, which is rather rare. You may need to convert the HTML to XML syntax, before using XSLT.

For the actual download, you would be better using one of the specialist tools, like curl or wget.

However, actually doing so is likely to be illegal. Even if you the information is a pure collection of facts, in countries like the UK, the would be covered by a database copyright. At least one reason why Lynx can get blocked form sites is that it is often used to extract information without the surrounding advertising/branding.

--
David Woolley
Emails are not formal business letters, whatever businesses may want.
RFC1855 says there should be an address here, but, in a world of spam,
that is no longer good advice, as archive address hiding may not work.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]