guile-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: salutations and web scraping


From: Catonano
Subject: Re: salutations and web scraping
Date: Mon, 16 Jan 2012 21:06:48 +0100

Andy,

Il giorno 10 gennaio 2012 22:46, Andy Wingo <address@hidden> ha scritto:
Hi Catonano,

On Fri 30 Dec 2011 23:58, Catonano <address@hidden> writes:

> I´m a beginner, I never wrote a single line of LISP or Scheme in my life
> and I´m here for asking for directions and suggestions.

Welcome! :-)

thank you so much for your reply. I had been eagerly waiting for a signal from the list and I had missed it ! I´m sorry.

The gmail learning mechanism hasn´t still learned enough about my interest in this issue, so it didn´t promptly reported about your reply. I had to dig inside the folders structure I had layed out in order to discover it. As for me I haven´t learned enough about the gmail learning mechaninsm woes. I guess we´re both learning, now.

Well, I was attempting a joke ;-)



> my boldness is such that I´d ask you to write for me an example
> skeleton code.


Hey, it's fair, I think; that is a new part of Guile, and there is not a
lot of example code.


Thanks, Andy, I´m grateful for this. Actually I managed to set up geiser, load a file and get me delivered to a prompt in which that file is loaded. Cool ;-) But there are still some thing I didn´t know that your post made clear.
 
Generally, we figure out how to solve problems at the REPL, so fire up
your Guile:

 $ guile
 ...
 scheme@(guile-user)>

(Here I'm assuming you have guile 2.0.3.)


Use the web modules.  Let's assume we're grabbing http://www.gnu.org/,
for simplicity:

 > (use-modules (web client) (web uri))
 > (http-get (string->uri "http://www.gnu.org/software/guile/"))
 [here the text of the web page gets printed out]

Ok, I had managed to arrive so far (thanks to the help received in the guile cannel in irc)

Actually there are two return values: the response object, corresponding
to the headers, and the body.  If you scroll your terminal up, you'll
see that they get labels like $1 and $2.

I didn´t know they were 2 values, thanks

Now you need to parse the HTML.  The best way to do this is with the
pragmatic HTML parser, htmlprag.  It's part of guile-lib.  So download
and install guile-lib (it's at http://www.non-gnu.org/guile-lib/), and
then, assuming the html is in $2:

I had seen those $i things but I hadn´t understood that stuff was "inside" them and that I could use them, so I was using a lot of (define this that). And this is probably why I missed the two values returned by http-get. Thanks !

 
  > (use-modules (htmlprag))
 > (define the-web-page (html->sxml $2))


And I didn´t know about htmlprag, thanks
 

That parses the web page to s-expressions.  You can print the result
nicely:

 > ,pretty-print the-web-page

thanks, I didn´t know this, either
 

Now you need to get something out of the web page.  The hackiest way to
do it is just to match against the entire page.  Maybe someone else can
come up with an example, but I'm short on time, so I'll proceed to The
Right Thing -- the problem is that whitespace is significant, and maybe
all you want is the contents of "the <title> in the <head> in the
<html>."

So in XML you'd use XPATH.  In SXML you'd use SXPATH.  It's hard to use
right now; we really need to steal
http://www.neilvandyke.org/webscraperhelper/ from Neil van Dyke.  But
you can see from his docs that the thing would be

 > (use-modules (sxml xpath))
 > (define matcher (sxpath '(// html head title)))
 > (matcher the-web-page)
 $3 = ((title "GNU Guile (About Guile)"))


I was going to attempt something along this line

(sxml-match (xml->sxml page) [(div (@ (id "real_player") (rel ,url))) (str

but I´m going to explore your lines too. I still wasn´t there, I had stumbled in something I thought it was a bug, but I also had something else to do (this is a pet project) so this had to wait.

But I´ll surely let you know

Thanks again for your help
Bye
Cato
 

reply via email to

[Prev in Thread] Current Thread [Next in Thread]