chicken-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Chicken-users] Parsing HTML, best practice with Chicken


From: mfv
Subject: [Chicken-users] Parsing HTML, best practice with Chicken
Date: Mon, 29 Dec 2014 03:28:15 +0100
User-agent: Mutt/1.5.21 (2010-09-15)

Hello, 

I am currently playing around the Chicken and the web. More precisely, I
want to make some web link collection and see how well it goes for me when
scraping web sites for links and content. 

Which eggs would you recommend for that? What should I avoid doing? 

So far, I have been getting the site with http-client, the raw html to sxml
with html-parser, and trying to process the resulting list with
matchable/srfi-13. I am not sure how much good it will do to use regex on those
lists. Are there any packages like Python's Beautifulsoup in the Chicken
arsenal?

So far, I have some troubles when trying to parse the resulting sxml, both with
matchable and string-contains.

Cheers, 

  Piotr


ps: ze code so far:



;; version 0.0.3

; high level HTTP client, HTML/SXML parsing library and regular expression
; library
(use http-client html-parser matchable srfi-13)

; grab a website
(define lnk
; "http://onlinelibrary.wiley.com/journal/10.1002/%28ISSN%291521-3773";)
(define raw (with-input-from-request lnk #f read-string))

;; convert site crawl data from html to sxml
(define sxml (html->sxml raw))

;; saving function
;; * display form is more suitable, for it evaluates all those \n and other
;; * specials characters;; * might be good to remove these things from regex
;; * processing, too. 
(define (savedata somedata filename)
  (call-with-output-file filename
    (lambda (p)
      (let f ((ls somedata))
        (unless (null? ls)
          (display (car ls) p)   ; changed: display->write
          (newline p)
          (f (cdr ls)))))))

; check how much the output is parsable.. 
(savedata sxml "output.txt")

;; non-TCO
(define (flatten x)
    (cond ((null? x) '())
          ((not (pair? x)) (list x))
          (else (append (flatten (car x))
                        (flatten (cdr x))))))

(define sxmlflat (flatten sxml))

;; ***************
;; Multi-check procedure is needed to check whether STRING element has:
;;  journal-id: "10.1002" 
;;  link string: "issuetoc"
;; 
;; function: 
;;   takes list of strings and checks wheather the element has them. 
;;   AND operator. 
;; ***************


;; --- member? returns #t if elemnt x is in list lst.
;; --- ref:
;; --- 
http://stackoverflow.com/questions/14668616/scheme-fold-map-and-filter-functions
;; --- use: (member? "a" (list "a" 1)) --> #t
(define (member? x lst)
  (fold (lambda (e r)
          (or r (equal? e x)))
        #f lst))

;; --- string-contains/m returns #t if all strings of list lsstr are in
;; --- string str. 
;; --- case insensitive string matching. 
;; --- does not check if lsstr is empty. This would return #t. 
;; --- use: (string-contains/m "Somestring" '("10.1002" "issuetoc")
(define (string-contains/m str lsstr)
  (if (string? str) 
      (if (not (member? #f (map (lambda (x) (string-contains-ci str x))
lsstr))) #t)))


(savedata
(filter (lambda (x) (string-contains/m x '("10.1002" "http://"; "toc")))
sxmlflat)
"filtered3.txt")

;; Something is wrong with those bloody strings!



reply via email to

[Prev in Thread] Current Thread [Next in Thread]