[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
#1805: `html->sxml` with escaped quotes breaks text into multiple nodes
From: |
Chicken Trac |
Subject: |
#1805: `html->sxml` with escaped quotes breaks text into multiple nodes |
Date: |
Fri, 10 Jun 2022 18:53:06 -0000 |
#1805: `html->sxml` with escaped quotes breaks text into multiple nodes
----------------------------+-----------------------------------
Reporter: Jeremy Steward | Owner: Alex Shinn
Type: defect | Status: assigned
Priority: minor | Milestone: someday
Component: extensions | Version: 5.3.0
Keywords: | Estimated difficulty:
----------------------------+-----------------------------------
There's some weirdness with escaping quotes in text when using
`html->sxml`. Perhaps a short example would be sufficient to explain the
problem I'm encountering:
{{{
(html->sxml "<p>foo'bar"baz</p>") ;=> (*TOP* (p "foo" "'" "bar"
"\"" "baz"))
}}}
As a counter-example, I'll use the [https://wiki.call-
cc.org/eggref/5/ssax: ssax egg]:
{{{
(call-with-input-string "<p>foo'bar"baz</p>") ;=> (*TOP* (p
"foo'bar\"baz"))
}}}
I guess fundamentally it's a question of whether there should be one text
node or not. I would argue that in this particular case, it should be a
single node. I have been using html-parser to try and scrape some web
pages, and this is extremely unexpected! Especially so if one uses
`txpath` / `sxpath` on the final result, as `//p/text()` queries will not
necessarily behave as expected. You would have to `(apply string-append
((txpath "//p/text()") sxml))` to the result to get the full contents of
the text.
Is there a rationale for this, or is that some kind of limitation of the
parser? I know that tags may also contain sub-tags in HTML, but I'm not
sure a new node should be made if a tag's contents are not HTML tags
themselves.
--
Ticket URL: <https://bugs.call-cc.org/ticket/1805>
CHICKEN Scheme <https://www.call-cc.org/>
CHICKEN Scheme is a compiler for the Scheme programming language.
- #1805: `html->sxml` with escaped quotes breaks text into multiple nodes,
Chicken Trac <=