[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Lynx-dev] Extract links from html with application/ld+json script
From: |
David Woolley |
Subject: |
Re: [Lynx-dev] Extract links from html with application/ld+json script |
Date: |
Sun, 17 Dec 2023 21:59:27 +0000 |
User-agent: |
Mozilla Thunderbird |
Looking a bit further, ld+json is a database serialisation format, based
on javascript, but it is declarative. It definitely isn't HTML, but one
could render it by basically pretty printing, without the need to handle
the generalities of javascript. You may, though have to manually
extract it from the page, as I suspect general execution of javascript
may be needed to actually find it reliably.
Lynx does not even have a JSON interpreter and I'm sure it doesn't have
a JSON pretty printer.
Using <http://jsonprettyprint.net/json-pretty-print> to pretty print it,
the core of one of the items comes out as (I've just used an extract to
minimise copyright issues):
{
"@type": "VideoObject",
"name": "The Chokepoint (EGC Finals)",
"url":
"https://clips.twitch.tv/ElatedIncredulousPepperOpieOP-oUeW6hXXZs8nmWtX",
"description": "Watch EGCTV's clip of Age of Empires IV on
Twitch!",
"thumbnailUrl": [
"https://clips-media-assets2.twitch.tv/A-IO1KFHluoV12bPJ5lrVw/AT-cm%7CA-IO1KFHluoV12bPJ5lrVw-preview-86x45.jpg",
"https://clips-media-assets2.twitch.tv/A-IO1KFHluoV12bPJ5lrVw/AT-cm%7CA-IO1KFHluoV12bPJ5lrVw-preview-260x147.jpg",
"https://clips-media-assets2.twitch.tv/A-IO1KFHluoV12bPJ5lrVw/AT-cm%7CA-IO1KFHluoV12bPJ5lrVw-preview-480x272.jpg"
],
"uploadDate": "2023-12-17T16:16:18Z",
"duration": "PT60S",
"position": 2,
"interactionStatistic": {
"@type": "InteractionCounter",
"interactionType": {
"@type": "http://schema.org/WatchAction"
},
"userInteractionCount": 29
},
"embedUrl":
"https://player.twitch.tv?video=1542310342&autoplay=true&parent=meta.tag"
},
I'm pretty sure that most of the tags have no intrinsic meaning, and you
still need the full javascript code, or to guess from the names, to
correctly interpret them.
The actual HTML doesn't include anything renderable. Everything is done
as empty DIVs and relies on styling for any display, so can't be
considered foreground content. There is some directly renderable
content, but it is SVG, with no accessible text fallback. This is an
image format, so useless for a text only browser.
On 17/12/2023 20:44, David Woolley wrote:
On 17/12/2023 19:31, Super Bonaci via Lynx-dev wrote:
Lynx is not able to extract most html links inside the html file.
There are no HTML links in 9ed7a8bb (no anchor elements, and all
occurrences of href are either in link elements, which don't generate
visible hyperlinks, inline, except for one, which is in javascript
code)! I think this is a Javascript application program, not an HTML
document. Lynx doesn't have a javascript interpreter and doesn't parse
HTML in a way that creates a document object model in a format that
would allow such an interpreter to do anything non-trivial.
Any links are created by manipulating the document in the browser, which
Lynx can't do.
Supporting javascript applications would require a complete rewrite from
first principles. The result would not be Lynx.
I suspect the same is true of the other document.
Since the Lynx version is from 2018
I don't think there have been major changes in HTML in the last five
years that would break a real HTML document on Lynx. The problem with
web applications is over a decade old. It goes back to the original
Netscape, but was solidified when the Web Hypertext Applications
Technology working group effectively took over control of HTML from W3C
leading to the creation of HTML5. Although that can be used for pure
documents, the name of the working group clearly indicates that the
intention was otherwise. That happened about 19 years ago.
Commercial artists and marketing managers, don't buy into the TBL notion
of HTML and want programs that can be run on the advertising consumer's
machine. Whilst there are some cases where this is valid, for
technical, or privacy reasons, most such applications are written for
marketing reasons.
Some text mode browsers handle some javascript uses, but I'm pretty sure
they would not cope with your examples.
The only certain way of finding the links in javascript code is run the
program.