lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev TRST: the next step


From: Klaus Weide
Subject: Re: lynx-dev TRST: the next step
Date: Sun, 7 Nov 1999 14:34:27 -0600 (CST)

On Fri, 5 Nov 1999, Philip Webb wrote:

> looking forward to better table support, i compiled 2-8-3dev.14 .
> unfortunately, the first set of tables i wanted to make use of
> did not meet TRST's extremely limited capabilities.
> 
> in README.TRST, Klaus states:
> 
>   A table is simple enough if each of its TR rows translates
>   into at most one display line in LTMS formatting (excluding leading
>   and trailing line breaks), and the width required by each row
>   (before as well as after fixup) does not exceed the availabe screen size.
>   Note that <t>his excludes all tables where some of the cells are marked up
>   as block elements ('paragraphs').
> 
> Lynx should ignore <P> ... </P> blocks when they occur inside tables:
> they can never improve Lynx rendering & can make it very bad.

... except if the table is used just for display formatting purposes
of non-tabular material.  I that case, mostly ignoring the table elements
(minimal table support) and honoring <P> is better.  I don't have any
good heuristics to determine which case we are dealing with (and determine
it early enough).

> to see what happens goto  www.chass.utoronto.ca/~purslow/trst.html ,
> where there is an example of a table from a recent transit report
> with the <P> tags deleted (by hand) & then with them left in place.
> Lynx makes an excellent job of the first; the second is grotesque.

An excellent job?  Not exactly.  Did you notice the difference when
toggling ^V?

That table is full of invalid HTML (invalid nesting), even after your
removal of <P>, which has an effect on lynx and shows up as different
rendering in TagSoup/SortaSGML.  If you want to work on this, it might
be useful to start with less broken examples.

Also note the alignment of cell contents: all the numeric fields are
left-aligned, when they should be right-aligned.  Since the HTML sets
the alignment in ALIGN attributes of the <P>s, rather than of the
<TD>s, you lose the alignment info by just excising the <P>s.

> surely, it's an easy matter to have Lynx ignore <P> tags within tables,
> at least as an option in  lynx.cfg .  if Klaus doesn't care to do it
> -- certainly, he's under no obligation to -- 

He even thinks he has some valid excuses for not starting down that road...

You start with <P>.  But what about <BR>?  What about <CENTER>, <DIV>, or
all the other tags that normally cause line breaks?  What if the contains
more than one <P>?
If treating <P>, alone, specially works well for your favorite example,
it doesn't yet make it a good approach for tables in general.

> could he or someone else
> tell me where to start in the source to do it myself.

That's the spirit. :)  [*]

I'll certainly tell you where to look.  I am kinda curious whether you
can come up with something that works well enough (for more than a few
selected test cases).

Look on HTML.c, locate the code under

    case HTML_P:

The first occurrence (in HTML_start_element) is for <P [...]>, the
second (in HTML_end_element) is for </P>.

You want to 'do nothing' under a certain condition, in this case, if we
are in a table.  It turns out there already is a convenient flag to
test, me->inTABLE.  It also turns out that me->inTABLE gets set to
TRUE for every <TABLE> and gets unset (to FALSE) for every </TABLE>,
so it won't work as expected for nested TABLEs, but it should be good
enough for a start (or actually work better this way, accidentally).
Note that me->inTABLE isn't anything TRST specific.

So you want something like, under 'case HTML_P:',
      if (me->inTABLE) {
         /* do nothing */
      }
      /* else do all the usual P stuff */

or

      if (!me->inTABLE) {
         /* Do all the usual P stuff */
      }

In the case of P, the 'usual P stuff' is in a separated function
LYHandlePlike (which you'll find in LYCharUtils.c).  You could
essentially get what you want by making that function do nothing,
by inserting

    if (me->inTABLE)
        return;
 
somewhere in it.

LYHandlePlike also gets called in some othere (rare) cases, but
as a start you could ignore that (you want to eliminate line breaks,
so you probably want to ignore them for more than just P anyway...).
Also ingnore the CHECK_ID(HTML_P_ID), or rather let it continue
to be called even if you suppress the rest of P handling.

   ----------------

This gives you "Lynx ignore[s] <P> tags within tables" as you asked
for, more or less.  It does not give you "Lynx ignores <P> tags within
tables if the table is a TRST candidate", which is not what you asked
for but maybe meant.  The code in HTML.c does not know whether TRST
handling is active, as a matter of separation of functions.

   ----------------

[*] Or is it?

What happened to the idea of fixing up HTML for lynx with external
programs?   After all, that would be much more flexible, you can tune
some external script more easily for a specific situation than trying
to find the optimal tweaks in Lynx C code (which would likely work
well only in some cases, and work badly in other cases).

See the example lynxcgi method, and instructions, given in
<http://www.flora.org/lynx-dev/html/month0799/msg00098.html>.
Here is a script, to use in place of the one in that message,
that works for your table:

(disclaimer: your sed may act slightly differently, this is
 GNU sed version 2.05)

#! /bin/sh
lynx -mime_header "$QUERY_STRING" | \
sed \
-e "\
/<[Tt][Aa][Bb][Ll][Ee]/,/<\/[Tt][Aa][Bb][Ll][Ee]/{;\
  :L;\
/<TD/{;\
/<\/TD>/!{;\
   N;\
   bL\
  }
 }
 s/\n//g;\
 s/<\/\?FONT[^>]*>//g;\
 s/<B><P \(ALIGN=\"[a-zA-Z]\+\"\)>/<P \1><B>/g;\
 s/<TD \([^>]*\)><P \([^>]*\)>/<TD \1 \2><P>/g;\
 s/<\/\?[Pp]\( [^>]*\)\?>//g;\
}"

What it does:
- between <TABLE> and </TABLE>
  - if a line has "<TD", join subsequent lines until there is also
    a "</TD>" on the combined line
  - remove "<FONT>" and "</FONT>"
  - Move "<B>" from before "<P>" to after it.  (This, together with the
    previous, get rid of invalid nesting.)
  - move attributes of <P> to become attributes of preceding "<TD>".
    This moves "up" ALIGN attributes so TRST can honor them.
  - finally get rid of "<P>" and "</P>".

(I'm sure this could be simplified more.  It's probably much easier
to do, and less cryptic, in perl.)
Try to do the equivalent in Lynx's C code, but generalized enough that it
deals with more than a specific example.

   Klaus


reply via email to

[Prev in Thread] Current Thread [Next in Thread]