lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

lynx-dev Re: [PATCH 2.8.5-dev14] *Really* large tables


From: Ilya Zakharevich
Subject: lynx-dev Re: [PATCH 2.8.5-dev14] *Really* large tables
Date: Thu, 10 Apr 2003 13:13:17 -0700
User-agent: Mutt/1.4i

On Thu, Apr 10, 2003 at 12:24:38AM -0700, Ilya Zakharevich wrote:
> With the supplied file 
> 
>   http://ilyaz.org/software/tmp/table_2col_bold_it_500000.html.gz
> 
> the footprint on my system (EMX malloc(), slowish, but quite dense) is
> 107M.  Same result with my malloc (aka the Perl one) - known for
> extreme efficiency (as usual, "at least in some situations" ;-).  So
> it looks "as good as it gets".  Let us break it down:
> 
> One gets the footprint 80 bytes per line if the same text is not put
> into a table.  This is 40M.
> 
> To process the table: eventually we need a large array of "row
> descriptors".  Each one takes 36byte.  The minimal footprint for this
> is 18M.  The algorithm used (reallocation with size increased by 50%
> each time) may live a "trail" of reallocated chunks 2 times as big -
> so the maximal footprint is 56M (but see below).
> 
> The cells take circa 20byte per cell.  The used algorithm allocateds
> them in several arenas, so there very little overhead.  This gives
> extra 20M.
> 
> So far, so good.  So the footprint to process the table is:
> 
>   a) to set the text: 40M
> 
>   b) to store the temporary table structures: 38M..76M;
> 
>   c) to reset the text with new "indentations": ???

Investigations with debugger + memory-use analyser give:

   93M before endStblTABLE().
   93M after finishTABLE().
   106M after insertion of blanks.
   106M after Stbl_free()

So the footprint 93M before (c), 106M after (c).  I think it is a pure
artefact of the current algorithm of line storage.  So the overheads
so far look like this:

   (a) text storage: 40M;

   (b) temporary table layout info: could be 38M, is 53M;

   (c) text storage redone due to whitespace inclusion: 13M.

There is a lot of possibilities to improve the constants (now when the
algorithm is actually linear):

   (a1) Even with styles enabled, the text includes ^A and ^B (etc)
        characters.  [Do not know how they get there, in style-less
        build they denote boundaries of bold/underwrited text.]  [They
        are sometimes visible during a partial display stage.]
        Removing them may improve (slightly) the memory footprint
        taken by lines-as-strings.

   (a2) The current algorithm for storing styles info is extremely
        wasteful.  Each line contains info about the whole stack of
        styles which lead to the style of this line.  E.g., in the
        example above each line should be marked at the beginning as
        start-html, start-body, start-table, and at the end as
        end-table, end-body, end-html.  Since all 3 these styles are
        not switched off until the end of the line, only the inner one
        matters.

        Moreover, almost each line of an HTML file will have start-html,
        and start-body tags.  Is not it better to just assume them,
        and (e.g.) just explicitly disable (an assumed) start-body
        style for the display of the <head> matter?

        I think (a1)+(a2) can lead to circa 20M winning (maybe more).

    (b) Instead of array of row-info elements for table information
        storage (rowinfo table[500000]) one could store an array of
        pointers (rowinfo *table[500000]), and allocate the rowinfo's
        themselves by chunks.  This way we need to realloc()ate only a
        2M array (instead of 18M one).  This gives a pessimization of
        2M, and an optimization of whatever reallocations we may save.
        It is reasonable to expect a winning of circa 10M on this stage.

    (c) I have no idea how to avoid this 13M waste...  Any thoughts?

Summary: currently we take 107M to render a simple 500K-line table.
One should expect that it should be possible to shave circa 40M out of
this.

Yours,
Ilya

; To UNSUBSCRIBE: Send "unsubscribe lynx-dev" to address@hidden

reply via email to

[Prev in Thread] Current Thread [Next in Thread]