[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

lynx-dev on caching (long)

From: Klaus Weide
Subject: lynx-dev on caching (long)
Date: Tue, 17 Nov 1998 13:51:08 -0600 (CST)

On Mon, 16 Nov 1998, Leonid Pauzner wrote:

> >> I want to add internal cache for html sources in Lynx,
> >> will explain my point of view in a separate message,
> >> currently I am not sure where the stream should be redirected
> >> and a little lost with anchor structure survey,
> >> so you help appreciated.
> > Ok, I will look at that discussion.
> "Why doesn't lynx cache HTML sources" - is a talk, and
> "lynx internal cache proposal" assume more positive efforts.

Who says that talk has to be negative? :)

Here are various thoughts on the subject of caching.  First, it is good
that you have read the HTTP 1.1 RFC, but you overemphasize the
importance of the version number.  If-Modified-Since has been in HTTP
1.0 for a long time, and using Etags instead or in addition isn't that
much more work.  Also you don't need to send "HTTP/1.1" in a request in
order to make a server use most HTTP/1.1 features.  Apache certainly
sends Etag headers whether the client says HTTP/1.1 or not.  Most of
the stuff that is new in HTTP 1.1 can be used by HTTP 1.0 clients and
servers.  That's the beauty of compatible protocols...

But if a client sends a request line with "HTTP/1.1", then the servers
is entitled to assume that the client
 - wants persistent connections (unless "Connection: close" is sent)
 - understands "Transfer-Encoding: chunked"
 - understands intermediate responses like "100 Continue" (probably;
   not sure whether something has changed there)
So there are no good reasons to pretend we are fully implementing HTTP 1.1,
and there are good reasons not to do so.  (Except in your own private
copy of lynx, to see what would go wrong...)


In principle, and IMO, caching should be like this:

     U             L1            L2            L3            O
   -------       -------       -------       -------       -------       
   |     | ----> |     | ----> |     | ----> |     | ----> |     |
   |     |       |     |       |     |       |     |       |     |       
   -------       -------       -------       -------       -------       
    user          cache         cache         cache         origin

Each of the intermediate boxes Ln represents a level of caching.
U is the "user" of cached data (not necesarily human, but a part of
an application.  The arrows show the flow of request; the response
data of course goes the other way.

Well so far I this isn't very original.  It is meant to be general,
"cache" can stand for all kinds of things.  Let's see how this can
apply to a HTTP request from Lynx.

 U:   mainloop()
L1:   cache of rendered documents - WHAT WE HAVE
L2:   ??? cache for raw bytes - WHAT WE DONT HAVE
L3:   a proxy cache in the network
 O:   HTTP origin server
(Btw. in some respect the internal links stuff can be viewed as an
additional first cache L0, consisting of only the viewed document,
but let's ignore that now.)

As you can see, I suggest a place where a new cache level for raw data
should belong - if we want to do it right.  Note that I leave it open
whether that new level stores data in memory, in files, or maybe even
both.  Do you agree so far?

Each "cache" box stands for
 - cached data
 - metadata: for example How old, Time Received, validity flags, ...
 - logic and code to
   - handle requests (incoming arrow from left)
   - see whether we have the data
   - determine whether data can be used for reply
   - make request (become a user of the next level, arrow to the right)
   - respond (possibly negative, erro message)
   - store new data if appropriate
   - keep metadata updated

For the left part of the diagram, communication between the levels is
function calls and return values and (for passing the data back)
HTStream stages.  To the right of L2, it's HTTP messages.

The simple diagram implies that a cache level automatically "writes
through" the request to the following level if it cannot satisfy the
request.  I.e. between two styles of communication:

A.   -- caller: Gimme data if you have, or an error.
     -- cache: Ok, here it is...   
     -- cache: Nope, try something else without me.

B:   -- caller: Gimme data, whether you already have it or not.
     -- cache: Ok, here it is... (I had it cached)
     -- cache: Ok, let's see... yep, I got it for you.

the second is preferred.  With B, levels can more easily be chained.
(We get a linear diagram.)  With A, we'd probably have something like
         U            Ln
      ------- Req.  ------- Req.     
      |     | ----> |     | ---->
      |     | <---- |     | <----     
      ------- data  -------  data    
       \   ^                    /
   Req. \   \ data             /
         v   \                / data
         -------             /
         |     | <----------
         |     | 

This looks more complicated...  One part of the program (the top left
box) now has to do a lot more: talk to two caches instead of one
(probably in a different language), determine which to use, resolve
conflicts, know a lot about the caches' metadata.  Now image that this
has to be implemented in (or somehow controlled by) mainloop() and 
getfile(), which are already horribly complicated... (I know, my fault

I am afraid that the new caching level will look more like the second
diagram if we are not careful.  Initially that may look like a win:
The new routines after all have to do less (don't have to pass on
requests).   But even if this new stuff is only meant for special cases
(reloading current document for '*', '[', ^V etc.), the caller has to
keep track of it; and it will get more complicated if someone later
wants to do a more complete cacheing system.

The second diagram suggests some splitting of the data stream; this could
mean that the same bytes, while being read from the next level (the network),
are simultaneously written to a disk file (or memory buffer) and towards
the HText rendering/caching.  I don't mean that this is bad -- it is most
certainly better than first-write-to-file,-then-load-file-data -- but it
should ideally be handled within one box.  I.e. one set of functions
(perhaps including a new HTStream stage) which interoperate and share (meta-)
data, but the rest of the program doesn't need to know much about their

So let's look more at the linear model.  For it to work, the caller has
to be able to tell the next cache level what it wants, and for that we
need to have a language... If that language is expressive enough (and
precise enough), the caller does not need to bypass the normal line of
request/resons communication to get direct access to the cache's data.

To explain more concretely what i mean, let's look at mainloop(). It normally
passes a request to getfile() which passes it on by calling lower levels.
The lower levels decide what to do based on the URL and other parameters
(and global variables, which ideally should be parameters).  Somewhere in
HTAccess.c it is decided whether to satisfy the request from the rendered-
document-cache.  mainloop() just gets a result after all is done; it doesn't
(and usually cannot) know whether the document loaded now came from the
cache or from the network (or a proxy server etc.).  This nice separation
is sometimes violated: when mainloop() wants to know whether a POST result
is still in the cache, in order to prompt the user if appropriate.  If the
language in which mainloop() talks to getfile() etc. could express "Give
me this document, but if it is not cached ask for confirmation first",
then mainloop() wouldn't have to do so much work.

Currently the caching-related request vocabulary already consists of a bunch
of flags: `reloading', `LYforce_no_cache', `LYoverride_no_cache',
`LYinternal_flag', ..., text->no_cache, anchor->no_cache, in conjunction with
various other bits and pieces.  This is very confusing.  These are all needed
so HTLoadDocument() (& friends) can make a decision what to do.  Now imagine
what happens if we add another cache level.  Or just modify or enhance the
current cache's behavior.

Again, putting more of the decision making in mainloop() may seem an easy way
out; I think instead there is an urgent need to rationalise the language,
before much gets added cachewise.

There are two places where we may look for guidance:
 * The HTTP protocol itself.
 * Newer versions of libwww.

Since HTTP is engineered to support cacheing, and is quite detailed about it,
it may tell us a lot about what kinds of requests a level of a cache
hierarchy can make - what kinds of parameters are needed, what modes of
operation one could think of etc.  With necessary modifications of course,
for the program-internal rather than over-the-network case.  Note that
HTTP caching is completely based on the style B communication (above).
(Squid caches talking ICP with each other are something else.)

In libwww 5.2, I find in <>:

  ------------ snip ------------
HTTP Cache Validation and Cache Control

   The Library has two concepts of caching: in memory and on file. When loading 
a document,
   this flag can be set in order to define who can give a response to the 
request. The
   mempory buffer is considered to be equivalent to a history buffer. That is, 
it doesn't not
   follow the same expiration mechanism that is characteristic for a persistent 
file cache.

   You can also set the cache to run in disconnected mode - see the [23]Cache 
manager for
   more details on how to do this.
typedef enum _HTReload {
    HT_CACHE_OK             = 0x0,              /* Use any version available */
    HT_CACHE_FLUSH_MEM      = 0x1,      /* Reload from file cache or network */
    HT_CACHE_VALIDATE       = 0x2,                   /* Validate cache entry */
    HT_CACHE_END_VALIDATE   = 0x4,                  /* End to end validation */
    HT_CACHE_FLUSH          = 0x10,                     /* Force full reload */
    HT_CACHE_ERROR          = 0x20         /* An error occurred in the cache */
} HTReload;

extern void HTRequest_setReloadMode (HTRequest *request, HTReload mode);
extern HTReload HTRequest_reloadMode (HTRequest *request);

------------ snip ------------

He, that looks just like what we need!  Instead of setting many variables
mainloop() and other higher-level functions could just set one request
variable to the right value, using max() if appropriate, to bump up the,
hmm, nocache-nature of a request.

Rough correspondence to current flags (omitting the two that may not really
belong here):
    HT_CACHE_OK             default or LYoverride_no_cache
    HT_CACHE_FLUSH_MEM      LYforce_no_cache, text->nocache
    HT_CACHE_FLUSH          `reloading'

Well, the correspondence isn't a direct one. But I think something like
this should replace the current flags as far as possible.  With added
values if we need more.

What do you think?


Some comments on your concrete suggestions:

> Currently I think cache should be checked in HTLoadDocument() or HTLoad(),

maybe add another function call level?

> than in HTLoadHTTP() we try If-Modified_Since and update cached header
> from (any) responce, and the last step -

No, only for 200 responses (and maybe similar ones).
Think about what happens if we get 4xx or 5xx responses intermixed with
2xx responses for the same URL (maybe unreliable server...)

Where do you plan to keep all the metainformation? (Last-modified, Etag, ...)
In the node_anchor as now?  maybe that is not so good.

> update cached data in LYAddVisitedLink() when everything done OK.

Much too late (too high level) for my taste...

Anyway, I am not sure what you mean with "update cached data".
I have to read your proposal more carefully.

> Anyway,
> we should kludge into HTStream properly...
> and the Anchor structure should be coupled with cached source
> when doing HTuncache_current_document() for rendered image.


Well all I wrote above was so much theory.  In practical implementation,
maybe we end up with something completely different.  But before you go 
ahead, could you try to explain how your ideas fit into my understanding?
Also, may I suggest you have a look at <>
and see whether you can reuse anything from there.

> By introducing a simple invariant
> "current_char_set and UCLYhndl_for_unspec always valid"
> I wipe lots of  if (hndl>= 0)  and made the code more understandable :-)

Well, good to know that that is now meant to be always true. :)


reply via email to

[Prev in Thread] Current Thread [Next in Thread]