wget-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: wget | wget should save directory listings as index.html (#11)


From: Yaroslav Nikitenko (@ynikitenko)
Subject: Re: wget | wget should save directory listings as index.html (#11)
Date: Thu, 26 May 2022 17:28:08 +0000



Yaroslav Nikitenko commented:


According to 
[w3techs](https://w3techs.com/technologies/details/ws-microsoftiis), 
"Microsoft-IIS is used by 6.0% of all the websites whose web server we know", 
so you are probably right.

I agree that the approach to save many versions is possible with 
*--convert-links*. If I understand right, `directory` and `directory/` will be 
stored as `directory.1` and `directory.2` if `directory/x` is found (because 
you didn't write that `wget2` would save `directory/` as 
`directory/index.html`). Here the order of .1/.2 is arbitrary as well, but 
probably not important because of link conversion. So I'll write about the 
default mirroring (without *--convert-links*). 

To mirror a site so that we save as many its pages as we can looks important, 
but to do it always correctly is in general impossible.

`wget` saves to file system. For `directory`, `directory/` and 
`directory/index.html` it can save only one correct file (`index.html`, which 
will directly correspond to the original site), two other files (if they exist 
and are all different) will have to be saved as new files, which can always:

1) conflict with another path from the site (as it is with `index.html`; 
however, the site can also have `directory.1`! Like 
https://docs.djangoproject.com/en/4.1/) 
2) falsely represent a non-existent site page. What if we never had 
`directory.1` on our site and become surprised that it appears in our 
downloaded files? This may be pretty minor, but I explicitly forbid my server 
to serve `dir/index.html`as a separate path just to avoid content duplication 
with `dir`.

This process of fixing new names can be endless (if we have `directory.1`, we 
save to `directory.2`, etc; and the same with `index` and `index.1`, etc) and 
the final difference between `directory.1` or `index.1.html` is probably 
absent. To ignore same names completely may be a not any less justified option.

This is why I think that this can be solved

a) through complicated options, with which the user can describe the exact 
algorithm they want with their site,
b) as a "default" solution for most cases. This is not a solution in a strict 
sense if we want to browse the site locally; but as I wrote, that does not 
exist in general. To save "difficult" paths or not and what names they will 
have in this case is optional (not very important).

For the default algorithm, I will start with the preferences.

- `directory/index.html` should have the highest priority, because 
`directory`/`directory/` can be auto-generated listings, and are thus less 
important. I don't know what `wget2` tracks, but if it knows that `directory` 
was saved as `index.html` before, it should replace that with the actual 
`index.html` (what happens to that version of `directory` is optional).
- `directory`/`directory/` are more or less the same. If there is no 
`directory/index.html` saved (and when we learn that it is a directory when we 
see a slash after that), save that as `index.html`, because this is a "native" 
(most basic) representation of a web directory. If `index.html` already exists, 
this can be because of 1) real `index.html` 2) previous save from directory 
with alternative version of slash. In that case the new version is saved to an 
optional name (because in the case 1 it has a lower priority than `index.html` 
and in the case 2 users typically navigate from the root of the site, and if 
the previous link was found earlier, it should probably have a higher weight; 
if both `directory` and `directory/` are found on one page, this may still 
hold, because more important links are closer to the top). It seems that this 
algorithm slightly contradicts to what I wrote about `index` above (the site 
master could forget that they have an old `index.html` and use just directory 
paths on the main page); maybe you have some concrete examples which we could 
see to select the better priorities?

In case we see `directory/whatever`, we should rename `directory` to 
`directory/index.html` (unless that file already exists, which I discussed 
above).

-- 
Reply to this email directly or view it on GitLab: 
https://gitlab.com/gnuwget/wget/-/issues/11#note_961446277
You're receiving this email because of your account on gitlab.com.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]