bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] bad filenames (again)


From: Tim Ruehsen
Subject: Re: [Bug-wget] bad filenames (again)
Date: Mon, 24 Aug 2015 15:44:09 +0200
User-agent: KMail/4.14.2 (Linux/4.1.0-1-amd64; KDE/4.14.2; x86_64; ; )

On Saturday 22 August 2015 00:39:01 Andries E. Brouwer wrote:
> On Fri, Aug 21, 2015 at 08:54:28PM +0200, Tim Rühsen wrote:
> > > Content-Disposition: attachment;
> > > filename="20101202_%EB...%A8-%EB%B0%B1_.sgf"
> > > This encodes a valid utf-8 filename, and that name should be used.
> > > So wget should save this file under the name
> > > 20101202_농심신라면배_바둑(다카오신지9단-백_.sgf
> > 
> > This is a different issue. Here we are talking about the encoding of HTTP
> > headers, especially 'filename' values within Content-Disposition HTTP
> > header. Wget simply does not parse this correctly - it is just not coded
> > in. It is just Wget missing some code here (worth opening a separate
> > bug).
> Good, saved for later.

Just implemented (or let's say fixed) Content-Disposition in wget2. It now 
saves the file as
20101202_농심신라면배_바둑(다카오신지9단-백_.sgf

Content-Disposition (filename, filename*) is standardized, but browsers seems 
to behave/parse very different, ignoring standards.
See 
http://stackoverflow.com/questions/93551/how-to-encode-the-filename-parameter-of-content-disposition-header-in-http
(answer 2 from Martin Ørding-Thomsen)

But that's just FYI. Different issue.


> > If the server AND the document do not explicitly specify the character
> > encoding, there still is one - namely the default. Has been ISO-8859-1
> > a while ago. AFAIR, HTML5 might have changed that (too late for me now
> > to look it up).
> 
> Yes - that is our main difference. You read the standard and find there
> what everyone is supposed to do, or what the default is.
> I download stuff from the net and encounter lots of things people do,
> that are perhaps not according to the most recent standard,
> and may differ from the default.
> 
> As a consequence I prefer to base the decision about what to do
> on the form of the filename (ASCII / UTF-8 / other), not on the
> headers encountered on the way to this file.

I guess we can find an easy agreement.

1. Wget has to obey the defaults. If it fails or we find a well-known 
misbehavior (server/document fault), handle it automatically.
That's how we try do do it now.

2. If still a problem arises, the user should be able to intercept. Using 
special command line options for fine-tuning Wget's behavior.

Of course we try our best, so that 2. is normally not necessary.

You already gave some examples, one of it (the Content-Disposition example) 
already lead to an optimization (I'll transfer the code to Wget1.x soon).
The other two obeyed the standards (one had f*cked up content, but that didn't 
touch Wget's functionality).

I would ask you to give more examples of websites that you think aren't 
standard and/or where Wget has problems parsing out the links.
That would be 50% of the work.

> (By the way, I checked my conjecture that iconv from UTF-8
> to UTF-8 need not be the identity map, and that is indeed the case.
> On my Ubuntu machine iconv from UTF-8 to UTF-8 converts NFD to NFC.)

We should have a 'shortcut', so if to-charset and from-charset are the same, 
we don't convert. 

Tim

Attachment: signature.asc
Description: This is a digitally signed message part.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]