bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

wget prints out information in unicode characters where ASCII could suff


From: ah
Subject: wget prints out information in unicode characters where ASCII could suffice
Date: Sat, 21 Mar 2020 14:40:40 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.1.1

Hello,

When wget gets a page successfully (consider for example: wget www.gnu.org), it reports something like this:

...output omitted...
2020-03-21 14:00:41 (1.43 MB/s) - ‘index.html’ saved [1114171/1114171]

Please notice the two apostrophes enclosing the fetched filename are in unicode (U+2018 and U+2019, I guess?) whereas the ASCII apostrophe character ' is completely sufficient.

What inplications does that have, except from polluting the terminal?

For one, when a user tries to copy+paste the fetched filename (e.g. index.html) from wget's output, the apostrophes are either copied into the buffer and that messes up further commands or the apostrophes are not copied and the user needs to add apostrophes manually when pasting), e.g. try

ls ‘index.html’

it fails with

ls: cannot access '‘index.html’': No such file or directory

However, the single (ASCII) quotes are very important for a lot of users in the case where filenames contain spaces or other characters that the shell does not like and need escaping. So it's a good idea to have them, but who would have thought that the devil is idle and decided to replace all apostrophes in GNU software with unicode!

So, ideally (AFAIC) wget, on successful completion, should have printed this:

2020-03-21 14:00:41 (1.43 MB/s) - 'index.html' saved [1114171/1114171]

(notice the single ASCII apostrophe for opening AND closing the filename)

and then the user could just copy that string and the apostrophes for further copy+paste.


I understand that there is danger in copy+paste-ing information from a program's output. But this is not relevant here as it is none of wget's business to deter users from copy-pasting its output. If that's a real concern then consider printing the filename in hex or as an image or call the copy-paste police and snitch the user when he/she attempts to use it.

But copy-paste is not the real issue here. There is another issue, far more important: shell scripts processing wget's output.

That brings us to yet another case-in-point where this behaviour of wget makes our lives more difficult: using wget's output in a shell script in order to find out the name of the fetched filed. Now, all of a sudden our shell scripts must deal with unicode characters too. This is a no-go scenario in many industrial places. A shell script may be classified as sub-standard if it has to deal with unicode because of the cans of worms that opens.

In conclusion, my opinion is that this bug is one of the most unpleasant and dangerous bugs in wget as it pollutes the terminal with UTF characters when ASCII characters are more than enough to convey the information to the user. It opens not one but a tonne of cans of worms and can have serious side effects to script processing in industry.

I would therefore URGE you to reconsider the use of unicode characters for mere aesthetic reasons especially when ASCII characters can be used for the same purpose. Aesthetics is a very subjective criterion as you know.

There must be serious reasons to give the KISS principle the capital punishment. Is this what GNU come to?

On a parallel note, please accept my congratulations for the very good, otherwise, software wget is. I am using it daily and I thank you (and I too have contributed to public domain software and with GNU licencing, spreading the karma of GNU)

bw,



reply via email to

[Prev in Thread] Current Thread [Next in Thread]