bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Unexpected character on a downloaded page


From: Angel Tsankov
Subject: Re: [Bug-wget] Unexpected character on a downloaded page
Date: Mon, 16 Jun 2014 14:08:01 +0300
User-agent: Mozilla/5.0 (X11; Linux i686; rv:24.0) Gecko/20100101 Thunderbird/24.4.0

On 06/15/14 23:28, Ángel González wrote:
On 14/06/14 20:31, Angel Tsankov wrote:
Why does wget 1.15 (and 1.12) insert  in several places in the copy
it makes of the following page:

http://www.helloquizzy.com/results/helen-fisher-personality-type-test/?var_Explorer=1&var_Negotiator=1&var_Director=1&var_Builder=1&fromCGI=1

Short answer: because that's what is at that page.

Long answer: That page contains several non-breaking spaces (ASCII 160,
U+00A0) which when encoded as UTF-8 result in the bytes C2 A0. If you
read the page as if it was iso-8859, you will view instead the byte C2
as the glyph Â.

The page correctly states it's in utf-8:
Content-Type: text/html; charset=utf-8
so it should be read in utf-8 mode.

(wget is doing nothing here, it's just receiving bytes and storing in
the file as-is)

Indeed, the browser (Firefox 27.0.1) displays the original page in UTF-8 and the downloaded page in Windows-1252 (which turned out to be the fallback encoding for pages that do not declare their encoding). But if "wget is doing nothing here" why does the browser think that only the original page declares its encoding?


Regards,

Angel Tsankov




reply via email to

[Prev in Thread] Current Thread [Next in Thread]