[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: byte-order marks
From: |
Ludovic Courtès |
Subject: |
Re: byte-order marks |
Date: |
Tue, 29 Jan 2013 14:27:11 +0100 |
User-agent: |
Gnus/5.130005 (Ma Gnus v0.5) Emacs/24.2 (gnu/linux) |
Andy Wingo <address@hidden> skribis:
[...]
>> Regarding byte-order marks, my preference is that users should explictly
>> consume BOMs if that's what they want (ideally using some convenience
>> procedure provided by Guile). Sometimes consuming the BOM is the wrong
>> thing. For example, if the user is copying a file to another file, or
>> to a socket, it may be important to preserve the BOM.
>
> If you are copying a binary file, you should use binary APIs. Otherwise
> you can misinterpret the characters, and potentially write them as a
> different encoding.
>
> Also otherwise, without O_BINARY on Windows, you will end up munging
> line-ends. So from a portable perspective, reading a file as
> characters already implies munging the text.
Agreed. Reading textual data implies interpretation of its byte
structure, and the BOM is just part of that meta-data.
>> If others feel strongly that BOMs should be consumed by default, then
>> the following compromise is about as far as I'd (reluctantly) consider
>> going:
>>
>> * 'open-input-file' could perhaps auto-consume a BOM at the beginning of
>> the stream, but *only* if the BOM is already in the encoding specified
>> by the user (possibly via an explicit call to 'file-encoding').
>
> The problem is that we have no way of knowing what file encoding the
> user specifies. The encoding could come from the environment, or from
> some fluid that some other piece of code binds. We are really missing
> an encoding argument to open-file.
Well, ‘%default-port-encoding’ is really an argument to ‘open-file’,
though admittedly not a convenient one. However, there’s no way to open
a file in binary mode when using ‘open-input-file’,
‘call-with-input-file’, etc.
>> Having said all this, if 'open-input-file' is changed to no longer call
>> 'scm_i_scan_for_file_encoding', then I think it's a fine idea to add
>> BOMs to its list of heuristics, though I tend to agree with Mike that a
>> coding declaration should take precedence, for the reasons he described.
>
> OK. Incidentally we should relax the scan-for-encoding requirement that
> the coding be in a comment, as we will begin compiling javascript, lua,
> etc files in the future.
OTOH, that would make it more likely that the “coding:” sequence is
misinterpreted as a coding declaration in contexts that have nothing to
do with that.
> I liked that my solution "just worked" with a small amount of code and
> no changes to the rest of the application. I can't help but think that
> requiring the user to put in more code is going to infect an endless set
> of call sites with little "helper" procedures that aren't going to be
> more correct in aggregate.
For textual files, it doesn’t seem unreasonable for ‘open-input-file’ to
consume the BOM, IMO. It’s not much different from the ‘eol-style’
transcoders.
Ludo’.
- byte-order marks, Andy Wingo, 2013/01/28
- Re: byte-order marks, Mike Gran, 2013/01/28
- Re: byte-order marks, Mark H Weaver, 2013/01/29
- Re: byte-order marks, Andy Wingo, 2013/01/29
- Re: byte-order marks, Ludovic Courtès, 2013/01/29
- Re: byte-order marks, Andy Wingo, 2013/01/30
- Re: byte-order marks, Ludovic Courtès, 2013/01/30
- Re: byte-order marks, Andy Wingo, 2013/01/31
- [PATCHES] Discard BOMs at stream start for UTF-{8,16,32} encodings, Mark H Weaver, 2013/01/30
- Re: [PATCHES] Discard BOMs at stream start for UTF-{8,16,32} encodings, Andy Wingo, 2013/01/31