guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Program received signal SIGSEGV, Segmentation fault.


From: Mark H Weaver
Subject: Re: Program received signal SIGSEGV, Segmentation fault.
Date: Sat, 17 Nov 2012 14:56:33 -0500
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.2 (gnu/linux)

Bruce Korb <address@hidden> writes:
> On 11/16/12 20:22, Mark H Weaver wrote:
>> Bruce, if you refuse to fix these problems properly, you will end up
>
> Hi Mark,
>
> My program's intent is to read text from two inputs and weave them
> together.  It has no need to know or understand the encoding in any way,

To weave them together, you need to interpret the input characters to
recognize the start and end points of each segment that you will copy to
the output.  Therefore, you need to know the correct character encoding.

Let me give you an example of what can happen if you blindly interpret
all inputs as a series of bytes, and interpret all of the bytes that
fall within the ASCII range as delimiters, macro invocations,
expressions, or whatever.

Consider the following quoted string containing a single chinese
character, and stored in a file using the GBK character encoding:

   "甛"

The bytes in the file corresponding to those three characters are:

   22 AE 5C 22  (hex)

These same bytes, interpreted as ISO-8859-1 (Latin-1), correspond to the
following four characters:

   "®\"

So if autogen reads this file as a sequence of bytes (or coaxes Guile
into doing so) it will see a backslash before the closing quote, and
thus treat it as an escape and keep reading the string.  At which point
your Chinese user is scratching his head and wondering what went wrong,
because he sees no backslash; he sees only a single chinese character
between the quotes.

> I want to hand the Guile library a string, a la
>    (define my-val (get "val-string"))
> where "get" is a function that pulls bytes from the input.

The example above demonstrates that you expect Guile to parse a string
literal, and thus it needs to know how to interpret the bytes as
characters.  For example, it needs to know whether 5C is really a
backslash, or the second byte of a two-byte character sequence for some
chinese character.

>> But if that's really what you want, fine, here's how you do it:
>> 
>>   (fluid-set! %default-port-encoding "ISO-8859-1")
>>   (set-port-encoding! (current-output-port) "ISO-8859-1")
>>   (set-port-encoding! (current-input-port) "ISO-8859-1")
>>   (set-port-encoding! (current-error-port) "ISO-8859-1")
>> 
>> and make sure to *not* set the locale.
>
> Every time I have a fragment of scheme code, I have a new port.

The (fluid-set! %default-port-encoding "ISO-8859-1") should cause all
ports opened in the future to use the ISO-8859-1 (Latin-1) character
encoding, as long as you haven't called 'setlocale'.  The only reason we
need to call 'set-port-encoding!' on the other ports is because they've
already been opened.

> Doing it this way would require concatenating that text with
> the text to invoke.  That adds an allocate, two string copies
> and a free to every scheme invocation.

I don't understand what you mean here.

> I'll poke around, but
> I am guessing there would have to be some more of this set up
> for each scheme sequence, yes?
>         {
>             SCM ln = AG_SCM_INT2SCM(line);
>             scm_set_port_filename_x(port, file);
>             scm_set_port_line_x(port, ln);
>             scm_set_port_column_x(port, SCM_INUM0);
>         }

I don't think you should need to add anything here, but this reminds me
of another problem with interpreting the inputs as byte streams: the
column number in error messages will not be correct.  It will be a byte
number instead of a character number on the line.

     Mark



reply via email to

[Prev in Thread] Current Thread [Next in Thread]