guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Wide string strategies


From: Mike Gran
Subject: Re: Wide string strategies
Date: Thu, 09 Apr 2009 20:39:18 -0700

On Thu, 2009-04-09 at 22:25 +0200, Ludovic Courtès wrote: 
> Hi!

> > -  SCM_WTA_DISPATCH_1 (*SCM_SUBR_GENERIC (proc), arg1,
> > -                 SCM_ARG1, scm_i_symbol_chars (SCM_SNAME (proc)));
> > +  {
> > +    char *str = scm_to_locale_string (scm_symbol_to_string (SCM_SNAME 
> > (proc)));
> > +    SCM_WTA_DISPATCH_1 (*SCM_SUBR_GENERIC (proc), arg1, SCM_ARG1, str);
> > +    free (str);
> > +  }
> 
> This is the kind of thing we can't afford in most cases.
> 
> Here STR is only needed because `SCM_WTA_DISPATCH_1 ()' calls
> `scm_wrong_type_arg ()', which operates on C strings.
> 
> One solution would be to change `scm_wrong_type_arg ()' to operate on
> opaque strings (e.g., take an `SCM' instead of `const char *').  The
> same applies to all the functions in "error.h", and probably many
> others.
> 

Makes sense.

> I think procedures like `scm_i_string_ref_eq_char ()' are a good idea
> because it fulfills the goal of having an opaque string type *and* the
> goal of being able to handle them easily in C.

I like it, too.

> All the POSIX interface needs fast access to ASCII strings.  How about
> something like:
> 
>   const char *layout = scm_i_ascii_symbol_chars (SCM_PACK (slayout));
> 
> where `scm_i_ascii_symbol_chars ()' throws an exception if its argument
> is a non-ASCII symbol?
> 
> This would mean special-casing ASCII stringbufs so that we can treat
> them as C strings.

OK.  Fast ASCII strings for the evaluator and for POSIX should be easy
enough.  Are there any other modules that definitely require fast
strings?

Also, the interaction between strings and sockets needs more thought.
If sendto and recvfrom are used for datagram transmission, as it
suggests in their docstrings, then locale string conversion could be a
bad idea.  (And, these functions should also operate on u8vectors, but
that's another issue.)

To be more general, I know some apps depend on 8-bit strings and use
them as storage of non-string binary data.  I think SND falls into this
category.  I wonder if ultimately wide strings would have to be a
run-time option that is off by default.  But I am (choose your English
idiom here) getting ahead of myself, or jumping the gun, or putting the
cart before the horse.

> > +SCM_INTERNAL int scm_i_string_ref_eq_char (SCM str, size_t x, char c);
> > +SCM_INTERNAL int scm_i_symbol_ref_eq_char (SCM str, size_t x, char c);
> > +SCM_INTERNAL int scm_i_string_ref_neq_char (SCM str, size_t x, char c);
> > +SCM_INTERNAL int scm_i_symbol_ref_neq_char (SCM str, size_t x, char c);
> 
> I'd remove the `neq' variants.
> 

Sure.

> > +SCM_INTERNAL int scm_i_string_ref_eq_int (SCM str, size_t x, int c);
> 
> Does it assume sizeof (int) >= 32 ?

I suppose it does.  But, I only used it to compare to the output of
scm_getc which also returns an int.

> 
> > +SCM_INTERNAL size_t scm_i_string_contains_char (SCM str, char ch);
> 
> Since it really returns a boolean, I'd use `int' as the return type.

Makes sense.

> 
> > +SCM_INTERNAL char *scm_i_string_to_write_sz (SCM str);
> > +SCM_INTERNAL scm_t_uint8 *scm_i_string_to_u8sz (SCM str);
> > +SCM_INTERNAL SCM scm_i_string_from_u8sz (const scm_t_uint8 *str);
> > +SCM_INTERNAL const char *scm_i_string_to_failsafe_ascii_sz (SCM str);
> > +SCM_INTERNAL const char *scm_i_symbol_to_failsafe_ascii_sz (SCM str);
> 
> What does "sz" mean?

Back in the day, "sz" was Microsoft-speak for the pointer to the first
character of a null-terminated char string.  By not knowing that, you
have demonstrated that you remain unpolluted. ;-) I probably was trying
to avoid writing "scm_i_string_to_string."

> 
> > +/* For ASCII strings, SUB can be used to represent an invalid
> > +   character.  */
> > +#define SCM_SUB ('\x1A')
> 
> Why SUB?  How about `SCM_I_SUB_CHAR', `SCM_I_INVALID_ASCII_CHAR' or
> similar?

If you're asking why SUB is set to 0x1A, the standard EMCA-48 says 0x1A
should be used to indicate an invalid ASCII character.  If you're asking
why I just called it SCM_SUB, laziness.

SCM_I_INVALID_ASCII_CHAR works for me.

> 
> Thanks,
> Ludo'.
> 
> 
I'll try to rework this next week.

-Mike






reply via email to

[Prev in Thread] Current Thread [Next in Thread]