[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#24603: [PATCHv5 05/11] Support casing characters which map into mult
From: |
Michal Nazarewicz |
Subject: |
bug#24603: [PATCHv5 05/11] Support casing characters which map into multiple code points (bug#24603) |
Date: |
Tue, 21 Mar 2017 03:09:54 +0100 |
On Sat, Mar 11 2017, Eli Zaretskii wrote:
>> From: Michal Nazarewicz <mina86@mina86.com>
>> Date: Thu, 9 Mar 2017 22:51:44 +0100
>>
>> Implement unconditional special casing rules defined in Unicode standard.
>>
>> Among other things, they deal with cases when a single code point is
>> replaced by multiple ones because single character does not exist (e.g.
>> ‘fi’ ligature turning into ‘FL’) or is not commonly used (e.g. ß turning
>> into SS).
>>
>> * admin/unidata/SpecialCasing.txt: New data file pulled from Unicode
>> standard distribution.
>> * admin/unidata/README: Mention SpecialCasing.txt.
>>
>> * admin/unidata/unidata-get.el (unidata-gen-table-special-casing): New
>> function for generating ‘special-casing’ character Unicode property
>> built from the SpecialCasing.txt Unicode data file.
>
> This new property is attainable via get-char-code-property, right? If
> so, it should be documented in the Elisp manual, in the "Character
> Properties" node.
>
> I think I'd also like to see a few simple tests for this property.
Done and done. I’ve actually split this property into three separate
ones. Previously, the property was unique in how it mapped a single
character into multiple values.
>> diff --git a/doc/lispref/strings.texi b/doc/lispref/strings.texi
>> index cf47db4a814..ba1cf2606ce 100644
>> --- a/doc/lispref/strings.texi
>> +++ b/doc/lispref/strings.texi
>> @@ -1166,6 +1166,29 @@ Case Conversion
>> @end example
>> @end defun
>>
>> + Note that case conversion is not a one-to-one mapping and the length
>> +of the result may differ from the length of the argument (including
>> +being shorter). Furthermore, because passing a character forces
>> +return type to be a character, functions are unable to perform proper
>> +substitution and result may differ compared to treating
>> +a one-character string. For example:
>> +
>> +@example
>> +@group
>> +(upcase "fi") ; note: single character, ligature "fi"
>> + @result{} "FI"
>> +@end group
>> +@group
>> +(upcase ?fi)
>> + @result{} 64257 ; i.e. ?fi
>> +@end group
>> +@end example
>> +
>> + To avoid this, a character must first be converted into a string,
>> +using @code{string} function, before being passed to one of the casing
>> +functions. Of course, no assumptions on the length of the result may
>> +be made.
>
> Once the ELisp manual describes the new special-casing property, the
> above text should include a cross-reference to that description.
Ah, actually forgot about that one. I don’t want to resend the patch,
but I’ll add:
+ Mapping for such special cases are taken from
+@code{special-uppercase}, @code{special-lowercase} and
+@code{special-titlecase} @xref{Character Properties}.
+
before submitting.
>> DEFUN ("upcase", Fupcase, Supcase, 1, 1, 0,
>> doc: /* Convert argument to upper case and return that.
>> The argument may be a character or string. The result has the same type.
>> -The argument object is not altered--the value is a copy.
>> +The argument object is not altered--the value is a copy. If argument
>> +is a character, characters which map to multiple code points when
>> +cased, e.g. fi, are returned unchanged.
>> See also `capitalize', `downcase' and `upcase-initials'. */)
>
> This (and other similar doc strings) should mention the special-casing
> property as the way to know in advance which characters will remain
> unchanged due to that.
Done.
--
Best regards
ミハウ “𝓶𝓲𝓷𝓪86” ナザレヴイツ
«If at first you don’t succeed, give up skydiving»
- bug#24603: [PATCHv5 08/11] Implement rules for title-casing Dutch ij ‘letter’ (bug#24603), (continued)
bug#24603: [PATCHv5 09/11] Implement Turkic dotless and dotted i casing rules (bug#24603), Michal Nazarewicz, 2017/03/09
bug#24603: [PATCHv5 11/11] Implement Irish casing rules (bug#24603), Michal Nazarewicz, 2017/03/09
bug#24603: [PATCHv5 05/11] Support casing characters which map into multiple code points (bug#24603), Michal Nazarewicz, 2017/03/09
bug#24603: [PATCHv5 00/11] Casing improvements, Eli Zaretskii, 2017/03/11
bug#24603: [PATCHv6 0/6] Casing improvements, language-independent part, Michal Nazarewicz, 2017/03/20
- bug#24603: [PATCHv6 3/6] Add support for title-casing letters (bug#24603), Michal Nazarewicz, 2017/03/20
- bug#24603: [PATCHv6 1/6] Split casify_object into multiple functions, Michal Nazarewicz, 2017/03/20
- bug#24603: [PATCHv6 6/6] Implement special sigma casing rule (bug#24603), Michal Nazarewicz, 2017/03/20
- bug#24603: [PATCHv6 4/6] Split up casify_region function (bug#24603), Michal Nazarewicz, 2017/03/20
- bug#24603: [PATCHv6 2/6] Introduce case_character function, Michal Nazarewicz, 2017/03/20
- bug#24603: [PATCHv6 5/6] Support casing characters which map into multiple code points (bug#24603), Michal Nazarewicz, 2017/03/20