[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: faster fnmatch
Re: faster fnmatch
Mon, 20 Apr 2009 14:18:19 +0200
On Sat, Apr 18, 2009 at 08:53:13PM +0200, Bruno Haible wrote:
> Ondrej Bilka wrote:
> > I looked more into source and discovered fnmatch doesn't work as I imagined.
> > By default it converts strings into widechars and match there.
> > utf8 allows searching be done bitwise. Its in most cases faster.
> fnmatch converts to wide characters because it often makes several passes
> across many characters of the string, and at each pass it has to call mbrtowc
> for looking up the extent of that character. And while UTF-8 is the most
> common encoding, there are other ones, such as ISO-8859-2 or GB18030, for
> which mbrtowc is really expensive.
I use it only when UTF8/singlebyte otherwise convert.
> > Is ok just use original fnmatch if pattern contains extended wildcard or 
> > with nonascii symbol?
> No. If the encoding is GB18030 and the pattern is "*5*", and you attempt
> to search for the '5' byte for byte, you will find a match where there
> is actually none - because multibyte characters in GB18030 can contains
> values in the range 0x30..0x39 in bytes 2..4.
> Similarly for the BIG5, BIG5-HKSCS, GBK, and SHIFT_JIS encodings.
Shift state encoding is great idea.
> > Here is casefold patch for fnmatch. (abusing wchar=u32)
> wchar_t == ucs4_t is only generally true on glibc systems, not on
> Solaris, FreeBSD, AIX, etc.
Is there any way detect it?
I send version that should be produce same matches as fnmatch but fallback in
lo of cases:
1. weird encoding. Here I cant do much.
2. fold - TODO
3. multibyte in . Does mblen recognize a+acute as one character?
4. collation. Code for collation is rather incomprehensible.
How does this relate to regexp collation?
5. extended wildcards. except negation they are easy to add but they are rarely