Date: Wed, 4 Jan 2023 04:38:33 +0000

sort seems to do extra computations on long line with invalid UTF8 characters 
and could hang for days on just two lines.

Here is the minimal example I could make to reproduce the bug:
$ perl -e 'print "\xcd\xe5\xe0"; print "\n"' > file1
$ perl -e 'print "\xcd\xe5\xe0"x1000; print "\n"' > file2

To verify:
$ ls -l file*
-rw-rw-r-- 1 u u    4 Jan  4 12:13 file1
-rw-rw-r-- 1 u u 3001 Jan  4 12:13 file2
$ xxd -p file1
$ xxd -p file2

$ export LC_ALL=en_US.UTF8
$ time sort --debug file1 file2
sort: using 'en_US.UTF8' sorting rules
real    0m1.951s
user    0m1.951s
sys     0m0.000s

It took nearly two seconds to sort two lines from two files.
If I replace the \xe0 with \x61 in the first (small) file, the time gets down 
to milliseconds:
$ perl -e 'print "\xcd\xe5\x61"; print "\n"' > file3
$ time sort --debug file3 file2
sort: using 'en_US.UTF8' sorting rules
real    0m0.007s
user    0m0.003s
sys     0m0.003s

The time it takes increases when one of the file gets larger, see for instance 
with 2k repetitions:
$ perl -e 'print "\xcd\xe5\xe0"x2000; print "\n"' > file4
$ time sort --debug file1 file4
sort: using 'en_US.UTF8' sorting rules
real    0m7.696s
user    0m7.690s
sys     0m0.004s

Expectedly, sort should take milliseconds at most in all cases for two 
moderately long lines.

$ uname -a
Linux 5.13.0-51-generic #58~20.04.1-Ubuntu SMP Tue Jun 14 11:29:12 UTC 2022 
x86_64 x86_64 x86_64 GNU/Linux
$ apt list installed coreutils
coreutils/focal,now 8.30-3ubuntu2 amd64 [installed]
$ sort --version
sort (GNU coreutils) 8.30

Xavier de Carné de Carnavalet





