bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#60544: sort hangs on lengthy line with invalid UTF8 characters


From: DE CARNE DE CARNAVALET, Xavier [COMP]
Subject: bug#60544: sort hangs on lengthy line with invalid UTF8 characters
Date: Wed, 4 Jan 2023 04:38:33 +0000

sort seems to do extra computations on long line with invalid UTF8 characters 
and could hang for days on just two lines.

Here is the minimal example I could make to reproduce the bug:
$ perl -e 'print "\xcd\xe5\xe0"; print "\n"' > file1
$ perl -e 'print "\xcd\xe5\xe0"x1000; print "\n"' > file2

To verify:
$ ls -l file*
-rw-rw-r-- 1 u u    4 Jan  4 12:13 file1
-rw-rw-r-- 1 u u 3001 Jan  4 12:13 file2
$ xxd -p file1
cde5e00a
$ xxd -p file2
cde5e0cde5e0cde5e0cde5e0cde5e0cde5e0cde5e0cde5e0cde5e0cde5e0
[...]
cde5e0cde5e0cde5e0cde5e0cde5e0cde5e0cde5e0cde5e0cde5e0cde5e0
0a

Then:
$ export LC_ALL=en_US.UTF8
$ time sort --debug file1 file2
sort: using 'en_US.UTF8' sorting rules
[...]
real    0m1.951s
user    0m1.951s
sys     0m0.000s

It took nearly two seconds to sort two lines from two files.
If I replace the \xe0 with \x61 in the first (small) file, the time gets down 
to milliseconds:
$ perl -e 'print "\xcd\xe5\x61"; print "\n"' > file3
$ time sort --debug file3 file2
sort: using 'en_US.UTF8' sorting rules
[...]
real    0m0.007s
user    0m0.003s
sys     0m0.003s

The time it takes increases when one of the file gets larger, see for instance 
with 2k repetitions:
$ perl -e 'print "\xcd\xe5\xe0"x2000; print "\n"' > file4
$ time sort --debug file1 file4
sort: using 'en_US.UTF8' sorting rules
[...]
real    0m7.696s
user    0m7.690s
sys     0m0.004s

Expectedly, sort should take milliseconds at most in all cases for two 
moderately long lines.

$ uname -a
Linux 5.13.0-51-generic #58~20.04.1-Ubuntu SMP Tue Jun 14 11:29:12 UTC 2022 
x86_64 x86_64 x86_64 GNU/Linux
$ apt list installed coreutils
coreutils/focal,now 8.30-3ubuntu2 amd64 [installed]
$ sort --version
sort (GNU coreutils) 8.30

Xavier de Carné de Carnavalet

[https://www.polyu.edu.hk/emaildisclaimer/PolyU_Email_Signature.jpg]<http://www.polyu.edu.hk>

www.polyu.edu.hk<http://www.polyu.edu.hk>

[https://www.polyu.edu.hk/emaildisclaimer/Icons-02.jpg]<https://www.polyu.edu.hk/cpa/online-channels/#ipolyuapp>
                [https://www.polyu.edu.hk/emaildisclaimer/Icons-03.jpg] 
<https://www.facebook.com/HongKongPolyU>                
[https://www.polyu.edu.hk/emaildisclaimer/Icons-04.jpg] 
<https://www.youtube.com/user/HongKongPolyU>            
[https://www.polyu.edu.hk/emaildisclaimer/Icons-05.jpg] 
<https://www.instagram.com/hongkongpolyu/>              
[https://www.polyu.edu.hk/emaildisclaimer/Icons-06.jpg] 
<https://www.linkedin.com/school/hong-kong-polytechnic-university/>             
[https://www.polyu.edu.hk/emaildisclaimer/Icons-07.jpg] 
<https://twitter.com/HongKongPolyU>             
[https://www.polyu.edu.hk/emaildisclaimer/Icons-08.jpg] 
<https://www.polyu.edu.hk/-/media/department/home/setting/polyu-wechat_qr-code_20190903.jpg?bc=ffffff&h=150&w=150&hash=679EE95BCB1796F71B5A4149647785C9>
                [https://www.polyu.edu.hk/emaildisclaimer/Icons-09.jpg] 
<https://www.weibo.com/hongkongpolyu>

Disclaimer:

This message (including any attachments) contains confidential information 
intended for a specific individual and purpose. If you are not the intended 
recipient, you should delete this message and notify the sender and The Hong 
Kong Polytechnic University (the University) immediately. Any disclosure, 
copying, or distribution of this message, or the taking of any action based on 
it, is strictly prohibited and may be unlawful.

The University specifically denies any responsibility for the accuracy or 
quality of information obtained through University E-mail Facilities. Any views 
and opinions expressed are only those of the author(s) and do not necessarily 
represent those of the University and the University accepts no liability 
whatsoever for any losses or damages incurred or caused to any party as a 
result of the use of such information.

Attachment: file1
Description: file1

Attachment: file2
Description: file2

Attachment: file3
Description: file3

Attachment: file4
Description: file4


reply via email to

[Prev in Thread] Current Thread [Next in Thread]