Re: sed UTF-8 processing problem

bug-gnu-utils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: sed UTF-8 processing problem

From:	Eli Zaretskii
Subject:	Re: sed UTF-8 processing problem
Date:	Tue, 15 Jun 2021 14:43:05 +0300

> From: Klaus Dechet <kdec.net@gmail.com>
> Date: Mon, 14 Jun 2021 23:15:31 +0200
> 
> Running sed in windows 10 cmd terminal.
> 
> sed --version
> GNU sed version 4.2.1
> Copyright (C) 2009 Free Software Foundation, Inc.
> 
> In cmd terminal I enter the following:
> 
> D:\Temp>chcp 6500
> D:\Temp>echo aΣb
> aΣb
> D:\Temp>echo aΣb > utf82.txt
> File utf82.txt is utf-8 encoded and has Σ encoded in 2 bytes (\u03A3)
> 
> D:\Temp>echo aΣb | sed s/./X/g
> XXXXX
> 
> This shows that sed is not processing UTF-8 encoding properly.
> 
> 
> D:\Temp>echo aΣb | sed s/./X/g > sedoutput.txt
> 
> sedoutput.txt is ANSI-1252 encoded.
> 
> 
> Question: How do I get sed to handle and produce UTF-8 encoded files per 
> default?

You can't, not even on Windows 10: the support for UTF-8 encoded text
is still very rudimentary.  In particular "chcp 65001" doesn't cause
Sed (or any other console application) to use UTF-8 as the locale
codeset, unless the program was especially modified to support that.
The root cause of the problem here is that the Windows C runtime
library doesn't support UTF-8 encoding in text-processing functions,
and also doesn't change the locale's codeset when you use chcp.

Sorry.

[Prev in Thread]

Current Thread

[Next in Thread]

sed UTF-8 processing problem, Klaus Dechet, 2021/06/14
- Re: sed UTF-8 processing problem, Eli Zaretskii <=

Prev by Date: sed UTF-8 processing problem
Next by Date: Semi-reproducible bug
Previous by thread: sed UTF-8 processing problem
Next by thread: Semi-reproducible bug
Index(es):
- Date
- Thread