[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: sed UTF-8 processing problem
From: |
Eli Zaretskii |
Subject: |
Re: sed UTF-8 processing problem |
Date: |
Tue, 15 Jun 2021 14:43:05 +0300 |
> From: Klaus Dechet <kdec.net@gmail.com>
> Date: Mon, 14 Jun 2021 23:15:31 +0200
>
> Running sed in windows 10 cmd terminal.
>
> sed --version
> GNU sed version 4.2.1
> Copyright (C) 2009 Free Software Foundation, Inc.
>
> In cmd terminal I enter the following:
>
> D:\Temp>chcp 6500
> D:\Temp>echo aΣb
> aΣb
> D:\Temp>echo aΣb > utf82.txt
> File utf82.txt is utf-8 encoded and has Σ encoded in 2 bytes (\u03A3)
>
> D:\Temp>echo aΣb | sed s/./X/g
> XXXXX
>
> This shows that sed is not processing UTF-8 encoding properly.
>
>
> D:\Temp>echo aΣb | sed s/./X/g > sedoutput.txt
>
> sedoutput.txt is ANSI-1252 encoded.
>
>
> Question: How do I get sed to handle and produce UTF-8 encoded files per
> default?
You can't, not even on Windows 10: the support for UTF-8 encoded text
is still very rudimentary. In particular "chcp 65001" doesn't cause
Sed (or any other console application) to use UTF-8 as the locale
codeset, unless the program was especially modified to support that.
The root cause of the problem here is that the Windows C runtime
library doesn't support UTF-8 encoding in text-processing functions,
and also doesn't change the locale's codeset when you use chcp.
Sorry.