[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Japanese '者' (U+8005) is replaced with \350\200
From: |
OKUMURA, Akira |
Subject: |
Re: Japanese '者' (U+8005) is replaced with \350\200 |
Date: |
Fri, 8 Jan 2021 15:08:59 +0900 |
Dear Karl,
Thank you. I am attaching the output result.
$ wdiff input1.txt input2.txt
16848 1月2日 40代男性 豊橋市 [-陽性者と接触-] {+知人が陽性?+}? 豊橋市発表445
Here is the output copied and pasted from my terminal. The \350\200 bytes,
which are seen in Emacs, corresponds to the characters in "?+}?" above.
I am sure that it is not an Emacs issue.
$ wdiff input1.txt input2.txt > wdiff.txt
$ python3
Python 3.8.5 (default, Jul 21 2020, 10:48:26)
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> f = open('wdiff.txt')
>>> line = f.readlines()[0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File
"/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/codecs.py",
line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 78-79: invalid
continuation byte$ ip
$ wdiff --version
wdiff (GNU wdiff) 1.2.2
--
OKUMURA, Akira oxon@mac.com / oxon@nagoya-u.jp
⌘ Junior Associate Professor at
- Institute for Space–Earth Environmental Research (ISEE)
- Kobayashi–Maskawa Institute for the Origin of Particles and the Universe (KMI)
Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8601, Japan
Office/Lab/Fax: +81 (0)52-789-4320/4324/4313
skype:okumura.akira
wdiff.txt
Description: Text document
> On Jan 8, 2021, at 7:27, Karl Berry <karl@freefriends.org> wrote:
>
> generates a result with a broken word, in which a Japanese
> character, '者', (Unicode U+8005) is replaced with
> \350\200 when opening the result in Emacs.
>
> Sorry, this probably isn't very helpful, but ... are you sure it's not
> an Emacs issue? As far as I can tell, wdiff is just outputting the bytes
> it sees.
>
> Running wdiff | od -c on your input files, I see the three bytes (in
> octal) 0250 0200 0205 in order in the output. I'm using LC_ALL=C to avoid
> locale interpretations getting in the way. --best, karl.