[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [cp-patches] FYI: Patch: character encoder/decoder cleanup/fixes
From: |
Archie Cobbs |
Subject: |
Re: [cp-patches] FYI: Patch: character encoder/decoder cleanup/fixes |
Date: |
Thu, 18 Nov 2004 11:57:41 -0600 (CST) |
Jeroen Frijters wrote:
> The string isn't valid Unicode so the UTF-8 encoder is within its rights
> to encode the surrogate as an invalid character.
Correct.. (unfortunately :-)
> > Yes, which is how I came across this bug. There are classes
> > in Classpath that store arbitrary binary data within String
> > objects.
>
> Class files don't use UTF-8 to encode strings, they use the format used
> by DataOutputStream.writeUTF (what Sun calls "modified UTF").
Right.. though it would be nice if there were an encoder
for "modified UTF" as well.
> So maybe all we need to do is make sure that
> DataOutputStream.writeUTF/DataInputStream.readUTF can roundtrip *any*
> string (even if it has invalid Unicode characters).
Definitely .. here's a test case (this one works):
import java.io.*;
import java.util.*;
public class xx {
public static void main(String[] args) throws Exception {
String s = "\ud8aa";
ByteArrayOutputStream bas = new ByteArrayOutputStream();
DataOutputStream das = new DataOutputStream(bas);
das.writeUTF(s);
das.close();
DataInputStream dis = new DataInputStream(
new ByteArrayInputStream(bas.toByteArray()));
String t = dis.readUTF();
System.out.println(s.equals(t));
}
}
My error was assuming that "UTF-8" encoding and Java's "modified UTF"
were the same thing when in fact they are different.
-Archie
__________________________________________________________________________
Archie Cobbs * CTO, Awarix * http://www.awarix.com
*
Confidentiality Notice: This e-mail message, including any attachments, is for
the sole use of the intended recipient(s) and may contain confidential and
privileged information. Any unauthorized review, use, disclosure or
distribution is prohibited. If you are not the intended
recipient, please contact the sender by reply e-mail and destroy all copies of
the original message.
*