[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Finding and mapping all UTF-8 characters

From: Pascal J. Bourguignon
Subject: Re: Finding and mapping all UTF-8 characters
Date: Sat, 05 Dec 2009 17:38:08 +0100
User-agent: Gnus/5.1008 (Gnus v5.10.8) Emacs/22.3 (darwin)

deech <address@hidden> writes:

> Hi all,
> I recently cut-and-pasted large chunks of text into an HTML document.
> When I tried to save the document I was warned that it was ISO-Latin
> but there were UTF-8 characters in the text.

I doubt it warned that.

ISO-Latin is not a character encoding, it is a familly of character
encodings.  A HTML document is not encoded by a familly of encodings,
but by one single encoding.

UTF-8 is a character encoding.  A character is not a character

So  a sentence saying that "a document is  ISO-Latin but there are
UTF-8 characters in the text." is totally meaningless.

> Is there a way to (1) search for the UTF-8 encoded characters in a
> document and

No it is not possible, because characters in a document are not
encoded, they are characters, that's all.

> (2) map them to a sensible ASCII character?

How do you map sensibly ∈, ㎲, 纺 or ⇣ to the characters in the ASCII
character set?

But even if you choosed a mapping (you could for example map the
character to their names: ELEMENT_OF, SQUARE_MU_S, U7EBA, and
DOWNWARDS_DASHED_ARROW), why would you want to do such a thing?

HTML is perfectly able to use encodings that can encode unicode
characters, and all the current browsers are able to deal with HTML
documents encoding unicode characters, so why would you want to
massacre your document?

(There's a valid reason to be wanting to do that, but if you don't
know it, then you don't have it).

__Pascal Bourguignon__

reply via email to

[Prev in Thread] Current Thread [Next in Thread]