Search 5,000,000+ questions and answers.

Frequently Asked Questions

What is the difference between UTF-8, UTF-16?

ISO
UTF-8 uses variable byte to store a Unicode. In different code range, it has its own code length, varies from 1 byte to 6 bytes. Because it varies from 8 bits (1 byte), it is so called "UTF-8". UTF-8 is suitable for using on Internet, networks or some kind of applications that needs to use slow connection. Unicode (or UCS) Transformation Format, 16-bit encoding form.
Related Questions

How do I convert an unpaired UTF-16 surrogate to UTF-8?

FAQ - UTF-8, UTF-16, UTF-32 & BOM
A different issue arises if an unpaired surrogate is encountered when converting ill-formed UTF-16 data. By represented such an unpaired surrogate on its own as a 3-byte sequence, the resulting UTF-8 data stream would become ill-formed. While it faithfully reflects the nature of the input, Unicode conformance requires that encoding form conversion always results in valid data stream. Therefore a converter must treat this as an error. [AF]
Related Questions

How many bits are used to represent Unicode, ASCII, UTF-16, and UTF-8 characters?

JAVA interview questions
Unicode requires 16 bits and ASCII require 7 bits. Although the ASCII character set uses only 7 bits, it is usually represented as 8 bits. UTF-8 represents characters using 8, 16, and 18 bit patterns. UTF-16 uses 16-bit and larger bit patterns.
Related Questions

What is the difference between UCS-2 and UTF-16?

FAQ - Basic Questions
UCS-2 is what a Unicode implementation was up to Unicode 1.1, *before* surrogate code points and UTF-16 were added as concepts to Version 2.0 of the standard. This term should be now be avoided. When interpreting what people have meant by "UCS-2" in past usage, it is best thought of as not a data format, but as an indication that an implementation does not interpret any supplementary characters. In particular, for the purposes of data exchange, UCS-2 and UTF-16 are identical formats.
Related Questions

How do I get UTF-8?

Tomcat FAQ - Miscellaneous Questions
It is not broken, your tag probably is. Many bug reports have been filed about this. Here is the bug report with all the gory details.
Related Questions

Java Internationalization FAQ
UTF-16 uses sequences of one or two unsigned 16-bit code units to encode Unicode code points. Values U+0000 to U+FFFF are encoded in one 16-bit unit with the same value. Supplementary characters are encoded in two code units, the first from the high-surrogates range (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF).
Related Questions

FAQ - UTF-8, UTF-16, UTF-32 & BOM
UTF-16 uses a single 16-bitcode unit to encode the most common 63K characters, and a pair of 16-bit code unites, called surrogates, to encode the 1M less commonly used characters in Unicode. Originally, Unicode was designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.
Related Questions

Unicode FAQ
Unicode was originally designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.) Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16. UTF-16 allows access to 63K characters as single Unicode 16-bit units.
Related Questions

Perl-XML Frequently Asked Questions
Since Unicode supports character positions higher than 256, a representation of those characters will obviously require more than one 8-bit byte. There is more than one system for representing Unicode characters as byte sequences. UTF-8 is one such system. It uses a variable number of bytes (from 1 to 4 according to RFC3629) to represent each character. This means that the most common characters (ie: 7 bit ASCII) only require one byte.
Related Questions

UTF-8 and Unicode FAQ
UCS and Unicode are first of all just code tables that assign integer numbers to characters. There exist several alternatives for how a sequence of such characters or their respective integer values can be represented as a sequence of bytes. The two most obvious encodings store Unicode text as sequences of either 2 or 4 bytes sequences. The official terms for these encodings are UCS-2 and UCS-4, respectively.
Related Questions

How to you handle UTF-8?

Grapeshot - Developer - FAQs
Grapeshot has a very professional approach to a multitude of character sets. Grapeshot indexing routines identify the character set in use within a document and introduces appropriate stemming routines as part of tokenising the words or phrases within the incoming text. Tokenisation includes word splitting or character separation, as well as dealing with the ideosyncracies of punctuation within each language.
Related Questions

What can I do with a UTF-8 string?

Perl-XML Frequently Asked Questions
You could obviously convert a UTF-8 encoded string to some other encoding, but before we get on to that, let's look at what you can do with it in its 'natural state'. If you wish to display the string in a web browser, no conversion is necessary. Modern browsers can understand UTF-8 directly, as can be seen on this page on the kermit project web site (some characters in the page will not display correctly without the correct fonts installed but that's a font issue rather than an encoding issue).
Related Questions

What is the UTF-8 encoding?

Java Internationalization FAQ
UTF-8 stands for Unicode (or UCS) Transformation Format, 8-bit encoding form. It is a transmission format for Unicode that uses 8-bit code units.
Related Questions

What is the definition of UTF-8?

FAQ - UTF-8, UTF-16, UTF-32 & BOM
UTF-8 is the byte-oriented encoding form of Unicode. For details of its definition, see Section 2.5 “Encoding Forms” and Section 3.9 “ Unicode Encoding Forms ” in the Unicode Standard. See, in particular, Table 3-5 UTF-8 Bit Distribution and Table 3-6 Well-formed UTF-8 Byte Sequences, which give succinct summaries of the encoding form. Also see sample code which implements conversions between UTF-8 and other encoding forms.
Related Questions

Who invented UTF-8?

UTF-8 and Unicode FAQ
The encoding known today as UTF-8 was invented by Ken Thompson. It was born during the evening hours of 1992-09-02 in a New Jersey diner, where he designed it in the presence of Rob Pike on a placemat (see Rob Pike's UTF-8 history).
Related Questions

How about issues in porting to UTF-16 or UTF-32?

FAQ - Programming Issues
There is no simple answer to that. The optimal solution depends on the nature of your application, the nature of the data it reads, and the nature of the APIs you are going to use. Assuming that your application currently reads and manipulates ASCII strings, the first thing to look at is the encoding form of Unicode you are going to use. [MD] & [MS]
Related Questions

When would using UTF-16 be the right approach?

FAQ - Programming Issues
If the APIs you are using, or plan to use, are UTF-16 based, which is the typical case, then working with UTF-16 directly is likely your best bet. Converting data for each individual call to an API is difficult and inefficient, while working around the occasional character that takes two 16-bit code units in UTF-16 is not particularly difficult (and does not have to be expensive). [MD] & [MS]
Related Questions

So how do we get invalid UTF-8 sequences into an Oracle database?

TOYS Frequently Asked Questions
The most common cause is the move towards UTF-8 as the database character set. This is a good idea but unfortunately there appear to be implementation issues which need to be resolved. Basically, if the character set on the client is set to the same as the character set on the server then Oracle does not validate that the character data passed to it is actually valid.
Related Questions

What is UTF-8 Character Encoding in WebMail?

E-Marketing Associates ~ Web Site Design, Hosting, Marketing...
Outbound messages sent from WebMail are fully standards compliant with The Unicode Standard, the Internationally recognized standard for multilingual communication on the Internet and all modern computer systems worldwide. Unicode ensures that the characters you use in your message are the same characters that the recipient of your message sees.
Related Questions

How do I turn on UTF-8 support in the client?

SILC Secure Internet Live Conferencing
You can give /set term_type command to see what encoding is currently used. If it is something else than "utf-8" you can turn on the UTF-8 by giving command /set term_type utf-8. Your terminal naturally need to support UTF-8 properly. In SILC all text messages are UTF-8 encoded, and the client is able to display the message correctly even if your terminal does not support UTF-8. However, if your terminal supports UTF-8 you should turn it on with /set term_type utf-8 command.
Related Questions

Why are some people opposed to UTF-16?

FAQ - UTF-8, UTF-16, UTF-32 & BOM
People familiar with variable width East Asian character sets such as Shift-JIS ( SJIS) are understandably nervous about UTF-16, which sometimes requires two code units to represent a single character. They have are well acquainted with the problems that variable-width codes, have caused.
Related Questions

Will UTF-16 ever be extended to more than a million characters?

FAQ - UTF-8, UTF-16, UTF-32 & BOM
No. Both Unicode and ISO 10646 have policies in place that formally limit future code assignment to the integer range that can be expressed with current UTF-16 (0 to 1,114,111). Even if other encoding forms (i.e. other UTFs) can represent larger intergers, these policies mean that all encoding forms will always represent the same set of characters. Over a million possible codes is far more than enough for the goal of Unicode of encoding characters, not glyphs.
Related Questions

Does UTF-16 have an alternative representation?

Unicode FAQ
Yes, all characters represented in UTF-16, both those represented with 16 bits and those with a surrogate pair, can be represented as a single 32-bit unit in UTF-32. This single 4 code unit corresponds to the Unicode scalar value, which is the abstract number associated with a Unicode character. UTF-32 is a subset of the encoding mechanism called UCS-4 in ISO 10646. For more information, see UTR #19: UTF-32.
Related Questions

How can I convert my GEDCOM to UTF-8?

PhpGedView FAQ - Online genealogy at its best
Your GEDCOM file should be encoded in the UTF-8 character set, especially if you use special characters. Most of the current commercial packages allow you to specify the character set when you export your GEDCOM. If UTF-8 is not one of the supported options, then you should export your GEDCOM first using the Unicode or Windows character set. A common encoding option for GEDCOMS is ANSI.
Related Questions

What can Perl do with a UTF-8 string?

Perl-XML Frequently Asked Questions
Perl versions prior to 5.6 had no knowledge of UTF-8 encoded characters. You can still work with UTF-8 data in these older Perl versions but you'll probably need the help of a module like Unicode::String to deal with the non-ASCII characters. The built-in functions in Perl 5.6 and later are UTF-8 aware so for example length will return the number of characters rather than the number of bytes in a string, and ord can return values greater than 255.
Related Questions

Got A Question? Ask Our Community!


More Questions >>

© Copyright 2007-2008 QueryCAT
About • Webmasters • Contact