Support » Help, guidance and tools » Guidance and support with using XML » Common Unicode and UTF-8 issues

Common Unicode and UTF-8 issues

Unicode is a character set (list of letters, digits, punctuations marks and other writing symbols) which covers almost all writing systems currently in use worldwide, and also many historic systems of significant academic interest. It covers not only the Latin A-Z alphabet with international accents (such as é and á) and local symbols (such as the £ Sterling sign or € Euro symbol), but also covers non-Latin scripts such as Greek, Hebrew, Arabic and Chinese. Each symbol is given a number.

The first 128 symbols of Unicode are identical to the older ASCII character set standard, which is also identical to the first half of both the ISO Western (ISO 8859-1) and ISO Celtic (ISO 8859-14) character sets. This means that provided that Latin alphabet letters without accents are only ever used, and local symbols such as the £ pound sign are avoided, then symbols in ASCII, ISO Western, ISO Celtic and Unicode will all have the same number. For example, uppercase A has the number 65 and lowercase z has the number 122.

The Welsh character ŵ (w-circumflex) is included in Unicode and ISO Celtic but is not included in ASCII nor ISO Western. In a similar but fundamentally opposing manner, the Scandinavian character ð (eth) is included in Unicode and ISO Western but not in ASCII nor ISO Celtic. To make matters more difficult, these two characters are assigned the same number in their respective ISO character sets. Using Unicode solves this problem of ISO character set incompatibility, since these two characters are assigned different numbers in Unicode.

Unicode plays a vital part in ensuring that HESA can support the correct representation of signs used throughout the United Kingdom.

Until recently, most computer systems in the UK were configured to use the ISO Western character set. In Welsh-speaking areas, and to a lesser extent in Scottish and Irish Gaelic speaking areas, computer systems may have alternatively been configured to use the ISO Celtic character set. Using Unicode prevents the need to make this choice between two opposing ISO character sets, since Unicode supports all character sets simultaneously.

UTF-8 is a standard for representing Unicode numbers in computer files. Symbols with a Unicode number from 0 to 127 are represented exactly the same as in ASCII, using one 8-bit byte. This includes all Latin alphabet letters without accents. Provided accented characters and local symbols such as the £ pound sign are not used, then files in ASCII, ISO Western, ISO Celtic and Unicode will be directly interchangeable (although the first three bytes of the file may be a special UTF-8 header; as discussed below).

For symbols which have a Unicode value above 127, which include the £ pound sign and accented letters such as é, these are encoded using two or more bytes. This means that files with these characters will not be compatible with ASCII, ISO Western nor ISO Celtic character sets. Most computer operating systems, most of the time, convert seamlessly between UTF-8 and whichever ISO character set the computer is set up to use. However, there is an exception to this rule which is particularly pertinent to the United Kingdom.

Since the occurrence of accented letters and national symbols, at least in the kinds of data that HESA is interested in, can be small or even none at all (usually restricted to proper nouns or financial values), it can be difficult to tell a UTF-8 file from an ASCII file. Therefore it is good practice to prefix a UTF-8 file with three special bytes, called the Byte Order Mark header (BOM header). Viewed in ISO Western, these bytes look like ï»¿ . Viewed in Unicode, these characters will generally not appear. If you see these strange characters at the start of a file, it is a strong indication that your computer system may not be correctly set up to use Unicode.

The HESA data collection system always outputs its UTF-8 files with BOM headers. It is strongly recommended that institutions use UTF-8 BOM headers in their submitted XML files. For some XML collections, BOM headers may be mandatory; please check the appropriate coding manual.

If data is submitted to HESA using ISO Western or ISO Celtic, then, depending on the collection, the data may be rejected, or it may be automatically converted to Unicode with UTF-8 (again, please check the appropriate coding manual). This means that it is theoretically possible to send single-byte encoded files but receive back multi-byte files. HESA will never change the data sent, but may change the method used to encode it.

XML files may also have the high-level attribute encoding="UTF-8". This is mandatory for all XML collections. Please check the appropriate coding manual.

It is important to note that simply adding the phrase encoding="UTF-8" does not automatically transform a file into a UTF-8 file; this attribute is a belt-and-braces indicator, so that systems reading these files know to expect UTF-8 (even if that expectation is subsequently proved wrong). It is therefore recommended that systems are set to use Unicode with UTF-8, and so not rely on this XML encoding attribute.

Search form

Common Unicode and UTF-8 issues