
4. Common Unicode and UTF-8 issues
The HESA data collection system operates entirely through a web-browser interface. To make full use of the system you need to use one of the following web browsers:
Most browsers can be downloaded free of charge from the internet. You may wish to ask your local IT support team to do this for you.
Users may encounter issues with the data collection system if they are using Microsoft Internet Explorer 6.0 on Windows XP to 2000. View Upload Problem with Microsoft Internet Explorer 6.0 for full details.
Specific file formats are required to be used when submitting files to the data collection system. These are as follows:
|
File format |
Applicable data collections |
|
XML |
Aggregate Offshore record |
|
Fixed-length or comma separated ASCII files |
DLHE record (until C10018) |
|
Web form |
Campus Information System |
|
Excel spreadsheet |
FSR with HE-BCI Survey (FSR and HE-BCI Part B) |
*Note that when using Zip archives you need to ensure that the Zip archive contains only one file.
XML is the eXtensible Markup Language and is a W3C recommendation. There are many XML training resources on the web, including the W3C Tutorial on XML and the resources of W3Schools.
XML files must be encoded with UTF-8 and schema validation will be in place to ensure this. Institutions must specify the encoding used in their XML files in the first line of the file (i.e. <?xml version="1.0" encoding="UTF-8" ?>) and to ensure that their files are actually saved with that encoding. If XML files are edited with some text editors and the encoding is not specified or does not match the actual file encoding, there may be problems when submitting these files for validation.
Elements that include characters with special meaning in XML (such as < less than, > greater than, & ampersand, ' apostrophe and " quotation mark) must be replaced with the appropriate entity reference. Further information is available at: http://www.w3schools.com/xml/xml_syntax.asp.
|
Character Name |
Entity Reference |
Character Reference |
|
Ampersand |
& |
& |
|
Left angle bracket |
< |
< |
|
Right angle bracket |
> |
> |
|
Straight quotation mark |
" |
" |
|
Apostrophe |
' |
' |
Institutions are advised to compile their XML files in the following format:
<COURSEID>ENGLISH01</COURSEID>
And not;
<COURSEID>
ENGLISH01
</COURSEID>
The inclusion of additional formatting line breaks within the element with its data may result in a schema error, with the file read to contain spaces which may not comply with the required data type.
Unicode is a character set (list of letters, digits, punctuations marks and other writing symbols) which covers almost all writing systems currently in use worldwide, and also many historic systems of significant academic interest. It covers not only the Latin A-Z alphabet with international accents (such as é and á) and local symbols (such as the £ Sterling sign or € Euro symbol), but also covers non-Latin scripts such as Greek, Hebrew, Arabic and Chinese. Each symbol is given a number.
The first 128 symbols of Unicode are identical to the older ASCII character set standard, which is also identical to the first half of both the ISO Western (ISO 8859-1) and ISO Celtic (ISO 8859-14) character sets. This means that provided that Latin alphabet letters without accents are only ever used, and local symbols such as the £ pound sign are avoided, then symbols in ASCII, ISO Western, ISO Celtic and Unicode will all have the same number. For example, uppercase A has the number 65 and lowercase z has the number 122.
The Welsh character ลต (w-circumflex) is included in Unicode and ISO Celtic but is not included in ASCII nor ISO Western. In a similar but fundamentally opposing manner, the Scandinavian character ð (eth) is included in Unicode and ISO Western but not in ASCII nor ISO Celtic. To make matters more difficult, these two characters are assigned the same number in their respective ISO character sets. Using Unicode solves this problem of ISO character set incompatibility, since these two characters are assigned different numbers in Unicode.
Unicode plays a vital part in ensuring that HESA can support the correct representation of signs used throughout the United Kingdom.
Until recently, most computer systems in the UK were configured to use the ISO Western character set. In Welsh-speaking areas, and to a lesser extent in Scottish and Irish Gaelic speaking areas, computer systems may have alternatively been configured to use the ISO Celtic character set. Using Unicode prevents the need to make this choice between two opposing ISO character sets, since Unicode supports all character sets simultaneously.
UTF-8 is a standard for representing Unicode numbers in computer files. Symbols with a Unicode number from 0 to 127 are represented exactly the same as in ASCII, using one 8-bit byte. This includes all Latin alphabet letters without accents. Provided accented characters and local symbols such as the £ pound sign are not used, then files in ASCII, ISO Western, ISO Celtic and Unicode will be directly interchangeable (although the first three bytes of the file may be a special UTF-8 header; as discussed below).
For symbols which have a Unicode value above 127, which include the £ pound sign and accented letters such as é, these are encoded using two or more bytes. This means that files with these characters will not be compatible with ASCII, ISO Western nor ISO Celtic character sets. Most computer operating systems, most of the time, convert seamlessly between UTF-8 and whichever ISO character set the computer is set up to use. However, there is an exception to this rule which is particularly pertinent to the United Kingdom.
Since the occurrence of accented letters and national symbols, at least in the kinds of data that HESA is interested in, can be small or even none at all (usually restricted to proper nouns or financial values), it can be difficult to tell a UTF-8 file from an ASCII file. Therefore it is good practice to prefix a UTF-8 file with three special bytes, called the Byte Order Mark header (BOM header). Viewed in ISO Western, these bytes look like  . Viewed in Unicode, these characters will generally not appear. If you see these strange characters at the start of a file, it is a strong indication that your computer system may not be correctly set up to use Unicode.
The HESA data collection system always outputs its UTF-8 files with BOM headers. It is strongly recommended that institutions use UTF-8 BOM headers in their submitted XML files. For some XML collections, BOM headers may be mandatory; please check the appropriate coding manual.
If data is submitted to HESA using ISO Western or ISO Celtic, then, depending on the collection, the data may be rejected, or it may be automatically converted to Unicode with UTF-8 (again, please check the appropriate coding manual). This means that it is theoretically possible to send single-byte encoded files but receive back multi-byte files. HESA will never change the data sent, but may change the method used to encode it.
XML files may also have the high-level attribute encoding="UTF-8". This is mandatory for all XML collections. Please check the appropriate coding manual..
It is important to note that simply adding the phrase encoding="UTF-8" does not automatically transform a file into a UTF-8 file; this attribute is a belt-and-braces indicator, so that systems reading these files know to expect UTF-8 (even if that expectation is subsequently proved wrong). It is therefore recommended that systems are set to use Unicode with UTF-8, and so not rely on this XML encoding attribute.