• C11041 Campus Information System return date is 1 June 2012. View details
  • C12025 Staff collection coding manual version 1.3 is now live
  • C12061 KIS record coding manual v1.3 available at C12061
  • C11051 Student xml validation kit rules now available. Download kit
  • HE Business and Community Interaction Survey Publication 2010/11 to be released 24 May. Pre-order your copy now.
  • Did you know you can follow HESA on twitter? @UkHESA

Technical formats

 

Technical formats

1. System formats

2. File formats

3. XML files

4. Common Unicode and UTF-8 issues

1. System formats

The HESA data collection system operates entirely through a web-browser interface. To make full use of the system you need to use one of the following web browsers:

Most browsers can be downloaded free of charge from the internet. You may wish to ask your local IT support team to do this for you.

Users may encounter issues with the data collection system if they are using Microsoft Internet Explorer 6.0 on Windows XP to 2000. View Upload Problem with Microsoft Internet Explorer 6.0 for full details.

2. File Format

Specific file formats are required to be used when submitting files to the data collection system. These are as follows:

File format

Applicable data collections

XML

Aggregate Offshore record
ITT In-Year record
Student record
KIS record
DLHE record (from C11018)
Staff record (from C12025)

Fixed-length or comma separated ASCII files

DLHE record (until C10018)
Staff record (until C11025)

Web form

Campus Information System
EMS record
FSR with HE-BCI Survey (HE-BCI Part A)

Excel spreadsheet

FSR with HE-BCI Survey (FSR and HE-BCI Part B)


Tips:

  • The file can have any name.
  • Files can be compressed using PKZip/WinZip. Compression can reduce the upload time significantly*.
  • Complete data can be sent in one or more files (all files must pass validation).

*Note that when using Zip archives you need to ensure that the Zip archive contains only one file.

3. XML files

XML is the eXtensible Markup Language and is a W3C recommendation. There are many XML training resources on the web, including the W3C Tutorial on XML and the resources of W3Schools.

XML files must be encoded with UTF-8 and schema validation will be in place to ensure this. Institutions must specify the encoding used in their XML files in the first line of the file (i.e. <?xml version="1.0" encoding="UTF-8" ?>) and to ensure that their files are actually saved with that encoding. If XML files are edited with some text editors and the encoding is not specified or does not match the actual file encoding, there may be problems when submitting these files for validation.

Special characters in XML

Elements that include characters with special meaning in XML (such as < less than, > greater than, & ampersand, ' apostrophe and " quotation mark) must be replaced with the appropriate entity reference. Further information is available at: http://www.w3schools.com/xml/xml_syntax.asp.

Character Name

Entity Reference

Character Reference

Ampersand

&amp;

&

Left angle bracket

&lt;

<

Right angle bracket

&gt;

>

Straight quotation mark

&quot;

"

Apostrophe

&apos;

'

File format

Institutions are advised to compile their XML files in the following format:

<COURSEID>ENGLISH01</COURSEID>

And not;

<COURSEID>

ENGLISH01

</COURSEID>

The inclusion of additional formatting line breaks within the element with its data may result in a schema error, with the file read to contain spaces which may not comply with the required data type. 

4. Common Unicode and UTF-8 issues

Unicode is a character set (list of letters, digits, punctuations marks and other writing symbols) which covers almost all writing systems currently in use worldwide, and also many historic systems of significant academic interest. It covers not only the Latin A-Z alphabet with international accents (such as é and á) and local symbols (such as the £ Sterling sign or € Euro symbol), but also covers non-Latin scripts such as Greek, Hebrew, Arabic and Chinese. Each symbol is given a number.

The first 128 symbols of Unicode are identical to the older ASCII character set standard, which is also identical to the first half of both the ISO Western (ISO 8859-1) and ISO Celtic (ISO 8859-14) character sets. This means that provided that Latin alphabet letters without accents are only ever used, and local symbols such as the £ pound sign are avoided, then symbols in ASCII, ISO Western, ISO Celtic and Unicode will all have the same number. For example, uppercase A has the number 65 and lowercase z has the number 122.

The Welsh character ลต (w-circumflex) is included in Unicode and ISO Celtic but is not included in ASCII nor ISO Western. In a similar but fundamentally opposing manner, the Scandinavian character ð (eth) is included in Unicode and ISO Western but not in ASCII nor ISO Celtic. To make matters more difficult, these two characters are assigned the same number in their respective ISO character sets. Using Unicode solves this problem of ISO character set incompatibility, since these two characters are assigned different numbers in Unicode.

Unicode plays a vital part in ensuring that HESA can support the correct representation of signs used throughout the United Kingdom.

Until recently, most computer systems in the UK were configured to use the ISO Western character set. In Welsh-speaking areas, and to a lesser extent in Scottish and Irish Gaelic speaking areas, computer systems may have alternatively been configured to use the ISO Celtic character set. Using Unicode prevents the need to make this choice between two opposing ISO character sets, since Unicode supports all character sets simultaneously.

UTF-8 is a standard for representing Unicode numbers in computer files. Symbols with a Unicode number from 0 to 127 are represented exactly the same as in ASCII, using one 8-bit byte. This includes all Latin alphabet letters without accents. Provided accented characters and local symbols such as the £ pound sign are not used, then files in ASCII, ISO Western, ISO Celtic and Unicode will be directly interchangeable (although the first three bytes of the file may be a special UTF-8 header; as discussed below).

For symbols which have a Unicode value above 127, which include the £ pound sign and accented letters such as é, these are encoded using two or more bytes. This means that files with these characters will not be compatible with ASCII, ISO Western nor ISO Celtic character sets. Most computer operating systems, most of the time, convert seamlessly between UTF-8 and whichever ISO character set the computer is set up to use. However, there is an exception to this rule which is particularly pertinent to the United Kingdom.

Since the occurrence of accented letters and national symbols, at least in the kinds of data that HESA is interested in, can be small or even none at all (usually restricted to proper nouns or financial values), it can be difficult to tell a UTF-8 file from an ASCII file. Therefore it is good practice to prefix a UTF-8 file with three special bytes, called the Byte Order Mark header (BOM header). Viewed in ISO Western, these bytes look like  . Viewed in Unicode, these characters will generally not appear. If you see these strange characters at the start of a file, it is a strong indication that your computer system may not be correctly set up to use Unicode.

The HESA data collection system always outputs its UTF-8 files with BOM headers. It is strongly recommended that institutions use UTF-8 BOM headers in their submitted XML files. For some XML collections, BOM headers may be mandatory; please check the appropriate coding manual.

If data is submitted to HESA using ISO Western or ISO Celtic, then, depending on the collection, the data may be rejected, or it may be automatically converted to Unicode with UTF-8 (again, please check the appropriate coding manual). This means that it is theoretically possible to send single-byte encoded files but receive back multi-byte files. HESA will never change the data sent, but may change the method used to encode it.

XML files may also have the high-level attribute encoding="UTF-8". This is mandatory for all XML collections. Please check the appropriate coding manual..

It is important to note that simply adding the phrase encoding="UTF-8" does not automatically transform a file into a UTF-8 file; this attribute is a belt-and-braces indicator, so that systems reading these files know to expect UTF-8 (even if that expectation is subsequently proved wrong). It is therefore recommended that systems are set to use Unicode with UTF-8, and so not rely on this XML encoding attribute.