Skip to main content

Processing error

Processing error includes processing-related errors in data capture, coding, editing and tabulation of the data. This section describes the processes used and the quality assurance apparatus that is employed to avoid bias in processing, and to limit the incidence of variance. We cover the issues that have arisen, and our estimates of their impact.

HESA’s processing practices and quality assurance approach are explained in the Survey methodology section on data processing.[1] It covers data capture, data quality checking, SIC/SOC data coding (where HESA employs a specialist contractor), free text field ‘cleaning’, and derived fields.

SIC and SOC coding

SIC and SOC codes are applied wherever we have sufficient data to allow this. The data processing section of the Survey methodology explains this further.[2] An experienced external supplier (Oblong[3]) undertakes this coding, and the quality checks they apply are explained in the Survey methodology. Established SIC-coding methodology has proved stable over the long term.[4] A new method had to be developed for SOC coding. Provisional SOC codes were processed using an agreed method by Oblong. These are then supplied to HE providers (through the Portal) which were invited to quality assure the data for themselves. During the first year of operation this was a semi-structured quality assurance process and relied on the varying resource that providers were able to bring. Although we received feedback from only a sub-set of providers, any changes to SOC coding resulting from this feedback were applied consistently across the entire collection. Since the second year of operations, the process has been streamlined and simplified.

All the provider feedback received is placed into one of the following four categories: Systemic (where the error is widespread and there is a clear pattern of miscoding); Non-systemic (isolated cases); Inconsistent (where multiple records in an occupation group are coded inconsistently with no obvious pattern) or Not actionable (no basis or evidence exists for coding to be changed).

This helped us identify potential processing issues that affected some records in the entire dataset. Non-systemic issues could not be used to improve individual-level data, as this would have been inequitable, and introduce bias through inconsistent application. This exercise has revealed some systemic errors in SOC coding, as well as scrutinising some areas where the coding ultimately met our quality standards. An overview of this process can be found in the data processing section of the operational survey information.[5] Detailed information on the exercise undertaken to review feedback and improve the data processing approach is also available in a detailed briefing, which identifies the impact of the issues identified.[6] It also includes a description of and the outcomes from additional internal checks which were carried out independently on the entire dataset.

The results from this year’s assessment highlighted a continued reduction in the number of issues identified as a result of provider feedback. In year one, 66 issues were identified as either inconsistent or systemic, reducing to 42 in year two, 40 in year three and only 10 in year four. The number of systemic issues identified each year has reduced even further, with only two identified this year as a result of provider feedback.  As a result of this comprehensive checking exercise, we believe the sources of systematic processing error identified by HE provider and manual quality checks have been removed, and the processing system fixed. There is no evidence that there is any remaining bias in the coding strategy for SOC, and any remaining processing error in year four data is likely to be minimal, and the product of random variation only.

During the second year of surveying we also conducted research into the reliability of our approach to coding, using established methods for this. In addition to the report on internal quality assurance work, on 29 April 2021 we published a second report detailing this independent verification of the reliability of our approach.[7] An exercise was carried out to compare codes returned by the primary coder for Graduate Outcomes with those returned by an independent organisation to validate HESA’s approach to coding and the outputs that follow. Independent coding of occupations by the Office for National Statistics found ‘almost perfect’ alignment between coders at the major-group level.

Handling free text responses

Most questions in Graduate Outcomes map directly to established lists of values, and details of these are available in the coding manual.[8] However, there is sometomes an “Other” option that permits a free text response. In this subsection and the subsequent ones, we cover the most important issues relating to free text processing, and explain the risks around processing error, giving our estimates for this.

At the end of the collection process, data returned for questions that permit a free-text response goes through a cleansing process, in order to improve data quality. This is usually where the respondent has not chosen a value from the drop-down list provided but has instead selected “other” and typed their own answer.

This cleansing process is undertaken for the town, city or area of employment or self-employment / running own business and prior to the removal of free text boxes from the survey, information relating to home country, country of further study, employment and self-employment / running own business and salary currency was also cleansed in a similar way. Where possible, the free text is mapped to an appropriate value from a dictionary published within the appropriate derived field specification.

We have encountered some specific issues in the processing of UK-based location information, which we turn to next. Later subsections offer comparable quality descriptions of cleansing of further study and home country data.

Location of work data – handling free text

Location of work is collected from graduates who are in paid work for an employer, voluntary or unpaid work. Respondents in employment are asked to tell us where they worked during the census week.[9] From 2020/21, a drop-down list was introduced to reduce the amount of free text data for cleansing. The majority of respondents supplied data that we could process into a structured format, such as their employer’s postcode.[10]

or an area name from the drop-down list. Where both a valid full or outward postcode and area information have been supplied, the postcode information is used in priority for mapping the data to a county / unitary authority.

From 2019/20, free text boxes relating to home country, country of further study, employment and self-employment / running own business and salary currency were removed due to low usage. From 2020/21, free text boxes relating to provider of further and previous study were also removed. 

Across all years, around 7% of those graduates in work during the census week did not provide any location information. These graduates are excluded from the table below.

In 2020/21, 0.8% (between 0.5% and 0.6% prior to 2020/21) of graduates who indicated the country in which they were employed did not provide any additional postcode or free text location information. Although difficult to identify precisely, across the years, around 0.8% provided free text information indicating that they refused; didn’t want to; were unable to provide more detailed location information or indicated they were remote working or work at various locations. In 2020/21 and with the introduction of the drop-down list, this figure dropped to around 0.6%.

Location of self-employment or own business is collected from graduates who are in self-employment or running their own business during the census week. In 2020/21, of those graduates in self-employment or running their own business during the census week, 10% (9% in 2019/20 & 2018/19; 10% in 2017/18) did not provide any location information. These graduates are also excluded from the table below.

HESA has developed an algorithm for processing free text information; combining with information collected through drop-down menus and mapping postcodes to counties / unitary authorities and regions.

The processing of free text information relating to UK location of work is complex and two-fold. The first iteration was based on the processing fields used to clean area (ZEMPAREA[11], ZBUSAREA[12]) and postcode (ZEMPPCODE, ZBUSPCODE) information. Cleaned postcode information was mapped to county/unitary authority or region and combined with the cleaned area information.

With the enhancement of the derived field mapping process, a large majority of graduates who provided some UK location information could be mapped to county / unitary authority level (derived in XEMPLOCUC / XBUSLOCUC). The matching process is specified in more detail within the derived field documents[10] for  XEMPLOCUC, XEMPLOCGR, XBUSLOCUC and XBUSLOCGR.

As a result, from year two, data has been released at a more granular geographic resolution. Users of microdata will also notice improvements in geographical resolution and should assess data quality for uses below regional level. Improving geographical resolution further remains a priority, as we are aware of strong user demand for high-resolution place-based analysis.

We continue to look to make improvements to the survey instruments and also to the algorithmic approach we utilise in data processing.

Table 1: Location of work, self-employment or own business data - processing free-text responses

Employment in the UK

2017/18

2018/19

2019/20

2020/21

Postcode information

144450

58.9%

156570

62.7%

158100

64.8%

176590

69.4%

Item selected from drop-down

N/A

N/A

N/A

N/A

N/A

N/A

71475

28.1%

Free text mapped to county / unitary authority

91765

37.4%

84885

34.0%

79030

32.4%

2480

1.0%

Free text mapped to Government office region (England only)

785

0.3%

650

0.3%

105

0.0%

25

0.0%

Mapped to country

8315

3.4%

7560

3.0%

6780

2.8%

3765

1.5%

Total with location info

245315

100.0%

249670

100.0%

244015

100.0%

254330

100.0%

Self-employment / own business in the UK

2017/18

2018/19

2019/20

2020/21

Postcode information

15690

54.8%

18350

60.2%

18675

61.6%

19415

64.0%

Item selected from drop-down

N/A

N/A

N/A

N/A

N/A

N/A

9750

32.2%

Free text mapped to county / unitary authority

11515

40.2%

10765

35.3%

10335

34.1%

400

1.3%

Free text mapped to Government office region (England only)

130

0.4%

105

0.3%

20

0.1%

5

0.0%

Mapped to country

1305

4.5%

1285

4.2%

1275

4.2%

750

2.5%

Total with location info

28640

100.0%

30510

100.0%

30305

100.0%

30320

100.0%

Next: Timeliness and punctuality


[1] See https://www.hesa.ac.uk/data-and-analysis/graduates/methodology/data-processing

[2] See https://www.hesa.ac.uk/data-and-analysis/graduates/methodology/data-processing#data-coding

[3] Information on our suppliers is here: https://www.hesa.ac.uk/innovation/outcomes/about/our-suppliers

[4] HESA has commissioned Oblong as a SIC code supplier in the past, using DLHE data that was similar to the structure of the relevant parts of Graduate Outcomes data. This longstanding methodology continued to prove robust.

[5] See https://www.hesa.ac.uk/definitions/operational-survey-information#data-classification-sicsoc

[6] See https://www.hesa.ac.uk/files/Graduate_Outcomes_SOC_Review_Summary_20220413.pdf

[7] See: https://www.hesa.ac.uk/files/Graduate-Outcomes-SOC-coding-Independent-verification-analysis-report-20210429.pdf

[8] The Graduate Outcomes survey results coding manual is available here: https://www.hesa.ac.uk/collection/c20072

[9] This data is gathered through various survey questions (dependent on routing) and stored in the fields: EMPPLOC; EMPPCODE; EMPPCODE_UNKNOWN; EMPCOUNTRY; and EMPCITY. We also collect parallel data on self-employed graduates, using the fields: BUSEMPPLOC; BUSEMPPCODE; BUSEMPPCODE_UNKNOWN; BUSEMPCOUNTRY; and BUSEMPCITY. Results for these fields are similar in proportion to those in employment, though the prevalence of self-employment is much lower, and hence we do not offer a detailed analysis on this much smaller group. Detailed metadata on all these fields can be viewed by following links from the data items index in the Graduate Outcomes survey results record coding manual, here: https://www.hesa.ac.uk/collection/c20072/index

[10] Post-processing, location data can be found in the following derived fields: XWRKLOCGR; XWRKLOCN; XWRKLOCUC; XSTULOCGR; XSTULOCN; XSTULOCUC; XEMPLOCGR; XEMPLOCN; XEMPLOCUC; XBUSLOCGR; XBUSLOCN; and XBUSLOCUC. Details of the processing involved in production is described by following the relevant links available from the derived fields specification contents page in the Graduate Outcomes survey results record coding manual, here: https://www.hesa.ac.uk/collection/c20072/derived/contents

[11] See the derived field specification: https://www.hesa.ac.uk/collection/c20072/derived/zemparea

[12] See the derived field specification: https://www.hesa.ac.uk/collection/c20072/derived/zbusarea