Skip to main content

Processing error

Processing error includes processing-related errors in data capture, coding, editing and tabulation of the data. This section describes the processes used and the quality assurance apparatus that is employed to avoid bias in processing, and to limit the incidence of variance. We cover the issues that have arisen, and our estimates of their impact.

HESA’s processing practices and quality assurance approach are explained in the Survey methodology section on data processing.[1] It covers data capture, data quality checking, SIC/SOC data coding (where HESA employs a specialist contractor), free text field ‘cleaning’, and derived fields.

SIC and SOC coding

SIC and SOC codes are applied wherever we have sufficient data to allow this. The data processing section of the Survey methodology explains this further.[2] An experienced external supplier (Oblong[3]) undertakes this coding, and the quality checks they apply are explained in the Survey methodology. Established SIC-coding methodology has proved stable over the long term.[4] A new method had to be developed for SOC coding. Provisional SOC codes were processed using an agreed method by Oblong. These are then supplied to HE providers (through the Portal) which were invited to quality assure the data for themselves. During the first year of operation this was a semi-structured quality assurance process and relied on the varying resource that providers were able to bring. Although we received feedback from only a sub-set of providers, any changes to SOC coding resulting from this feedback were applied consistently across the entire collection. Since the second year of operations, the process has been streamlined and simplified.

All the provider feedback received placed into one of the following four categories: Systemic (where the error is widespread and there is a clear pattern of miscoding); Non-systemic (isolated cases); Inconsistent (where multiple records in an occupation group are coded inconsistently with no obvious pattern) or Not actionable (no basis or evidence exists for coding to be changed).

This helped us identify potential processing issues that affected some records in the entire dataset. Non-systemic issues could not be used to improve individual-level data, as this would have been inequitable, and introduce bias through inconsistent application. This exercise has revealed some systemic errors in SOC coding, as well as scrutinising some areas where the coding ultimately met our quality standards. An overview of this process can be found in the data processing section of the operational survey information.[5] Detailed information on the exercise undertaken to review feedback and improve the data processing approach is also available in a detailed briefing, which identifies the impact of the issues identified.[6] It also includes a description of and the outcomes from additional internal checks which were carried out independently on the entire dataset.

The results from this year’s assessment highlighted a continued reduction in the number of issues identified as a result of provider feedback. In year one, 66 issues were identified as either inconsistent or systemic, reducing to 42 in year two and 40 in year three. The number of systemic issues identified this year is far lower, with only six systemic issues resulting from the process, compared to 12 last year.  As a result of this comprehensive checking exercise, we believe the sources of systematic processing error identified by HE provider and manual quality checks have been removed, and the processing system fixed. There is no evidence that there is any remaining bias in the coding strategy for SOC, and any remaining processing error in year three data is likely to be minimal, and the product of random variation only.

During the second year of surveying we also conducted research into the reliability of our approach to coding, using established methods for this. In addition to the report on internal quality assurance work, on 29 April 2021 we published a second report detailing this independent verification of the reliability of our approach.[7] An exercise was carried out to compare codes returned by the primary coder for Graduate Outcomes with those returned by an independent organisation to validate HESA’s approach to coding and the outputs that follow. Independent coding of occupations by the Office for National Statistics found ‘almost perfect’ alignment between coders at the major-group level.

Handling free text responses

Most questions in Graduate Outcomes map directly to established lists of values, and details of these are available in the coding manual.[8] However, there is often an “Other” option that permits a free text response. In this subsection and the subsequent ones, we cover the most important issues relating to free text processing, and explain the risks around processing error, giving our estimates for this.

At the end of the collection process, data returned for questions that permit a free-text response goes through a cleansing process, in order to improve data quality. This is usually where the respondent has not chosen a value from the drop-down list provided but has instead selected “other” and typed their own answer. This process also runs for questions seeking postcode, city/area and country of employment, or self-employment / running own business; country in which graduate is living and of further study; provider of further study, and salary currency. Where possible, the free text is mapped to an appropriate value from a dictionary published within the appropriate derived field specification.

We have encountered some specific issues in the processing of UK-based location information, which we turn to next. Later subsections offer comparable quality descriptions of cleansing of further study and home country data.

Location of work data – handling free text

Location of work is collected from graduates who are in paid work for an employer, voluntary or unpaid work or contracted to start a job in the next month. Respondents in employment are asked to tell us where they worked during the census week and respondents contracted to start a job are asked where they will be working.[9] The majority of respondents supplied data that we could process into a structured format, such as their employer’s postcode.[10] Across all years, around 7% of those graduates in work during the census week did not provide any location information and in 2019/20, of those graduates contracted to start a job in the next month, 12% (11% in 2018/19; 10% in 2017/18) did not provide any location information. These graduates are excluded from the table below.

Consistently over time, around 0.5% of graduates who indicated the country in which they were employed or due to start a job did not provide any additional postcode or free text location information. Although difficult to identify precisely, around 0.3% provided free text information indicating that they refused; didn’t want to or were unable to provide more detailed location information.

Location of self-employment or own business is collected from graduates who are in self-employment or running their own business during the census week. In 2019/20, of those graduates in self-employment or running their own business during the census week, 9% (9% in 2018/19, 10% in 2017/18) did not provide any location information. These graduates are also excluded from the table below.

HESA has developed an algorithm for processing free text information; combining with information collected through drop-down menus and mapping postcodes to counties and regions.

ZEMPCOUNTRY[11] and ZBUSCOUNTRY[17] are processing fields which clean up the data provided by a graduate responding to the Graduate Outcomes survey question “In which country is your place of work?” by combining the data provided in EMPCOUNTRY / BUSEMPCOUNTRY (based on a restricted list of values available to the respondent as a drop-down menu) with that from the free text fields EMPCOUNTRY_OTHER / BUSEMPCOUNTRY_OTHER. Due to low numbers of graduates supplying free text information which could be mapped to country information, the free text fields were removed from 2019/20.

The processing of free text information relating to UK location of work is more complex and two-fold. The first iteration was based on the processing fields used to clean area (ZEMPAREA[12], ZBUSAREA[13]) and postcode (ZEMPPCODE, ZBUSPCODE) information. Cleaned postcode information was mapped to county/unitary authority or region and combined with the cleaned area information. Following the collection of 2017/18 data, a second round of more in-depth cleaning was carried out on the free text information supplied in EMPCITY and BUSEMPCITY, using standard area lists for mapping. This method was adopted as standard from 2018/19.

With the enhancement of the derived field mapping process, a large majority of graduates who provided some UK location information could be mapped to county / unitary authority level (derived in XEMPLOCUC / XBUSLOCUC). The matching process is specified in more detail within the derived field documents[10] for  XEMPLOCUC, XEMPLOCGR, XBUSLOCUC and XBUSLOCGR.

As a result, from year two, data has been released at a more granular geographic resolution. Users of microdata will also notice improvements in geographical resolution and should assess data quality for uses below regional level. Improving geographical resolution further remains a priority for HESA, as we are aware of strong user demand for high-resolution place-based analysis.

We continue to view improvements to geographical resolution as a priority and are currently undertaking a programme of work to evaluate the pros and cons of various options for this, including improvements to the survey instruments and also to the algorithmic approach we utilise in data processing.

Table 29: Location of work, self-employment or own business data - processing free-text responses

 

2017/18

2018/19

2019/20

Employment not in the UK

           

Country selected from drop-down

32710

99.3%

36870

99.5%

39415

99.7%

Free text mapped

50

0.1%

10

0.0%

N/A

N/A

No free text or not mapped

185

0.6%

165

0.4%

110

0.3%

Total

32940

100.0%

37045

100.0%

39525

100.0%

Employment in the UK

           

Postcode information

144450

58.9%

156570

62.7%

158100

64.8%

Free text mapped to county / unitary authority

91765

37.4%

84885

34.0%

79030

32.4%

Free text mapped to Government office region (England only)

785

0.3%

650

0.3%

105

0.0%

Mapped to country

8315

3.4%

7560

3.0%

6780

2.8%

Total

245315

100.0%

249670

100.0%

244015

100.0%

Due to start work not in the UK

           

Country from drop-down

1405

99.2%

1470

99.5%

1735

99.4%

Free text mapped

0

0.1%

0

0.0%

N/A

N/A

No free text or not mapped

10

0.7%

5

0.5%

10

0.6%

Total

1415

100.0%

1480

100.0%

1745

100.0%

Due to start work in the UK

           

Postcode information

3110

45.1%

3340

48.6%

4540

54.7%

Free text mapped to county / unitary authority

3425

49.7%

3235

47.1%

3485

42.0%

Free text mapped to Government office region (England only)

25

0.4%

30

0.5%

5

0.1%

Mapped to country

330

4.8%

270

3.9%

265

3.2%

Total

6890

100.0%

6875

100.0%

8295

100.0%

Self-employment not in the UK

           

Country selected from drop-down

7165

21.7%

8445

22.8%

9480

24.0%

Free text mapped

5

0.0%

5

0.0%

N/A

N/A

No free text or not mapped

70

0.2%

40

0.1%

30

0.1%

Total

7240

22.0%

8490

22.9%

9510

24.1%

Self-employment in the UK

           

Postcode information

15690

6.4%

18350

7.4%

18675

7.7%

Free text mapped to county / unitary authority

11515

4.7%

10765

4.3%

10335

4.2%

Free text mapped to Government office region (England only)

130

0.1%

105

0.0%

20

0.0%

Mapped to country

1305

0.5%

1285

0.5%

1275

0.5%

Total

28640

11.7%

30510

12.2%

30305

12.4%

Further study data – handling free text

Further study data on the provider attended[15] reflects the very large number of HE providers that UK graduates go on to study at. Provider information is collected from graduates undertaking further study during the census week; those who are due to start studying in the next month and those who undertook interim study. Graduates can either select their UK provider from a drop-down menu or can provide details in a free text box. Where a student selects their provider from the drop-down menu, they are assumed to be studying in the UK, otherwise they are asked for their country of provider. Graduates provide country of further study information by selecting from the country drop-down menu, prior to cohort D, 2019/20 a free text box was provided as an alternative to picking from the drop-down menu. Note that from 2019/20, graduates who selected a provider from the drop-down menu were also asked to select their country of provider from the drop-down menu. As part of the further study linking work, free text information collected for further study, due to start study and interim study (UCNAME_OTHER and PREVUCNAME_OTHER1-3) went through a manual cleaning process for UK domiciled studying in the UK. The results of this cleaning process have been included in the table below for information. Graduates who indicated they were in interim study are not asked for their country of study and have been excluded from the table below.

In 2019/20, of those graduates who were in further study in the census week, 15% (17% in 2018/19, 20% in 2017/18) did not select a provider from the drop-down menu and did not provide any free text information. Of those due to start study, 24% (22% in 2018/19, 28% in 2017/18) did not provide any provider information. These graduates have been excluded from the table below.

Table 30: Provider - processing free text responses

 

2017/18

2018/19

2019/20

Further study

           

Provider selected from drop-down

47530

79.6%

52925

76.7%

54100

77.3%

Free text mapped

1575

2.6%

1930

2.8%

2075

3.0%

UK study - free text not mapped

6150

10.3%

8700

12.6%

7775

11.1%

Non-UK study - free text not mapped

4475

7.5%

5410

7.8%

6025

8.6%

Total

59730

100.0%

68970

100.0%

69975

100.0%

Due to start study

           

Provider selected from drop-down

13720

82.2%

14980

82.3%

12845

79.7%

Free text mapped

475

2.9%

515

2.8%

500

3.1%

UK study - free text not mapped

1720

10.3%

1840

10.1%

1840

11.4%

Non-UK study - free text not mapped

785

4.7%

855

4.7%

935

5.8%

Total

16705

100.0%

18190

100.0%

16120

100.0%

Home country – handling free text

Home country information is collected from graduates who are doing something other than being in some form of employment or further study during the census week.[16] Graduates can either select their home country from a drop-down menu or,  prior to 2019/20, graduates could provide details in a free text box. Due to the small numbers, this option was removed in 2019/20. Across all years, 2% of graduates did not provide any home country information. These graduates are excluded from the table below.

Table 31: Home country - processing free text responses

 

2017/18

2018/19

2019/20

Home country

           

Country from drop-down

35040

99.9%

42160

100.0%

37965

100.0%

Free text mapped

20

0.1%

0

0.0%

N/A

N/A

Free text not mapped

20

0.1%

5

0.0%

N/A

N/A

Total

35080

100.0%

42165

100.0%

37965

100.0%

 

Next: Timeliness and punctuality


[1] See https://www.hesa.ac.uk/data-and-analysis/graduates/methodology/data-processing

[2] See https://www.hesa.ac.uk/data-and-analysis/graduates/methodology/data-processing#data-coding

[3] Information on our suppliers is here: https://www.hesa.ac.uk/innovation/outcomes/about/our-suppliers

[4] HESA has commissioned Oblong as a SIC code supplier in the past, using DLHE data that was similar to the structure of the relevant parts of Graduate Outcomes data. This longstanding methodology continued to prove robust.

[5] See https://www.hesa.ac.uk/definitions/operational-survey-information#data-classification-sicsoc

[6] See https://www.hesa.ac.uk/files/Graduate_Outcomes_SOC_Review_Summary_20220413.pdf

[7] See: https://www.hesa.ac.uk/files/Graduate-Outcomes-SOC-coding-Independent-verification-analysis-report-20210429.pdf

[8] The Graduate Outcomes survey results coding manual is available here: https://www.hesa.ac.uk/collection/c19072

[9] This data is gathered through various survey questions (dependent on routing) and stored in the fields: EMPPLOC; EMPPCODE; EMPPCODE_UNKNOWN; EMPCOUNTRY; EMPCOUNTRY_OTHER, and; EMPCITY. We also collect parallel data on self-employed graduates, using the fields: BUSEMPPLOC; BUSEMPPCODE; BUSEMPPCODE_UNKNOWN; BUSEMPCOUNTRY; BUSEMPCOUNTRY_OTHER, and; BUSEMPCITY. Results for these fields are similar in proportion to those in employment, though the prevalence of self-employment is much lower, and hence we do not offer a detailed analysis on this much smaller group. Detailed metadata on all these fields can be viewed by following links from the data items index in the Graduate Outcomes survey results record coding manual, here: https://www.hesa.ac.uk/collection/c19072/index

[10] Post-processing, location data can be found in the following derived fields: XWRKLOCGR; XWRKLOCN; XWRKLOCUC; XSTULOCGR; XSTULOCN; XSTULOCUC; XEMPLOCGR; XEMPLOCN; XEMPLOCUC; XBUSLOCGR; XBUSLOCN; XBUSLOCUC, and; XCURRLOC. Details of the processing involved in production is described by following the relevant links available from the derived fields specification contents page in the Graduate Outcomes survey results record coding manual, here: https://www.hesa.ac.uk/collection/c19072/derived/contents

[11] See the specification for ZEMPCOUNTRY: https://www.hesa.ac.uk/collection/c19072/derived/zempcountry

[12] See the derived field specification: https://www.hesa.ac.uk/collection/c19072/derived/zemparea

[13] See the derived field specification: https://www.hesa.ac.uk/collection/c19072/derived/zbusarea

[14] See derived field specification for ZBUSCOUNTRY at: https://www.hesa.ac.uk/collection/c19072/derived/zbuscountry

[15] See the specification for free text ‘other’ responses at https://www.hesa.ac.uk/collection/c19072/a/ucname_other. This is returned where a respondent does not locate suitable option from the list of values at: https://www.hesa.ac.uk/collection/c19072/a/ucname

[16] See derived field specification at https://www.hesa.ac.uk/collection/c19072/derived/ZHOMECOUNTRY