Skip to main content

Processing error

Processing error includes processing-related errors in data capture, coding, editing and tabulation of the data. This section describes the processes used and the quality assurance apparatus that is employed to avoid bias in processing, and to limit the incidence of variance. We cover the issues that have arisen, and our estimates of their impact.

HESA’s processing practices and quality assurance approach are explained in the Survey methodology section on data processing.[1] It covers data capture, data quality checking, SIC/SOC data coding (where HESA employs a specialist contractor), free text field ‘cleaning’, and derived fields.

Imputation and editing

No instances of imputation have occurred during the second year of surveying.

In the first year of Graduate Outcomes data processing, HESA applied imputation in one variable that records which country the graduate was studying in on the census week.[2] This variable (STUCOUNTRY) is required to be answered when the previous question, which identifies the university or college the graduate studied at (UCNAME) has an answer that is either not in the pre-defined list of providers, or has not been answered. However, the issue has arisen in the routing where if UCNAME has not been answered, then STUCOUNTRY does not display and cannot be answered. This issued affected respondents meeting the above conditions prior to a fix being applied on 2019-03-22. Of the 2,260 graduates that have missing data, we successfully imputed observations for 625 of them.

Our solution was, where possible, to use imputation to fill the gap of missing data by utilising linked data from the 2018/19 Student record(s) to identify graduates who studied at a UK Higher Education provider whose data is collected by HESA. A process of fuzzy matching was carried out to attempt to link these graduates to the HESA Student and AP records and for the appropriate country code of further study to be picked up (England = XF, Wales = XI , Scotland = XH, Northern Ireland = XG).

SIC and SOC coding

SIC and SOC codes are applied wherever we have sufficient data to allow this. The data processing section of the Survey methodology explains this further.[3] An experienced external supplier (Oblong[4]) undertakes this coding, and the quality checks they apply are explained in the Survey methodology. Established SIC-coding methodology has proved stable over the long term.[5] A new method had to be developed for SOC coding. Provisional SOC codes were processed using an agreed method by Oblong. These are then supplied to HE providers (through the Portal) which were invited to quality assure the data for themselves. During this phase, more than 90 providers undertook peer review. During the first year of operation this was a semi-structured quality assurance process and relied on the varying resource that providers were able to bring. Although we received feedback from only a sub-set of providers, any changes to SOC coding resulting from this feedback were applied consistently across the entire collection. During the second year of operations, learning from the first year has been applied to streamline and simplify the process, but it remains essentially the same.

All the provider feedback received placed into one of the following four categories: Systemic (where the error is widespread and there is a clear pattern of miscoding); Non-systemic (isolated cases); Inconsistent (where multiple records in an occupation group are coded inconsistently with no obvious pattern) or Not actionable (no basis or evidence exists for coding to be changed).

This helped us identify potential processing issues that affected a large number of records. Non-systemic issues could not be used to improve individual-level data, as this would have been inequitable, and introduce bias through inconsistent application. This exercise has revealed some systemic errors in SOC coding, as well as scrutinising some areas where the coding ultimately met our quality standards. An overview of this process can be found in the data processing section of the operational survey information.[6] Detailed information on the exercise undertaken to review feedback and improve the data processing approach is also available in a detailed briefing, which identifies the impact of the issues identified.[7]

As a result of the exercise described above, 8% of records from the first year of surveying had SOC codes changed prior to the production of the outputs. Many users group all codes within SOC major groups 1-3 to identify ‘professional and managerial’, ‘highly skilled’, or ‘graduate’ jobs. Of all the records that changed at the major group level as a result of identified coding problems, 28% moved from major groups 4-9 to 1-3; 13% from major groups 1-3 to 4-9 and the remaining 59% continued to be coded within the same of these two groupings. We also amended the logic to remove the impact of qualification requirements on coding, and to allow many partially-completed responses to also be coded, increasing the usefulness of the data.[8] As a result of this comprehensive checking exercise, we believe the sources of systematic processing error identified by HE provider manual quality checks have been removed, and the processing system fixed. There is no evidence that there is any remaining bias in the coding strategy for SOC, and any remaining processing error in year one data is likely to be minimal, and the product of random variation only.

During the second year of operations our engagements with providers over quality assurance was streamlined. On 29 April 2021 we published a report summarizing the year two SOC coding assurance we have undertaken.[9] The summary of SOC coding assurance for the second year of the Graduate Outcomes survey reveals that there were 3048 queries received from HE providers about the SOC codes assigned to graduates’ jobs. From these queries systemic issues were found with the coding of 12 occupation groups and a further 30 groups were found to have been coded inconsistently. 95.5% of queries were deemed non-systemic or not actionable.

During the second year of surveying we also conducted research into the reliability of our approach to coding, using established methods for this. In addition to the report on internal quality assurance work, on 29 April 2021 we published a second report detailing this independent verification of the reliability of our approach.[10] An exercise was carried out to compare codes returned by the primary coder for Graduate Outcomes with those returned by an independent organisation to validate HESA’s approach to coding and the outputs that follow. Independent coding of occupations by the Office for National Statistics found ‘almost perfect’ alignment between coders at the major-group level.

Having demonstrated a high standard of reliability in our approach, we intend to refine engagements with HE providers over SOC coding assessment to reduce the burden they experience in supporting quality assurance, and to streamline our production processes.

Handling free text responses

Most questions in Graduate Outcomes map directly to established lists of values, and details of these are available in the coding manual.[11] However, there is often an “Other” option that permits a free text response. In this subsection and the subsequent ones, we cover the most important issues relating to free text processing, and explain the risks around processing error, giving our estimates for this.

At the end of the collection process, data returned for questions that permit a free-text response goes through a cleansing process, in order to improve data quality. This is usually where the respondent has not chosen a value from the drop-down list provided but has instead selected “other” and typed their own answer. This process also runs for questions seeking postcode, city/area and country of employment, or self-employment / running own business; country in which graduate is living and of further study; provider of further study, and salary currency. Where possible, the free text is mapped to an appropriate value from a dictionary published within the appropriate derived field specification.

We have encountered some specific issues in the processing of UK-based location information, which we turn to next. Later subsections offer comparable quality descriptions of cleansing of further study and home country data.

Location of work data – handling free text

Location of work is collected from graduates who are in paid work for an employer, voluntary or unpaid work or contracted to start a job in the next month. Respondents in employment are asked to tell us where they worked during the census week.[12] The majority of respondents supplied data that we could process into a structured format, such as their employer’s postcode.[13] In 2018/19, of those graduates in work during the census week, 7% (7% in 2017/18) did not provide any location information and of those graduates contracted to start a job in the next month, 11% (10% in 2017/18) did not provide any location information. These graduates are excluded from the table below.

HESA has developed an algorithm[14] for the processing field ZEMPCOUNTRY, which cleans up the data provided by a graduate responding to the Graduate Outcomes survey question “In which country is your place of work?”. It does this by combining the data provided in EMPCOUNTRY (which is based on a restricted list of values available to the respondent as a drop-down menu) with that from the free text field EMPCOUNTRY_OTHER.

With the enhancement of the derived field mapping process, a large majority of graduates who provided some UK location information could be mapped to county / unitary authority level (derived in XEMPLOCUC). Our matching process is specified in detail within the following processing fields ZEMPAREA and ZBUSAREA and derived fields XEMPLOCUC, XEMPLOCGR, XBUSLOCUC and XBUSLOCGR. Our exact matching process is specified in detail in the two processing fields ZEMPAREA[15] and ZBUSAREA[16]. The previous iteration of the quality report goes into detail about the problems we faced and our course of action for publication of year one data.

Table 26 Location of work data - processing free-text responses

 

2017/18

2018/19

 

In work

Due to start

Total

%

In work

Due to start

Total

%

Non-UK

 

Country selected from drop-down

32705

1405

34110

99.3%

36870

1470

38345

99.5%

Free text country information mapped

50

0

50

0.1%

10

0

10

0.0%

Free text country information not mapped (including NULLs)

185

10

195

0.6%

165

5

170

0.4%

Total in work not in the UK

32940

1415

34355

100.0%

37045

1480

38525

100.0%

UK

 

Mapped to county/unitary authority level

235705

6530

242235

96.1%

241215

6570

247785

96.6%

…of whom gave postcode

143945

3105

147050

 

156330

3335

159665

 

Mapped to GOR level but not county/UA level

1055

30

1085

0.4%

725

30

760

0.3%

Mapped to country level (based on EMPPLOC)

8545

330

8875

3.5%

7725

270

8000

3.1%

…of whom refused to give information (approximate)

1110

50

1160

 

1335

50

1385

 

…of whom gave NULL response

1155

40

1195

 

1285

35

1320

 

Total in work in the UK

245305

6890

252195

100.0%

249670

6875

256545

100.0%

Location of self-employment or own business is collected from graduates who are in self-employment or running their own business during the census week. In 2018/19, of those graduates in self-employment or running their own business during the census week, 9% (10% in 2017/18) did not provide any location information. These graduates are excluded from the table below. With the enhancement of the derived field mapping process,[17] a large majority of graduates who provided some UK location information could be mapped to county / unitary authority level (derived in XBUSLOCUC).[18]

Table 27 Location of self-employment or own business - processing free-text responses

 

2017/18

2018/19

 

Total

%

Total

%

Non-UK

 

Country selected from drop-down

7165

99.0%

8445

99.5%

Free text country information mapped

5

0.1%

5

0.0%

Free text country information not mapped (including NULLs)

70

1.0%

40

0.5%

Total in self-employment/own business not in the UK

7240

100.0%

8490

100.0%

UK

 

Mapped to county/unitary authority level

27145

94.8%

29085

95.3%

…of whom gave postcode

15625

 

18320

 

Mapped to GOR level but not county/UA level

160

0.6%

115

0.4%

Mapped to country level (based on EMPPLOC)

1330

4.6%

1305

4.3%

…of whom refused to give information (approximate)

240

 

315

 

…of whom gave NULL response

150

 

165

 

Total in self-employment/own business in the UK

28635

100.0%

30510

100.0%

As a result, the year two data has been released at a more granular geographic resolution. We have also reprocessed year one data to also reach a higher resolution. Users of microdata will also notice improvements in geographical resolution and should assess data quality for uses below regional level. Improving geographical resolution further remains a priority for HESA, as we are aware of strong user demand for high-resolution place-based analysis.

We continue to view improvements to geographical resolution as a priority and are currently undertaking a programme of work to evaluate the pros and cons of various options for this, including improvements to the survey instruments and also to the algorithmic approach we utilise in data processing.

Further study data – handling free text

Further study data on the provider attended[19] reflects the very large number of HE providers that UK graduates go on to study at. Provider information is collected from graduates undertaking further study during the census week or those who are due to start studying in the next month. Graduates can either select their UK provider from a drop-down menu or can provide details in a free text box. Where a student selects their UK provider from the drop-down menu, they are assumed to be studying in the UK, otherwise they are asked for their country of provider. Graduates provide country of further study information by selecting from the country drop-down menu or entering in the free text box.

Of those 2018/19 graduates who were in further study in the census week, 17% (20% in 2017/18) did not provide any information about their provider and of those due to start study, 22% (28% in 2017/18) did not provide any information about their provider. Of the 2018/19 graduates in further study in the census week who did not select their provider from the drop-down menu, 5% (10% in 2017/18) did not provide any country information and of those due to start study, 10% (17% in 2017/18) did not provide any country information.

The first table below excludes these graduates who did not supply any provider information and the second excludes those who did not supply any country information and did not select their provider from the drop-down menu. Most graduates use the drop-down menus to supply provider and country information, only a small proportion use the free text box.

Table 28 Provider - processing free text responses

 

2017/18

2018/19

 

In study

Due to start

Total

%

In study

Due to start

Total

%

Provider selected from drop-down menu

47525

13720

61245

80.1%

52920

14980

67900

77.9%

Free text provider information mapped (studying in the UK)

395

105

500

0.7%

590

130

720

0.8%

Free text provider information not mapped (studying in the UK)

7245

2075

9320

12.2%

9900

2200

12100

13.9%

Not studying in the UK

4560

800

5355

7.0%

5555

880

6435

7.4%

Total

59725

16700

76425

100.0%

68965

18190

87155

100.0%

Table 29 Provider country - processing free text responses

 

2017/18

2018/19

 

In study

Due to start

Total

%

In study

Due to start

Total

%

Provider selected from drop-down

47525

13720

61245

71.3%

52920

14980

67900

68.2%

Country selected from drop-down

19080

5430

24510

28.5%

25615

5975

31590

31.7%

Free text country information mapped

30

15

50

0.1%

15

20

35

0.0%

Free text country information not mapped

65

30

100

0.1%

35

15

50

0.1%

Total

66700

19200

85905

100.0%

78590

20990

99580

100.0%

Home country – handling free text

Home country information is collected from graduates who are doing something other than being in some form of employment or further study during the census week.[20] Graduates can either select their home country from a drop-down menu or can provide details in a free text box. In both 2018/19 and 2017/18, 2% of graduates did not provide any home country information. The tables below exclude those graduates who did not provide any home country information.

Table 30 Home country - processing free text responses

 

2017/18

2018/19

 

Total

%

Total

%

Country selected from drop-down

35035

99.9%

42160

100.0%

Free text country information mapped

20

0.1%

0

0.0%

Free text country information not mapped

20

0.1%

5

0.0%

Total (excluding Null)

35070

100.0%

42165

100.0%

Next: Timeliness and punctuality


[5] HESA has commissioned Oblong as a SIC code supplier in the past, using DLHE data that was similar to the structure of the relevant parts of Graduate Outcomes data. This longstanding methodology continued to prove robust.

[8] We undertook an investigation to determine whether an improved methodology could yield more complete SOC code data, without reducing consistency. We learned that accurate coding could be achieved for some records that were not previously coded. As a reminder, our previous methodology required four fields (Company Name; SIC code; Job title; Job Duties Description) to be completed by the graduate for SOC to be coded. However, on reflection, where we have found that responses of sufficient quality have been provided in job title and job duties, even where the employer’s name and/or duties are missing, we can derive a code, satisfactorily.

[11] The Graduate Outcomes survey results coding manual is available here: https://www.hesa.ac.uk/collection/c18072

[12] This data is gathered through various survey questions (dependent on routing) and stored in the fields: EMPPLOC; EMPPCODE; EMPPCODE_UNKNOWN; EMPCOUNTRY; EMPCOUNTRY_OTHER, and; EMPCITY. We also collect parallel data on self-employed graduates, using the fields: BUSEMPPLOC; BUSEMPPCODE; BUSEMPPCODE_UNKNOWN; BUSEMPCOUNTRY; BUSEMPCOUNTRY_OTHER, and; BUSEMPCITY. Results for these fields are similar in proportion to those in employment, though the prevalence of self-employment is much lower, and hence we do not offer a detailed analysis on this much smaller group. Detailed metadata on all these fields can be viewed by following links from the data items index in the Graduate Outcomes survey results record coding manual, here: https://www.hesa.ac.uk/collection/c18072/index

[13] Post-processing, location data can be found in the following derived fields: XMLOCGR; XMLOCN; XMLOCUC; XSTULOCGR; XSTULOCN; XSTULOCUC; XEMPLOCGR; XEMPLOCN; XEMPLOCUC; XBUSLOCGR; XBUSLOCN; XBUSLOCUC, and; XCURRLOC. Details of the processing involved in production is described by following the relevant links available from the derived fields specification contents page in the Graduate Outcomes survey results record coding manual, here: https://www.hesa.ac.uk/collection/c18072/derived/contents

[17] See derived field specification for ZBUSCOUNTRY at: https://www.hesa.ac.uk/collection/c18072/derived/zbuscountry

[18] See derived field specifications XBUSLOCGR and XBUSLOCUC (navigating from https://www.hesa.ac.uk/collection/c18072/derived/contents)

[19] See the specification for free text ‘other’ responses at https://www.hesa.ac.uk/collection/c18072/a/ucname_other. This is returned where a respondent does not locate suitable option from the list of values at: https://www.hesa.ac.uk/collection/c18072/a/ucname