Skip to main content

Data classification (SIC/SOC)

18/19 SOC coding update

HESA has been exploring the potential use of the newly released SOC2020 coding framework and, working with the Steering Group, has taken the decision to use this framework for the coding of year two (18/19) survey data. To enable this, we are currently undertaking the planning and implementation activity required to make this change. This means that we have delayed the provision of the raw feed of SIC / SOC data until this work is complete.
 
We’re keen to share the lessons learned from year one and how this has helped to inform what will be our finalised approach to the coding of year two data. In the coming months, we will provide a full overview, to include our approach to quality assurance and how we intend to engage with the sector.
 
We would like to thank providers for their patience while we develop this approach and to those who’ve engaged with HESA regarding occupation coding and other aspects of the survey during this unprecedented time for the sector.

Outcomes of 17/18  SOC coding assessment

We would like to express our sincere thanks to all providers for their cooperation and patience during the assessment process. We have now provided an overview of the process, the action we’ve taken, and our next steps - this has been sent to all providers for their information. Please click the button below to view this:

View the SOC assessment outcomes

Which occupation groups have been identified in the 17/18 data as having systemic errors?
During the process outlined above, the following occupation groups have been identified as having systemic issues. These groups have been addressed in the data delivery to providers in late March 2020.

Secretary in educational establishment

Campaigner

Investment banking associates

Marketing assistant

Auditors

Accountants

Buying assistant

Influencer

Investment administrator

University researcher

Electronics engineer

Mechanical engineer

Construction engineer

Assurance associate

Event administrator

Research administrator (higher education)

Electrical engineer

Test engineer

Brewer

IT and software engineer

Administrators at HE providers

Gallery assistant

Private secretary

Aircraft engineer

Merger and acquisition analyst

Unqualified teacher

Traders/dealers

Support worker (housing)

-

-

When can providers expect delivery of additional SOC codes for 17/18?
Following the delivery of final provider survey data in late March, we received questions regarding a number of SOC records with a code of “00010 - insufficient information to code”. HESA’s coding methodology is underpinned by the need for data quality. As such, we aim to collect the requisite amount of data (all four fields outlined below) and this data must be of sufficient quality to allow correct coding, as determined by our coding partner. A large percentage of the records coded 00010 did not meet these criteria and were therefore not sent for coding, as per our methodology. 

Following an investigation, we delivered a new iteration of the collection results data into the provider portal on 22 May. This delivery included additional SOC coding of a subset of partial responses where we are satisfied that accurate coding can be achieved. This is, for example, where responses of sufficient quality have been provided in job title and job duties, even if the employer’s name and/or duties are missing.

Further information on coding processes

Who is completing the data classification coding for Graduate Outcomes?
Oblong is our supplier for the coding of occupations and industries that graduates are working in (known as SIC and SOC coding). They are business data experts, with their main focus being the classification of businesses and database cleaning/enhancement. They have been SIC coding DLHE for the past 6 years, using specially developed coding software, in combination with a highly experienced manual research team.

The classification of Graduate Outcomes is key to allowing analysis and understanding of this large data source, and accuracy and consistency are paramount given the scrutiny and importance of the data. Learn more about Oblong.

What’s the approach to SIC coding?
Over the years, Oblong has developed self-learning software to deal with the classification of company data. This software has been finely tuned to work with HESA data. Graduate Outcomes will use their Business Data, Unity matching software suite and AutoSIC software to add industry classifications - SIC codes - to companies that employ graduates. The dedicated manual research team quality check most of the data and infill the gaps where the system can't add a SIC.

The fields Oblong use for SIC coding are:

  • Company Name
  • Company Town/City
  • Company Postcode
  • Country
  • Company Description
  • Job title (to help with School/Healthcare classifications)
  • Course title
  • JACS level 3 grouping
  • Level of qualification

What’s the approach to SOC coding?
As part of the Graduate Outcomes survey, Oblong has also been contracted to add occupation classifications - SOC codes - to summarise the type of job each graduate undertakes for the company they work for. It will use their new self-learning AutoSOC software to add classifications. This will be followed up by manual quality checks on most of the data and manual infill on those the system can't classify.

The fields Oblong use for SOC coding are:

  • Company Name
  • SIC code
  • Job title
  • Job Duties Description

They will also take into account if the company is an NHS organisation, if the graduate is self-employed, freelance, running their own business, supervising staff, or own the business.

The coding system uses various different methods to SOC code a record. It looks for keywords in the job title and job duties field, and takes into account if the qualification was required or not, before choosing the SOC. The system learns from data that has been previously coded (including manual SOC coded records), so if it sees a record with similar details to one that was seen before, it can be assigned the same SOC as last time. Oblong are manually reviewing all of the SOC codes the system produces, and then following this up with a second manual quality check and a final consistency check at the end of each cohort.

What is the ongoing data quality process for SIC / SOC data?
The automated coding system uses various methods to try and SOC code a record using different combinations of all the available data. The system learns from data it has previously coded (including manual SOC coded records). Therefore, when it encounters a record with similar details to one it has seen before it can be assigned the same SOC as the last time.

Currently, all SOC codes produced by the system are manually reviewed, and then followed up with a second manual quality check and a final consistency check at the end of each cohort.

A final data quality review will take place at the end of the collection. This will involve consistency checks across all cohorts to make sure no single cohort within the collection looks different to the rest. This is particularly important as most provider queries only started emerging towards the end of cohort C, by which point more than two cohorts had already been coded. The end of collection consistency checks will ensure that any amendments are applied across all four cohorts.

By the end of the process, every SOC and SIC code will have been manually checked at least twice – the thoroughness of the check on each record depends on the confidence the system has in the code allocated.

For telephone interviews, Oblong has an open channel with IFF Research (our contact centre supplier) to discuss data collection improvements to assist with collecting better data for more accurate coding. Of the responses collected through telephone interviewing, any uncodable records identified by Oblong are sent back to IFF for a follow-up interview where there is a reasonable case for going back to the respondents. Over the last few cohorts, the number of uncodable telephone responses has reduced.

What is a systemic error?
The most common example of systemic error is the incorrect coding of an entire professional group. At times, providers may not identify this using data from just one cohort. This may require a review of data across multiple cohorts.

In addition, it’s important to note that any form of coding is impacted by subjective judgment. The reason for undertaking consistency checks is to reduce this impact and make the results as objective as possible. To this effect, we advise that any judgment about systemic errors should entirely be based on data provided by the respondent. Subjective opinion and/or contextual, extraneous information will not be considered when reviewing the case for systemic issues.