Skip to main content

Data processing

On this page: Data capture | Data quality checking | SIC/SOC data coding | Quality assurance process | Free text field cleaning | Derived fields

Data capture

All data collected across the three modes (online, telephone and postal) are captured in a single location provided by the same software used to administer the survey. Every day, all completed survey results are transferred to HESA’s internal databases. These are processed overnight, ready for dissemination through the provider portal the following day.

Data captured in the internal databases are also used for quality assurance and output production.

Data quality checking

A separate Data quality report is published on the website, detailing a current assessment of the strengths and weaknesses of the Graduate Outcomes data as well as providing information on any known quality issues. It also forms part of an advanced user’s guide to further information HESA has published on Graduate Outcomes, signposting technical specifications, papers, and reports of interest to analysts.

SIC/SOC data coding

Where we have received sufficient data (more than one alpha-numeric character in one of the four employment fields) in the employment and/or self-employment sections of the survey, responses are passed on to Oblong, our supplier for coding of Standard Industrial Classifications (SIC) and Standard Occupational Classifications (SOC). Surveys completed in Welsh are first translated and then sent to the coding supplier.

The SOC2020 framework is being adopted for the 18/19 collection.

The fields used for SIC coding are:

Company Name
Company Town/City
Company Postcode
Country
Company Description
Job Title (to help with School/Healthcare classifications)
Course Title
Self-Employed or Own Business

The fields used for SOC coding are:

Company Name
SIC Code
Job Title
Job Duties Description
Most Important Activity
Self-Employed
Own Business
Portfolio
Qualification Required
Course Title
For business owners, whether they have employees
Whether they Supervise Others
Company Description

The Company Description can help in some cases to clarify the SOC code. A combination of the Course Title studied and the Qualification Required question, where appropriate, help to inform and give confidence to the coding.

Ideally, all of the above variables are needed to obtain the most relevant SOC code for a given record. In some instances, it may be possible to obtain a code even when all the information is not provided. However, as previously noted, at least one of the four employment fields must be provided as a minimum.

Over the years, our supplier has developed self-learning software to deal with the classification and matching of company data. This software has been re-written and trained to work with HESA data, and utilises fuzzy logic, knows of common typos and uses spelling error algorithms to deal with the free text in the data. The software is underpinned by our supplier’s own database of UK companies and uses machine learning on both SIC and SOC from historic data to improve coding. They also employ a dedicated team of manual coders who check all codes and fill gaps where the software could not apply a code.

Our supplier first loads the data into their systems and pre-processes it, tidying it up, addressing common issues and putting it into the right format ready for further automated processing. Each field has its own set of unique pre-processing tasks, which can range from keyword replacement, keyword removal and character substitution.

Next, industry classifications (SIC codes) are automatically added to companies that employ graduates. The manual coding team then complete an initial check of the data and fill the gaps where the system cannot apply a SIC code. The codes are then checked again by a quality control team and amended where necessary.

The data are then automatically SOC coded, and the system uses various methods to apply a SOC code to a record. It looks for keywords in both the job title and job duties fields. The system learns from data that have already been coded (including previous manually SOC coded records), so if it sees a record with similar details to one that was seen before, it can be assigned the same SOC code.

Of the responses collected through telephone interviewing, any uncodable records identified by Oblong are sent back to IFF for a follow-up interview where there is a reasonable case for going back to the respondents.

Quality assurance process

The SOC codes generated by the automated process are manually reviewed, and the gaps filled where the automated systems could not apply a code. All records are then sent to the SOC quality checking team to be checked before being released back to HESA.

The manual coders are in constant contact with each other and the quality team, and any new/different occupations encountered are discussed with the quality team, who will then research an occupation if necessary, or discuss with HESA or the ONS if required.

For most of the job titles, the coding index (list of job titles in the SOC framework) contain the job titles and records can be coded from them. Where the job title is not in the indexes detail in the job duties is used to ascertain what the job involves and code accordingly. Due to the international element of the data, jobs which do not appear in the indexes are also encountered. Coders are adept at assessing the job duties and placing the job with the appropriate code, and this is all subsequently checked by a quality checker. If a coder still cannot code then they raise a query with the quality checkers, who will discuss with other team members, research the role if necessary, and advise on coding. Research is done online using reputable sources (for example the company website where the person works, NHS websites, large well-known job sites, where one can see what qualifications are required and what a job involves). Where appropriate, the documentation which the coders use is subsequently amended for future reference.

Doing this exercise over multiple years, and given the volume of data, allows Oblong to refresh their databases with new jobs that did not exist before. When new jobs are encountered, a decision is made on an appropriate code and this information is disseminated to all coders via their coding indexes for future reference.

A final consistency check is completed at the end of each cohort and for many records a final data consistency/quality review takes place at the end of the collection. This involves consistency checks across employers, job titles and all cohorts to make sure no single cohort within the collection looks different to the rest. By the end of the process, every SOC and SIC code will have been manually checked multiple times.

With the introduction of SOC2020 for year two, our supplier has taken the opportunity to review all logic and associated reference data (over 50,000 sets of keywords and SOC associations) within the SOC coding automated systems, to ensure that provider feedback has been embedded in the software, and has also refined and added to the guidance documentation used to manually classify the responses. The manual coders have been retrained on the new SOC2020 taxonomy, and also continue to be re-briefed ongoing following changes based on provider feedback.

Oblong also provide a standardised company name, improved business postcode, Companies House registration numbers and employee size information in the final, returned data, in order to aid analysis.

Following receipt of coded data, HESA undertook an extensive data quality assurance exercise this year. This was governed by a 3-stage quality assurance strategy, recommended by HESA and approved by the steering group. The 3 stages were:

  1. Identification of systemic coding issues using feedback from providers
  2. Identifying non-random anomalies in the distribution of coded data
  3. Independent verification of coded data by an external organisation

All three stages of quality assurance were successfully completed in Spring 2021 and their outcomes have been made available to users and other members of the public.

Free text field cleaning

At the end of the collection process, data returned for questions that permit a free-text response goes through a cleansing process, in order to improve data quality. This is usually where the respondent has not chosen a value from the drop-down list provided but has instead selected “other” and typed their own answer. This process also runs for questions seeking postcode, city/area and country of employment, or self-employment / running own business; country in which graduate is living and of further study; provider of further study, and salary currency. Where possible, the free text maps to an appropriate value in the drop-down menu or the appropriate country or region.

For year two and retrospectively for year one, a second round of more in-depth cleaning has been carried out on the free text questions relating to city/area of employment and self-employment / running own business. This involved looking for common area names contained within the free text information; free text information contained in common area names; accounting for spelling mistakes; cross-checking area names with other graduates providing the same free text information in addition to a valid postcode; comparing free text relating to both employment and self-employment / running own business supplied by the same graduate. This second round of cleaning was used as an enhancement of the derived fields process.

Derived fields

Further aggregation of some key fields is carried out to produce standard derived breakdowns used across HESA’s published material. Key areas of derivation include minimum response for inclusion in publication, method of response, activity (including most important activity), location of activity; grouping of standard industrial classification (SIC), standard occupational classification (SOC) and salary; employment and study undertaken after graduate and prior to survey activity. Details of these derivations will be published within the survey results coding manual.

Previous: Data collection     Next: Data analysis