Skip to main content

Data processing

On this page: Data capture | Data quality checking | SIC/SOC data coding | Free text field cleaning | Derived fields

Data capture

All data collected across the three modes (online, telephone and postal) are captured in a single location provided by the same software used to administer the survey. Every day, all completed survey results are transferred to HESA’s internal databases. These are processed overnight, ready for dissemination through the provider portal the following day.

Data captured in the internal databases are also used for quality assurance and output production.

Data quality checking

A series of data quality checks were carried out on the data collected in year one. Most of these checks will also be relevant in future years except those that relate to fixed survey components such as routing and completion logic. The main areas of consideration included:

  • Survey completion logic - assessment of the coding of data fields which informed whether the survey was completed or not. Where coding issues were identified, fixes were implemented in a timely manner.
  • Survey routing - capturing any errors around survey routing, such as incorrect questions being answered given the activities selected and compulsory questions skipped, allowing graduates to proceed and answer questions from following sections. An example of this occurrence is a small number of cases (less than 50 graduates) that were identified and attributed to the existence of the “back” button in the survey, allowing graduates to go back to earlier questions and delete answers. This issue was virtually eliminated mid-way through cohort A (17/19 collection) and in the following cohorts when this feature was disabled from the online survey. This is only retained in the telephone survey to maintain a good interviewer-respondent relationship. Those completing the survey online can contact HESA by email if they wish to request a change to their survey answers or completely reset the record.
  • Free text fields - analysis of the data captured in the free text fields and proposals for future modifications to help improve quality. This included identifying trends in responses from graduates who were unable to select the appropriate response from a drop-down menu and opted to select “other” and complete the free text box.
  • Salary - analysis of salaries returned by graduates, including percentage of known salary for those graduates paid in UK pounds and in full-time employment or self-employed or running own business and known salaries split by currency. This also included comparison of minimum, maximum, average (mean and median) and missing values against previously published material in DLHE and other national sources.
  • Standard Industrial Classification (SIC) and Standard Occupational Classification (SOC) - analysis and trends found within the free text fields for those graduates for which SIC or SOC could not be coded. Further information on the SIC/SOC coding process is covered below.
  • Partial responses - analysis of what could constitute as a sufficient response in order to be included within published material.

SIC/SOC data coding

Where we have received sufficient data in the employment and/or self-employment sections of the questionnaire, responses are passed on to our supplier for coding of Standard Industrial Classifications (SIC) and Standard Occupational classifications (SOC).

Over the years, our supplier has developed self-learning software to deal with the classification of company data. This software has been trained to work with HESA data. Graduate Outcomes uses their specialised software suite to add industry classifications (SIC codes) to companies that employ graduates. The dedicated manual research team quality check most of the data and fill the gaps where the system can't add apply a SIC code.

Surveys completed in Welsh are first translated and then sent to the coding supplier following the above process.

The fields used for SIC coding are:

  • Company Name
  • Company Town/City
  • Company Postcode
  • Country
  • Company Description
  • Job title (to help with School/Healthcare classifications)
  • Course title
  • JACS level 3 grouping
  • Level of qualification

The fields used for SOC coding are:

  • Company Name
  • SIC code
  • Job title
  • Job Duties Description

Our supplier also takes into account if the company is an NHS organisation, if the graduate is self-employed, freelance, running their own business, supervising staff, or own the business.

The coding system uses various methods to SOC code a record. It looks for keywords in the job title and job duties field. The system learns from data that has been previously coded (including manual SOC coded records), so if it sees a record with similar details to one that was seen before, it can be assigned the same SOC as last time.

Currently, all SOC codes produced by the system are manually reviewed, and then followed up with a second manual quality check and a final consistency check at the end of each cohort. A final data quality review takes place at the end of the collection. This involves consistency checks across all cohorts to make sure no single cohort within the collection looks different to the rest.

By the end of the process, every SOC and SIC code will have been manually checked at least twice. Find out more about Graduate Outcomes SIC and SOC coding on our website.

At the end of year one data collection, providers had the opportunity to review their data including the draft SOC (occupation) coding and submit feedback to HESA. All of the provider feedback received was individually reviewed and tracked. Outcomes from this review have been published on the website, alongside a description of the review process itself and next steps.

Free text field cleaning

At the end of the collection process, data returned for questions that permit a free-text response goes through a cleansing process, in order to improve data quality. This is usually where the respondent has not chosen a value from the drop-down list provided but has instead selected “other” and typed their own answer. This process also runs for questions seeking postcode, city/area and country of employment, or self-employment / running own business; country in which graduate is living and of further study; provider of further study, and salary currency. Where possible, the free text maps to an appropriate value in the drop-down menu or the appropriate country or region.

Derived fields

Further aggregation of some key fields is carried out to produce standard derived breakdowns used across HESA’s published material. Key areas of derivation include minimum response for inclusion in publication, method of response, activity (including most important activity), location of activity; grouping of standard industrial classification (SIC), standard occupational classification (SOC) and salary; employment and study undertaken after graduate and prior to survey activity. Details of these derivations will be published within the survey results coding manual.

Previous: Data collection     Next: Data analysis