Skip to main content

Data processing

On this page: Data capture | Data quality checking | SIC/SOC data coding | Free text field cleaning | Derived fields

Data capture

All data collected across the three modes (online, telephone and postal) are captured in a single location provided by the same software used to administer the survey. Every day, all completed survey results are transferred to HESA’s internal databases. These are processed overnight, ready for dissemination through the provider portal the following day.

Data captured in the internal databases are also used for quality assurance and output production.

Data quality checking

A series of data quality checks were carried out on the data collected in year one. Most of these checks will also be relevant in future years except those that relate to fixed survey components such as routing and completion logic. The main areas of consideration included:

  • Survey completion logic - assessment of the coding of data fields which informed whether the survey was completed or not. Where coding issues were identified, fixes were implemented in a timely manner.
  • Survey routing - capturing any errors around survey routing, such as incorrect questions being answered given the activities selected and compulsory questions skipped, allowing graduates to proceed and answer questions from following sections. An example of this occurrence is a small number of cases (less than 50 graduates) that were identified and attributed to the existence of the “back” button in the survey, allowing graduates to go back to earlier questions and delete answers. This issue was virtually eliminated mid-way through cohort A (17/19 collection) and in the following cohorts when this feature was disabled from the online survey. This is only retained in the telephone survey to maintain a good interviewer-respondent relationship. Those completing the survey online can contact HESA by email if they wish to request a change to their survey answers or completely reset the record.
  • Free text fields - analysis of the data captured in the free text fields and proposals for future modifications to help improve quality. This included identifying trends in responses from graduates who were unable to select the appropriate response from a drop-down menu and opted to select “other” and complete the free text box.
  • Salary - analysis of salaries returned by graduates, including percentage of known salary for those graduates paid in UK pounds and in full-time employment or self-employed or running own business and known salaries split by currency. This also included comparison of minimum, maximum, average (mean and median) and missing values against previously published material in DLHE and other national sources.
  • Standard Industrial Classification (SIC) and Standard Occupational Classification (SOC) - analysis and trends found within the free text fields for those graduates for which SIC or SOC could not be coded. Further information on the SIC/SOC coding process is covered below.
  • Partial responses - analysis of what could constitute as a sufficient response in order to be included within published material.

SIC/SOC data coding

Where we have received sufficient data (more than one alpha-numeric character in one of the four employment fields) in the employment and/or self-employment sections of the survey, responses are passed on to Oblong, our supplier for coding of Standard Industrial Classifications (SIC) and Standard Occupational Classifications (SOC). Surveys completed in Welsh are first translated and then sent to the coding supplier.

The SOC2020 framework is being adopted for the 18/19 collection.

The fields used for SIC coding are:

  • Company Name
  • Company Town/City
  • Company Postcode
  • Country
  • Company Description
  • Job Title (to help with School/Healthcare classifications)
  • Course Title
  • Self-Employed or Own Business

The fields used for SOC coding are:

  • Company Name
  • SIC Code
  • Job Title
  • Job Duties Description
  • Most Important Activity
  • Self-Employed
  • Own Business
  • Portfolio
  • Qualification Required
  • Course Title
  • For business owners, whether they have employees
  • Whether they Supervise Others
  • Company Description

The Company Description can help in some cases to clarify the SOC code. A combination of the Course Title studied and the Qualification Required question, where appropriate, help to inform and give confidence to the coding.

Ideally, all of the above variables are needed to obtain the most relevant SOC code for a given record. In some instances, it may be possible to obtain a code even when all the information is not provided. However, as previously noted, at least one of the four employment fields must be provided as a minimum.

SIC/SOC coding process

Over the years, our supplier has developed self-learning software to deal with the classification and matching of company data. This software has been re-written and trained to work with HESA data, and utilises fuzzy logic, knows of common typos and uses spelling error algorithms to deal with the free text in the data. The software is underpinned by our supplier’s own database of UK companies and uses machine learning on both SIC and SOC from historic data to improve coding. They also employ a dedicated team of manual coders who check all codes and fill gaps where the software could not apply a code.

Our supplier first loads the data into their systems and pre-processes it, tidying it up, addressing common issues and putting it into the right format ready for further automated processing. Each field has its own set of unique pre-processing tasks, which can range from keyword replacement, keyword removal and character substitution.

Next, industry classifications (SIC codes) are automatically added to companies that employ graduates. The manual coding team then complete an initial check of the data and fill the gaps where the system cannot apply a SIC code. The codes are then checked again by a quality control team and amended where necessary.

The data are then automatically SOC coded, and the system uses various methods to apply a SOC code to a record. It looks for keywords in both the job title and job duties fields. The system learns from data that have already been coded (including previous manually SOC coded records), so if it sees a record with similar details to one that was seen before, it can be assigned the same SOC code.

Of the responses collected through telephone interviewing, any uncodable records identified by Oblong are sent back to IFF for a follow-up interview where there is a reasonable case for going back to the respondents.

Supplier-led quality assurance process

The SOC codes are manually reviewed, and the gaps filled where the automated systems could not apply a code. All records are then sent to the SOC quality checking team to be checked before being released back to HESA.

The manual coders are in constant contact with each other and the quality team, and any new/different occupations encountered are discussed with the quality team, who will then research an occupation if necessary, or discuss with HESA or the ONS if required.

For most of the job titles, the coding index (list of job titles in the SOC framework) contain the job titles and records can be coded from them. Where the job title is not in the indexes detail in the job duties is used to ascertain what the job involves and code accordingly. Due to the international element of the data, jobs which do not appear in the indexes are also encountered. Coders are adept at assessing the job duties and placing the job with the appropriate code, and this is all subsequently checked by a quality checker. If a coder still cannot code then they raise a query with the quality checkers, who will discuss with other team members, research the role if necessary, and advise on coding. Research is done online using reputable sources (for example the company website where the person works, NHS websites, large well-known job sites, where one can see what qualifications are required and what a job involves). Where appropriate, the documentation which the coders use is subsequently amended for future reference.

Doing this exercise over multiple years, and given the volume of data, allows Oblong to refresh their databases with new jobs that did not exist before. When new jobs are encountered, a decision is made on an appropriate code and this information is disseminated to all coders via their coding indexes for future reference.

A final consistency check is completed at the end of each cohort and for many records a final data consistency/quality review takes place at the end of the collection. This involves consistency checks across employers, job titles and all cohorts to make sure no single cohort within the collection looks different to the rest. By the end of the process, every SOC and SIC code will have been manually checked multiple times.

At the end of year one (17/18) data collection, providers had the opportunity to review their data including the draft SOC (occupation) coding and submit feedback to HESA. All of the provider feedback received was individually reviewed and tracked. Outcomes from this review have been published on the website, alongside a description of the review process itself and next steps.

The provider feedback allowed Oblong to correct and improve on coding for year one, and the learning has been fed into the systems to enhance future SOC coding.

With the introduction of SOC2020 for year two, our supplier has taken the opportunity to review all logic and associated reference data (over 50,000 sets of keywords and SOC associations) within the SOC coding automated systems, to ensure that provider feedback has been embedded in the software, and has also refined and added to the guidance documentation used to manually classify the responses. The manual coders have been retrained on the new SOC2020 taxonomy, and also continue to be re-briefed ongoing following changes based on provider feedback.

Oblong also provide a standardised company name, improved business postcode, Companies House registration numbers and employee size information in the final, returned data, in order to aid analysis.

For more information about the coding process for year two (18/19), please visit the operational survey information page.

Usage of salary

Graduate Outcomes collects data on annual salary from all respondents in employment. We have reviewed the case for using this data in the coding process.

Salary is an optional question as respondents can skip it to move onto the next question. As such, there is a level of missing data for this question. Furthermore, salary is one of the sensitive questions in any personal survey and is not likely to yield accurate responses all the time. In the absence of a mechanism to validate this information, using administrative data for example, we are unable to comment on the accuracy of data in this field.

The use of salary data in SOC coding poses unique challenges. Given the geographic, industry and demographic variations in earnings, it is not possible to identify a set of principles that could be applied to the coding of different occupations, consistently for all graduates. For example, there is not a single rule of thumb that dictates the salary of people working in highly skilled roles across all industries and sectors. The case of part time roles and those working under other forms of contractual arrangements makes this task even more difficult. A system that is based on a set of standard principles that are consistently applied across thousands of records using a combination of automated and manual tools cannot accommodate a variable with a high degree of variance.

After careful consideration it has been decided that salary will not be utilised in SOC coding of Graduate Outcomes data. HESA will, however, consider its use in the quality assurance of coded data.

Free text field cleaning

At the end of the collection process, data returned for questions that permit a free-text response goes through a cleansing process, in order to improve data quality. This is usually where the respondent has not chosen a value from the drop-down list provided but has instead selected “other” and typed their own answer. This process also runs for questions seeking postcode, city/area and country of employment, or self-employment / running own business; country in which graduate is living and of further study; provider of further study, and salary currency. Where possible, the free text maps to an appropriate value in the drop-down menu or the appropriate country or region.

Derived fields

Further aggregation of some key fields is carried out to produce standard derived breakdowns used across HESA’s published material. Key areas of derivation include minimum response for inclusion in publication, method of response, activity (including most important activity), location of activity; grouping of standard industrial classification (SIC), standard occupational classification (SOC) and salary; employment and study undertaken after graduate and prior to survey activity. Details of these derivations will be published within the survey results coding manual.

Previous: Data collection     Next: Data analysis