Skip to main content

Student data quality report

1. Introduction

Principle Q3 of the Code of Practice for Statistics states that producers of statistics and data should explain clearly how they assure themselves that statistics and data are accurate, reliable, coherent and timely.

Read the full Code of Practice

Each Official and National Statistics output prepared by Jisc under the HESA brand contains key quality information in respect of the specific content of the output. This information is provided in the notes, definitions or 'about' link underneath each table or chart.

In addition, Jisc publishes summary quality reports for each of the main sources of data used to compile Official and National Statistics, to provide key qualitative information on the various dimensions of quality and the methods used to produce outputs.

Official and National Statistics products which use HESA student data include ‘Higher Education Student Statistics', ‘Higher Education Student Data' and 'HE Graduate Outcomes data'. Other products include historic editions of UK Performance Indicators.

2. Summary of quality

2.1 Relevance

The degree to which the statistical product meets user needs for both coverage and content.

Higher education (HE) student data is collected by Jisc under the HESA brand at an individualised level on a census-basis covering HE students at:

  • HE providers in England registered with the Office for Students (OfS) in the Approved (fee cap) or Approved categories;
  • Publicly funded HE providers in Wales, Scotland and Northern Ireland; and
  • Further education (FE) colleges in Wales.

Coverage of the individualised student collection extends to all students who are undertaking a course or programme which leads to the award of a qualification or institutional credit. Excluded from coverage are students studying overseas for the entire duration of their course even when they are formally registered at a UK-based HE provider. Students studying overseas by distance learning are similarly excluded unless they are funded by a UK HE funding body. The HE providers at which each student is registered are responsible for submitting the data about that student to HESA, drawing on information held within their internal administrative systems.

Data on the characteristics of students studying overseas for the entire duration of their course is collected within a separate aggregate data collection known as the ‘Aggregate Offshore Record’. This includes students who are either registered with a UK-based HE provider or who are studying for a qualification awarded by a UK HE provider. Students studying overseas through distance learning arrangements are also included. Summary data drawn from this aggregate collection is included alongside data drawn from the individualised student collection within the HESA-branded student statistical products.

HESA student data is used by a wide variety of types of users for a range of purposes. UK HE funding/regulatory bodies and government education departments and Devolved Administrations (‘statutory users') use the information to support the allocation of state funding and regulation of higher education providers and in the formulation and monitoring of higher education policy. It is used by HE providers for strategic organisational planning and benchmarking performance and characteristics against other HE providers. Prospective students and their advisors use the information to inform their choices of provider and course, often through intermediary publications such as university guides and league tables. The media use information to support reports and articles on UK higher education. Private companies use it for monitoring and targeting their graduate recruitment efforts. Academic researchers investigating many different aspects of higher education draw upon the information extensively. International users studying the UK HE system or monitoring the success of students from their country studying in the UK also use the information.

Jisc ensures that its HESA student data collections and products derived from them remain relevant to users in a number of ways. Regular meetings with statutory users provide the opportunity to ensure that their needs are met. Engagement with users within HE providers is undertaken via a number of groups and discussion mechanisms such as the ‘Provider Forum’. All HESA records undergo a major review cycle every few years that brings together representatives from the key user communities to ensure that the information collected remains appropriate for their needs. Other user groups exist in relation to specific HESA-branded information products and surveys of users are undertaken periodically. Feedback received through all of these channels and others helps to shape the information collected and the content of products derived from the information. In this way the needs of user communities are continuously monitored and, where appropriate and practicable, acted upon.

HESA student data serves multiple purposes and must be aggregated and refined in order to be used to generate meaningful statistics. This includes the derivation of ‘populations' of students based on the application of restriction clauses to exclude any records which are inappropriate to a particular output. An example of this would be in statistics on qualifications obtained - clearly only those students who have been awarded a qualification in any given reporting year should be included in such statistics. Details of all population definitions and aggregation techniques used are clearly explained within the definitions section of each HESA-branded output.

2.2 Accuracy

The closeness between an estimated result and the (unknown) true value.

As a census rather than a survey, no estimates are produced from HESA student data and issues of sampling error are not relevant. Instead, characteristics of the population in question (students) can be measured directly and comprehensively.

For census collections, the extent of missing records or data items is of more relevance to accuracy. Missing records are not considered to impact materially on the student data collected by Jisc. Data is collected annually using a ‘muster' approach, so that no individual record may disappear from one year to the next without reaching an expected ‘end-state' (such as leaving the HE provider or gaining the intended qualification) or an explanation from the provider as to why the record is missing. In addition, counts of students are compared annually with returns made to funding bodies in respect of state funding allocations, and any discrepancies are investigated. Student data returns are also subject to audit processes from time to time. If any case arises where Jisc becomes aware that records are missing for any given HE provider, this is noted within a given product.

In respect of missing data items, the majority of data items are collected for all students but some are restricted to students of a particular type or geographical location. An example of this is that data on ethnicity of students is only collected for students of UK domicile. In such cases this is clearly explained in the definitions provided with each product. Some data items may include categories for ‘unknown' or ‘information refused' such as ethnicity. In such cases the levels of unknown values are shown in statistical tables and caveats may be included which explain that the statistics may not be representative of the population. The level of unknown entries within data items is routinely monitored during the data collection process. Any HE provider recording abnormally high levels of unknown values in key data items are strongly encouraged to reduce this level over time.

An overview of the quality checks that are made throughout the collection process is available in the support guide. These involve automated quality rules, credibility checks, the raising and closing of data quality issues by Jisc and some statutory users, and the assurance of a final formal sign-off verifying the data, from the head of the organisation submitting it. Detailed information on quality rules is available in the coding manual. Information gathered from the quality assurance process is made available in data intelligence notes accompanying the main outputs.

Derived fields are generated from the raw data collected by Jisc as part of the data ingestion process. These are aggregations and derivations of the data that are then used in our publications and onward analysis by statutory users. An example of a such a field is the derivation of student age from the raw date of birth field collected from HE providers. These derived fields form an important part of the quality assurance process, and ensure quality assurance is undertaken in a way that reflects how data fields will be presented in the outputs.

2.3 Timeliness and punctuality

Timeliness refers to the lapse of time between publication and the period to which the data refer. Punctuality refers to the time lag between the actual and planned dates of publication.

The first release of information from the HESA student data collections for any given academic year usually occurs in the January following the end of that academic year. For example, data for the 2021/22 academic year was first published in January 2023. First release occurs within the Statistical Bulletin entitled ‘Higher Education Student Statistics'. This is a National Statistics publication and is freely available from the HESA website and the National Statistics Publication Hub. A further more detailed open data release drawing on these data is then published in February entitled ‘Higher Education Student Data'. This is an Official Statistics release and is available free of charge from the HESA website

The reason for the time delay between the end of the academic year and first publication of statistics in relation to that year is the time required to collect, process and quality-assure the data and to prepare the statistical release itself. HESA student data is collected annually as a retrospective activity in the autumn following the academic year to which it relates. For example the 2021/22 HESA Student data collection system opened in March 2022 with a return date of 15 September 2022. A data quality checking period ran until 1 November, with sign-off of the record occurring on 7 November. Final processing of data and delivery to statutory users covered the period to 24 November 2022. At this point preparation of the January Statistical Bulletin commenced culminating in publication on 19 January 2023.

All of Jisc's HESA-branded National snd official statistics publication schedules are pre-announced on the HESA website with month of release normally shown six months in advance and precise dates being announced four weeks prior to release. In the unlikely event of a change to a pre-announced release date, attention will be drawn to this through the HESA website together with a full explanation of the reason for the change.

2.4 Accessibility and clarity

Accessibility is the ease with which users are able to access the data, also reflecting the format(s) in which the data are available and the availability of supporting information. Clarity refers to the quality and sufficiency of the metadata, illustrations and accompanying advice.

HESA-branded statistical products are available free-of-charge from the HESA website. These are published as interactive tables and charts with machine-readable downloads.

Older editions of statistical products are available from an online publications archive.

Further extracts of data from the HESA student data collections are available on request from the Jisc data analytics team (email [email protected] or tel (0)1242 211 133), further details can be found on Jisc's website.

Each HESA-branded publication is accompanied by full definitions and supporting information on specific aspects of quality. Further advice on aspects of any HESA publication is available from the Official Statistics team (email [email protected] or tel (0) 1242 388 513 [option 2]).

2.5 Comparability

The degree to which data can be compared over time and domain.

The specifications for the HESA student data collections are subject to a major review process every few years and minor changes may occur on a more frequent cycle. Changes are driven by the requirements of statutory users, higher education providers and other key stakeholders. Requirements in relation to the publication of official and National statistics are fed into that review process. However as administrative data collections with primary purposes which are not related to the publication of official and National Statistics, these requirements are of secondary importance to those of the statutory users in relation to state funding/regulation of higher education and formulation of HE policy. As such, changes to the data collection arrangements which have implications for official and National Statistics publications do occur from time to time. Jisc’s statistical planning process is designed to assess the impact of any changes in data collection on statistical outputs and to determine methods for minimising impact. In very many cases changes operate at a sufficiently low level within the microdata so as to permit simple adjustments of underlying aggregations which do not materially disturb the higher level statistics, which are derived from the data. Where changes generate greater impact, methods such as re-basing historic data may be used to provide consistent time series within statistical outputs. In such cases this is made clear to users within the accompanying supporting information. Where changes are so serious as to render re-basing impossible, such as a move to an incompatible classification system for example, any discontinuities in the data are made clear within statistical outputs. For examples of the above treatment of changes in the data collected please see Notes in the January 2023 edition of Higher Education Student Statistics.

It is Jisc policy to use national or international data standards wherever relevant and practicable to maximise comparability over domain. There are many examples of alignments with international standards (e.g. ISO) and national standards (e.g. National Statistics or UK Population Census) within the HESA student data collections. Jisc only deviates from existing data standards if such standards are seen to be inappropriate or inadequate for UK higher education uses.

In 2019/20 we transitioned our approach to presenting the subjects studied by students on courses from a previous standard (JACS) to the Higher Education Classification of Subjects (HECoS) vocabulary. A Common Aggregation Hierarchy (CAH) was produced to aid time series analysis across both of these standards. We published our initial analysis showing how patterns of subject coding have changed during the transition, but chose not to use CAH for presenting time series at this stage. CAH was subsequently updated from version 1.2 to version 1.3.4 and as a result we extended our previous analysis to ascertain what impact the latest version of CAH has on time series analysis. When CAH v1.3.4 is applied to 2018/19 data it redistributes 72,570 students from the nearest analogous subject groups in CAH 1.2. For 2019/20 data the figure is 39,690 students. While some changes between analogous subject groupings in each version of CAH are in some cases large, overall the two aggregations yield fairly similar results. CAH v1.3.4 increases the number of students mapped to HESA's grouping of science subjects by 12,765 in 2018/19 over the number produced by CAH v1.2 for that year, but there is no difference between the figures yielded by the two versions of CAH for HESA's groupings of science subjects in 2019/20. The difference between the number of students allocated to each version of CAH's nearest analogous subject groups is always less than 1% of the total student numbers reported, and the modal change is nil.

CAH v1.3.4 has been used in the publications from 2020/21, and we have also taken the decision to retrospectively apply CAH 1.3.4 to our published data from 2019/20, in order to maximize comparability and consistency of definitions between years for data users. CAH v.1.3.4 was introduced to increase the usefulness and intelligibility of detailed subject groupings, but at the highest level of aggregation it delivers similar insights to CAH v1.2. Jisc’s decision to utilise CAH v1.3.4 for previous years enables us to ensure that all HECoS data is presented in a consistent manner. CAH v.1.3.4 is anticipated to be the adopted standard for several years.

However, one aspect of our analysis provides evidence for a need to continue assessing the value of the CAH in providing a bridge between JACS and HECoS. That is, the proportional difference between student numbers reported for each subject grouping in CAH v1.3.4 is higher overall between the two HECoS coded years (2019/20 and 2020/2021) than for the transition from JACS (2018/19) to HECoS (2019/20), therefore indicating that year-on-year variation within a consistently coded dataset can exceed that observed in the transition between underlying coding approaches. Given the high level of user interest in longitudinal data on subjects and subject groupings we therefore intend to specify further analytical work with the goal of assessing the usefulness of CAH in providing long-term time series analyses not only over the transition years, but also by application of CAH v.1.3.4 to older JACS-coded data. Further data-driven investigations into the use of a timeseries including both JACS and CAH are required before a final decision is made on the use of CAH for longer time series analysis.

2.6 Coherence

The degree to which data that are derived from different sources or methods, but which refer to the same phenomenon, are similar.

HESA student data is the only comprehensive source of UK-wide statistics on HE students and their study choices. There are, however, other sources of student information that are more limited in coverage, either by type of student or by geographical region of the UK. Probably the most notable of these are statistics compiled and published by UCAS using information supplied by students and HE providers as part of the course applications process. Although UK-wide, the UCAS admissions process primarily focuses on students applying to full-time first degree and some sub-degree courses such as HND (though UCAS process limited numbers of applications to other types of courses). This presents one of the major differences between UCAS and HESA statistics since HESA statistics are not limited to a subset of HE courses offered. There are also other key differences. UCAS statistics are based on numbers of applications and acceptances on courses whereas HESA statistics are records of students who actually enrolled on courses. In some cases accepted applicants never actually enrol on the course on which they have been accepted. There are also cases in which students apply directly to HE providers without using the UCAS admissions process and therefore never appear in published UCAS statistics. In addition to differences in coverage there are also methodological differences used by UCAS and HESA in presenting statistics on student numbers. An example of this is the recording of subjects studied by students who are undertaking subject combination courses - UCAS allocate students to a single major subject or a combination category whereas HESA divides student numbers across the combination subjects. These definitional differences between HESA and UCAS data must be noted when comparing statistics derived from these sources, although users will note that the overall trends over time in UCAS statistics on entry to higher education are similar to those in HESA statistics for full-time undergraduate entrants.

In England, student numbers can also be derived from the annual Office for Students ‘HESES' return (Higher Education Students Early Statistics). This differs in a number of ways from published HESA statistics. Coverage extends only to England-based HE providers. In addition, figures collected within this census are based on an aggregate count of students whereas the HESA student data collections contain individualised data. HESES figures are based on retrospective counts of students from 1 August to 1 December in each academic year together with a forecast of student numbers from 2 December to 31 July with an allowance made for numbers of students who are forecast to leave before completing the academic year. HESA figures are entirely based on a retrospective count with no forecasting or estimation required. In other respects, the coverage of the two census collections is intentionally similar and indeed figures from the two collections undergo a formal reconciliation process which is used as a mechanism to verify the data provided by HE providers.

The Department for Education and the Devolved Administrations produce statistics outputs on Higher Education. These often draw upon data from the HESA student data collections and use common definitions. These figures should therefore normally be comparable to statistics published by Jisc.

3. Summary of methods used to compile the output

Collection of the HESA student data

Data are supplied by HE providers to Jisc via a secure web-based transfer system created and maintained by Jisc. The data supplied are subject to an extensive quality assurance process. The first stage of this includes a suite of validation checks, which ensure that the data collected meet specification, dates fall within expected ranges and the information provided within fields of data is consistent. Failures at this stage may cause a data return to be rejected, requiring a re-submission from the provider once corrected. The second stage of quality assurance comprises a verification process whereby frequency counts and cross tabulations are produced automatically from the data submission of each provider and these are fed back to the providers. A team of quality assurance analysts at Jisc also scrutinise this material. Year on year comparisons provide a summary of changes and the level of change in any particular area is examined closely if it falls outside of an expected range. Any issues arising from this stage of quality assurance are logged within an online system to which the submitting providers have access. Providers must respond to each issue to either confirm that anomalies are genuine or correct the data and re-submit. The final stage of the quality assurance process is a sign-off by the head of each provider confirming that data meet required quality standards and are fit for onward use.

Contracts in place between Jisc and statutory users require that the data be of sufficient quality for statutory users' funding, regulatory and policy purposes. Sanctions may be applied against Jisc and HE providers should these quality standards not be met. The quality standards set by statutory users are deemed more than adequate for the purposes of production of official statistics.

Production of statistics from the data collections

Once the data collections have been completed, quality assured and signed-off the resulting data sets are made available to statistical analysts at Jisc for preparation of relevant publications. The production process for official and National Statistics occurs within a project structure with appropriate governance mechanisms in place. Preparation of the HESA National Statistics releases is undertaken in collaboration with statisticians at the Office for Students, Department for Education and the Devolved Administrations. The process includes extensive quality assurance procedures at key stages. These include peer review of data specifications for releases, peer review of interactive table and chart structures in the HESA website, checking of data tables and charts, detailed manual checking of figures cited in statistical commentary together with extensive proof reading of commentary. Each key stage of the production process requires senior staff sign-off and full issue-tracking is used.

Information on the methods used to generate specific elements of data in published releases can be found within the Notes and Definitions supplied with each release.