Student data quality report
Principle Q3 of the Code of Practice for Statistics states that producers of statistics and data should explain clearly how they assure themselves that statistics and data are accurate, reliable, coherent and timely.
Each Official and National Statistics output prepared by HESA contains key quality information in respect of the specific content of the output. This information is provided in the notes, definitions or 'about' link underneath each table or chart.
In addition, HESA publishes summary quality reports for each of the main sources of data used to compile Official and National Statistics, to provide key qualitative information on the various dimensions of quality and the methods used to produce outputs.
Official and National Statistics products which use HESA student data include two Statistical Bulletins (previously referred to as Statistical First Releases) ‘Higher Education Student Statistics' and 'Higher Education Graduate Outcomes Statistics'. Other products include the open data releases ‘Higher Education Student Data', 'HE Graduate Outcomes Data' and all UK Performance Indicators.
2. Summary of quality
The degree to which the statistical product meets user needs for both coverage and content.
Higher education (HE) student data is collected by HESA at an individualised level on a census-basis covering HE students at:
- HE providers in England registered with the Office for Students (OfS) in the Approved (fee cap) or Approved categories;
- Publicly funded HE providers in Wales, Scotland and Northern Ireland; and
- Further education (FE) colleges in Wales.
Coverage extends to all students who are undertaking a course or programme which leads to the award of a qualification or institutional credit. Excluded from coverage are students studying overseas for the entire duration of their course even when they are formally registered at a UK-based HE provider. Students studying overseas by distance learning are similarly excluded unless they are funded by a UK HE funding body. The HE providers at which each student is registered are responsible for submitting the data about that student to HESA, drawing on information held within their internal administrative systems.
HESA student data is used by a wide variety of types of users for a range of purposes. UK HE funding/regulatory bodies and government education departments and Devolved Administrations (‘statutory users') use the information to support the allocation of state funding and regulation of higher education providers and in the formulation and monitoring of higher education policy. It is used by HE providers for strategic organisational planning and benchmarking performance and characteristics against other HE providers. Prospective students and their advisors use the information to inform their choices of provider and course, often through intermediary publications such as university guides and league tables. The media use information to support reports and articles on UK higher education. Private companies use it for monitoring and targeting their graduate recruitment efforts. Academic researchers investigating many different aspects of higher education draw upon the information extensively. International users studying the UK HE system or monitoring the success of students from their country studying in the UK also use the information.
HESA ensures that its student data collections and products derived from them remain relevant to users in a number of ways. A ‘Statutory Customers Technical Group' exists to ensure that the requirements of statutory users are met. Engagement with users within HE providers is undertaken via a number of groups and discussion mechanisms. All HESA records undergo a major review cycle every few years that brings together representatives from the key user communities to ensure that the information collected remains appropriate for their needs. Other user groups exist in relation to specific HESA information products and surveys of users are undertaken periodically. Feedback received through all of these channels and others helps to shape the information collected and the content of products derived from the information. In this way the needs of user communities are continuously monitored and, where appropriate and practicable, acted upon.
HESA student data serves multiple purposes and must be aggregated and refined in order to be used to generate meaningful statistics. This includes the derivation of ‘populations' of students based on the application of restriction clauses to exclude any records which are inappropriate to a particular output. An example of this would be in statistics on qualifications obtained - clearly only those students who have been awarded a qualification in any given reporting year should be included in such statistics. Details of all population definitions and aggregation techniques used are clearly explained within the definitions section of each HESA output.
The closeness between an estimated result and the (unknown) true value.
As a census rather than a survey, no estimates are produced from HESA's student data and issues of sampling error are not relevant. Instead characteristics of the population in question (students) can be measured directly and comprehensively.
For census collections, the extent of missing records or data items is of more relevance to accuracy. Missing records are not considered to impact materially on the student data collected by HESA. Data is collected annually using a ‘muster' approach, so that no individual record may disappear from one year to the next without reaching an expected ‘end-state' (such as leaving the HE provider or gaining the intended qualification) or an explanation from the provider as to why the record is missing. In addition, counts of students are compared annually with returns made to funding bodies in respect of state funding allocations, and any discrepancies are investigated. Student data returns are also subject to audit processes from time to time. If any case arises where HESA becomes aware that records are missing for any given HE provider, this is noted within a given product.
In respect of missing data items, the majority of data items are collected for all students but some are restricted to students of a particular type or geographical location. An example of this is that data on ethnicity of students is only collected for students of UK domicile. In such cases this is clearly explained in the definitions provided with each product. Some data items may include categories for ‘unknown' or ‘information refused' such as ethnicity. In such cases the levels of unknown values are shown in statistical tables and caveats may be included which explain that the statistics may not be representative of the population. The level of unknown entries within data items is routinely monitored during the data collection process. Any HE provider recording abnormally high levels of unknown values in key data items are strongly encouraged to reduce this level over time.
An overview of the quality checks that are made throughout the collection process is available in the support guide. These involve automated quality rules, credibility checks, the raising and closing of data quality issues by HESA and some Statutory users, and the assurance of a final formal sign-off verifying the data, from the head of the organisation submitting it. Detailed information on quality rules is available in the coding manual. Information gathered from the quality assurance process is made available in data intelligence notes accompanying the main outputs.
Derived fields are generated from the raw data collected by HESA as part of the data ingestion process. These are aggregations and derivations of the data that are then used in our publications and onward analysis by statutory users. An example of a such a field is the derivation of student age from the raw date of birth field collected from HE providers. These derived fields form an important part of the quality assurance process, and ensure quality assurance is undertaken in a way that reflects how data fields will be presented in the outputs.
2.3 Timeliness and punctuality
Timeliness refers to the lapse of time between publication and the period to which the data refer. Punctuality refers to the time lag between the actual and planned dates of publication.
The first release of information from HESA's student data collections for any given academic year usually occurs in the January following the end of that academic year. For example, data for the 2016/17 academic year was first published on 11 January 2018. First release occurs within the Statistical Bulletin entitled ‘Higher Education Student Statistics'. This is a National Statistics publication and is freely available from the HESA website and the National Statistics Publication Hub. A further more detailed open data release drawing on these data is then published in February entitled ‘Higher Education Student Data'. This is an Official Statistics release and is available free of charge from the HESA website.
The reason for the time delay between the end of the academic year and first publication of statistics in relation to that year is the time required to collect, process and quality-assure the data and to prepare the statistical release itself. HESA student data is collected annually as a retrospective activity in the autumn following the academic year to which it relates. For example the 2016/17 HESA Student data collection system opened in March 2017 with a return date of 15 September. A data quality checking period ran until 31 October, with sign-off of the record occurring on 6 November. Final processing of data and delivery to statutory users covered the period to 24 November 2018. At this point preparation of the January Statistical Bulletin commenced culminating in publication on 11 January 2018.
All of HESA's National Statistics publication dates are pre-announced on the HESA website and the National Statistics Publication Hub Release Calendar. In the unlikely event of a change to a pre-announced release date, attention will be drawn to this through the NS Hub Release Calendar and the HESA website together with a full explanation of the reason for the change.
In addition, since 2009 release dates for all of HESA's remaining outputs (which are predominantly Official Statistics) have been pre-announced via the HESA website with month of release normally shown six months in advance and precise dates being announced four weeks prior to release. HESA has met all such pre-announced release dates.
2.4 Accessibility and clarity
Accessibility is the ease with which users are able to access the data, also reflecting the format(s) in which the data are available and the availability of supporting information. Clarity refers to the quality and sufficiency of the metadata, illustrations and accompanying advice.
Editions of the Statistical Bulletin ‘Higher Education Student Statistics' can be accessed free of charge from the HESA website. Each of the data tables included within the bulletin are provided in machine-readable downloadable format, to encourage analysis and re-use. For historical Statistical First Releases these were provided in Microsoft Excel format.
‘Students in Higher Education' is available free of charge from the HESA website. Earlier editions of this also contain data tables in Microsoft Excel format. In recent years, the contents of this have been released as interactive tables and charts with machine-readable downloads.
‘Higher Education Statistics for the United Kingdom' is available free of charge as an online download from the HESA website.
Further extracts of data from HESA's student data collections are available on request from the Jisc data analytics team (email [email protected] or tel (0)1242 211 133), further details can be found on Jisc's website.
Each HESA publication is accompanied by full definitions and supporting information on specific aspects of quality. Further advice on aspects of any HESA publication is available from the Official Statisics team (email [email protected] or tel (0) 1242 388 513 (option 2) ).
The degree to which data can be compared over time and domain.
The specifications for HESA's student data collections are subject to a major review process every few years and minor changes may occur on a more frequent cycle. Changes are driven by the requirements of statutory users, higher education providers and other key stakeholders. Requirements in relation to the publication of Official and National Statistics are fed into that review process. However as administrative data collections with primary purposes which are not related to the publication of Official and National Statistics, these requirements are of secondary importance to those of the statutory users in relation to state funding/regulation of higher education and formulation of HE policy. As such, changes to the data collection arrangements which have implications for Official and National Statistics publications do occur from time to time. HESA's statistical planning process is designed to assess the impact of any changes in data collection on statistical outputs and to determine methods for minimising impact. In very many cases changes operate at a sufficiently low level within the microdata so as to permit simple adjustments of underlying aggregations which do not materially disturb the higher level statistics, which are derived from the data. Where changes generate greater impact, methods such as re-basing historic data may be used to provide consistent time-series within statistical outputs. In such cases this is made clear to users within the accompanying supporting information. Where changes are so serious as to render re-basing impossible, such as a move to an incompatible classification system for example, any discontinuities in the data are made clear within statistical outputs. For examples of the above treatment of changes in the data collected please see Notes to Editors in the January 2011 SFR.
It is HESA policy to use national or international data standards wherever relevant and practicable to maximise comparability over domain. There are many examples of alignments with international standards (e.g. ISO) and national standards (e.g. National Statistics or UK Population Census) within the HESA student data collections. HESA only deviates from existing data standards if such standards are seen to be inappropriate or inadequate for UK higher education uses.
In 2019/20 we transitioned our approach to presenting the subjects studied by students on courses from a previous standard (JACS) to the Higher Education Classification of Subjects (HECoS) vocabulary. A Common Aggregation Hierarchy (CAH) was produced to aid time-series analysis across both of these standards. We published our initial analysis showing how patterns of subject coding have changed during the transition, but chose not to use CAH for presenting time series at this stage. CAH was subsequently updated from version 1.2 to version 1.3.4 and as a result we extended our previous analysis to ascertain what impact the latest version of CAH has on time-series analysis. When CAH v1.3.4 is applied to 2018/19 data it redistributes 72,570 students from the nearest analogous subject groups in CAH 1.2. For 2019/20 data the figure is 39,690 students. While some changes between analogous subject groupings in each version of CAH are in some cases large, overall the two aggregations yield fairly similar results. CAH v1.3.4 increases the number of students mapped to HESA's grouping of science subjects by 12,765 in 2018/19 over the number produced by CAH v1.2 for that year, but there is no difference between the figures yielded by the two versions of CAH for HESA's groupings of science subjects in 2019/20. The difference between the number of students allocated to each version of CAH's nearest analogous subject groups is always less than 1% of the total student numbers reported, and the modal change is nil.
CAH v1.3.4 has been used in the publications from 2020/21, and we have also taken the decision to retrospectively apply CAH 1.3.4 to our published data from 2019/20, in order to maximize comparability and consistency of definitions between years for data users. CAH v.1.3.4 was introduced to increase the usefulness and intelligibility of detailed subject groupings, but at the highest level of aggregation it delivers similar insights to CAH v1.2. HESA's decision to utilise CAH v1.3.4 for previous years enables us to ensure that all HECoS data is presented in a consistent manner. CAH v.1.3.4 is anticipated to be the adopted standard for several years.
However, one aspect of our analysis provides evidence for a need to continue assessing the value of the CAH in providing a bridge between JACS and HECoS. That is, the proportional difference between student numbers reported for each subject grouping in CAH v1.3.4 is higher overall between the two HECoS coded years (2019/20 and 2020/2021) than for the transition from JACS (2018/19) to HECoS (2019/20), therefore indicating that year-on-year variation within a consistently coded dataset can exceed that observed in the transition between underlying coding approaches. Given the high level of user interest in longitudinal data on subjects and subject groupings we therefore intend to specify further analytical work with the goal of assessing the usefulness of CAH in providing long-term time-series analyses not only over the transition years, but also by application of CAH v.1.3.4 to older JACS-coded data. Further details of this planned work will be shared in due course.
The degree to which data that are derived from different sources or methods, but which refer to the same phenomenon, are similar.
HESA's student data is the only comprehensive source of UK-wide statistics on HE students and their study choices. There are, however, other sources of student information that are more limited in coverage, either by type of student or by geographical region of the UK. Probably the most notable of these are statistics compiled and published by UCAS using information supplied by students and HE providers as part of the applications process. Although UK-wide, the UCAS admissions process primarily focuses on students applying to full-time first degree and some sub-degree courses such as HND (though UCAS process limited numbers of applications to other types of courses). This presents one of the major differences between UCAS and HESA statistics since HESA statistics are not limited to a subset of HE courses offered. There are also other key differences. UCAS statistics are based on numbers of applications and acceptances on courses whereas HESA statistics are records of students who actually enrolled on courses. In some cases accepted applicants never actually enrol on the course on which they have been accepted. There are also cases in which students apply directly to HE providers without using the UCAS admissions process and therefore never appear in published UCAS statistics. In addition to differences in coverage there are also methodological differences used by UCAS and HESA in presenting statistics on student numbers. An example of this is the recording of subjects studied by students who are undertaking subject combination courses - UCAS allocate students to a single major subject or a combination category whereas HESA divides student numbers across the combination subjects. These definitional differences between HESA and UCAS data must be noted when comparing statistics derived from these sources, although users will note that the overall trends over time in UCAS statistics on entry to higher education are similar to those in HESA statistics for full-time undergraduate entrants.
In England, student numbers can also be derived from the annual Office for Students ‘HESES' return (Higher Education Students Early Statistics). This differs in a number of ways from published HESA statistics. Coverage extends only to England-based HE providers. In addition, figures collected within this census are based on an aggregate count of students whereas the HESA student data collections contain individualised data. HESES figures are based on retrospective counts of students from 1 August to 1 December in each academic year together with a forecast of student numbers from 2 December to 31 July with an allowance made for numbers of students who are forecast to leave before completing the academic year. HESA figures are entirely based on a retrospective count with no forecasting or estimation required. In other respects the coverage of the two census collections is intentionally similar and indeed figures from the two collections undergo a formal reconciliation process which is used as a mechanism to verify the data provided by HE providers.
The Department for Education and the Devolved Administrations produce statistics outputs on Higher Education. These often draw upon data from the HESA student data collections and use common definitions. These figures should therefore normally be comparable to statistics published by HESA.
3. Summary of methods used to compile the output
Collection of the HESA student data
Data are supplied by HE providers to HESA via a secure web-based transfer system created and maintained by HESA. The data supplied are subject to an extensive quality assurance process. The first stage of this includes a suite of validation checks, which ensure that the data collected meet specification, dates fall within expected ranges and the information provided within fields of data is consistent. Failures at this stage may cause a data return to be rejected, requiring a re-submission from the provider once corrected. The second stage of quality assurance comprises a verification process whereby frequency counts and cross tabulations are produced automatically from the data submission of each provider and these are fed back to the providers. A team of quality assurance analysts at HESA also scrutinise this material. Year on year comparisons provide a summary of changes and the level of change in any particular area is examined closely if it falls outside of an expected range. Any issues arising from this stage of quality assurance are logged within an online system to which the submitting providers have access. Providers must respond to each issue to either confirm that anomalies are genuine or correct the data and re-submit. The final stage of the quality assurance process is a sign-off by the head of each provider confirming that data meet required quality standards and are fit for onward use.
Contracts in place between HESA and Statutory users require that the data be of sufficient quality for Statutory users' funding, regulatory and policy purposes. Sanctions may be applied against HESA and HE providers should these quality standards not be met. The quality standards set by Statutory users are deemed more than adequate for the purposes of production of Official Statistics.
Production of statistics from the data collections
Once the data collections have been completed, quality assured and signed-off the resulting data sets are made available to statistical analysts at HESA for preparation of relevant publications. The production process for Official and National Statistics occurs within a project structure with appropriate governance mechanisms in place. Preparation of the HESA National Statistics releases is undertaken in collaboration with statisticians at the Office for Students, Department for Education and the Devolved Administrations. The process includes extensive quality assurance procedures at key stages. These include peer review of data specifications for releases, peer review of interactive table and chart structures in the HESA website, checking of data tables and charts, detailed manual checking of figures cited in statistical commentary together with extensive proof reading of commentary. Each key stage of the production process requires senior staff sign-off and full issue-tracking is used.
Information on the methods used to generate specific elements of data in published releases can be found within the Notes and Definitions supplied with each release.