Linked data: Details
We asked in the consultation what approach we should take to data linking.
We highlighted three basic models that could be pursued:
- We could rely entirely on linked data to determine graduate outcomes.
- We could choose not to utilise linked data, and continue collecting data (including salary) by consent as at present.
- We could take a mixed approach, collecting data from the best available data sources, whether surveys of graduates like DLHE or national datasets, and merge them to produce a composite source of information.
We set out our intention to pursue option three, and asked respondents to feedback on whether this would be the most suitable approach. 94% of respondents agreed that linked data should form a critical part of the product. We highlighted that one of the key forms of linked study data we would use is the Longitudinal Education Outcomes (LEO) data, and also highlighted the potential to use linked study data from the HESA Student record.
HESA collects information on students and courses throughout HE in the UK as part of its normal work. We propose to make greater use of this resource, linking and providing data back to HE providers, instead of collecting through a survey.
Our approach will be to undertake linking via personal identifiers such as the Unique Learner Number (ULN) and the HESA Unique Student Identifier (HUSID) in the first instance, and then to ‘fuzzy match’ any remaining records, to maximise the valid data obtainable. Due to the limited roll-out of the ULN, and some limitations on the linking of HUSIDs between study engagements, an inadequate number of records would be linked using identifiers only, at least initially. This is because continuity of HUSID is not always maintained from one HE provider to another, and gaps cannot be spotted on a 100% reliable basis. We have looked at what information might be collected to improve matching of records through fuzzy matching: while names and birthdays are likely to persist, in a population as large as the HESA Student record this is not always adequate. Because of the high levels of mobility of the majority of the student population, address information also offers a weak indicator for matching records. However, collecting the institution at which the further study is taking place constrains the pool of potential matches considerably. We therefore propose the following strategy:
- We will match on ULN where present in both Student records
- We will match on HUSID where present in both Student records
- We will undertake a fuzzy match that will catch most remaining records
- In the activity section, we will have an option for further study, and two questions to determine the level of study and the university/college name.
This will both significantly improve the accuracy of fuzzy matching, and allow the UK Performance Indicators (UKPIs) to be produced from survey data alone if necessary. It also represents an overall reduction of three questions on this topic.
As the use of ULNs grows this will improve the quality of matching and the policy can be kept under review with a view to moving more fully toward linked data over time. We will also investigate the availability of linked data from other education sources to enable more to be understood about students who go on to study qualifications that are not within the HESA constituency, such as at FE level.
Providers who anticipate requiring more detailed information about further study will be able to define additional bespoke questions.
The LEO dataset contains UK-wide earnings data, derived into both annual and daily earnings. It also provides some information on the start and end date of the spell of employment, although this is subject to fuzzy matching and therefore may be limited. This is used to derive whether a graduate was in a period of sustained employment during the tax year. This can be linked to student data and NewDLHE data using a method described by the Department for Education in England. The dataset also contains flags that indicate whether a graduate had a record of self-assessment tax data.
In the consultation, we asked respondents to give case studies of how they use the current salary data, and how they planned to use the data in future. From this data, we learned that the overwhelming majority of use cases are for grouped data, particularly at course level, or by occupational group, as well as by various protected characteristics.
Our aim is to make the information available to meet the following uses and we are working with DfE to establish the mechanisms to do so. HESA sees itself holding and providing access to the parts of the LEO data that complement NewDLHE (e.g. data that refers to past students and showing their earnings several years post-graduation. For mature students this might also include information prior to, and during, HE studies).
We are working with the DfE to agree a mechanism for accessing the UK-wide data. This would involve DfE providing HESA with individualised data to allow HESA to publish Official Statistics (including designated National Statistics) and to provide anonymised salary information to HE providers and other data users. The uses of this data will necessarily be restricted to the meanings allowed by the Small Business, Enterprise and Employment Act 2015, and subject to HESA’s charitable purposes:
- UK wide statistical products which HESA publishes, including the NewDLHE statistical first release and NewDLHE full publication.
- Supporting the Office for Students in England and the devolved administrations in the rest of the UK in producing public information sources to enable consumer evaluation of the effectiveness of education and training, such as Unistats.
- Supporting third-party and market providers of information, advice and guidance by providing appropriate datasets to enable consumer evaluation of the effectiveness of education and training.
- Undertake or support research projects aimed at furthering the assessment of policy on and effectiveness of education and training by either HESA or appropriate third parties (novel analysis may subsequently be incorporated into HESA’s open data statistical products where this demonstrably improves public benefit).
- Data supply to support HE providers in evaluating the effectiveness of the training or education they provide through understanding the employment outcomes of their graduates, by supplying appropriate datasets on their graduates to the governing body of the institution.
HESA will seek formal authority to perform these assessment activities on behalf of the Secretary of State and the devolved administrations.
The limitations placed on the supply of data by the 2015 Act are clear, and no data could be supplied that falls outside the functions specified above. All data will be anonymised and an appropriate implementation of the HESA rounding strategy applied.
As HESA is a non-profit organisation with an open data strategy (more information available here) charges will only be levied where services are provided in addition to data, requiring the use of analytical or other resources.
LEO data is presently only available following the end of the full tax year for graduates. Therefore we envisage having the first LEO data linked into the NewDLHE in June 2020. However much of the data from NewDLHE is likely to be ready for publication before this, in January 2020.
We also aim to develop experimental statistics publications on some of the new measures envisaged, and will publish this in due course.
Update June 2017
We consulted on the model in March/April 2017, and published a synthesis of consultation responses. We have also published a number of responses and clarifications on points raised by respondents, including points on linked data.