Data Futures: Key concepts
Key concepts that underpin the new Data Futures approach and Coding manual (Data Futures) are discussed below. When Data Futures is implemented, data collection will move from one annual collection to in-year submission where it will be possible to submit data at any point during the year. The in-year submission will be grouped into three Reference periods.
Please note that it has not yet been decided how these concepts and processes will be approached during the transition year.
We have taken three separate data collections, and translated them into a single collection, expressed in a sector-owned, observed data language.
Details of the entities and fields to be collected are presented in the Data dictionary in the Coding manual, which forms the core of this publication.
The Specification is a further development of the concepts and structures that have been refined over the past two years, and which are now largely stable.
When Data Futures is implemented we will move from a system of discrete and retrospective annual data collection, to a model of continuous data collection.
We will be moving from a system aimed at producing one ‘good’ version of a single file comprising of a complete annual return, to an on-going update approach, where fragments of data are accumulated, and build towards a complete picture over time.
Fragments could be large files comprising a complete return, or could be a small correction to add, say, a missing date of birth. We encourage HE providers to move towards small, frequent updates: changes, or “deltas”.
For the new Data Futures in-year collection process, data will only be collected where it is new, changed, or a correction. Data will relate to events, so if nothing happens during a period, nothing need be returned. Student personal characteristics and course information will only need to be returned once. Three Reference periods (discussed below) per year ensure consistent, comparable data can be produced for a wide range of public purposes.
In this model, HE providers have the option to smooth resourcing and increase efficiencies by adopting best practice in data management; by validating data close to the point of capture; and by sharing new/changing data with HESA as small updates, processed automatically.
We recognise that there are many different journeys HE providers are on towards improving the way they manage data, and so collection can still be managed through bulk file uploads whether through regular updates or through bulk updates towards the end of the Reference period as an old-style “commit” deadline. A data entry tool will also be available.
A Reference period is a fixed period of time, the end of which, aligns to when HESA’s statutory and public purpose customers require sector-wide data and information. The diagram below summarises the structure of a Reference period:
Key terms relating to a Reference period:
Sign-off: A formal declaration that the in-scope data submitted to HESA represents an honest, impartial, and rigorous account of the HE provider’s events up to the end of the Reference period. This is not the same as all data submitted since last sign-off (see question about In-scope periods below).
Sign-off occurs during the Sign-off period before the Dissemination point. It must occur at least once but a provider’s Reference period data can be signed off as many times as the provider deems necessary, until the deadline (Dissemination point) is reached. Data must be signed-off by the head of the HE provider – normally the Vice-Chancellor or Principal.
Dissemination point - The specified date, following the end of a Reference period, by which signed-off data will be extracted and supplied to HESA's data customers. Data disseminated at the Dissemination point will be used for official accounts of the higher education provider’s activity for statistical, regulatory, and public information purposes.
Every entity has an ‘In-Scope period’ based on ‘In-scope start date’ and ‘In-scope end date’ formulas, which define when that row of data comes into and out of scope for validation and sign-off. When a Reference period overlaps the ‘In-scope period’, it is defined as being in scope for that Reference period. Data submitted prior to the In-scope period will not require sign-off until it becomes in scope. Submission after this period will be subject to exception processing. Below are two examples of 'In-scope' periods for September and January intake - please note, the 'In-scope period' will change according to an individual organisation's timetable of activity.
- Acknowledge fluidity of data.
- Enable older data to fall out of coverage naturally.
- Allow data of interest to be identified.
- Enable future data (for instance, planned curriculum data) to be submitted without being subject to sign-off.
- With in-year reporting, timeliness becomes a quality dimension alongside more familiar concepts like validity, completeness, and uniqueness. In-scope periods enable quality assurance processes to operate in real-time.
In-scope periods for each entity will generally be set with start-dates relative to relevant curriculum-related or student-related start dates; and with end-dates generated from the arrival of new data, which has the effect of marking the end of a part of a student journey.
Historical amendments to provider data can occur through the operation of exception processing, where data whose In-scope period has now passed is returned to HESA. Since such changes amend the historic record they will follow a procedure whereby an appropriate authorisation path will be followed and a judgement made whether the data should be processed. If historical amendment is permitted HESA will coordinate the reopening of closed Reference periods to enable exception processing to occur.
There will be three Reference periods over the traditional academic year, as this common timetable reflects both the majority of activity, and the principal regulatory activities that depend on data. The flexibility of the model allows HE providers to reflect their own timetables of activity, and respect the different delivery patterns of courses with different start dates.
The Data Futures model sees data as a natural output of the HE provider’s own internal business processes and therefore submissions follow in close proximity to business events such as registration and enrolment. The Reference periods are not deterministic – the model is designed to follow the (generally annual) rhythm of course deliveries, recognising that different courses operate on different timescales. Business events occur and generate data, which is then reported to HESA in the Reference period when they occur. The collection system is always open and HE providers can continuously submit, quality assure, and view consolidated data using the Data Futures platform throughout the year. A suite of quality assurance and sign-off activities enable us to provide the sector with reliable, comparable, and consistent in-year information.
The diagram below illustrates how Reference periods will work (please note that the table below is only presented to illustrate the model it does not imply that any decisions on timings have been agreed):
The diagram illustrates that when one Reference period closes the next one immediately begins. This means that the collection system is always open, even during each Reference period’s sign-off stage.
During a Reference period, data will be submitted following business events such as registration, enrolment etc and then, if required, updates to this data can be made in the following Reference periods. In effect, the work of submitting data to HESA becomes one of continual submission of changes (in the form of new and updated data) and quality assurance is spread throughout the year, punctuated by Sign-off prior to Dissemination points.
There will only be one Student record, which accumulates over time. New data in-scope is signed-off after each Reference period, adding to the single longitudinal dataset (and where exception processing occurs, rectifying previous errors or omissions).
The move to continual submission and quality assurance means that HE providers will continuously submit and quality assure their data throughout the year. At the end of the Reference period the following activities will occur:
- HE providers complete their submission of data for the Reference period; for example if a student enrols on 30th March and the End of Reference period is 31st March the HE provider has until the Sign-off date to complete submission and quality assurance of their data. HESA does not specify the Sign-off date, only that it must occur before the Dissemination point i.e. the End of Reference period is not the last submission date for that Reference period.
- HE providers preview consolidated data supply for the forthcoming Dissemination point (and have the opportunity to correct their data and have a final review of consolidated data prior to sign-off). It is anticipated that the draft consolidated data to be supplied should be available before End of Reference period.
- Any final data issues are raised by HESA or its statutory customers and are resolved by the HE provider.
- HE providers sign-off their data.
- At the Dissemination point HESA delivers the specified outputs.
A detailed definition of the processes following the Reference period has been developed during the Detailed design phase. These processes will be piloted during the Alpha and Beta pilots.
Data will always be validated in the context of a Reference period. A Reference period always has a Specification which is in force for that Reference period, with the Specification defining which rules to run. Any data which has an In-scope period which overlaps with the Reference period, is considered in scope for that Reference period, and is therefore validated and considered as being in scope for Sign-off.
This approach will result in a process of continuous quality assurance and means that data is accepted, and feedback on the current quality state is fed-back rapidly. As illustrated by the diagram below, each processing cycle passes through a number of stages. The stages are indicated by the coloured boxes (see legend at top of diagram) inside each cycle.
Methods for assuring data quality
There are three methods for assuring the quality of data:
Implicit validation rules – Metadata defined in the Specification describing intrinsic constraints on the data.
Explicit validation rules – Explicit rules for indicating if a row of data is valid or not. An ‘Applicable To’ expression defines the population on which to run the rule, and is used to calculate the percentage of failing rows. These rules encompass the current concepts of Exceptions, Warnings, and Continuity Rules, but they are implemented in the generic fashion.
Credibility rules – Rules that slice and dice the data within a dataset and apply an algorithm in order to assess the credibility of the dataset as a whole, e.g. an automatic year-on-year assessment of credibility.
Implicit validation rules
The results of the implicit validation rule failures are shown to the user alongside the explicit validation rule failures.
The table below lists the properties that can be defined in the metadata that are used to create implicit validation rules.
The level column indicates which one of the three levels of implicit rules the property falls into. This determines the extent to which data can be processed and assured.
Feedback to users at all levels of quality assurance
Validation failures do not necessarily prevent further processing from taking place. This allows validation results to be calculated and displayed even if other fields are invalid.
The table below shows what processing is performed and what results are displayed depending on the validation state.
Please note that the three levels of implicit rules processing may not be delivered in the first release of the system – this is under discussion at the time of writing.
Explicit validation rules
- A validation rule is defined as an expression operating on a row of data, indicating if that row is valid. An ‘Applicable To’ expression defines the population on which to run the rule, and is used to calculate the percentage of failing rows.
- Each rule defines a default tolerance percentage and/or a default tolerance row count plus an override approver role. If the number of failing rows is above the tolerance row count, the tolerance percentage is used to determine if the rule is in or out of tolerance.
- There is scope to use the rules to tighten tolerances over time, allowing a suitable period to achieve the data quality required. For example, within two months of students starting there could be a tolerance on unknown person characteristics of 20%, after 4 months 10% etc.
- The tolerances can be overridden on a per-provider basis, and the overrides can have an expiry date, with multiple overrides for the same rule and provider being allowed.
- The first unexpired tolerance with the soonest expiry date is the one in force.
- A review date can be specified for the tolerance to inform the approver to re-assess the override.
- Any change in tolerances will be immediately reflected in the list of validation results.
- Credibility rules are reports with a set of dimensions (one of which could be time) and a measure – where the measure is checked for credibility based on changes to that measure over time. An algorithm is used to determine credibility.
- Certain credibility reports are provided for information only, and do not perform automatic credibility checking.
- Credibility reports are grouped into chapters for convenience.
- The dataset used for a Credibility report will allow the appropriate filtering and aggregation logic to be applied.
- Comparing data between consecutive Reference periods may not be meaningful, so year-on-year comparisons are likely to be the norm.
- Comparing the partial set of data in the current period, with the full set of data from the previous Reference period, will only become meaningful once the full set of data has been received.