Data Futures: Key concepts
Key concepts that underpin the new Data Futures approach and Coding manual (Data Futures) are discussed below. When Data Futures is implemented, data collection will move from one annual collection to in-year submission where it will be possible to submit data at any point during the year. The in-year submission will be grouped into three Reference periods.
A Glossary is also available to help clarify new terminology.
Please note that it has not yet been decided how these concepts and processes will be approached during the transition year.
We have taken three separate data collections, and translated them into a single collection, expressed in a sector-owned, observed data language.
Details of the entities and fields to be collected are presented in the Data dictionary in the Coding manual, which forms the core of this publication.
The Specification is a further development of the concepts and structures that have been refined over the past two years, and which are now largely stable.
A Reference period is a fixed period of time, the end of which, aligns to when HESA’s statutory and public purpose customers require sector-wide data and information. The diagram below summarises the structure of a Reference period:
Key terms relating to a Reference period:
Sign-off: A formal declaration that the in-scope data submitted to HESA represents an honest, impartial, and rigorous account of the HE provider’s events up to the end of the Reference period. This is not the same as all data submitted since last sign-off (see question about In-scope periods below).
Sign-off occurs during the Sign-off period before the Dissemination point. It must occur at least once but a provider’s Reference period data can be signed off as many times as the provider deems necessary, until the deadline (Dissemination point) is reached. Data must be signed-off by the head of the HE provider – normally the Vice-Chancellor or Principal.
Dissemination point - The specified date, following the end of a Reference period, by which signed-off data will be extracted and supplied to HESA's data customers. Data disseminated at the Dissemination point will be used for official accounts of the higher education provider’s activity for statistical, regulatory, and public information purposes.
There will be three Reference periods over the traditional academic year, as this common timetable reflects both the majority of activity, and the principal regulatory activities that depend on data. The flexibility of the model allows HE providers to reflect their own timetables of activity, and respect the different delivery patterns of courses with different start dates.
The Data Futures model sees data as a natural output of the HE provider’s own internal business processes and therefore submissions follow in close proximity to business events such as registration and enrolment. The Reference periods are not deterministic – the model is designed to follow the (generally annual) rhythm of course deliveries, recognising that different courses operate on different timescales. Business events occur and generate data, which is then reported to HESA in the Reference period when they occur. The collection system is always open and HE providers can continuously submit, quality assure, and view consolidated data using the Data Futures platform throughout the year. A suite of quality assurance and sign-off activities enable us to provide the sector with reliable, comparable, and consistent in-year information.
The diagram below illustrates how Reference periods will work (please note that the table below is only presented to illustrate the model it does not imply that any decisions on timings have been agreed):
The diagram illustrates that when one Reference period closes the next one immediately begins. This means that the collection system is always open, even during each Reference period’s sign-off stage.
During a Reference period, data will be submitted following business events such as registration, enrolment etc and then, if required, updates to this data can be made in the following Reference periods. In effect, the work of submitting data to HESA becomes one of continual submission of changes (in the form of new and updated data) and quality assurance is spread throughout the year, punctuated by Sign-off prior to Dissemination points.
There will only be one Student record, which accumulates over time. New data in-scope is signed-off after each Reference period, adding to the single longitudinal dataset (and where exception processing occurs, rectifying previous errors or omissions).
Data will always be validated in the context of a Reference period. A Reference period always has a Specification which is in force for that Reference period, with the Specification defining which rules to run. Any data which has an In-scope period which overlaps with the Reference period, is considered in scope for that Reference period, and is therefore validated and considered as being in scope for Sign-off.
This approach will result in a process of continuous quality assurance and means that data is accepted, and feedback on the current quality state is fed-back rapidly. As illustrated by the diagram below, each processing cycle passes through a number of stages. The stages are indicated by the coloured boxes (see legend at top of diagram) inside each cycle.
Methods for assuring data quality
There are three methods for assuring the quality of data:
Implicit validation rules – Metadata defined in the Specification describing intrinsic constraints on the data.
Explicit validation rules – Explicit rules for indicating if a row of data is valid or not. An ‘Applicable To’ expression defines the population on which to run the rule, and is used to calculate the percentage of failing rows. These rules encompass the current concepts of Exceptions, Warnings, and Continuity Rules, but they are implemented in the generic fashion.
Credibility rules – Rules that slice and dice the data within a dataset and apply an algorithm in order to assess the credibility of the dataset as a whole, e.g. an automatic year-on-year assessment of credibility.
Implicit validation rules
The results of the implicit validation rule failures are shown to the user alongside the explicit validation rule failures.
The table below lists the properties that can be defined in the metadata that are used to create implicit validation rules.
The level column indicates which one of the three levels of implicit rules the property falls into. This determines the extent to which data can be processed and assured.
Feedback to users at all levels of quality assurance
Validation failures do not necessarily prevent further processing from taking place. This allows validation results to be calculated and displayed even if other fields are invalid.
The table below shows what processing is performed and what results are displayed depending on the validation state.
Please note that the three levels of implicit rules processing may not be delivered in the first release of the system – this is under discussion at the time of writing.
Explicit validation rules
- A validation rule is defined as an expression operating on a row of data, indicating if that row is valid. An ‘Applicable To’ expression defines the population on which to run the rule, and is used to calculate the percentage of failing rows.
- Each rule defines a default tolerance percentage and/or a default tolerance row count plus an override approver role. If the number of failing rows is above the tolerance row count, the tolerance percentage is used to determine if the rule is in or out of tolerance.
- There is scope to use the rules to tighten tolerances over time, allowing a suitable period to achieve the data quality required. For example, within two months of students starting there could be a tolerance on unknown person characteristics of 20%, after 4 months 10% etc.
- The tolerances can be overridden on a per-provider basis, and the overrides can have an expiry date, with multiple overrides for the same rule and provider being allowed.
- The first unexpired tolerance with the soonest expiry date is the one in force.
- A review date can be specified for the tolerance to inform the approver to re-assess the override.
- Any change in tolerances will be immediately reflected in the list of validation results.
- Credibility rules are reports with a set of dimensions (one of which could be time) and a measure – where the measure is checked for credibility based on changes to that measure over time. An algorithm is used to determine credibility.
- Certain credibility reports are provided for information only, and do not perform automatic credibility checking.
- Credibility reports are grouped into chapters for convenience.
- The dataset used for a Credibility report will allow the appropriate filtering and aggregation logic to be applied.
- Comparing data between consecutive Reference periods may not be meaningful, so year-on-year comparisons are likely to be the norm.
- Comparing the partial set of data in the current period, with the full set of data from the previous Reference period, will only become meaningful once the full set of data has been received.