Skip to main content

Quality Assurance: Approach

This page takes the reader through the reasons quality assurance (QA) is so important in a shared collection service through the issues with the current approach, the vision for the future and the detail of how that vision could be realised when moving to an in-year collection.

It should be read with QA: Implementation which is the logical flow of how data is proposed to be collected, integrated and output. 

What is the role of QA in a shared collection system?

  • Data quality is one of the critical success factors for the successful collection and onward use of provider data
  • The move to an in-year collection must deliver this quality in a cost effective and sustainable way
  • The value of a shared service collection service is primarily within the comparability and utility of a robustly quality assured data set, across all providers serving many customers.

How does QA operate within the current collection system?

  • There is a ‘One size fits all’ set of quality rules
    • Many of which are triggered unnecessarily
  • However, this is necessary to deliver a high quality dataset
    • The feedback is rich but the collection systems are not joined up
      • Three systems: Data collection system, Minerva, email
      • Non-integrated output: Questions are asked in Minerva but data is elsewhere
    • Many manual interventions in the QA process
    • Process of 'what happens next' is not always clear

What is the vision for QA for future collection systems?

  • Quality is driven by outputs not inputs. Specifically quality needs to be high enough – but no higher – to support the purpose for which it is collected
  • Setting and maintaining the appropriate level of quality is material to the  management of burden
  • Rules should be configured to suit the provider profile
  • These rules should be written in plain English so easy to navigate and understand
  • Providers can view quality issues from early submission – and begin to respond to them
  • High levels of automation allow tolerances to adjust to early submission of data
  • This approach should lead to very few cases where submission is rejected
  • Persistent tolerances carried through lifecycle of the student
  • Tolerance approach to all quality issues
    • No longer binary ‘errors’ and ‘warnings’, instead tolerances built into the concept of a  quality profile. Signposting, errors and warnings will be based on context and previous data
    • Maybe tailored on a per-provider basis (this will require some kind of baselining activity)
    • Significant work on business rules to clearly define the difference between errors and warning and the response required
  • Quality issues should be expressed in plain English, and with relative priority based on severity
  • Customers should take a more risk-based approach to quality assurance so ‘fitness for purpose’ will be a dialogue between customer and provider
  • Shared metrics to measure quality assurance to improve data quality to level required for outputs.

What benefits will be realised?

  • Greater and earlier visibility of quality issues
  • Management and control in the right hands
    • Providers and other stakeholders able to manage requests awaiting their approval
    • Provider and other stakeholders may approve tolerance changes for an appropriate period of time and level, or reject them
    • Providers are not reliant on HESA asking questions about their data
    • More accountability for quality is with the provider
  • Direct routing of tolerance requests from provider to appropriate stakeholder
  • Quality issues in plain English to help remove barriers
  • Burden reduction
    • Automation will lead to a more consistent approach
    • Signposting and Warnings prioritised for ‘urgency’ and ‘impact’
    • Focus effort on where it’s needed, e.g. long term tolerance changes don’t need to be revisited
  • Better understanding of the requirements for the quality of the data based on the quality profile
  • Better use of the knowledge held within the HESA data to improve the service

How will this work in practice?

In Quality Assurance: Implementation we describe how the QA process is split between ‘input’, ‘integration’ and ‘output’.  In this section we discuss how the six quality dimensions are applied to support this approach. 

The proposed data quality profile is shown below:

First we should consider how these quality dimensions are deployed for in-year collection - these are contextualised from standard best practice definitions which can be found here

The schematic above shows 'tolerances' for each dimension bounded by a 'best practice' date and a 'last submission date'

Timeliness - an event such as enrolment - should be returned as close to the point where the provider feels it is stable and before the last submission date defined by the first reference period or output it is required for.  Early submission will trigger the quality checks and analytics in the in-year QA engines, thereby driving up the quality to required output levels in the pre-delivery environment. Obviously 'stable' is a subjective term and data will have varying levels of entropy between the capture point and the first use case in the output QA cycle.

So in the schematic above, we are using timeliness almost as an 'x-axis'. Other dimensions fit within this x-axis as explained below.

Completeness - This should be considered as defined coverage based on known outputs which are harmonised to a reference period. This is the minimum requirement but being 'over-coverage' is not an issue and may provide further analytical outputs or early sight of issues for later outputs. During the Detailed Design phase, there may be an opportunity to refine the definition of complete.

The current view is HESA should provide 'early warning' and 'signposted feedback' based on the level of completeness.  Input will not be 'policed' other than basic schema rules and these will be 'soft' against almost all items so submissions are not rejected for other quality dimensions (e.g. postcode incorrectly coded).

Once data has passed input QA, it will be assessed for 'logical completeness' for the outputs at the next reference period.

Validity - submitted data will mainly be validated by reference data potentially supported by machine-learned tolerances which could be provider specific. The context of the submitted data will be assessed for validity based on the time / place in the academic cycle using business rules that are cognisant of previously submitted data. The tolerances for validity will change depending on the completeness of the submitted data.

Uniqueness - A number of logical tests can be applied to ascertain if a record is unique, when it is required to be. This could be based on multiple and / or a hierarchy of personal IDs. Fuzzy matching is planned to be run as data is submitted triggering confidence reports around continuity. There needs to be further discussion with the sector on the action taken when a match is found and what tolerances could be applied.

Consistency - These are the business rules which check the consistency with previous data items sharing the same primary keys. The cohorts for various outputs potentially could be supplied as part of these rules. Clearly there needs to be a simple way to interpret these rules based on what is held within the data dictionary.  Consistency in this context is essentially the student journey profiled over time. We see visualisation tools being extremely valuable here to identify outliers, etc.

Accuracy – This is primarily defined by the provider context. Therefore the reliance for meeting the criteria for this quality dimension lives primarily with the providers. HESA may be able to run some predictive calculations on expected accuracy and confidence.

The quality profile schematic uses these definitions to visually represent the tolerances. The logic is input (Schema) QA is primarily defined by validity and completeness. Integration QA is focussed on quality issues around uniqueness and consistency with accuracy being closest to the output use case. This is summarised in the table below:

  Input Integration Output
Data Role Operational Management Information Use Case
QA Role Schema Credibility and Consistency Fitness for purpose



Q: How do resubmissions work?

A: Data can be resubmitted under two scenarios; 1) the submitted data was incorrect - for example, student coded onto the wrong courses or, 2) the submitted data has been updated - for example, additional or more accurate data about one or more students.  Business rules will be developed to ensure these resubmissions are reflected in all of the QA process.

Q: How do deletions work?

A: A deletion of a course, a student, etc will remove the item from the QA process and allow all / any validation to be rerun. If this materially affects an output that has already been run, this output can be recreated. However, the process of how the customers of this data will consume data that has passed the reference period for which it was targeted has yet to designed, as it is outside the scope of the Collection Design project, and will be considered in the Data Futures Detailed Design phase.

Q: How will continuity rules work in first year as previously submitted data on the 'old' system won't be available?

A: This is out of scope for the Collection Design project, but the current thinking is the in-year collection will be seeded with data submitted for the current return and / or tolerances will be relaxed to allow continuity issues to be passed.

Q: How will the sign-off process work?

A: There is still work to do here. The proposal is that the reference period will be a sign-off point for multiple customer outputs. Who will sign this off will to some extent depend on the provider and the statutory requirement. We do not currently envisage sign-off points for input or integration QA, only for when the data is transformed into an output.