Skip to main content

Collection Design: V3 Implementation approach

Please note, version 3 pages are still available for your reference, but we have disabled links to avoid confusion: please refer to the Data Futures Resources page for project documents for your consideration. 

The concept of segments as shown below was presented below as a way of understanding the new collection schedule, but this concept has not been carried forward to the detailed design phase or implementation and transition.

A number of points have come across very strongly in the feedback to the version 2 design. On this page we have asked the appropriate project team member to respond in detail. The scope of that response is to explain the design rationale behind making these choices, and how this rationale affects the wider design.

"You’ve made a number of references that indicate your approach has been to take the current collection mandate and translate it to be collected as in-year data, but we see new data items in the specification. Why and how is this justified?"

Our approach has been, as far as possible, to translate the current mandate into new terms defined in the Data Language. What do we mean by this? The collection mandate, or scope, has been rigidly defined for us: only the current HESA Student, AP_Student and ITT-in-year collections are in scope for the Collection Design project. The mandate for collection can be obtained from the coverage statements (at collection-, entity- and field-levels) and in the Reason required metadata (at field- and entity-levels). Notes fields are also instructive, when creating an appropriate ‘translation’.

It’s worth spelling out what we mean by translation.

First, what are we translating from and to? The ‘to’ part is easy: the HE Logical Model is written in the HE Data Language, which was constructed from observation – the Higher Education Data and Information Improvement Programme (HEDIIP) team observed the concepts that are of interest in terms of student data, and the way these concepts are linked to each other. While every HE provider is different, some core concepts apply universally (providers offer courses, which students register on, and through a process of engagement with study and academic progression over time, attain an award or other outcome, and eventually leave). The Logical Model, and the Segments, reflect these universal student processes. By contrast, the language we are translating from (i.e. the current HESA Student, AP_Student and ITT-in-year collections) represents an accumulation of concepts that relate to both the student journey, and the way it is funded and monitored. These have accumulated over time in response to the needs of the sector and its funders/regulators. There is much of value here, but its syntax and strictures reflect a set of concerns that are not native to the business of higher education.

Second, how does the process of translation work? We start by taking an attribute of the Logical Model, and asking “does this have an analogue in either the Student, AP_Student or ITT-in-year collections?” If there is an analogue, then we have marked it for collection. If not, then we have placed it out of scope. This leads to some slightly odd effects. Much of what is logically part of Segment 0: Organisations and Venues is part of HESA Provider Profile or other similar collections, and hence out of scope for collection under Data Futures – at least initially. Where there is an analogue, we have had to determine how this could be produced under Data Futures. Here we have another set of translation decisions: how closely related are the terms in the two languages? If the concepts are expressed in similar ways, then the job is easy – a field like ELQ is almost identical in both languages. Where the expression is very different, then we have a more complex job to do – we have to determine an approach that is faithful to the mandate, but express this as economically as possible in the new language, without resorting to overly cumbersome devices. It is worth looking at two examples to show how we have approached this in two different cases, STULOAD and MSTUFEE.

Instance.STULOAD has a reasonable analogue in StudentCourseSession.StudentLoad, and providers are welcome to supply this field based on their own calculations. However, many providers have criticised STULOAD as being derivable from MODOUTs. Other providers have rightly identified edge cases where this will not work. In translation, we have chosen to retain StudentLoad for collection, but have also defined a derived field (XSTULOAD) that produces a STULOAD-like value from other data (including ModuleInstance.Proportion). A provider *might* choose to use this value in submitting StudentLoad – but they don’t have to if it doesn’t match the way their programmes work. So in this case there is one standard, and two ways of meeting it.

MSTUFEE is very different. We know from our Statutory Customers that fees and funding information is of great interest. MSTUFEE is an obviously calculated field. Such fields were, as far as possible, purged from the Data Language. They are burdensome to produce with any accuracy, and the provider is unlikely to consume the data in this format for their own purposes. Instead, the Data Language identifies simpler concepts, such as the Fee Invoice Amount. The translation was going to be difficult, however we chose to approach it. In the end, we went back to the reason why MSTUFEE is required: “To provide understanding of the various sources of student fees and the extent to which various bodies are supporting students through payment of their fees.” It’s a legitimate aim, and not necessarily best served by the way that MSTUFEE is currently constructed, which takes the granular information of who is paying the fees for students, and looks for the modal funding organisation for each course, which is then mapped on to a typology owned by HESA. Someone is having to do this heavy lifting of extracting, transforming and loading the data for HESA. It could be being done in the Student Finance Office, or Registry, or in Planning, and it could be being supported by a software provider, or not. But one way or another, the work is getting done.

The proposed model does away with this work. HE providers return the amount and which organisation was invoiced for it – information that we know is held in systems. We will be able to consume a wide range of identifiers, and provide an open register of organisations drawn from Companies House, the Charity Commission, UK Register of Learning Providers (UKRLP), the DfE, the NHS, and others, in the form of the HESAID. This will mean that we can either match the data HE providers are able to provide, or provide a look-up facility that can be fed directly into systems via an API (or both). With this in mind, even missing data is easier to tolerate, because to get an XMSTUFEE value, less than 100% valid data is required for a course. While there is undoubtedly a cost to change, the steady state is simpler and closer to actual processes, and the end data even in its raw state more fully meets the reasons why MSTUFEE is required than MSTUFEE currently can. In some sense, therefore, XMSTUFEE demonstrates perfectly the complexity and burden of the current arrangement, and its severe limitations in terms of producing useful information. This is one area where we have reasoned that the best translation requires more granular and simple data than is currently required.

"At points the data dictionary will appear to deviate from the structures identified in the collection specification models; the question to answer is: ‘Why and how is this justified?’"

This is entirely expected and due to optimising the collection for ease of data supply. For example, in S0 – Organisations and Venues, the data dictionary shows that all that is needed to identify the submitting provider is a UKPRN. Whereas the corresponding segment model doesn’t have UKPRN as an attribute. What is actually happening here is that upon supply of a UKPRN the system will match this to the appropriate ‘Organisation Identifer.Identifier’ with an OrgTypeID of UKPRN.  So what looks like a single field in the dictionary utilises a wider set of entities in the model. This approach may also be developed in other areas – for example S6 – Funding and Monitoring could utilise this approach to allow providers to return identifiers other than the HESAID when returning information about organisations involved in funding (for example a student’s employer). Whilst these other opportunities have not been modelled in this version they may be introduced through Data Futures.

"Why is there so little information in S0 - Organisations and Venues, and what does this mean for Provider Profile?"

The scope of the Collection Design project, and more broadly within Data Futures, is limited to the Student, AP and ITT collections. This means that where there are dependencies between these datasets and those that are not in scope occurs the collection design proposal will look like it is missing data. The most notable example of this is around the data that is currently collected within Provider Profile, that is used to add context to the student returns. The long term ambition is to have all these collections running on the same platform, but does mean there will be a transition period where that is not the case. Data Futures is aware of these dependencies and will be working them out through detailed design and implementation.