Quality matters: The Census and HESA Widening Participation data
HESA's Data and Innovation team use Census statistics to assess the quality of HESA's parental education and socio-economic classification data.
A primary purpose for HESA undertaking research activity, as outlined in our strategy, is to utilise external sources of information to assist with reviewing the quality of our own data. This aligns with the recommendations of the Office for Statistics Regulation (OSR) who highlight that one of the methods by which to evaluate accuracy is to utilise other datasets where possible.
With the 2021 Census taking place in England, Wales and Northern Ireland this month, the Data and Innovation team have undertaken research to explore the potential of Census (2011) data to facilitate a further appraisal of the parental education and occupation information that HESA collects.
We hope the findings will be useful to a wide range of our key stakeholders, including widening participation (WP) practitioners within higher education providers and researchers/policymakers who utilise these two variables as part of their analysis of HESA data.
The data we hold on parental education and occupation predominantly originates from the Universities and Colleges Admissions Service (UCAS) application that prospective students are asked to submit as part of the entry process into higher education. Individuals are requested to confirm whether either of their (step-) parents or guardians hold a higher education qualification, as well as the occupation of the highest earner. The latter detail is then used to assign individuals/households into a particular group - based on the National Statistics Socio-economic Classification (NS-SEC) – which subsequently offers one potential means of determining socio-economic background.1 WP remains a key focus of policy across the UK, with several universities using parental education and/or occupation to determine their target groups when carrying out WP activity (as evidenced, for example, by their access and participation plans). However, there continues to be uncertainty around the accuracy of the data, given the questions in the UCAS application are optional and self-reported. The Office for Students (OfS) have considered completeness through their investigation into the extent of missing data among these two variables, alongside looking at how sector and population proportions compare.2
We complement the work undertaken by OfS as follows;
- Using linked HESA-Census 2011 data, we provide a further assessment of the likely accuracy of these two variables.
- We explore whether there is any relationship between the amount of 'missing' data and socio-economic status (e.g. are those from more disadvantaged backgrounds less likely to supply this information?).
Our analysis does suggest that individuals are reporting this information accurately, though there is some evidence that disadvantaged students are less likely to know the answer or respond to these questions.
The Census in the UK takes place every ten years and attempts to provide a detailed picture of the entire population. Data is gathered in as consistent a manner as possible across all nations, with households obliged to supply information relating to their socio-economic background, such as the qualifications they hold and their occupation. These statistics are then aggregated at various geographical levels and made available to the general public, with the smallest level available being the output area (referred to as ‘small areas’ in Northern Ireland). An example output area report from the Census is available here.
Using this data, we developed a small UK-wide Census dataset containing the following fields;
- Output area code
- Proportion of individuals in the output area with level 4 qualifications or above (%)3
- Proportion of individuals in the output area classified as being in NS-SEC groups 1 or 2 (%)4
Linking to HESA data
To carry out the analysis required, we use output area codes to link the compiled Census dataset to a relevant HESA data extract containing the parental education and occupation fields from the Student record. We were able to assign output areas to each individual in our database by using their postcode information. The population of interest is restricted to UK (excluding Jersey, Guernsey and the Isle of Man) domiciled full-time first degree entrants aged 20 or under at the time of starting university, with the academic years covered being 2011/12 to 2016/17. The rationale behind choosing this age group is that the occupation data UCAS gathers only relates to the parent if the individual was under 21 when beginning their degree. Please note that we also include any students who entered via a non-UCAS route in this study. While we cannot be sure that such students are asked the same questions as those that appear on the UCAS application, we find only a few percent of our total population in each academic year enrolled into university through a pathway other than UCAS and their inclusion is therefore unlikely to materially influence our conclusions.
With our linked HESA-Census dataset, we firstly partition each of the two continuous Census fields into deciles for each academic year. Decile 10 contains those areas where the proportion of individuals with a level 4 qualification or above (in NS-SEC groups 1 and 2) is highest, with decile 1 comprising the localities with the lowest proportions. By academic year of entry, we then cross-tabulate the HESA parental education and occupation fields against the appropriate corresponding categorical Census variable. The NS-SEC field in the HESA data had to be transformed prior to doing this, so that it mimics the Census variable as closely as possible. Specifically, we begin by recoding those who fall into the ‘never worked and long-term unemployed’ or ‘not classified’ groups as missing data. Guidance from the Office for National Statistics (ONS) suggests that care should be taken on whether to amalgamate the ‘never worked and long-term unemployed’ group with other categories or to analyse this separately. In the HESA data extract, we find that approximately 0.30% of individuals are allocated to this NS-SEC category, limiting the potential to examine this group on its own. For this study, we have therefore placed them into our ‘missing information’ category in the knowledge that the addition of a tiny proportion of individuals to this group will not fundamentally modify our results. This is followed up by generating a binary marker that indicates whether or not the highest earning parent was based in an occupation that falls within NS-SEC groups 1 or 2.
While we are able to explore the relationship for parental education between HESA and Census data from the 2011/12 academic year, we can only begin our examination of parental occupation in 2015/16, as this was the first instance in which parental occupation in HESA records was based on SOC 2010 and therefore mirrored the coding framework used in the 2011 Census. For this piece of analysis in particular, we must therefore work under the presumption that there is not a great deal of alteration in the characteristics of individuals and households living in specific output areas over time (i.e. the Census data by output area would be very similar in 2015 and 2016 to that seen in 2011). Using the 2001 and 2011 Census, work by Rebecca Tunstall at the University of York has evaluated the extent of variation across the decade (within a lower super output area) in the proportion of residents that were based in NS-SEC groups 1 or 2. The study finds little evidence of meaningful change. Though we use a different geographic level, this research provides some support for our assumption.
One would hypothesise that as we move from decile 1 to 10 of the relevant Census variable, we should observe:
- An increase in the proportion of individuals who self-report that they have a parent who has experience of higher education.
- A rise in the percentage of individuals who self-report that their highest earning parent works in an occupation that falls within NS-SEC categories 1 or 2.
Beginning with parental education, we see that in the first two academic years we consider (2011/12 and 2012/13), approximately 78% of individuals have responded with either a ‘yes’ or a ‘no’ to the associated question on the UCAS application, which is in line with findings reported by the OfS. We also detect the expected trend when cross-tabulating the HESA and Census variables, with the proportion of individuals who have a parent with a higher education qualification rising as we move up the deciles. A noticeable change then occurs from 2013/14 onwards, with around 85% of individuals now supplying either a ‘yes’ or a ‘no’ response, which appears to be largely driven by the decline in the proportion who refuse to provide this information. Furthermore, we continue to observe the hypothesised pattern, with the results being fairly consistent between 2014/15 and 2016/17.5 As with the parental education variable, we note that the extent of missing data in the HESA parental occupation field stood at approximately 15% in 2015/16 and 2016/17. Furthermore, we see that the proportion of individuals reporting that their highest earning parent is based in an occupation that sits within NS-SEC groups 1 or 2 increases as we move from decile 1 to 10, with very similar findings evident in both 2015/16 and 2016/17.
One issue that emerges from our examination is that, for both the parental education and occupation fields we collect, it is within the lower deciles of the categorical Census variables that we find a greater share of data that could be described as ‘missing’. These deciles will consist of areas where the local population will have fewer educational qualifications and are employed in occupations that are more likely to be associated with lower pay. Hence, these are localities that could be considered to be more economically and socially deprived.
In the case of parental education, the ‘missing' data seems to be driven by the fact that individuals from poorer backgrounds are less likely to know whether or not their parents hold a higher education qualification, rather than due to refusing to supply this information or not responding at all.
This analysis leads us to make the following overarching comments about our work:
- By 2016/17, the proportion of missing data in both the parental education and occupation fields stood at around 15%.
- The extent of ‘missing’ data is greater among those that seem to be from more disadvantaged backgrounds.
- Aggregate statistics available from the Census data have, however, helped to provide further support for the accuracy of these two variables.
The data from Census 2021 will be published over the course of the next few years, which will offer us another opportunity to further this analysis and quality assure more recent data from the HESA student record. Given the ongoing importance of WP to providers and policymakers, we shall continue to look for ways in which we can assess and/or improve the quality of the variables we collect relating to this matter. Comments and/or questions on this piece are most welcome and can be sent to [email protected].
- HESA also encourages providers to submit information on parental education/occupation for any full-time undergraduate entrants who enter higher education through a different route. Ref 1
- See https://www.officeforstudents.org.uk/publications/differences-in-student-outcomes-further-characteristics/ for more information on these data quality checks. Ref 2
- Qualifications at level 4 and above encompass various types of higher education qualifications. See https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/articles/qualificationsandlabourmarketparticipationinenglandandwales/2014-06-18#background-notes for further information. Ref 3
- ‘Not classified’ and those who are ‘long-term unemployed or never worked’ are coded as missing data. Consequently, the proportion in each output area in NS-SEC groups 1 or 2 is calculated as follows: (Total in NS-SEC groups 1 or 2 / Total in NS-SEC groups 1-7) * 100. Ref 4
- 2013/14 also represented the year in which the coding manual for the parental education variable was modified to explicitly include a ‘No response given’ option. In our analysis, those in the ‘No response given’ category prior to this academic year are individuals for whom this field was blank. From 2013/14 onwards, this group incorporates both those with a ‘No response given’ entry and blank fields. Ref 5