# Data analysis

On this page: Approach to weighting | Reliability | Salary data

## Approach to weighting

Non-response to a survey can result in estimates derived from a sample not accurately reflecting the wider population. Consequently, in year 1, HESA completed an internal investigation into whether weighting could help to alleviate this consequence of non-response. We did not find any evidence to suggest that applying weighting would be beneficial, given the minimal difference between the weighted and unweighted estimates for the proportion in employment and/or study.[1]

It was recognised that further research would be required in year 2 to examine the robustness of this conclusion. HESA therefore commissioned the Institute for Social and Economic Research (ISER) at the University of Essex to carry out this analysis for the second year. When compared with year 1, the work was extended to include the proportion in highly skilled employment and/or study as an additional outcome to analyse.

Data from year 2 of the survey, covering graduates from the 2018/19 academic year, was used in the research. The research was replicated on data from year 1 to establish the robustness of the conclusions.

Weights were produced, designed to ensure that the responding sample, once weighted, matched the full population of graduates in terms of characteristics such as subject area of study, level and class of award, provider, sex, age at entry and region of domicile. Four different sets of weights were developed, each matching the sample to a different combination of the available variables.

Each set of weights was tested to see whether they improved the accuracy of estimates of the proportion of graduates in employment and/or study or in highly skilled employment and/or study, both overall and for a number of population subgroups.

It was found that weighting – using any of the four approaches – improved accuracy for only a minority of estimates. Furthermore, when an improvement occurred it was rather small in size, making little or no practical difference to the conclusions that would be drawn from the data. In other words, the accuracy of estimates did not substantially differ between weighted and unweighted estimates.

ISER’s investigation of year 2 data therefore reached a similar conclusion to HESA’s previous investigation of year 1 data - that there is no need to use weighted estimation with the Graduate Outcomes survey data.

Users of the data may be reassured that the findings from this project indicate that there is no evidence of substantial non-response bias in the survey data.

A full technical report of the ISER research on weighting has been published on the HESA website.[2]

Following on from the recommendations provided by ISER, this year we have undertaken a smaller scale analysis to evaluate whether there is any evidence that weighting the Graduate Outcomes data could assist with improving our estimates.

The results of Model 1 and 3 tested by ISER led to very similar results (see table 8 of their report [2]). Given time constraints and the more parsimonious nature of Model 3, we decided to utilise this specification in our year 3 assessment. While ISER conducted an analysis for the full sample, as well as by domicile (see table 8 [2]), we decided to focus on the overall sample for our examination. Both the UK/non-UK models were estimated separately and replicated as closely as possible, with minor changes resulting from factors such as new providers emerging in 2019/20 Graduate Outcomes data. Propensities to respond were subsequently calculated and the inverse was taken to assign everyone in the sample with a weight.

Our dependent variable of interest was the proportion in highly skilled employment and/or further study. Unweighted estimates and the associated 95% confidence intervals for this outcome were therefore developed for the overall sample, as well as by subject (based on CAH), provider and subject within provider. This was proceeded by creating corresponding weighted estimates. Weighting can firstly impact estimates by influencing bias, which occurs when a sample statistic does not accurately reflect the population value. It can also have an effect on the precision of estimates. In general, it is likely to inflate standard errors, though should the variables utilised in the weighting procedure be strongly correlated with both the outcome and non-response, there is the possibility that weighting could lead to more precise estimates. However, the concluding section of the ISER report notes that ‘despite a rich set of auxiliary variables being available for weighting, it would seem that none of them are sufficiently strongly associated with both the propensity to participate in the survey and the y-variables’. Consequently, it is improbable that a weighting procedure will, in general, lead to more precise estimates in Graduate Outcomes. Indeed, in year 1, we saw that weighting mostly led to a small increase in standard errors.

We therefore concentrated our attention this year on assessing the impact weighting had on bias using the same approach as ISER. That is, we looked at whether the weighted estimate fell within the 95% confidence interval for the unweighted estimate. The bias reduction was assumed to be zero if this was the case. If, however, the weighted estimate was outside of the unweighted estimate confidence interval, the extent of the bias reduction was determined by the difference between the weighted and unweighted estimate. Having carried out this approach for the overall sample, as well as by subject, provider, and subject within provider, we found in just over 99% of cases that the bias reduction was equal to zero. It was therefore concluded that there was no evidence to suggest that weighting was required for Graduate Outcomes year 3 data.

**Reliability**

Some statistics published from the Graduate Outcomes survey will be at a very granular level, e.g., employment rates by HE provider and subject. In some cases, the sample of respondents for such statistics may be small and/or the response rate for that sample may be lower than the overall survey response rate. In these cases, the statistics may be subject to high levels of variability and a lack of statistical precision. HESA intends to publish confidence intervals[3] on these statistics (ranges within which we have a high level of confidence that the equivalent whole-population statistic would fall, where a narrow range indicates greater precision and a wide range indicates less precision).

In addition, for some statistics, it may be necessary to introduce publication thresholds whereby statistics based on very small sample sizes and/or lower response rates are suppressed. The actual decisions on use of these techniques will be clearly explained in each HESA statistical release.

**Salary data**

Preliminary analysis of the data on salaries submitted by respondents in the Graduate Outcomes survey reveals a small number of salary outliers which are suggestive of data quality issues, such as misinterpretation of the salary question. Whilst HESA has taken steps to reduce misinterpretation, some level of irregularity in responses is expected and decisions must be made on the treatment of salary outliers for dissemination of survey data. In determining a reasonable approach, at the lower end of the range of reported salaries HESA has taken the decision to exclude those falling below the UK national minimum wage equivalent (calculated using the minimum wage rates relevant to the year of reported employment). Salaries below this level are considered implausible. For this, for all three years it is assumed that for full-time graduates are working at least 30 hours a week and this is multiplied by 52 to calculate for the whole year. This minimum wage changed between each year of publication and during the COVID pandemic (for the 2018/19 and 2019/20 graduates), the value taken for minimum wage was multiplied by 0.8 to reflect the possible deductions due to furlough.

Year 1 (2017/18) - £11,513

Year 2 (2018/19) - £12,012 x 0.8 = £9,610

Year 3 (2019/20) - £12,792 x 0.8 = £10,233

At the opposite end of the salary range we see a small proportion of very high salaries reported which are worthy of additional scrutiny. In the first year of publication in 2020 (using 2017/18 data) HESA conducted statistical analysis of the data which suggested that if the top 1.5% of reported salaries were excluded the remaining data would more closely fit a ‘normal’ statistical distribution (which would be the usual expectation for data such as this drawn from a very large sample). HESA therefore previously concluded that it would be appropriate to exclude the top 1.5% of salaries as outliers. Some user feedback received since the 2020 publication challenged this approach on the basis that the statistical analysis did not necessarily suggest those high salaries were erroneous. In response, HESA has undertaken further analysis of all three years of data, including reviewing the literature and manual scrutiny of salaries reported to be in excess of £100,000 alongside other reported characteristics of the associated employment such as job titles. This assessment leads us to conclude that salary data is unlikely to be distributed normally, and further, that beyond a certain salary threshold, the proportion of reported salaries that are not credible increases markedly. The threshold we have determined falls approximately around £245,000, accounting for the top 0.1% of reported salaries. Because we cannot be confident of the data quality of salaries in this uppermost range we have taken the decision to exclude them, so they do not have a negative impact on calculations such as mean salary levels overall. Further detail on our assessment of high salaries is available in the Quality Report.

As with previous presentations of graduate salary data, HESA expects to show data only for graduates reporting themselves as in full-time paid UK employment where the currency paid is British pounds.

HESA statistical releases show numbers of graduates by salary bands which start at the national minimum wage equivalent and are divided into £3,000 bands within the most common range of graduate salaries. In the 2020 publications the highest band covered the range £39,000 or more. For 2018/19 data publications these bands have been revised to account for the fact that a reasonably sized proportion of graduates in full-time paid UK employment can be grouped into this salary band (particularly among those having graduated from some postgraduate courses). Four additional divisions have therefore been added at the higher end, meaning the highest band covers £51,000 or more. In addition to banded salary data, median salaries are published. The methodology for salary bands has remained the same for the 2019/20 release.

Previous: Data processing Next: Dissemination

[3] Confidence intervals are calculated at the 95% level using the method proposed by Goodman (1965) implemented in R using the MultinomCI function (Goodman, L. A. (1965) *On Simultaneous Confidence Intervals for Multinomial Proportion*s. Technometrics, 7, 247-254).