DLHE Longitudinal Survey 2008/09 cohort

Guidelines for use of the DLHE Longitudinal Survey Dataset

Version 1.0 Produced 2013-07-18

Introduction

These guidelines are for use with the 2012/13 DLHE Longitudinal Survey dataset (leaving cohort 2008/09). They include guidance on:

how to generate statistics from the data using the survey weights;
how the precision of those statistics can be estimated;
how to present tables generated from the data;
the interpretation of data from the tables.

The DLHE Longitudinal Survey has a complex structure. This might mean that assistance from a colleague with technical statistical knowledge, especially of statistical software packages, will be required.

Overview of the survey design

The DLHE Longitudinal Survey is a follow-up survey that looks at the destinations of leavers from higher education up to three and a half years after they qualified. The survey to which these guidelines relate involved re-contacting a sample of leavers from the 2008/09 leaving cohort who completed an Early DLHE questionnaire and inviting them to complete a follow-up questionnaire. There were 470,940 leavers eligible to take part in the census survey in 2008/09, of which 354,730 (75.3%) took part.

The Longitudinal Survey, not a census survey, is instead based on two sub-samples of the 354,730 students who responded to the census survey in 2008/09. For Sample A, 80,835 were selected from across all institutions, but with some groups of graduates being over-sampled relative to other groups, so that the sample is intentionally skewed towards foundation degree graduates, research graduates completing a Masters or Doctoral degree, specific subject areas and non-white graduates. 33,640 responses were received from Sample A. In addition 192,745 of the remaining 273,890 graduates for whom an email address was available were contacted (Sample B) resulting in a further 28,565 responses to the survey. Following investigation by HESA and IFF Research it was agreed that it was possible to combine Sample A and Sample B. Therefore the total number of responses for analysis is 62,205.

The rationale for the over-sampling in Sample A was to ensure that the Longitudinal Survey will have sufficient numbers of graduates in key sub-groups to allow for separate statistical analysis of these groups. As a consequence, this over-sampling means that the statistics generated from the Longitudinal Survey dataset are misrepresentative, unless they are adjusted to compensate for the over-sampling. Details of how to adjust the data can be found below.

The implications of starting with a sample rather than a census

The 2008/09 Early DLHE census involved all graduates eligible to take part and a high overall response rate was achieved; the statistics generated from the survey were therefore treated as representative exact. Because the Longitudinal Survey is based on a sample rather than a census, it cannot be guaranteed, even after accounting for the over-sampling of sub-groups, that the sample is representative of all graduates. This lack of certainty needs to be taken into account when interpreting the data.

A census involves contacting all members of a population; a sample survey by contrast includes a selected subset of the population.

The uncertainty surrounding the statistics from a sample survey is usually presented by:

a) Computing Confidence Intervals (CIs) around single statistics:

The interpretation of a confidence interval is that it gives an estimated range of values which is likely to include the true value. Confidence intervals are usually calculated as ‘95% confidence intervals' which means that there is a 95% chance that the interval calculated from the sample covers the true value. It is recommended that 95% CIs are used when using data from the DLHE Longitudinal survey. The width of the confidence interval gives some idea about how uncertain one can be about the true value: the wider the interval, the greater the uncertainty. The 95% confidence interval is that interval within which a parameter of a parent population is calculated to lie with probability of 95%, an 80% confidence interval will be narrower than this interval. (The confidence interval will be smaller when the sample size is larger).

And;

b)Testing that differences between two (or more) statistics are significant, by using formal statistical significance tests (e.g. a normal test, ‘t' test or a chi-squared test.) In order to do this assistance from a colleague with technical statistical knowledge, especially of statistical software packages, may be required.

Generating unbiased statistics from the survey (use of survey weights).

The sample design of the Longitudinal Survey, with some groups of graduates being over-sampled, thus over-represented, means that the data from the survey cannot be used without adjustment. Adjustment is achieved by weighting the cases in the data files.

To illustrate how this is done:

Black, Asian, mixed and other ethnic group graduates accounted for 22.6% of the selected Sample A.
From the initial census it is known that these graduates represent just 14.6% of all graduates.
To ensure that these graduates feature in the analysis in their correct proportion, the 'black', 'Asian', 'mixed ethnic group' and 'other ethnicity' graduates in the sample would be given a weight of 14.6/22.6.

In practice, the data also needs to be weighted for the other sampling criteria - the data is also weighted for non-response and to take account of the additional 28,565 responses from Sample B, so there will be a wide range of weight values.

The unweighted base for any analysis is the number of respondents to the survey. The weighted base is the adjusted number of respondents obtained using the method set out above.

The appropriate weight to apply to each response is included in the data files as a separate variable. The weight to apply for analysis of the data from an individual institution is called heiweight. The data must be weighted by heiweight for all analyses of the data for individual institutions.

Note that the variable heiweight takes into account the differential non-response rates to the Longitudinal Survey by sub-groups of graduates as well as the over-sampling. This is to ensure that the final data set of graduates who completed a questionnaire is reasonably representative of all graduates from the cohort.

The heiweight includes an additional adjustment to ensure that the weighted data for each institution gives a reasonable match to the Early DLHE Survey respondents for that institution, based on a small number of key variables, as detailed below. Since adjustment of this kind is only statistically efficient for large sample sizes, the amount of adjustment per institution is dependent on its Longitudinal Survey sample size.

For institutions with 400 or more Longitudinal Survey respondents the survey data are weighted so as to give a close percentage match between the survey and the total population census in terms of broad subject group (the broad subject groups used are: health and welfare; science and agriculture; engineering, manufacture and construction; social science, business, law and combined; humanities and arts; education), mode, the part-time/full-time split; and level, the postgraduate/undergraduate split.

For institutions with between 200 and 399 Longitudinal Survey respondents the survey data are weighted so as to give a close percentage match between the survey and the census in terms of mode, the part-time/full-time split; and level, the postgraduate/undergraduate split.

For institutions with between 100 and 199 Longitudinal Survey respondents the survey data are weighted so as to give a close percentage match between the survey and the census in terms of level, the postgraduate/undergraduate split.

For institutions with fewer than 100 Longitudinal Survey respondents no HEI-level adjustment has been made.

It is important when carrying out any analysis that the weighted data is used.

Software implications

Since the survey data has to be weighted, this means that generating statistics from the data has to be undertaken using a software package that can cope with weighted data. This means using a statistical software package such as SAS (version 9 or later), SPSS (version 13 or later), Stata (version 12 or later), or SUDAAN (version 9 or later) for your analysis.

The sample is a stratified random sample and the weighting must be taken into account in analysis. This is not easily achieved without using a statistical software package.

Presenting tables from the survey data

This section includes some good practice guidelines on how to present tables based on survey data.

The example table below is used to illustrate some points.

Table 1.1 Activity on longitudinal census date, by age and gender

	FICTIONAL FIGURES
	Full-time paid work (%)	Part-time paid work (%)	Voluntary / unpaid work only (%)	Work and further study (%)	Further study only (%)	Unemployed (%)	Not available for employment (%)	Other (%)	Total	Base (weighted)
Female	75.1	11.2	1.0	6.2	3.1	0.6	1.2	1.6	100.0	2485
24 years & under	83.6	5.4	0.5	6.3	2.2	0.6	0.8	0.6	100.0	1055
25 years & over	68.4	18.2	1.3	6.4	2.5	1.1	0.8	1.3	100.0	1280

Male	83.2	4.1	0.2	6.5	2.8	2.6	0.3	0.3	100.0	1130
24 years & under	83.1	4.2	0.3	7.4	3.2	0.8	0.7	0.3	100.0	465
25 years & over	83.6	4.2	0.1	6.5	1.8	3.3	0.4	0.1	100.0	765

Total	74.4	6.6	0.2	8.6	5.2	2.3	2.6	0.1	100.0	7180

Good practice guidelines

Headings should be clear and specific.
The population that the figures in the table refer to should be clearly specified.
Avoid spurious precision (percentages should be shown to one decimal place).
Percentages calculated on populations which contain 52 or fewer individuals (unweighted base) must be suppressed.
Add footnotes to the bottom of the table if they are needed to explain the data.
Avoid presenting statistics for very small sub-groups of students where there is a risk that the individual students in the sub-group could be identified.
HESA's standard rounding strategy must be used when presenting raw data. This can be summarised as follows:
- 0, 1, 2 are rounded to 0; and
- all other numbers are rounded up or down to the nearest multiple of 5.
So for example 3 is represented as 5, 22 is represented as 20, 3286 is represented as 3285 while 0, 20, 55, 3510 remain unchanged.

Interpreting the data from tables

Avoid reporting on small sample sizes where possible

Statistics based on small sample sizes will be imprecise, and comparing two (or more) statistics which are based on small sample sizes can lead to some very misleading interpretations of data if this imprecision is ignored. The best way to avoid misinterpretation is to simply accept the limitations of the data; if any sub-groups have sample sizes that are too small for meaningful analysis then do not use these sub-groups in analysis.

Use significance tests to test whether an observed difference between two groups is likely or not to be genuine given the size and design of the sample. A number of formal statistical tests can be used for this purpose, e.g. a normal test, ‘t'test or a chi-squared test. It may be necessary to take advice on which of these tests is most appropriate under the circumstances.