Skip to main content

Paradata Review

On this page: Overview | Paradata available to use in Graduate Outcomes | Data quality review | Future work


Paradata largely comprises system-generated logging data which is, in its own way, as rich as the survey data itself, and offers us insights into the behavioural characteristics of respondents. It requires some complex scripting to access, and, as we learn more about the capabilities of this system, we are extending the catalogue of paradata we wish to extract from the system and utilise. When combined with our data on the population characteristics, it also yields potential insights into non-respondents.

Our current paradata dictionary includes variables for the start mode, partial completion mode, completion mode, various status markers, last question viewed, number of calls made, and a range of variables relating to the sending of emails and SMS messages. We have also added new variables enabling us to record non-response by question, duration by section, stage of the survey reached and the route a respondent travels through the survey.

Over the last few years, we have been using some of this paradata to inform our data collection processes such as identifying the most suitable time for sending emails and SMSs based on completion times, changing subject lines to encourage higher email open and click rates, monitoring interviewer performance using average number of calls, to name a few. 

Now that we have a better understanding of what variables we require and utilise, we have obtained a regular import of these variables (from Forsta (formerly Confirmit), the survey data collection platform) in a format that enables us to link this data to other data such as population characteristics, survey completion status and results data. It is therefore vital that a review of every paradata variable is carried out. Checks are carried out at the end of each Cohort to ensure that the paradata is accurate and present for all variables that should contain data.

Paradata available to use in Graduate Outcomes


The Paradata we refer to is the data collected during the administration of the Graduate Outcomes survey. We capture data relating to numerous variables, only some of which we have so far explored in detail. The data we have utilised and how it is used is outlined in more detail in Table 1.

Paradata is collected for all graduates interacting with the survey (accessing the survey link online) and for those receiving calls, so is present for respondents and some non-respondents. Certain items of paradata are monitored with regularity whilst others are yet to be used. For example, paradata relating to completion dates enables us to monitor the operational running of the survey on a daily basis. This helps us to report on how our response rate is progressing versus the same time period during the previous year. In addition, variables relating to survey modes are used to make informed decisions such as identifying the effectiveness of our engagement strategy and highlighting areas for improvement.

Given the widespread collection of paradata in Graduate Outcomes and other surveys, there are many research areas that could emerge from analysis of the data that could inform both the collection of paradata and how it is used. For instance, in collaboration with our contact centre we also use data about survey interviewers or interviewer observations to monitor progress; identify and address data collection issues. This is not in scope of this report as the focus is largely on the standard data items which relate to all respondents. 

How we use the Paradata

In Year 5 (c21072) we extended the number of paradata variables we have access to and report on. This has helped our operational running of the survey, improved our understanding of how graduates interact with the survey and informed us on how best to contact graduates via our engagement strategy.

The table below summarises how we currently use the paradata information we import daily from the survey data collection platform.

Table 5: How we use the paradata fields and some of the key findings


How it is used 

Key findings

When and which survey links are accessed

Identifying which survey links are accessed enables us to judge the effectiveness of our online communications

Over 60% of online surveys were accessed via email links, less than and 40% via SMS links


Start mode, first question completion mode, partial completion mode, completion mode

Enables us to view how graduates interact with our concurrent mixed mode survey

Around 60% of our surveys are completed via CATI and 40% online


Status markers including whether the graduate has answered part, the minimum requirements or all of the survey

To identify graduates completing only part of the survey.

Survey completion rates are high for those answering the first question (<10% of graduates that start the survey do not reach completion status)


Date and time information

Enables us to track most popular completion times of day/week

Weekdays (particularly midweek) and early evenings are usually the most popular completion times



Useful to enable us to keep the survey as short but as comprehensive as possible. Enables us to see how duration differs by mode and identify ways of making efficiency savings.


The survey is significantly quicker online than over the phone with an average duration of <10 minutes

Browser and Device

To ensure our survey is user friendly to what our target audience are using.

Chrome is the most frequently recorded browser used to take the survey


Number of good and unobtainable phone numbers called during the cohort

We complete a CATI review post cohort, to analyse how effective the phone contact details were that we were supplied with


Only a small proportion of graduates do not have a valid phone number. There is a link between Providers with a high unobtainbale rate and low CATI Response Rate.

Status of calls, for example appointments/answer phones

To track how effective the contact centre was at obtaining useful call outcomes

The presence of a valid phone number does not guarantee response as several calls go unanswered


Number of proxy surveys/web transfers

To judge how effective proxy surveys and web transfers are in boosting responses.

Proxy surveys account for <1% of the completed CATI surveys and successful web transfers also only account for <1% of the completed online surveys

Number of surveys offered and answered in Welsh

We offer the survey in Welsh and tracking the volume completed is used for invoicing

Only a tiny minority of surveys were completed in Welsh. Even where Welsh was selected as the preferred language by a respondent, several responses are provided in English.

Opt-out rates and reasons for opting out (online)

Helps us understand why graduates do not want to complete the survey and enables us to identify peak times of opting out

Of the options provided the most frequently observed reason for opting out was "I’m not interested in completing the survey"

Seen/Answered flags These variables enable us to see which questions were seen, and subsequently answered by graduates as they progressed through the survey. From this data we can calculate the Response rate for each question. For the majority of questions we obtain a RR >95%. It is important to identify those questions with a low rate as this can indicate which questions graduates are reluctant to answer or which arethe reason for survey drop-out.
Section Flow This variable shows us the order in which the sections of the survey were asked. The ordering is determined by both the activities selected in the first question and which activity was selected as most important. We have seen that there are differences by mode, with a higher proportion of CATI graduates undertaking section C before section B.
Completion Status  This variable tracks at what stage of the survey a graduate reached. For example, whether they dropped out before the end of the survey.  We can classify graduates as partial completions if they did not answer all of the core mandatory questions. We can also classify graduates as total completions if they reached the end of the survey. This is the case for over 80% of those starting the survey.  

Data quality review

Summary of the data quality checks

Most of the variables appear to be accurate in terms of the coverage (data is present and correct for all graduates that should have paradata present). In some instances, the accuracy of a variable can be judged by comparing it with another similar variable, and where contradictions occur this can indicate an error in one or both fields. Similarly, errors can be found where data is present in one field but missing in another.

Missing data

A small number of errors were found by comparing data from similar fields. For example, where start mode was missing, but the paradata suggested the first question was answered because partial completion mode was present. As start mode is populated when the graduate first interacts with the survey and it should always be present for those answering the first question.

Errors due to outliers

The survey duration is recorded in seconds and needs to be calculated only after removing outliers. This is because the data will include instances where a graduate leaves and resumes the survey at a latter point, inflating the survey duration. There are also instances of the survey length being recorded as a shorter time period than would have been possible to complete the survey. For the purpose of analysis, it has been decided that we would remove data relating to anything taking longer than 1 hour (3,600 seconds) or under 1 minute (60 seconds). The average survey durations we calculate by mode closely match our expectations and for CATI closely match the survey durations the call centre have observed.

Survey Routing complications

We have spotted some data quality concerns with the Seen/Answered flags, potentially caused by graduates seeing questions they no longer had a requirement to answer. This can occur when going back to a previous section as a result of needing to change their route through the survey. Surveys scripts have been changed to better record the final route the graduates took. This should prevent inconsistencies in the data and quality assessments of the flags will continue.

Future work

We have gradually increased the number of paradata variables we have access to and can analyse the data for. These can deliver important insight into how our survey is operating, as well as how graduates interact with it. Some of the variables we have access to are recent additions, and therefore the focus is now on understanding this new data and how best we can utilise the knowledge to make survey improvements.

Below is a list of paradata items we are looking to improve the accuracy for and/or analyse in further detail.


  • Survey duration split by section. These variables record the duration in seconds a respondent spent on each section. This enables us to not only use a combined total of the sections to calculate the overall duration, but to see which sections of the Survey are taking the longest and are most burdensome. We can also calculate the average time taken to answer a question in each section because we know which questions they answered.
  • Seen/Answered flags. By looking at those who have not completed the survey, we can identify questions that cause survey drop-out. We have spotted some data quality concerns with these flags which also need to be fixed before we can fully utilise this variable.
  • Section flow. This variable shows us the order in which the sections of the survey were asked to a graduate. The ordering is determined by both the activities selected in the first question and which activity was selected as most important. We have seen that there are differences by mode, the reasons for which need to be investigated further. We also need to understand why graduates on CATI are selecting more activities than those online.


  • Inbound calling flag: This variable enables us to track the graduates that miss calls and call back our contact centre. We can subsequently see how many of these graduates go on to complete the survey. 
  • New paradata field that tells us for each graduate called, the date of their first call. will enable us to identify when they had first CATI contact, or in other words how long it was that they were only able to complete the survey online. 

Next: Mode effects