Skip to main content

Paradata Review


Paradata largely comprises system-generated logging data which is, in its own way, as rich as the survey data itself, and offers us insights into the behavioural characteristics of respondents. It requires some complex scripting to access, and, as we learn more about the capabilities of this system, we are extending the catalogue of paradata we wish to extract from the system and utilise. When combined with our data on the population characteristics, it also yields potential insights into non-respondents.

Our current paradata dictionary includes variables for the start mode, partial completion mode, completion mode, various status markers, last question viewed, number of calls made, and a range of variables relating to the sending of emails and SMS messages. We have recently added new variables enabling us to record non-response by question, duration by section and the route a respondent travels through the survey.

Over the last few years we have been using some of this paradata to inform our data collection processes such as identifying the most suitable time for sending emails and SMSs based on completion times, changing subject lines to encourage higher email open and click rates, monitoring interviewer performance using average number of calls, to name a few. 

Now that we have a better understanding of what variables we require and utilise, we have obtained a regular import of these variables (from Forsta (formerly Confirmit), the survey data collection platform) in a format that enables us to link this data to other data such as population characteristics, survey completion status and results data. It is therefore vital that a review of every paradata variable is carried out. Checks are carried out at the end of each Cohort to ensure that the paradata is accurate and present for all variables that should contain data.

Paradata available to use in Graduate Outcomes


The Paradata we refer to is the data collected during the administration of the Graduate Outcomes survey. We capture data relating to numerous variables, only some of which we have so far explored in detail. The data we have utilised and how it is used is outlined in more detail in Table 1.

Paradata is collected for all graduates interacting with the survey (accessing the survey link online) and for those receiving calls, so is present for respondents and some non-respondents. Certain items of paradata are monitored with regularity whilst others are yet to be used. For example, paradata relating to completion dates enables us to monitor the operational running of the survey on a daily basis. This helps us to report on how our response rate is progressing versus the same time period during the previous year. In addition, variables relating to survey modes are used to make informed decisions such as identifying the effectiveness of our engagement strategy and highlighting areas for improvement.

Given the widespread collection of paradata in Graduate Outcomes and other surveys, there are many research areas that could emerge from analysis of the data that could inform both the collection of paradata and how it is used. For instance, in collaboration with our contact centre we also use data about survey interviewers or interviewer observations to monitor progress; identify and address data collection issues. This is not in scope of this report as the focus is largely on the standard data items which relate to all respondents. 

How we use the Paradata

In Year 4 we extended the number of paradata variables we have access to and report on. This has helped our operational running of the survey, improved our understanding of how graduates interact with the survey and informed us on how best to contact graduates via our engagement strategy.

The table below summarises how we currently use the paradata information we import daily from the survey data collection platform.

Table 4: How we use the paradata fields and some of the key findings


How it is used 

Key findings

When and which survey links are accessed

Identifying which survey links are accessed enables us to judge the effectiveness of our online communications

Approximately 70% of online surveys were accessed via email links and 30% via SMS links


Start mode, first question completion mode, partial completion mode, completion mode

Enables us to view how graduates interact with our concurrent mixed mode survey

Around 60% of our surveys are completed via CATI and 40% online


Status markers including whether the graduate has answered part, the minimum requirements or all of the survey

To identify graduates completing only part of the survey.

Survey completion rates are high for those answering the first question (<10% of graduates that start the survey do not reach completion status)


Date and time information

Enables us to track most popular completion times of day/week

Weekdays (particularly midweek) and early evenings are the most popular completion times



Useful to enable us to keep the survey as short but as comprehensive as possible. Enables us to see how duration differs by mode and identify ways of making efficiency savings.


The survey is significantly quicker online than over the phone with an average duration of <10 minutes

Browser and Device

To ensure our survey is user friendly to what our target audience are using.

Chrome is the most frequently recorded browser used to take the survey


Number and type of contact details we are supplied with

We identify and contact Providers falling below certain thresholds that are likely to impact response rates in the forthcoming cohort

Missing details are rare, approximately 95% of graduates have a phone number supplied and 98% have an email address

Number of good and unobtainable phone numbers called during the cohort

We complete a CATI review post cohort, to analyse how effective the phone contact details were that we were supplied with


Only a small proportion of graduates do not have a valid phone number

Status of calls, for example appointments/answer phones

To track how effective the contact centre was at obtaining useful call outcomes

The presence of a valid phone number does not guarantee response as several calls go unanswered


Number of proxy surveys/web transfers

To judge how effective proxy surveys and web transfers are in boosting responses.

Proxy surveys account for <1% of the completed CATI surveys and successful web transfers only account for <1% of the completed online surveys

Number of surveys offered and answered in Welsh

We offer the survey in Welsh and tracking the volume completed is used for invoicing

Only a tiny minority of surveys were completed in Welsh. Even where Welsh was selected as the preferred language by a respondent, several responses are provided in English.

Opt-out rates and reasons for opting out (online)

Helps us understand why graduates do not want to complete the survey and enables us to identify peak times of opting out

Of the options provided the most frequently observed reason for opting out was "I’m not interested in completing the survey"

Seen/Answered flags These variables enable us to see which questions were seen, and subsequently answered by graduates as they progressed through the survey. From this data we can calculate the Response rate for each question. For the majority of questions we obtain a RR >95%. It is important to identify those questions with a low rate as this can indicate which questions graduates are reluctant to answer or which arethe reason for survey drop-out.
Section Flow This variable shows us the order in which the sections of the survey were asked. The ordering is determined by both the activities selected in the first question and which activity was selected as most important. We have seen that there are differences by mode, with a higher proportion of CATI graduates undertaking section C before section B.

Data quality review

Summary of the data quality checks

Most of the variables appear to be accurate in terms of the coverage (data is present and correct for all graduates that should have paradata present). In some instances, the accuracy of a variable can be judged by comparing it with another similar variable, and where contradictions occur this can indicate an error in one or both fields. Similarly, errors can be found where data is present in one field but missing in another.

Missing data

A small number of errors were found by comparing data from similar fields. For example, where start mode was missing, but the paradata suggested the first question was answered because partial completion mode was present. As start mode is populated when the graduate first interacts with the survey and it should always be present for those answering the first question.

In addition, scripting changes were identified to change the point at which we collect the hour and day the survey was completed. The two variables hour of completion/day of completion were missing data for those not reaching the end of the survey. This was because the data was collected at the point the survey closed, rather than the point when the graduate had answered thefirst question. The specification for data capture has been revised subsequently.

Dialer issues

There were instances where technical errors with the telephone dialer meant that the call count field needed to be reset for some graduates. This variable had been manually reset back to 0 during the cohort by the survey programmer, so that we could resume calling of the affected records. This meant the number of calls made during the cohort was under recorded in some instances.

Errors due to outliers

The survey duration is recorded in seconds and needs to be calculated only after removing outliers. This is because the data will include instances where a graduate leaves and resumes the survey at a latter point, inflating the survey duration. There are also instances of the survey length being recorded as a shorter time period than would have been possible to complete the survey. For the purpose of analysis, it has been decided that we would remove data relating to anything taking longer than 1 hour (3,600 seconds) or under 1 minute (60 seconds). The average survey durations we calculate by mode closely match our expectations and for CATI closely match the survey durations the call centre have observed.

Survey Routing complications

We have spotted some data quality concerns with the Seen/Answered flags, potentially caused by graduates seeing questions they no longer had a requirement to answer. This can occur when going back to a previous section as a result of needing to change their route through the survey. Surveys scripts have been changed to better record the final route the graduates took. This should prevent inconsistencies in the data and quality assessments of the flags will continue.

Future work

We have gradually increased the number of paradata variables we have access to and can analyse the data for. These can deliver important insight into how our survey is operating, as well as how graduates interact with it. Some of the variables we have access to are recent additions, and therefore the focus is now on understanding this new data and how best we can utilise the knowledge to make survey improvements.

Below is a list of paradata items we are looking to improve the accuracy for and/or analyse in further detail.

  • An accurate call count variable for all graduates is vital in understanding our call management system.. However due to the practical requirement to reset records in response to technical issues, this data item suffers from a degree of under-recording. We are looking at obtaining a more accurate call count variable.
  • Survey duration split by section. These variables record the duration in seconds a respondent spent on each section. This enables us to not only use a combined total of the sections to calculate the overall duration, but to see which sections of the Survey are taking the longest and are most burdensome. We can also calculate the average time taken to answer a question in each section because we know which questions they answered.
  • Seen/Answered flags. By looking at those who have not completed the survey, we can identify questions that cause survey drop-out. We have spotted some data quality concerns with these flags which also need to be fixed before we can fully utilise this variable.
  • Section flow. This variable shows us the order in which the sections of the survey were asked to a graduate. The ordering is determined by both the activities selected in the first question and which activity was selected as most important. We have seen that there are differences by mode, the reasons for which need to be investigated further. We also need to understand why graduates on CATI are selecting more activities than those online.

Next: Mode effects