Skip to main content

Paradata Review

Overview

Paradata largely comprises system-generated logging data which is, in its own way, as rich as the survey data itself, and offers us insights into the behavioural characteristics of respondents. It requires some complex scripting to access, and, as we learn more about the capabilities of this system, we are extending the catalogue of paradata we wish to extract from the system and utilise. When combined with our data on the population characteristics, it also yields potential insights into non-respondents.

Our current paradata dictionary includes variables for the start mode, partial completion mode, completion mode, various status markers, last question viewed, number of calls made, and a range of variables relating to the sending of emails and SMS messages. Over the last few years we have been using some of this paradata to inform our data collection processes such as identifying the most suitable time for sending emails and SMSs based on completion times, changing subject lines to encourage higher email open and click rates, monitoring interviewer performance using average number of calls, to name a few. We have gradually and regularly increased the number of paradata variables we have access to. Although data quality checks have been carried out in previous cohorts of the survey, these were mostly as a result of random errors being spotted, and checks were carried out on an ad hoc, rather than regular basis.

Now that we have a better understanding of what variables we require and utilise, we have obtained a regular import of these variables (from Confirmit, the survey data collection platform) in a format that enables us to link this data to other data such as population characteristics, survey completion status and results data. It was therefore vital that a review of every paradata variable was carried out.

Over the years we have also made incremental changes to the quality of different paradata variables. As a first in a series of reports we are reporting on a review using data from Cohort D of Year 3, to analyse the usefulness and accuracy of the paradata we can obtain from the Confirmit system. Cohort D is by far our largest cohort, accounting for around 70% of the population, and therefore was the most useful sample to review.  Having undertaken the review of nearly 280,000 records, similar analysis would be carried out at the conclusion of every new cohort.

Paradata available to use in Graduate Outcomes

Background

The Paradata we refer to is the data collected during the administration of the Graduate Outcomes survey. We capture data relating to numerous variables, only some of which we have so far explored in detail. The data we have utilised and how it is used is outlined in more detail in Table 1.

Paradata is collected for all graduates interacting with the survey (accessing the survey link online) and for those receiving calls, so is present for respondents and some non-respondents. Certain items of paradata are monitored with regularity whilst others are yet to be used. For example, paradata relating to completion dates enables us to monitor the operational running of the survey on a daily basis. This helps us to report on how our response rate is progressing versus the same time period during the previous year. In addition, variables relating to survey modes are used to make informed decision such as identifying the effectiveness of our engagement strategy and highlighting areas for improvement.

Given the widespread collection of paradata in Graduate Outcomes and other surveys, there are many research areas that could emerge from analysis of the data that could inform both the collection of paradata and how it is used. For instance, in collaboration with our contact centre we also use data about survey interviewers or interviewer observations to monitor progress; identify and address data collection issues. This is not in scope of this report as the focus is largely on the standard data items which relate to all respondents. In future, we are also looking at the feasibility of incorporating paradata variables into the modelling we use for case prioritization.

How we use the Paradata

In Year 3 we extended the number of paradata variables we have access to and report on. This has helped our operational running of the survey, improved our understanding of how graduates interact with the survey and informed us on how best to contact graduates via our engagement strategy.

The table below summarises how we currently use the paradata information we import daily from the survey data collection platform.

Table 4: How we use the paradata fields and some of the key findings

Paradata

How it is used 

Key findings

When and which survey links are accessed

Identifying which survey links are accessed enables us to judge the effectiveness of our online communications

Around 70% of online surveys were accessed via email links and 30% via SMS links

 

Start mode, partial completion mode, completion mode

Enables us to view how graduates interact with our concurrent mixed mode survey

60% of our surveys are completed via CATI and 40% online

 

Status markers including whether the graduate has answered part, the minimum requirements or all of the survey

To identify graduates completing only part of the survey.

Survey completion rates are high for those answering the first question (<10% of graduates that start the survey do not reach completion status)

 

Date and time information

Enables us to track most popular completion times of day/week

Weekdays (particularly midweek) and early evenings are the most popular completion times

 

Duration

Useful to enable us to keep the survey as short but as comprehensive as possible. Enables us to see how duration differs by mode and identify ways of making efficiency savings.

 

The survey is significantly quicker online than over the phone with an average duration of <10 minutes

Browser and Device

To ensure our survey is user friendly to what our target audience are using.

Chrome is the most frequently recorded browser used to take the survey

 

Number and type of contact details we are supplied with

We identify and contact Providers falling below certain thresholds that are likely to impact response rates in the forthcoming cohort

Missing details are rare, approximately 95% of graduates have a phone number supplied and 98% have an email address

Number of good and unobtainable phone numbers called during the cohort

We complete a CATI review post cohort, to analyse how effective the phone contact details were that we were supplied with

 

Only a small proportion of graduates do not have a valid phone number

Status of calls, for example appointments/answer phones

To track how effective the contact centre was at obtaining useful call outcomes

The presence of a valid phone number does not guarantee response as several calls go unanswered

 

Number of proxy surveys/web transfers

To judge how effective proxy surveys and web transfers are in boosting responses.

Proxy survey and successful web transfers only account for <1% of the completed surveys

Number of surveys offered and answered in Welsh

We offer the survey in Welsh and tracking the volume completed is used for invoicing

Only a tiny minority of surveys were completed in Welsh. Even where Welsh was selected as the preferred language by a respondent, several responses are provided in English.

Opt-out rates and reasons for opting out (online)

Helps us understand why graduates do not want to complete the survey and enables us to identify peak times of opting out

Of the options provided the most frequently observed reason for opting out was "I’m not interested in completing the survey"

Potential usage of Paradata in case prioritization

The use of paradata in managing non-response bias is an established methodology in survey research to use in propensity modelling. The paradata we collect in Graduate Outcomes may be able to help us identify groups of non-respondents who are least likely to respond despite our efforts to engage with them.

We are currently looking at the feasibility of incorporating paradata variables into the modelling for Case Prioritization. This has to date not been used, due to data quality concerns identified after the data has been extracted. Variables for Case Prioritization need to be accurate at around the mid-point of the Cohort (before the propensity scoring is done), and therefore would need to be reviewed for quality concerns at this point.

One variable that we are investigating relates to those that opened the survey link and if so, at what time they opened.  With this information we can create a derived, categorical variable which either contains the number of days taken to open or otherwise places graduates into a final category of ‘never opened’. Not opening the questionnaire for a long period would suggest poor engagement with the survey and potentially lower probability of taking part in the survey. Analysis of this data suggests that the trend is for the frequency to reduce over time, with the largest number of graduates who open the link doing so early in a Cohort.

Data quality review

Summary of the data quality checks

Most of the variables appear to be accurate in terms of the coverage (data is present and correct for all graduates that should have paradata present). In some instances, the accuracy of a variable can be judged by comparing it with another similar variable, and where contradictions occur this can indicate an error in one or both fields. Similarly, errors can be found where data is present in one field but missing in another.

Errors were found in only 6 of the 36 fields reviewed. Where unexplained errors were present this generally related to missing data and represented <1% of the total. Most errors could be explained by scripting or manual interventions made during the cohort. Paradata fields where errors were spotted are included in Table 2.

Table 5: List of paradata fields that have been quality checked

Paradata field

Field Description

Outcome of data quality check

StartMode 

Start Mode of survey

Error(s) found (4 records missing data)

 

PartialCompletionMode 

The Partial CompletionMode is set immediately after section D

No errors

CompletionMode 

Final Completion Mode of the Survey

Error(s) found (2 records missing data)

Completion_Status 

What stage of the survey was reached: 1 - Partial
2 – Minimum requirements
3 – Reached end of the survey
4 – Full complete (all questions answered)

Error(s) found (49 records with an incorrect partial status)

Contact_Status 

A marker to determine if a graduate has reached the minimum requirements to be a survey completion status

No errors

FurthestQn 

Furthest question reached in the survey

Error(s) found (some surveys questions appear to not be captured by this marker)

CallCount 

Number of call attempts to each graduate across all numbers

 

Error(s) found (errors spotted where callcounts contradict other paradata fields)

CatiExtendedStatus 

Status of graduates’ last interaction with IFF

 

No errors

HideIfProxy 

Whether the survey was completed by proxy

 

No errors

Interview_Start 

Start time and date of graduate entering the survey, this will get overwritten if a graduate re-enters survey

 

No errors

Interview_End 

The time and date where the respondents reach completion status in the survey.

No errors

LastComplete 

The date/time the interview last completed. It can therefore be set before the end of the question set depending upon how far the respondent reached.

 

No errors

StartSMS 

Records which UK mobile number was used at entry via SMS

No errors

ReturnSMS1 

Records which UK mobile number was used at re-entry via SMS

No errors

ReturnSMS2 

Records which UK mobile number was used at second re-entry via SMS

No errors

StartEmail 

Records which email address was used at entry

No errors

ReturnEmail1 

Records which email address was used at re-entry

No errors

ReturnEmail2 

Records which email address was used at second re-entry

No errors

BrowserType 

Browser used (Safari/Chrome etc)

No errors

DeviceType 

The device used to answer the survey (Desktop/Touch)

No errors

valOfNotAns 

The latest question of the survey that was not answered

No errors

A1completionMode 

Mode in which the first question of the survey was completed

No errors

times_1 

The time and date that the survey link was accessed

 

No errors

times_3 

The time and date that the final question of the survey completed

 

No errors

elapsedTime 

The duration of the survey in seconds

 

No errors

UnobtainableCount 

The number of telephone numbers that a graduate has which are invalid

 

No errors

GoodNumberCount 

The number of good telephone numbers that the record has

 

No errors

HourOfCompletion 

The hour of the day that the survey was completed

 

No errors

DayOfCompletion 

The day that the survey was completed

 

No errors

contacttype 

The type of telephone number that was used to start (not complete) the survey on (Mobile/Landline etc.)

 

No errors

ScreenWidth 

The screen width of the device used to do the survey; this is only populated for those who don't opt out

 

No errors

ScreenHeight 

The screen height of the device used to do the survey; this is only populated for those that don't opt out

 

No errors

First_Interviewstart 

The date and time for the first time a graduate enters the survey.

 

Error(s) found (some instances where the survey entry time recorded is inaccurate)

OptOutQ 

A flag which you can use to see which graduates have opted out or not.

 

No errors

DateUnsubscribed 

opt out date

 

No errors

OptOutReasonsMS_1 to_6

opt out reason

No errors

Missing data

A small number of errors were found by comparing data from similar fields. For example, where start mode was missing, but the paradata suggested the first question was answered. As start mode is populated when the graduate first interacts with the survey and it should always be present for those answering the first question.

Previously there were a far greater number of instances where data was missing, due to the point at which the data was being collected.  For example, variables relating to a count of the number of good/unconnected phone calls were being under recorded, because the data was being collected only when the graduate had started the survey. To rectify this, scripting change requirements were identified in Confirmit.

In addition, scripting changes were identified to change the point at which we collect the hour and day the survey was completed. The two variables hour of completion/day of completion were missing data for those not reaching the end of the survey. This was because the data was collected at the point the survey closed, rather than the point when the graduate had answered the minimum set of questions. The specification for data capture has been revised subsequently.

Dialer issues

There were instances where technical errors with the telephone dialer meant that the call count field needed to be reset for some graduates. This variable had been manually reset back to 0 during the cohort by the survey programmer, so that we could resume calling of the affected records. This meant the number of calls made during the cohort was under recorded in some instances.

Errors due to outliers

Elapsed time (the survey duration) is recorded in seconds and needs to be calculated only after removing outliers. This is because the elapsed time field will include instances where a graduate leaves and resumes the survey at a latter point, inflating the survey duration. There are also instances of the survey length being recorded as a shorter time period than would have been possible to complete the survey. For the purpose of analysis, it has been decided that we would remove data relating to anything taking longer than 1 hour (3,600 seconds) or under 1 minute (60 seconds). The average survey durations we calculate by mode closely match our expectations and for CATI closely match the survey durations the call centre have observed. In cohort D of Year 3, the average length of an online survey was 9 minutes, 5 seconds and for CATI this was 14 minutes, 17 seconds.

Future work

The following fields have been found to hold inaccurate data and are not being used extensively for reporting. Corrective measures are currently being explored and implemented where possible.

  • Furthest question records the last question of the survey reached by the graduate and would help us to quickly identify when graduates drop out of the survey. This paradata items appears to be inaccurate though and some questions are not being recorded as the last question reached, whilst other questions have higher than expected totals.
  • An accurate call count variable for all graduates is vital in understanding our call management system and could be a useful metric to report and has potential to be used in case prioritization. However due to the practical requirement to reset records in response to technical issues, this data item suffers from a degree of under-recording.

Next: Mode effects