Linking datasets without personal information to improve healthcare services

17 Aug 2021

To understand how best to provide healthcare for patients, we often have to bring together information from multiple datasets, as a patient’s information is typically stored in different places. However, this raises privacy concerns around the accessing and sharing of personal information. NIHR ARC North Thames researcher Helen Blake is exploring how we can link up different datasets without needing personal information, illustrated using bowel cancer patient records.

This summary and the above image were created by Dr Helen Blake in collaboration with Patient and Public Involvement (PPI) representatives, who have supported the development and delivery of this research.

In order to answer important clinical and public health questions, researchers increasingly use national clinical and administrative datasets which provide information about patients, their medical information, care they receive and what happens to them afterwards. An example of a national clinical dataset is the National Bowel Cancer Audit dataset, which collects clinical information on patients diagnosed with bowel cancer in National Health Service (NHS) hospitals in England and Wales. An example of a national administrative dataset is Hospital Episode Statistics, which records information on all English NHS hospital admissions, for administrative and reimbursement purposes. We often need to link different datasets together, because there is no single resource containing all the information we need.

A common way of linking the same patient across datasets is to match the personal information in each dataset, where personal information is defined as the information that can be used to identify patients, such as NHS number, date of birth and residential address. This approach of linking datasets usually needs this personal information to be sent to a separate organisation to carry out the linkage. Other methods that do not need personal information could link datasets without having to share this information between organisations.

In our paper, we use something called probabilistic linkage. This means that, instead of matching on personal information, we give scores of how likely it is that pairs of records belong to the same person. The scores depend on how well information in the two records agree. Any information that is in both datasets can be used to link patients across the datasets. For example, we can use age, area of residence, hospital where treatment took place, dates of admission and discharge, and details of procedures. If all of this information together agrees very well, we can be confident that the two records belong to the same patient, without needing to know who the patient is.

We developed and tested a way of linking datasets without using personal information. We give advice on which information to use for linkage, and how to calculate agreement scores. We showed how well this works by linking a clinical dataset to an administrative dataset for bowel cancer patients having emergency surgery in the NHS. We showed that this way of linking datasets works very well compared to linkage using personal information. We linked more than 81 out of every 100 patients without using personal information. The usual way of linking by matching personal information linked 83 out of every 100 patients.

Our work shows that datasets can be linked without using personal information. The use of probabilistic linkage without personal information can:

allow more analysts to link anonymous datasets, including in resource-poor settings where specialised highly secure data environments may not be available.
avoid the need for personal information to be sent between organisations. This will: protect patient data; speed up important research that needs datasets to be linked; and lower the costs of this research.
link datasets that do not have personal information.
improve the usual way of linking, for example when some personal information is missing.

Linking patient data without needing personal information will enable more rapid and lower cost research whilst protecting patient data. Better use can then be made of patient information to improve health and care services, even in settings where services are limited or harder to access.

Related links

Read the research publication: 'Probabilistic linkage without personal information successfully linked national clinical datasets' (Journal of Clinical Epidemiology).
View the research poster.
Learn more on the project web page.