UK Biobank’s data breach is part of a global arms race between research bodies and criminals seeking to exploit personal data.
Last week’s revelation that “rogue” researchers had tried to sell UK Biobank data came after 18 months of attempts by the charity to maintain the security of the biological data it holds of more than 500,000 volunteers.
After receiving an anonymous tip-off, UK Biobank discovered that researchers linked to three Chinese academic institutions, who had previously been vetted by the charity, had created listings on e-commerce websites owned by Alibaba.
It banned the three institutions and asked the government for help. Diplomats liaised with the Chinese government and Alibaba, which took down the listings rapidly.
Although the data contained no names, addresses or dates of birth, there is a risk that anonymised data can be pieced together to identify people. Naomi Allen, UK Biobank’s chief scientist, said they had been “assured” that the data had not been “sold to third parties”.
The UK Biobank incident was just one of several data security issues that emerged last week. Hackers stole 19m records from the French agency that manages driving licences, passports and ID cards, while Booking.com and ADT, the home security firm, were also hacked. The UK was hit by 8.5m cybercrimes last year.
Data breaches became a headache for the charity in 2024 after academic journals began requiring researchers to publish computer code they had used to analyse large medical datasets. Sometimes that code, usually published on code-sharing platform GitHub, has included raw data from UK Biobank, according to Allen.
“We’ve got a machine learning algorithm that does a daily trawl of all open-source repositories and we check that it doesn't include any data,” she said. “When we do find it, we get the researcher to take it down immediately. Or if we can't find the researcher, GitHub takes it down for us. That has been really successful.”
Ethics researchers have warned that the type of threat posed by rogue researchers – data misuse rather than data leaks or hacks – is not taken seriously enough by biobanks. This was based on a 2021 analysis of BBMRI-ERIC, an association of more than 550 European biobanks.
UK Biobank was one of the first attempts to gather large amounts of medical data about individuals so that researchers could make links between different biological processes in the body. Recruitment began in 2003 of people aged 40 to 69 who agreed to have blood tests, body and brain scans and to answer pages of questions about their health and lifestyle. The details of their answers are so intimate that married couples who signed up are treated separately, said Claudine Henderson, who runs one of UK Biobank’s imaging centres in Newcastle.
Newsletters
Choose the newsletters you want to receive
View more
For information about how The Observer protects your data, read our Privacy Policy
It has been enormously successful in enabling scientific advances, with 22,000 researchers accessing its data producing 18,000 papers. Doctors now analyse heart scans in seconds with AI developed using UK Biobank data, and NHS clinics can diagnose dementia in minutes.
The problems began when journals required researchers to publish code they had used
The problems began when journals required researchers to publish code they had used
Yet UK Biobank’s longevity, size and success is part of the reason why it is more vulnerable. When it began making data available to researchers in 2012, it allowed them to download datasets. Now the database is nearly 40 petabytes (approximately 40m gigabytes) – a high-speed domestic connection would take more than 10 years to download it.
Most of the large health data repositories that have been launched since then, such as Finland’s FinnGen, Germany’s Nako, and All of Us and the Million Veteran Program in the US, have put data in the cloud, and forced researchers to do their analysis there instead.
UK Biobank began to transfer to a similar system in 2021, but the security measure was highly unpopular with researchers as it was more costly. Timothy Raben, a geneticist, said in 2024 that some groups would have to give up research because it had become “cost prohibitive”. China’s Kadoorie Biobank still appears to allow researchers to download data and biotech firms are increasingly turning to it and other Chinese data sources, industry sources say.
“The most secure data set is one that’s locked away and is never used,” Allen said. “We want to make rapid progress into the causes and treatments of disease. It’s a balance between making the data available and advancing science versus ensuring the security of the data.”
There is a risk in sharing data, but also in not sharing it, she said.
“Scientific progress will not be made if you don't have the global collaborative community working on these data to make those discoveries,” Allen said. “And I think that that trade off is difficult to get right, because the technology is moving all the time.”
Photograph by Gary Calton for The Observer


