A team of Cambridge scientists reported Thursday that they used Internet searches and genealogy websites to discern the names of nearly 50 people who had anonymously provided genetic samples listed in a publicly-accessible research database, demonstrating that like credit card and bank account numbers, genetic information is vulnerable to hacking.
The people had volunteered for research studies involving the sequencing of their DNA based on assurances their names would not be linked to their samples, but the Whitehead Institute for Biomedical Research scientists showed how easily this sensitive health information could be revealed—and possibly fall into the wrong hands. Identifying the supposedly-anonymous research participants didn’t require fancy tools or expensive equipment: it took a single researcher with an Internet connection about three to seven hours per person.
In the era of Facebook and face-recognition software, the loss of privacy is part of a daily conversation, but many people don’t realize how much they’ve already sacrificed. Adding DNA to the mix raises the stakes because of how much information it carries about an individual’s disease risk and traits.
The feat, described in the journal Science, has already triggered action at the National Institutes of Health, which has removed the ages of the participants from a searchable repository of genetic material and put the information under tighter control. It has also sparked a broader discussion about how to facilitate the sharing of data among researchers while also protecting individuals.
“This is not shocking. I think this is just a moment of recognition, a reflection moment,” said Dr. Eric Green, director of the National Human Genome Research Institute, which worked with the cell repository to remove the ages of participants. “We have all these values which are totally laudable but are beginning to come into conflict. What is the best way to navigate this?”
Green said some scientists fear that restricting access to genomic databases could slow research and “in some people’s views, none of this could be completely private in this era ... therefore, what we should be doing is changing the conversation, and be very open.”
The researchers did not undertake the project with the intent of exposing individuals or violating their privacy. Yaniv Erlich, a fellow at the Whitehead Institute who led the research, was inspired by a previous job, at a computer security company. To check the robustness of a bank or credit card company’s database, he would do vulnerability testing—try to break in to identify security weaknesses.
Now, as a researcher who works with DNA, he got interested in testing how reliable the assurances were when research volunteers were told it was unlikely they would be identified when they provided DNA for studies.
Erlich decided to use genealogy websites, which publicly post limited genetic information taken from mens’ Y chromosomes to help people try to track down their ancestors. Specifically, he and his team examined short tandem repeats—stretches of DNA with characteristic repetitive patterns that are inherited. Genealogy websites post such information because males pass down both Y chromosomes and last names, and people with similar numbers of repeats may be related. But Erlich thought he could use the database to figure out the last names of people whose anonymous genetic information was available for science research.
First, he took the publicly available genome of J. Craig Venter, the biologist who played a key role in sequencing the human genome. He took the repeating stretches of Venter’s Y chromosome and put those into the genealogy website. The top hit for the last name associated with that genetic fingerprint was the last name Venter. With just a few other pieces of information—a year of birth, a state of residence—it was easy to use an Internet search to identify the famous biologist.
Then, he decided to extend the technique to see if it would work with truly anonymous data. He began with 10 unidentified men whose DNA sequences had been analyzed and posted online as part of the federally-funded 1000 Genomes Project. The men were also part of a separate scientific research study in which their family members had provided genetic samples. The samples, and the donors’ relationship to one another, were listed on a website and publicly available from a tissue repository.
Using the same basic technique used to match Venter as well as obituaries and other searchable public records, Erlich was able to identify nearly 50 people—some of the original men, plus family members who had provided genetic samples. One man, for example, was identified because his great-great nephew had submitted a sample to a genealogy database.
Green said the paper would add to an ongoing discussion about how to preserve access to scientific data and also protect the people who participate in studies.
“My hope is that this will make everything more open,” said George Church, a geneticist at Harvard Medical School who runs the Personal Genome Project, a research effort that publicly displays people’s DNA sequences, traits, and in some cases, names. “I think pretending that there’s a new encryption algorithm or ... if we put the age in one database and the data in another, to fix things, that’s just sticking our heads in the sand,” Church said.
In some ways, the work echoes other efforts to draw out the identities of people from data they would never consider was identifiable.
“We release information about ourselves without thinking about where it’s going to go and what it means to us,” said Jennifer Lynch, a staff attorney at the Electronic Frontier Foundation, a nonprofit digital rights group. “And in many instances, I think we release that information for good reason. There’s a lot to be gained by giving up samples of DNA for research purposes.”
Lynch said her fear is that something a single researcher did in three to seven hours could easily be automated and used by companies or insurers to make predictions about a person’s risk for disease, for example. Although there is federal legislation, called GINA, to protect DNA from being used by health insurers and employers to discriminate against people, she and others consider it insufficient.
Still, societal attitudes are evolving. Some young people may realize they can be identified by the breadcrumbs they leave on social networks and think, “So what?”
“The up and coming generations have a much different concept of privacy than past generations have,” said Dov Greenbaum, an assistant professor of molecular biophysics and biochemistry at Yale University. “Perhaps that will play out in terms of how controlling people are going to want to be over their private information.”Carolyn Y. Johnson can be reached at email@example.com. Follow her on Twitter @carolynyjohnson.