转自:http://www.36kr.com/p/219383.html
看到自己的信息被卖给第三方貌似是习以为常的事情,比如手机号、姓名、家庭住址等。不过,我们对于信用卡这类支付与安全信息还是很有信心的,毕竟会抹去大部分个人敏感信息。但这周 Science 上一篇研究报告可能会打破这种信任感。
对信用卡信息最新研究表明,整个大数据中,只有一小部分属于匿名的。普林斯顿计算机科学家 Arvind 对媒体说,数据管理者需要增强数据查看的限制,增加新的防护手段。
而剑桥 MIT 应用数学教授更是直白指出,未来只要涉及个人敏感数据的业务,都不可能做到开放分享。最好的一个例子可能就是个人健康和医疗历史了,这一点我们曾在之前的报告中强调过。
研究小组在三个月内搜集到单个国家内 1100 万人群在 10000 个门店中的消费,即使银行隐去每笔消费中的姓名、信用卡号、门店地址和消费时间,只剩下元数据,包括消费额、门店类型、以及相关的 code,依旧没有躲过攻击。(研究小组绝口不提这些数据来源,只说是个大银行。)
数据如何确定的呢?虽然个人信息被抹去,但小组发现,每个用户的消费规律是独特的,这个现象被称作相关攻击法(correlation attack),要进行这种攻击,外部数据就能够完成。
最有名的相关攻击是纽约出租车和车委会在去年的“名人乘车信息泄露”事件。
如何保护这些信息呢?小组举了个例子,将数据范围再次扩大,比如不提到某天的消费,而是某个星期,也不提到精确消费数据,而是一个区间,是一种不错的手段。
[本文参考以下来源:sciencemag.org]
Vol. 347 no. 6221 p. 468
DOI: 10.1126/science.347.6221.468
Credit card study blows holes in anonymity
Attack suggests need for new data safeguards.
For social scientists, the age of big data carries big promises: a chance to mine demographic, financial, medical, and other vast data sets in fine detail to learn how we lead our lives. For privacy advocates, however, the prospect is alarming. They worry that the people represented in such data may not stay anonymous for long. A study of credit card data in this week’s issue of Science (p. 536) bears out those fears, showing that it takes only a tiny amount of personal information to de-anonymize people.
“The open sharing of raw data sets is not the future.”
Yves-Alexandre de Montjoye, MIT
PHOTO: ADAM GAULT/GETTY IMAGES
The result, coming on top of earlier demonstrations that personal identities are easy to pry from anonymized data sets, indicates that such troves need new safeguards. “In light of the results, data custodians should carefully limit access to data,” says Arvind Narayanan, a computer scientist at Princeton University who was not involved with the study. Or as the study’s lead author, Yves-Alexandre de Montjoye, an applied mathematician at the Massachusetts Institute of Technology (MIT) in Cambridge, puts it: When it comes to sensitive personal information, “the open sharing of raw data sets is not the future.”
De Montjoye’s team analyzed 3 months of credit card transactions, chronicling the spending of 1.1 million people in 10,000 shops in a single country. (The team is tightlipped about the data’s source—a “major bank,” de Montjoye says—and it has not disclosed which country.) The bank stripped away names, credit card numbers, shop addresses, and even the exact times of the transactions. All that remained were the metadata: amounts spent, shop type—restaurant, gym, or grocery store, for example—and a code representing each person.
But because each individual’s spending pattern is unique, the data have a very high “unicity.” That makes them ripe for what de Montjoye calls a “correlation attack.” To reveal a person’s identity, you just need to correlate the metadata with information about the person from an outside source.
One correlation attack became famous last year when the New York City Taxi and Limousine Commission released a data set of the times, routes, and cab fares for 173 million rides. Passenger names were not included. But armed with time-stamped photos of celebrities getting in and out of taxis—there are websites devoted to celebrity spotting—bloggers, after deciphering taxi driver medallion numbers, easily figured out which celebrities paid which fares.
Stealing a page from the taxi data hack, de Montjoye’s team simulated a correlation attack on the credit card metadata. They armed their computers with a collection of random observations about each individual in the data: information equivalent to a single time-stamped photo. (These clues were simulated, but people generate the real-world equivalent of this information day in and day out, for example through geolocated tweets or mobile phone apps that log location.) The computer used those clues to identify some of the anonymous spenders. The researchers then fed a different piece of outside information into the algorithm and tried again, and so on until every person was de-anonymized.
Just knowing an individual’s location on four occasions was enough to fingerprint 90% of the spenders. And knowing the amount spent on those occasions—the equivalent of a few receipts from someone’s trash—made it possible to de-anonymize nearly everyone and trace their entire transaction history with just three pieces of information per person. The findings echo the results of a 2013 Scientific Reports study in which de Montjoye and colleagues started with a trove of mobile phone metadata on subscribers’ movements and showed that knowing a person’s location on four occasions was enough to identify them.
One way to protect against correlation attacks is to blur the data by binning certain variables. For example, rather than revealing the exact day or price of a transaction, the public version of the data set might reveal only the week in which it occurred or a price range within which it fell. Binning did not thwart de Montjoye’s correlation attack; instead, it only increased the amount of information needed to de-anonymize each person to the equivalent of a dozen receipts.
These studies needn’t be the death knell for social science research using big data. “We need to bring the computation to the data, not the other way around,” de Montjoye says. Big data with sensitive information could live “in the cloud,” protected by gatekeeper software, he says. The gatekeeper would not allow access to individual records, thwarting correlation attacks, but would still let researchers ask statistical questions about the data.
The mathematics needed to run such a system, a set of standards and algorithms known as differential privacy, is one of the hottest topics in data science. “It works best when you have a large amount of data,” says Cynthia Dwork, a computer scientist at Microsoft Research in Mountain View, California, who is one of the pioneers of the technique. She admits that it is a stark departure from the traditional academic practice of open data sharing, and many scientists are resistant.
But without such safeguards, rich databases could remain off limits. Take, for example, the data MIT has accumulated from its massive open online courses. It’s an information trove that education researchers dream of having: a record of the entire arc of the learning process for millions of students, says Salil Vadhan, a computer scientist at Harvard University. But the data are under lock and key, partly out of fears of a prospective privacy breach. “If we can provide data for research without endangering privacy,” Vadhan says, “it will do a lot of good.”
The editors suggest the following Related Resources on Science sites
In Science Magazine
转载请注明:jinglingshu的博客 » 利用相关攻击法,信用卡匿名数据也变成精确信息