The privacy risks of compiling mobility data

Merging different types of location-stamped data can make it easier to discern users’ identities, even when the data is anonymized.

MIT researchers find that the growing practice of compiling massive datasets about people’s movement patterns for urban planning and development research may, in fact, put people’s private data at risk — even if that data is anonymized.
MIT researchers find that the growing practice of compiling massive datasets about people’s movement patterns for urban planning and development research may, in fact, put people’s private data at risk — even if that data is anonymized.

Various companies, entities or institutions have started collecting anonymized data that contains location stamps of the user. Though, the data can be acquired through phones records, credit card transactions and social media accounts.

Doing this, they got details about users and users’ behavior for business expansion.

What comes here is privacy issues. For example, most of the users use location stamps for nefarious purposes. A recent study has shown that using randomly selected points in mobility datasets, someone could identify and learn sensitive information about individuals. With merged mobility datasets, this becomes even easier.

Now, in a new study by the MIT university, scientists have shown that how this can happen in the first-ever analysis of so-called user “matchability” in two large-scale datasets from Singapore, one from a mobile network operator and one from a local transportation system.

Scientists used statistical model to track location stamps of users from both datasets. Their model detected 17 percent of individuals in one week’s worth of data, and more than 55 percent of individuals after one month of collected data.

No matter what, working with large-scale datasets can allow discovering unprecedented insights about human society and mobility, allowing us to plan cities better. Nevertheless, it is important to show if identification is possible, so people can be aware of the potential risks of sharing mobility data.

Carlo Ratti, a professor of the practice in MIT’s Department of Urban Studies and Planning and director of MIT’s Senseable City Lab said, “In publishing the results — and, in particular, the consequences of deanonymizing data — we felt a bit like ‘white hat’ or ‘ethical’ hackers. We felt that it was important to warn people about these new possibilities [of data merging] and [to consider] how we might regulate it.”

For the study, scientists used a simple statistical approach — measuring the probability of false positives — to efficiently predict match ability among scores of users in massive datasets.

They then compiled two anonymized low-density datasets — a few records per day — about mobile phone use and personal transportation in Singapore, recorded over one week in 2011. The mobile data came from a large mobile network operator and comprised timestamps and geographic coordinates in more than 485 million records from over 2 million users. The transportation data contained over 70 million records with timestamps for individuals moving through the city.

The likelihood that a given client has recorded in both datasets will increase alongside the measure of the blended datasets, however so will the likelihood of false positives. The scientists’ model chooses a client from one dataset and finds a client from the other dataset with a high number of coordinating location stamps.

Basically, as the quantity of coordinating focuses builds, the likelihood of a false-positive match diminishes. In the wake of coordinating a specific number of focuses along a trajectory, the model discounts the likelihood of the match is a false positive.

Concentrating on normal clients, they evaluated a matching ability achievement rate of 17 percent over seven days of compiled data and around 55 percent for about a month. That gauge hops to around 95 percent with information accumulated more than 11 weeks.

Looking at users with between 30 and 49 personal transportation records, and around 1,000 mobile records, they estimated more than 90 percent success with a week of compiled data. Additionally, by combining the two datasets with GPS traces — regularly collected actively and passively by smartphone apps — the researchers estimated they could match 95 percent of individual trajectories, using less than one week of data.

Ratti said, “All data with location stamps (which is most of today’s collected data) is potentially very sensitive and we should all make more informed decisions on who we share it with. We need to keep thinking about the challenges in processing large-scale data, about individuals, and the right way to provide adequate guarantees to preserve privacy.”

The paper is published today in IEEE Transactions on Big Data.