LOGO

Synthetic Data for Human Trafficking Research | Privacy Preserving

September 23, 2021
Synthetic Data for Human Trafficking Research | Privacy Preserving

Understanding Human Trafficking Through Synthetic Data

Effective efforts to counter human trafficking necessitate a thorough understanding of the issue, and in the modern era, this increasingly relies on data analysis. A significant challenge, however, is the lack of a centralized, publicly accessible registry of trafficking victims, despite the considerable amount of confidential information that exists.

Microsoft and the International Organization for Migration (IOM) have potentially overcome this obstacle with the development of a novel synthetic database. This database replicates the key characteristics of genuine trafficking data, yet is entirely artificial, ensuring privacy.

The Challenge of Data Sensitivity

While each victim’s experience is unique, broader trends – such as source and transit countries, common trafficking routes, and final destinations – can be identified through statistical analysis.

Crucially, the evidence needed to discern these patterns and prevent future exploitation is often contained within thousands of individual, sensitive stories that organizations are hesitant to make public.

Harry Cook, IOM program coordinator, highlighted the issue, stating that administrative data on identified trafficking cases is a primary data source, but is also highly sensitive. He expressed IOM’s satisfaction with the two-year collaboration with Microsoft Research to address the challenge of data sharing for analysis while safeguarding victim privacy.

Limitations of Traditional Anonymization

Conventional methods of data protection, like extensive redaction, have proven inadequate against determined attempts at data reconstruction.

The proliferation of publicly available and leaked databases, coupled with increasing computing power, makes re-identification surprisingly feasible.

A Novel Approach: Synthetic Data Generation

Microsoft Research adopted a different strategy: creating a synthetic dataset based on the original data. This new dataset preserves the statistical relationships present in the source material, but contains no personally identifiable information.

This process goes beyond simple name or location changes. Instead, data from groups of at least 10 individuals with similar characteristics are combined. This creates a statistically accurate representation without allowing for individual identification.

Image Credits: Microsoft Research / IOM

The Value of Accessible Data

The resulting synthetic data lacks the fine-grained detail of the original, but it is usable for analysis in ways the sensitive source data is not.

The goal isn’t necessarily to predict the next smuggling operation, but to provide a factual basis for policy and diplomatic discussions. Instead of making general accusations, stakeholders can now cite concrete data, such as “36 percent of sex trafficking victims transit through your jurisdiction.”

Understanding the global trade in human exploitation as a systemic issue, rather than a series of isolated incidents, is inherently valuable.

The data is available for review and use on the program’s GitHub page, and further information about the creation process can also be found there.

This approach offers a pathway to informed action against human trafficking, balancing the need for data-driven insights with the paramount importance of protecting vulnerable individuals.

#synthetic data#human trafficking#privacy#big data#data privacy#research