Statistical Theory of Differentially Private Marginal-based Data Synthesis Algorithms
Marginal-based methods achieve promising performance in the synthetic data competition hosted by the National Institute of Standards and Technology (NIST). To deal with high-dimensional data, the distribution of synthetic data is represented by a probabilistic graphical model (e.g., a Bayesian netwo...
Saved in:
| Main Authors | , , |
|---|---|
| Format | Journal Article |
| Language | English |
| Published |
20.01.2023
|
| Subjects | |
| Online Access | Get full text |
| DOI | 10.48550/arxiv.2301.08844 |
Cover
| Summary: | Marginal-based methods achieve promising performance in the synthetic data competition hosted by the National Institute of Standards and Technology (NIST). To deal with high-dimensional data, the distribution of synthetic data is represented by a probabilistic graphical model (e.g., a Bayesian network), while the raw data distribution is approximated by a collection of low-dimensional marginals. Differential privacy (DP) is guaranteed by introducing random noise to each low-dimensional marginal distribution. Despite its promising performance in practice, the statistical properties of marginal-based methods are rarely studied in the literature. In this paper, we study DP data synthesis algorithms based on Bayesian networks (BN) from a statistical perspective. We establish a rigorous accuracy guarantee for BN-based algorithms, where the errors are measured by the total variation (TV) distance or the$L^2$distance. Related to downstream machine learning tasks, an upper bound for the utility error of the DP synthetic data is also derived. To complete the picture, we establish a lower bound for TV accuracy that holds for every$\epsilon$ -DP synthetic data generator. |
|---|---|
| DOI: | 10.48550/arxiv.2301.08844 |