Di Wang's Publications
(★ my Master/PhD/Intern/Visiting students)
2021
Conference Papers
 Differentially Private Pairwise Learning Revisited. Abstract▼
Instead of learning with pointwise loss functions, learning with pairwise loss functions (pairwise learning) has received much attention recently as it is more capable of modeling the relative relationship between pairs of samples. However, most of the existing algorithms for pairwise learning fail to take into consideration the privacy issue in their design. To address this issue, previous work studied pairwise learning in the Differential Privacy (DP) model. However, their utilities (population errors) are far from optimal. To address the suboptimal utility issue, in this paper, we proposed new $(\epsilon, \delta)$ or $\epsilon$DP algorithms for pairwise learning. Specifically, when the loss functions are Lipschitz, smooth and strongly convex, we show that the output of our algorithm achieves an expected population error of $O(\frac{1}{n}+\frac{d\log\frac{1}{\delta}}{n^2\epsilon^2})$ and $O(\frac{1}{n}+\frac{d^2}{n^2\epsilon^2})$ for $(\epsilon, \delta)$DP and $\epsilon$DP, respectively, where $n$ is the sample size and $d$ is the dimension of the underlying space.
Moreover, for general convex case, the output of our algorithm achieves an expected population error of $O(\frac{1}{\sqrt{n}}+\frac{\sqrt{d \log\frac{1}{\delta}}}{n\epsilon})$ and $O(\frac{1}{\sqrt{n}}+\frac{d}{n\epsilon})$ for $(\epsilon, \delta)$DP and $\epsilon$DP, respectively. It is also notable that these upper bounds are {\bf optimal} ({\em i.e.,} match the lower bounds). We also conduct extensive experiments on realworld datasets to evaluate the proposed algorithms, experimental results support our theoretical analysis and show the priority of our algorithms.
Zhiyu Xue^{*}^{★}, Shaoyang Yang^{*}^{★}, Mengdi Huai and Di Wang. (* equal contribution)
The 30th International Joint Conference on Artificial Intelligence (IJCAI 2021).
 Estimating Smooth GLM in Noninteractive Local Differential Privacy Model with Public Unlabeled Data [Link] Abstract▼
In this paper, we study the problem of estimating smooth Generalized Linear Models (GLM) in the Noninteractive Local Differential Privacy (NLDP) model.
Different from its classical setting, our model allows the server to process
some additional public but unlabeled data.
We first show that there is an $(\epsilon, \delta)$NLDP algorithm for
GLM (under some mild assumptions), if each data record is i.i.d sampled from some subGaussian distribution with bounded $\ell_1$norm.
The sample complexity of both
public and private data, for the algorithm to achieve an $\alpha$ estimation error (in $\ell_\infty$norm), is $\tilde{O}(p^2\alpha^{2}\epsilon^{2})$ if $\alpha$ is not too small (i.e., $\alpha\geq \Omega(\frac{1}{\sqrt{p}})$), where $p$ is the dimensionality of the data. This is a significant improvement over the previously known quasipolynomial (in $\alpha$) or exponential (in $p$) complexity of convex GLM with no public data.
We then extend our idea to the nonlinear regression problem and show
a similar phenomenon for it.
Finally, we demonstrate the practicality of our algorithms through experiments on both synthethic and real world datasets.
To our best knowledge, this is the first paper showing the existence of efficient and practical
algorithms for GLM and nonlinear regression
in the NLDP model with public unlabeled data.
Di Wang^{*}, Huanyu Zhang^{*}, Marco Gaboardi and Jinhui Xu. (* equal contribution)
The 32nd International Conference on Algorithmic Learning Theory (ALT 2021).
Journal Papers
 Inferring Ground Truth From Crowdsourced Data Under Local Attribute Differential Privacy
[Link] Abstract▼
Recently, the problem of ground truth inference under local differential privacy (LDP) model has been recently studied. However, this problem is still not well understood and even some basic questions have not been solved yet. First, it is still unknown what is the average error of the private estimators to the underlying ground truth. Secondly, we do not known whether we can infer the ability of each user under LDP model and what is the estimation error w.r.t the underlying users ability. Finally, previous work only show that their methods have better performance than the private major voting algorithm through experiments. However, there is still no theoretically result which shows this priority formally or mathematically. In this paper, we partially solve these problems by studying the ground truth inference problem under local attribute differential privacy (LADP) model, and propose a new algorithm called private DawidSkene method, which is motivated by the classical DawidSkene method. Specifically, we first provide the estimation errors for both ability of users and the ground truth under some assumptions of the problem if the algorithm start with some appropriate initial vector. Moreover, we propose an explicit instance and show that the estimation error of the ground truth achieved by the private major voting algorithm is always greater than the error achieved by our method. To our best knowledge, this is the first result on showing the explicit estimation errors for both ability of users and ground truth for the problem. Also, this paper is the first result on theoretically comparing with the private major voting algorithm.
Di Wang and Jinhui Xu.
Theoretical Computer Science
Volume 865, 14 April 2021, Pages 8598.
 Differentially Private High Dimensional Sparse Covariance Matrix Estimation [Link] Abstract▼
In this paper, we study the problem of estimating the covariance matrix under differential privacy, where the underlying covariance matrix is assumed to be sparse and of high dimensions. We propose a new method, called DPThresholding, to achieve a nontrivial $\ell_2$norm based error bound,
which is significantly better than the existing ones from adding noise directly to the empirical covariance matrix. We also extend the $\ell_2$norm based error bound to a general $\ell_w$norm based one for any $1\leq w\leq \infty$, and show that they share the same upper bound asymptotically. Our approach can be easily extended to local differential privacy. Experiments on the synthetic datasets show consistent results with our theoretical claims.
Di Wang and Jinhui Xu.
Theoretical Computer Science
Volume 865, 14 April 2021, Pages 119130.
 On Sparse Linear Regression in the Local Differential Privacy Model [Link] Abstract▼
In this paper, we study the sparse linear regression problem in the Local Differential Privacy (LDP) model. We first show that polynomial dependency on the dimensionality $p$ of the space is unavoidable for the estimation error in both noninteractive and sequential interactive local models, if the privacy of the whole dataset needs to be preserved. Similar limitations also exist for other types of error measurements and in the relaxed local models. This indicates that differential privacy in high dimensional space is unlikely achievable for the problem. With the understanding of this limitation, we then present two algorithmic results. The first one is
a sequential interactive LDP algorithm for the low dimensional sparse case, called Locally Differentially Private Iterative Hard Thresholding (LDPIHT), which achieves a near optimal upper bound. This algorithm is actually rather general and can be used to solve quite a few other problems, such as (Local) DPERM with sparsity constraints and sparse regression with nonlinear measurements. The second one is for the restricted (high dimensional) case where only the privacy of the responses (labels) needs to be preserved. For this case,
we show that the optimal rate of the error estimation can be made logarithmically depending on $p$ (i.e., $\log p$) in the local model, where an upper bound is obtained by a labelprivacy version of LDPIHT. Experiments on real world and synthetic datasets confirm our theoretical analysis.
Di Wang and Jinhui Xu.
IEEE Transactions on Information Theory, Volume 67, no. 2, Pages 11821200, Feb. 2021.
2020
Conference Papers

Global Interpretation for Patient Similarity Learning [Link] Abstract▼
As an important family of learning problems, pairwise learning has received much attention in recent years. Since pairwise learning involves pairs of instances in its loss function, it is more capable of modeling the relative relationships between instances compared with traditional pointwise learning (e.g., classification). In practice, many machine learning and data mining tasks can be categorized as pairwise learning. Although pairwise learning has achieved tremendous success in many realworld applications, the lack of transparency behind the behavior of the learned pairwise model impedes users from trusting the predicted results, which hampers its further applications in the real world. To tackle this problem, in this paper, we investigate how to enable interpretation in pairwise learning and propose a global interpretation method for pairwise learning. Based on the proposed global interpretation method, we can identify a minimal sufficient subset of data features that are sufficient in themselves to justify the global predictions made by the pairwise model. The identified minimal sufficient feature subset can help us better understand the overall behaviors of the learned pairwise model across different subpopulations of instance pairs. To the best of our knowledge, this is the first work that provides global interpretation for pairwise learning. We also conduct extensive experiments on realworld datasets to evaluate the performance of the proposed global interpretation method.
Mengdi Huai, Chenglin Miao, Jinduo Liu, Di Wang, Jingyuan Chou and Aidong Zhang.
The 2020 IEEE International Conference on Bioinformatics and Biomedicine (IEEE BIBM 2020).
Selected as Regular Paper (Acceptance Rate: 19.4%).
 Escaping Saddle Points of Empirical Risk Privately and Scalably via DPTrust Region Method [Link] Abstract▼
It has been shown recently that many nonconvex objective/loss functions in machine learning and deep learning are known to be strict saddle. This means that finding a secondorder stationary point (i.e., approximate local minimum) and thus escaping saddle points are sufficient for such functions to obtain a classifier with good generalization performance. Existing algorithms for escaping saddle points, however, all fail to take into consideration a critical issue in their designs, that is, the protection of sensitive information in the training set. Models learned by such algorithms can often implicitly memorize the details of sensitive information, and thus offer opportunities for malicious parties to infer it from the learned models. In this paper, we investigate the problem of privately escaping saddle points and finding a secondorder stationary point of the empirical risk of nonconvex loss function.
Previous result on this problem is mainly of theoretical importance and has several issues ({e.g., high sample complexity and nonscalable) which hinder its applicability, especially, in big data.
To deal with these issues,
we propose in this paper a new method called Differentially Private Trust Region, and show that it outputs a secondorder stationary point with high probability and less sample complexity, compared to the existing one. Moreover, we also provide a stochastic version of our method (along with some theoretical guarantees) to make it faster and more scalable.
Experiments on benchmark datasets suggest that our methods are indeed more efficient and practical than the previous one.
Di Wang and Jinhui Xu.
The 2020 European Conference on Machine Learning (ECMLPKDD 2020).
 On Differentially Private Stochastic Convex Optimization with Heavytailed Data [Link] Abstract▼
In this paper, we consider the problem of designing Differentially Private (DP) algorithms for Stochastic Convex Optimization (SCO) on heavytailed data. The irregularity of such data violates some key assumptions used in almost all existing DPSCO and DPERM methods, resulting in failure to provide the DP guarantees. To better understand this type of challenges, we provide in this paper a comprehensive study of DPSCO under various settings. First, we consider the case where the loss function is strongly convex and smooth. For this case, we propose a method based on the sampleandaggregate framework, which has an excess population risk of $\tilde{O}(\frac{d^3}{n\epsilon^4})$ (after omitting other factors), where $n$ is the sample size and $d$ is the dimensionality of the data. Then, we show that with some additional assumptions on the loss functions, it is possible to reduce the \textit{expected} excess population risk to $\tilde{O}(\frac{ d^2}{ n\epsilon^2 })$. To lift these additional conditions, we also provide a gradient smoothing and trimming based scheme to achieve excess population risks of $\tilde{O}(\frac{ d^2}{n\epsilon^2})$ and $\tilde{O}(\frac{d^\frac{2}{3}}{(n\epsilon^2)^\frac{1}{3}})$ for strongly convex and general convex loss functions, respectively, \textit{with high probability}. Experiments on both synthetic and realworld datasets suggest that our algorithms can effectively deal with the challenges caused by data irregularity.
Di Wang^{*}, Hanshen Xiao^{*}, Srini Devadas and Jinhui Xu (* equal contribution).
The 37th International Conference on Machine Learning (ICML 2020).
 Scalable Estimating Stochastic Linear Combination of Nonlinear Regressions [Link] Abstract▼
In this paper we study the problem of estimating stochastic linear combination of nonlinear regressions, which has a closed connection with many machine learning and statistical models such as nonlinear regressions, Single Index Models, Multiindex Models, Varying Coefficient Index Models and Twolayer Neural Networks. Specifically, we first show that under some mild assumptions, if the variates are multivariate Gaussian, then there is an algorithm whose outputs have $\ell_2$norm estimation error of $O(\sqrt{\frac{p}{n}})$ with high probability, where $p$ is the dimensionality and $n$ is the size of samples. Moreover, we extend to the bounded subGaussian case by using the zerobias transformation, which could be seen as a generalization of the classical Stein's lemma. We show that with some additional assumptions there is an algorithm whose outputs have $\ell_\infty$norm estimation error of $O(\frac{1}{\sqrt{p}}+\sqrt{\frac{p}{n}})$ with high probability. Finally, for both Gaussian and subGaussian cases we propose a scalable algorithm based on subsampling method and show that when the subsample size is large enough the estimation errors will be almost the same as previous ones. Experimental results for both Gaussian and subGaussian cases support our theoretical results. To our best knowledge, this is the first paper study and provide theoretical guarantees of the stochastic linear combination of nonlinear regressions model.
Di Wang^{*} , Xiangyu Guo^{*}, Chaowen Guan, Shi Li and Jinhui Xu (* equal contribution).
The ThirtyFourth AAAI Conference on Artificial Intelligence (AAAI 2020).
 Pairwise Learning with Differential Privacy Guarantees [Link] Abstract▼
Pairwise learning has received much attention recently as it is more capable of modeling the relative relationship between pairs of samples. Many machine learning tasks can be categorized as pairwise learning, such as AUC maximization and metric learning. Existing techniques for pairwise learning all fail to take into consideration a critical issue in their design, i.e., the protection of sensitive information in the training set. Models learned by such algorithms can implicitly memorize the details of sensitive information, which offers opportunity for malicious parties to infer it from the learned models. To address this challenging issue, in this paper, we propose several differentially private pairwise learning algorithms for both online and offline settings. Specifically, for the online setting, we first introduce a differentially private algorithm (called OnPairStrC) for strongly convex loss functions. Then, we extend this algorithm to general convex loss functions and give another differentially private algorithm (called OnPairC). For the offline setting, we also present two differentially private algorithms (called OffPairStrC and OffPairC) for strongly and general convex loss functions, respectively. These proposed algorithms can not only learn the model effectively from the data but also provide strong privacy protection guarantee for sensitive information in the training set. Extensive experiments on realworld datasets are conducted to evaluate the proposed algorithms and the experimental results support our theoretical analysis.
Mengdi Huai^{*}, Di Wang^{*}, Chenglin Miao, Jinhui Xu and Aidong Zhang (* equal contribution).
ThirtyFourth AAAI Conference on Artificial Intelligence (AAAI 2020).
 Towards Interpretation of Pairwise Learning [Link] Abstract▼
Recently, there are increasingly more attentions paid to an important family of learning problems called pairwise learning, in which the associated loss functions depend on pairs of instances. Despite the tremendous success of pairwise learning in many realworld applications, the lack of transparency behind the learned pairwise models makes it difficult for users to understand how particular decisions are made by these models, which further impedes users from trusting the predicted results. To tackle this problem, in this paper, we study feature importance scoring as a specific approach to the problem of interpreting the predictions of blackbox pairwise models. Specifically, we first propose a novel adaptive Shapleyvaluebased interpretation method, based on which a vector of importance scores associated with the underlying features of a testing instance pair can be adaptively calculated with the consideration of feature correlations, and these scores can be used to indicate which features make key contributions to the final prediction. Considering that Shapleyvaluebased methods are usually computationally challenging, we further propose a novel robust approximation interpretation method for pairwise models. This method is not only much more efficient but also robust to data noise. To the best of our knowledge, we are the first to investigate how to enable interpretation in pairwise learning. Theoretical analysis and extensive experiments demonstrate the effectiveness of the proposed methods.
Mengdi Huai, Di Wang, Chenglin Miao and Aidong Zhang.
The ThirtyFourth AAAI Conference on Artificial Intelligence (AAAI 2020).
Journal Papers
 Empirical Risk Minimization in the Noninteractive Local
Model of Differential Privacy. [Link] Abstract▼
In this paper, we study the Empirical Risk Minimization (ERM) problem in the noninteractive Local Differential Privacy (LDP) model. We first show that if the loss function is $(\infty, T)$smooth, by using the Bernstein polynomial approximation we can avoid a dependency of the sample complexity, to achieve error $\alpha$, on the exponential of the dimensionality $p$ with base $1/\alpha$ (i.e., $\alpha^{p}$).
This answers a question from (Smith et.al., 2017). Then, we propose playerefficient algorithms with $1$bit communication complexity and $O(1)$ computation cost for each player. The error bound of these algorithms is asymptotically the same as the original one. With some additional assumptions, we also give an algorithm which is more efficient for the server. Based on different types of polynomial approximations, we propose (efficient) noninteractive locally differential private algorithms for learning the set of kway marginal queries and the set of smooth queries. Moreover, we study the case of $1$Lipschitz generalized linear convex loss functions and show that there is an $(\epsilon, \delta)$LDP algorithm whose sample complexity for achieving error $\alpha$ is only linear in the dimensionality $p$ and quasipolynomial in other terms. To prove this, we first show that the conclusion holds for the hinge loss function. Then, we extend the result to any $1$Lipschitz generalized linear convex loss functions by showing that every such a function can be approximated by a linear combination of hinge loss functions and some linear functions. Our results use a polynomial of inner product approximation technique. Then we apply our technique to the Euclidean median problem and show that its sample complexity needs only to be quasipolynomial in $p$, which is the first result with a subexponential sample complexity in $p$ for nongeneralized linear loss functions.
Di Wang, Marco Gaboardi, Adam Smith and Jinhui Xu.
Journal of Machine Learning Research, Volume 21, 200 (2020), Pages 139.
 Robust High Dimensional Expectation Maximization Algorithm via Trimmed Hard Thresholding. [Link] Abstract▼
In this paper, we study the problem of estimating latent variable models with arbitrarily corrupted samples in the high dimensional case, i.e., $d\gg n$, where the underlying parameter is sparse. Specifically, we propose a method called Trimmed (Gradient) Expectation Maximization which attaches a trimming gradients and hard thresholding step to the Expectation step (Estep) and Maximization step (Mstep), respectively. Particularly, under some mild assumptions, with an appropriate initialization, we show that the algorithm is corruptionfree and converges to the (near) optimal statistical rate geometrically when the fraction of corruption samples satisfies $\alpha\leq O(\frac{1}{\sqrt{n}})$. Moreover, we implement our general framework to three canonical examples: mixture of Gaussians, mixture of regressions and linear regression with missing covariates. Experiments also support our theoretical analysis.
Di Wang^{*}, Xiangyu Guo^{*}, Shi Li and Jinhui Xu (* equal contribution).
Machine Learning, 109, 22832311 (2020).
 Tight Lower Bound of Locally Differentially Private Sparse Covariance Matrix Estimation. [Link] Abstract▼
In this paper, we study the sparse covariance matrix estimation problem in the local differential privacy model, and give a lower bound of $\Omega(\frac{s^2\log p}{n\epsilon^2})$ on the $\epsilon$ noninteractive private minimax risk in the metric of squared spectral norm, where $s$ is the row sparsity of the underlying covariance matrix, $n$ is the sample size, and $p$ is the dimensionality of the data.
We show that the lower bound is actually tight, as it matches a previous upper bound.Our main technique for achieving this lower bound is a general framework, called General Private Assouad Lemma, which is a considerable generalization of the previous private Assouad lemma and can be used as a general method for bounding the private minimax risk of matrixrelated estimation problems.
Di Wang and Jinhui Xu.
Theoretical Computer Science,
Volume 815, 2 May 2020, Pages 4759.
 Estimating Stochastic Linear Combination of Nonlinear Regressions Efficiently and Scalably. [Link] Abstract▼
Recently, many machine learning and statistical models such as nonlinear regressions, the Single Index, Multiindex, Varying Coefficient Index Models and Twolayer Neural Networks can be reduced to or be seen as a special case of a new model which is called the \textit{Stochastic Linear Combination of Nonlinear Regressions} model. However, due to the high nonconvexity of the problem, there is no previous work study how to estimate the model. In this paper, we provide the first study on how to estimate the model efficiently and scalably. Specifically, we first show that with some mild assumptions, if the variate vector $x$ is multivariate Gaussian, then there is an algorithm whose output vectors have $\ell_2$norm estimation errors of $O(\sqrt{\frac{p}{n}})$ with high probability, where $p$ is the dimension of $x$ and $n$ is the number of samples. The key idea of the proof is based on an observation motived by the Stein's lemma. Then we extend our result to the case where $x$ is bounded and subGaussian using the zerobias transformation, which could be seen as a generalization of the classic Stein's lemma. We also show that with some additional assumptions there is an algorithm whose output vectors have $\ell_\infty$norm estimation errors of $O(\frac{1}{\sqrt{p}}+\sqrt{\frac{p}{n}})$ with high probability. We also provide a concrete example to show that there exists some link function which satisfies the previous assumptions. Finally, for both Gaussian and subGaussian cases we propose a faster subsampling based algorithm and show that when the subsample sizes are large enough then the estimation errors will not be sacrificed by too much. Experiments for both cases support our theoretical results.
To the best of our knowledge, this is the first work that studies and provides theoretical guarantees for the stochastic linear combination of nonlinear regressions model.
Di Wang^{*} , Xiangyu Guo^{*} , Chaowen Guan, Shi Li and Jinhui Xu (* equal contribution).
Neurocomputing, Volume 399, 25 July 2020, Pages 129140.
 Principal Component Analysis in the Local Differential Privacy Model. [Link] Abstract▼
In this paper, we study the Principal Component Analysis (PCA) problem under the (distributed) noninteractive local differential privacy model. For the low dimensional case (i.e., $p \ll n$), we show the optimal rate of $\Theta(\frac{kp}{n\epsilon^2})$ (omitting the eigenvalue terms) for the private minimax risk of the $k$dimensional PCA using the squared subspace distance as the measurement, where $n$ is the sample size and $\epsilon$ is the privacy parameter. For the high dimensional (i.e., $p\gg n$) row sparse case, we first give a lower bound of $\Omega(\frac{ks\log p }{n\epsilon^2})$ on the private minimax risk, where $s$ is the underlying sparsity parameter. Then we provide an efficient algorithm to achieve the upper bound of $O(\frac{s^2\log p}{n\epsilon^2})$. Experiments on both synthetic and real world datasets confirm our theoretical guarantees.
Di Wang and Jinhui Xu.
Theoretical Computer Science, Volume 809, 24 February 2020, Pages 296312.
2019
Conference Papers
 Facility Location Problem in Differential Privacy Model Revisited. [Link] Abstract▼
In this paper we study the uncapacitated facility location problem in the
model of differential privacy (DP) with uniform facility
cost. Specifically, we first show that, under the \emph{hierarchically wellseparated tree (HST) metrics} and the superset output setting that was introduced in (Gupta et al 2010), there is an $\epsilon$DP
algorithm that achieves an $O(\frac{1}{\epsilon})$
(expected multiplicative) approximation ratio; this implies an
$O(\frac{\log n}{\epsilon})$
approximation ratio for the general metric case, where $n$ is the size of the input metric. These bounds improve
the bestknown results given by (Gupta et al 2010). In particular, our approximation ratio for HSTmetrics is independent of $n$, and the ratio for general metrics is independent of the aspect ratio of the input metric. On the negative side, we show that the approximation ratio of any $\epsilon$DP algorithm is lower bounded by $\Omega(\frac{1}{\sqrt{\epsilon}})$, even for instances on HST metrics with uniform facility cost, under the superset output setting. The lower bound shows that the dependence of the approximation ratio for HST metrics on $\epsilon$ can not be removed or greatly improved. Our novel methods and techniques for both the upper and lower bound may find additional applications.
[alphabetic order] Yunus Esencayi, Marco Gaboardi, Shi Li and Di Wang
Conference on Neural Information Processing Systems (NIPS/NeurIPS), 2019.
 Lower Bound of Locally Differentially Private Sparse Covariance Matrix Estimation. [Link] Abstract▼
In this paper, we study the sparse covariance matrix
estimation problem in the local differential privacy
model, and give a nontrivial lower bound on the
noninteractive private minimax risk in the metric
of squared spectral norm. We show that the lower
bound is actually tight, as it matches a previous upper
bound. Our main technique for achieving this
lower bound is a general framework, called General
Private Assouad Lemma, which is a considerable
generalization of the previous private Assouad
lemma and can be used as a general method for
bounding the private minimax risk of matrixrelated
estimation problems.
Di Wang and Jinhui Xu.
The 28th International Joint Conference on Artificial Intelligence (IJCAI 2019).
 Principal Component Analysis in the Local Differential Privacy Model. [Link] Abstract▼
In this paper, we study the Principal Component
Analysis (PCA) problem under the (distributed)
noninteractive local differential privacy model.
For the low dimensional case, we show the optimal
ratefor the private minimax risk of the kdimensional
PCA using the squared subspace distance
as the measurement. For the high dimensional
row sparse case, we first give a lower bound
on the private minimax risk, . Then we provide
an efficient algorithm to achieve a near optimal upper
bound. Experiments on both synthetic and real
world datasets confirm the theoretical guarantees of
our algorithms.
Di Wang and Jinhui Xu .
The 28th International Joint Conference on Artificial Intelligence (IJCAI 2019).
 Privacyaware Synthesizing for Crowdsourced Data. [Link] Abstract▼
The prevalence of the World Wide Web enables the data collectors to easily share their data that are collected from a large crowd of users with the public. Although releasing crowdsourced data on the web brings many benefits to the data analyzers to conduct statistical analysis, it may violate crowd users' data privacy, which has been a major obstacle to release the crowdsourced data. A potential way to address this problem is to employ the traditional differential privacybased mechanisms and perturb the data with some noise before publishing them. However, such noise perturbation mechanisms offer little data utility because crowdsourced datasets contain massive amounts of conflicting data records with large domains. In particular, this degraded utility results from two respects: firstly, the originally collected crowdsourced data usually contain conflicting data; secondly, the noise needed to guarantee differential privacy may be proportional to the number of data records or the domain of the input data, which renders the released crowdsourced data useless. To address the above challenges, we propose a novel privacyaware synthesizing method for crowdsourced data. In this method, the data collectors first learn the underlying distributions of the crowdsourced data through taking each user's fine grained reliability degrees into account. Then, a set of candidate synthetics are sampled from the learned distributions. Finally, these sampled candidate synthetics are subjected to privacy tests, and only those candidate synthetics which pass privacy tests are allowed to be safely released. The proposed method not only provides strong privacy protection for individual users but also generates high utility synthetic data. The desirable performance of the proposed method is verified via theoretical analysis and extensive experiments conducted on both realworld and synthetic datasets.
Mengdi Huai, Di Wang, Chenglin Miao, Jinhui Xu, Aidong Zhang.
The 28th International Joint Conference on Artificial Intelligence (IJCAI 2019).
 Differentially Private Empirical Risk Minimization with Nonconvex Loss Functions. [Link] Abstract▼
We study the problem of Empirical Risk Minimization (ERM) with (smooth) nonconvex loss functions under the differentialprivacy (DP) model. Existing approaches for this problem mainly adopt gradient norms to measure the error, which in general cannot guarantee the quality of the solution. To address this issue,
we first study the expected excess empirical (or population) risk, which was primarily used as the utility to measure the quality for convex loss functions. Specifically, we show that
the excess empirical (or population) risk can be upper bounded by $\tilde{O}(\frac{d\log (1/\delta)}{\log n\epsilon^2})$ in the $(\epsilon, \delta)$DP settings, where $n$ is the data size and $d$ is the dimensionality of the space.
The $\frac{1}{\log n}$ term in the empirical risk bound can be further improved to $\frac{1}{n^{\Omega(1)}}$ (when $d$ is a constant) by a highly nontrivial analysis on the timeaverage error.
To obtain more efficient solutions, we also consider the connection between achieving differential privacy and finding approximate local minimum.
Particularly, we show that when the size $n$ is large enough, there are $(\epsilon, \delta)$DP algorithms which can find an approximate local minimum of the empirical risk with high probability in both the constrained and nonconstrained settings.
These results indicate that one can escape saddle points privately.
Di Wang, Changyou Chen and Jinhui Xu.
The 36th International Conference on Machine Learning (ICML 2019).
 On Sparse Linear Regression in the Local Differential Privacy Model. [Link] Abstract▼
In this paper, we study the sparse linear regression problem under the Local Differential Privacy (LDP) model. We first show that polynomial dependency on the dimensionality $p$ of the space is unavoidable for the estimation error in both noninteractive and sequential interactive local models, if the privacy of the whole dataset needs to be preserved. Similar limitations also exist for other types of error measurements and in the relaxed local models. This indicates that differential privacy in high dimensional space is unlikely achievable for the problem. With the understanding of this limitation, we then present two algorithmic results. The first one is
a sequential interactive LDP algorithm for the low dimensional sparse case, called Locally Differentially Private Iterative Hard Thresholding (LDPIHT), which achieves a near optimal upper bound. This algorithm is actually rather general and can be used to solve quite a few other problems, such as (Local) DPERM with sparsity constraints and sparse regression with nonlinear measurements. The second one is for the restricted (high dimensional) case where only the privacy of the responses (labels) needs to be preserved. For this case,
we show that the optimal rate of the error estimation can be made logarithmically depending on $p$ (i.e., $\log p$) in the local model,
where an upper bound is obtained by a labelprivacy version of LDPIHT. Experiments on real world and synthetic datasets confirm our theoretical analysis.
Di Wang and Jinhui Xu.
The 36th International Conference on Machine Learning (ICML 2019).
Selected as Long Talk(Acceptance Rate: 140/3424= 4.1%) .
 Estimating Sparse Covariance Matrix Under Differential Privacy via Thresholding. [Link] Abstract▼
In this paper, we study the problem of estimating the covariance matrix under differential privacy, where the underlying covariance matrix is assumed to be sparse and of high dimensions. We propose a new method, called DPThresholding, to achieve a nontrivial $\ell_2$norm based error bound,
which is significantly better than the existing ones from adding noise directly to the empirical covariance matrix. Experiments on the synthetic datasets show consistent results with our theoretical claims.
Di Wang, Jinhui Xu and Yang He.
The 53rd Annual Conference on Information Sciences and Systems (CISS 2019).
 Noninteractive Locally Private Learning of Linear Models via Polynomial Approximations. [Link] Abstract▼
In this paper, we study the Empirical Risk Minimization problem in the noninteractive Local Differential Privacy (LDP) model. First, we show that for the hinge loss function, there is an $(\epsilon, \delta)$LDP algorithm whose sample complexity for achieving an error of $\alpha$ is only linear in the dimensionality $p$ and quasipolynomial in other terms. Then, we extend the result to any $1$Lipschitz generalized linear convex loss functions by showing that every such function can be approximated by a linear combination of hinge loss functions and some linear functions. Finally, we apply our technique to the Euclidean median problem and show that its sample complexity needs only to be quasipolynomial in $p$, which is the first result with a subexponential sample complexity in $p$ for nongeneralized linear loss functions. Our results are based on a technique, called polynomial of inner product approximation, which may be applicable to other problems.
Di Wang, Adam Smith and Jinhui Xu.
The 30th International Conference on Algorithmic Learning Theory (ALT 2019).
 Differentially Private Empirical Risk Minimization with Smooth Nonconvex Loss Functions: A Nonstationary View. [Link] Abstract▼
In this paper, we study the Differentially Private Empirical Risk Minimization (DPERM) problem with nonconvex loss functions and give several upper bounds for the utility in different settings. We first consider the problem in lowdimensional space. For DPERM with nonsmooth regularizer, we generalize an existing work by measuring the utility using $\ell_2$ norm of the projected gradient. Also, we extend the error bound measurement, for the first time, from empirical risk to population risk by using the expected $\ell_2$ norm of the gradient. We then investigate the problem in high dimensional space, and show that by measuring the utility with FrankWolfe gap, it is possible to bound the utility by the Gaussian Width of the constraint set, instead of the dimensionality $p$ of the underlying space. We further demonstrate that the advantages of this result can be achieved by the measure of $\ell_2$ norm of the projected gradient. A somewhat surprising discovery is that although the two kinds of measurements are quite different, their induced utility upper bounds are asymptotically the same under some assumptions. We also show that the utility of some special nonconvex loss functions can be reduced to a level (i.e., depending only on $\log p$) similar to that of convex loss functions. Finally, we test our proposed algorithms on both synthetic and real world datasets and the experimental results confirm our theoretical analysis.
Di Wang and Jinhui Xu.
The ThirtyThird AAAI Conference on Artificial Intelligence (AAAI 2019).
Selected as Oral Presentation (Acceptance Rate: 460/7095=6.5%).
Journal Papers
 Faster Large Scale Constrained Linear Regression via TwoStep Preconditioning. [Link] Abstract▼
In this paper, we study the large scale constrained linear regression problem and propose a twostep preconditioning method, which is based on some recent developments on random projection, sketching techniques and convex optimization methods. Combining the method with (accelerated) minibatch SGD, we can achieve an approximate solution with a time complexity lower than that of
the stateoftheart techniques for the low precision case. Our idea can also be extended to the high precision case, which gives an alternative implementation to the Iterative Hessian Sketch (IHS) method with significantly improved time complexity.
Experiments on benchmark and synthetic datasets suggest that our methods indeed outperform existing ones considerably in both the low and high precision cases.
Di Wang and Jinhui Xu.
Neurocomputing, Volume 364, 28 October 2019, Pages 280296.
2018
Conference Papers
 Empirical Risk Minimization in Noninteractive Local Differential Privacy Revisited. [Link]
Abstract▼
In this paper, we revisit the Empirical Risk Minimization problem in the noninteractive local model of differential privacy. In the case of constant or low dimensions ($p\ll n$), we first show that if the loss function is $(\infty, T)$smooth, we can avoid a dependence of the sample complexity, to achieve error $\alpha$, on the exponential of the dimensionality $p$ with base $1/\alpha$ (i.e., $\alpha^{p}$),
which answers a question in (Smith et al 2017). Our approach is based on polynomial approximation. Then, we propose playerefficient algorithms with $1$bit communication complexity and $O(1)$ computation cost for each player. The error bound is asymptotically the same as the original one. With some additional assumptions, we also give an efficient algorithm for the server.
In the case of high dimensions ($n\ll p$),
we show that if the loss function is a convex generalized linear function, the error can be bounded by using the Gaussian width of the constrained set, instead of $p$, which improves the one in
Smith et al.
Our techniques can be extended to some related problems, such as $k$way marginal queries and smooth queries.
Di Wang, Marco Gaboardi and Jinhui Xu.
Conference on Neural Information Processing Systems (NIPS/NeurIPS), 2018.
 Differentially Private Sparse Inverse Covariance Estimation. [Link] Abstract▼
In this paper, we give the first study of sparse inverse covariance estimation problem under differential privacy. Firstly, we propose an $\epsilon$differentially private algorithm via output perturbation, which is based on the sensitivity of the optimization problem and Wishart mechanism. Based on the idea of that, we propose a general covariance perturbation method, and then for $\epsilon$differential privacy, we analyze Laplacian and Wishart mechanisms, for $(\epsilon,\delta)$differential privacy we analyze Gaussian and Wishart mechanisms. Moreover, we extend the covariance perturbation algorithm to distributed setting and local differential privacy. Experiments on synthetic and benchmark datasets are also support our theoretical analysis.
Di Wang, Mengdi Huai and Jinhui Xu.
2018 6th IEEE Global Conference on Signal and Information Processing (2018 GlobalSip).
Selected as Oral Presentation.
 Large Scale Constrained Linear Regression Revisited: Faster Algorithms via Preconditioning. [Link] Abstract▼
In this paper, we revisit the largescale constrained linear regression problem and propose faster methods based on some recent developments in sketching and optimization.
Our algorithm combines minibatch SGD with a new method called twostep preconditioning to achieve an $\epsilon$accuracy solution for the low precision case, and has a lower time complexity than the stateoftheart techniques. Our idea can also be extended to the high precision case, which gives an alternative implementation to the Iterative Hessian Sketch (IHS) method with significantly improved time complexity.
Experiments on benchmark and synthetic datasets suggest that our methods indeed outperform existing ones considerably in both the low and high precision cases.
Di Wang and Jinhui Xu.
The ThirtySecond AAAI Conference on Artificial Intelligence (AAAI 2018).
Selected as Oral Presentation (Acceptance Rate: 411/3800=10.8%).
2017
Conference Papers
 Differentially Private Empirical Risk Minimization Revisited: Faster and More General. [Link] Abstract▼
In this paper we study the differentially private Empirical Risk Minimization (ERM) problem in different settings. For smooth (strongly) convex loss function with or without (non)smooth regularization, we give algorithms that achieve either optimal or near optimal utility bounds with less gradient complexity compared with previous work. For ERM with smooth convex loss function in highdimensional ($p\gg n$) setting, we give an algorithm which achieves the upper bound with less gradient complexity than previous ones. At last, we generalize the expected excess empirical risk from convex loss functions to nonconvex ones satisfying the PolyakLojasiewicz condition and give a tighter upper bound on
the utility than the one in (Zhang et al 2017).
Di Wang, Minwei Ye and Jinhui Xu.
Conference on Neural
Information Processing Systems (NIPS/NeurIPS), 2017.
Workshop Papers
 Estimating Smooth GLM in Noninteractive Local Differential Privacy Model with Public Unlabeled Data. [Link] Abstract▼
In this paper, we study the problem of estimating smooth Generalized Linear Models (GLM) in the Noninteractive Local Differential Privacy (NLDP) model.
Different from its classical setting, our model allows the server to process
some additional public but unlabeled data.
We first show that
there is an $(\epsilon, \delta)$NLDP algorithm for
GLM (under some mild assumptions), if each data record is i.i.d sampled from some subGaussian distribution with bounded $\ell_1$norm.
The sample complexity of both
public and private data, for the algorithm to achieve an $\alpha$ estimation error (in $\ell_\infty$norm), is $\tilde{O}(p^2\alpha^{2}\epsilon^{2})$ if $\alpha$ is not too small
(i.e., $\alpha\geq \Omega(\frac{1}{\sqrt{p}})$), where $p$ is the dimensionality of the data. This is a significant improvement over the previously known quasipolynomial (in $\alpha$) or exponential (in $p$) complexity of convex GLM with no public data.
We then extend our idea to the nonlinear regression problem and show
a similar phenomenon for it.
Finally, we demonstrate the practicality of our algorithms through experiments on both synthethic and real world datasets.
To our best knowledge, this is the first paper showing the existence of efficient and practical
algorithms for GLM and nonlinear regression
in the NLDP model with public unlabeled data.
Di Wang^{*}, Huanyu Zhang^{*}, Marco Gaboardi and Jinhui Xu. (* equal contribution)
NeurIPS 2019 Workshop on Privacy in Machine Learning.
 High Dimensional Sparse Linear Regression under Local Differential Privacy: Power and Limitations. [Link] Abstract▼
In this paper, we study high dimensional sparse linear regression under the Local Differential Privacy (LDP) model, and give both negative and positive results. On the negative side, we show that polynomial dependency on the dimensionality $p$ of the space is unavoidable in the estimation error under the noninteractive local model, if the privacy of the whole dataset needs to be preserved. Similar limitations also exist for other types of error measurements and in the (sequential) interactive local or relaxed local models. This indicates that differential privacy in high dimensional space is unlikely achievable for the problem. On the positive side, we show that the optimal rate of the error estimation can be made logarithmically depending on $p$ (i.e., $\log p$) under the local model, if only the privacy of the responses (labels) is to be preserved, where the upper bound is obtained by a new method called Differentially Private Iterative Hard Thresholding (DPIHT), which is interesting in its own right. Our result can also be extended to some nonlinear monotone measurement models while keeping the responses locally private.
Di Wang, Adam Smith and Jinhui Xu.
NeurIPS 2018 Workshop on Privacy Preserving Machine Learning.