Jian Li's Publications

Machine Learning Theory (Online learning, Deep Learning Theory, Optimization, Unsupervised learning and clustering algorithms)

Efficient Algorithms for Sparse Moment Problems without Separation. Zhiyuan Fan, Jian Li. The 36th Annual Conference on Learning Theory (COLT 2023) [ArXiv]

We consider the sparse moment problem of learning a k-spike mixture in high dimensional space from its noisy moment information in any dimension. We measure the accuracy of the learned mixtures using transportation distance. Previous algorithms either assume certain separation assumptions, use more recovery moments, or run in (super) exponential time. Our algorithm for the 1-dimension problem (also called the sparse Hausdorff moment problem) is a robust version of the classic Prony's method, and our contribution mainly lies in the analysis. We adopt a global and much tighter analysis than previous work (which analyzes the perturbation of the intermediate results of Prony's method). A useful technical ingredient is a connection between the linear system defined by the Vandermonde matrix and the Schur polynomial, which allows us to provide tight perturbation bound independent of the separation and may be useful in other contexts. To tackle the high dimensional problem, we first solve the 2-dimensional problem by extending the 1-dimension algorithm and analysis to complex numbers. Our algorithm for the high dimensional case determines the coordinates of each spike by aligning a 1-d projection of the mixture to a random vector and a set of 2d-projections of the mixture. Our results have applications to learning topic models and Gaussian mixtures, implying improved sample complexity results or running time over prior work.

Analyzing Sharpness along GD Trajectory: Progressive Sharpening and Edge of Stability. Zhouzi Li, Zixuan Wang, Jian Li. Proceedings of the Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2022). [ArXiv]

Recent findings (e.g., arXiv:2103.00065) demonstrate that modern neural networks trained by full-batch gradient descent typically enter a regime called Edge of Stability (EOS). In this regime, the sharpness, i.e., the maximum Hessian eigenvalue, first increases to the value 2/(step size) (the progressive sharpening phase) and then oscillates around this value (the EOS phase). This paper aims to analyze the GD dynamics and the sharpness along the optimization trajectory. Our analysis naturally divides the GD trajectory into four phases depending on the change of the sharpness. We empirically identify the norm of output layer weight as an interesting indicator of sharpness dynamics. Based on this empirical observation, we attempt to theoretically and empirically explain the dynamics of various key quantities that lead to the change of sharpness in each phase of EOS. Moreover, based on certain assumptions, we provide a theoretical proof of the sharpness behavior in EOS regime in two-layer fully-connected linear neural networks. We also discuss some other empirical findings and the limitation of our theoretical results.

Generalization Bounds for Gradient Methods via Discrete and Continuous Prior. Jian Li, Xuanyuan Luo. Proceedings of the Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2022). [ArXiv]

Proving algorithm-dependent generalization error bounds for gradient-type optimization methods has attracted significant attention recently in learning theory. However, most existing trajectory-based analyses require either restrictive assumptions on the learning rate (e.g., fast decreasing learning rate), or continuous injected noise (such as the Gaussian noise in Langevin dynamics). In this paper, we introduce a new discrete data-dependent prior to the PAC-Bayesian framework, and prove a high probability generalization bound for Floored GD (i.e. a finite precision version of gradient descent). We remark that our bound holds for nonconvex and nonsmooth scenarios. Moreover, our theoretical results provide numerically favorable upper bounds of testing errors (e.g., 0.037 on MNIST). Using a similar technique, we can also obtain new generalization bounds for certain variants of SGD. Furthermore, we study the generalization bounds for gradient Langevin Dynamics (GLD). Using the same framework with a carefully constructed continuous prior, we show a new high probability tighter generalization bound for GLD. The new faster rate is due to the concentration of the difference between the gradient of training samples and that of the prior.

Simple and Optimal Stochastic Gradient Methods for Nonsmooth Nonconvex Optimization. Zhize Li, Jian Li. Journal of Machine Learning Research (JMLR), 2022 (accepted). [ArXiv]

We propose and analyze several stochastic gradient algorithms for finding stationary points or local minimum in nonconvex, possibly with nonsmooth regularizer, finite-sum and online optimization problems. First, we propose a simple proximal stochastic gradient algorithm based on variance reduction called ProxSVRG+. We provide a clean and tight analysis of ProxSVRG+, which shows that it outperforms the deterministic proximal gradient descent (ProxGD) for a wide range of minibatch sizes, hence solves an open problem proposed in~\citet{reddi2016proximal}. Also, ProxSVRG+ uses much less proximal oracle calls than ProxSVRG (Reddi et al. 2016) and extends to the online setting by avoiding full gradient computations. Then, we further propose an optimal algorithm, called SSRGD, based on SARAH (Nguyen et al. 2017) and show that SSRGD further improves the gradient complexity of ProxSVRG+ and achieves the the optimal upper bound, matching the known lower bound. Moreover, we show that both ProxSVRG+ and SSRGD enjoy automatic adaptation with local structure of the objective function such as the Polyak-\L ojasiewicz (PL) condition for nonconvex functions in the finite-sum case, i.e., we prove that both of them can automatically switch to faster global linear convergence without any restart performed in prior work ProxSVRG. Finally, we focus on the more challenging problem of finding an $(\epsilon, \delta)$-local minimum instead of just finding an $\epsilon$-approximate (first-order) stationary point (which may be some bad unstable saddle points). We show that SSRGD can find an $(\epsilon, \delta)$-local minimum by simply adding some random perturbations. Our algorithm is almost as simple as its counterpart for finding stationary points, and achieves similar optimal rates.

Simple Combinatorial Algorithms for Combinatorial Bandits: Corruptions and Approximations. Haike Xu, Jian Li. Uncertainty in Artificial Intelligence (UAI 2021). [paper]

We consider the stochastic combinatorial semi-bandit problem with adversarial corruptions.

Improved Algorithms for Convex-Concave Minimax Optimization, Yuanhao Wang, Jian Li. 2020 Conference on Neural Information Processing Systems (NeurIPS 2020). [ArXiv]

This paper studies minimax optimization problems min_x max_y f(x,y), where f(x,y) is mx-strongly convex with respect to x, my-strongly concave with respect to y and (Lx,Lxy,Ly)-smooth. This paper proposes a new algorithm with better gradient complexity upper bound, which improves over the best known upper bound by Lin et al. Our bound achieves linear convergence rate and tighter dependency on condition numbers, especially when Lxy<L (i.e., when the interaction between x and y is weak). Via reduction, our new bound also implies improved bounds for strongly convex-concave and convex-concave minimax optimization problems. When f is quadratic, we can further improve the upper bound, which matches the lower bound up to a small sub-polynomial factor.

A Fast Anderson-Chebyshev Acceleration for Nonlinear Optimization. Zhize Li, Jian Li. The 23rd International Conference on Artificial Intelligence and Statistics (AISTATS 2020) [ArXiv]

Anderson acceleration (or Anderson mixing) is an efficient acceleration method for fixed point iterations in numerical analysis. It is known that Anderson acceleration is quite efficient in practice and can be viewed as an extension of Krylov subspace methods for nonlinear problems. In this paper, we show that Anderson acceleration with Chebyshev polynomial can achieve the optimal convergence rate O(\kappa\sqrt{ln1/eps}) for quadratic functions. Moreover, we provide a convergence analysis for minimizing general nonlinear problems. The experimental results demonstrate that the proposed Anderson-Chebyshev acceleration method converges significantly faster than other algorithms, e.g., vanilla gradient descent (GD), Nesterov's Accelerated GD.

Gradient Descent Maximizes the Margin of Homogeneous Neural Networks. Kaifeng Lyu, Jian Li. 2020 International Conference on Learning Representations (ICLR2020, Oral) [ArXiv]

Recent works on implicit regularization have shown that gradient descent converges to the max-margin direction for logistic regression with one-layer or multi-layer linear networks. In this paper, we generalize this result to homogeneous neural networks, including fully-connected and convolutional neural networks with ReLU or LeakyReLU activations. In particular, we study the gradient flow (gradient descent with infinitesimal step size) optimizing the logistic loss or cross-entropy loss of any homogeneous model (possibly non-smooth), and show that if the training loss decreases below a certain threshold, then we can define a smoothed version of the normalized margin which increases over time. We also formulate a natural constrained optimization problem related to margin maximization, and prove that both the normalized margin and its smoothed version converge to the objective value at a KKT point of the optimization problem. Furthermore, we extend the above results to a large family of loss functions. We conduct several experiments to justify our theoretical finding on MNIST and CIFAR-10 datasets. For gradient descent with constant learning rate, we observe that the normalized margin indeed keeps increasing after the dataset is fitted, but the speed is very slow. However, if we schedule the learning rate more carefully, we can observe a more rapid growth of the normalized margin. Finally, as margin is closely related to robustness, we discuss potential benefits of training longer for improving the robustness of the model.

On Generalization Error Bounds of Noisy Gradient Methods for Non-Convex Learning. Jian Li, Xuanyuan Luo, Mingda Qiao. 2020 International Conference on Learning Representations (ICLR2020) [ArXiv]

Generalization error (also known as the out-of-sample error) measures how well the hypothesis obtained from the training data can generalize to previously unseen data. Obtaining tight generalization error bounds is central to statistical learning theory. In this paper, we study the generalization error bound in learning general non-convex objectives, which has attracted significant attention in recent years. In particular, we study the (algorithm-dependent) generalization bounds of various iterative gradient based methods.

(1)We develop a new framework, termed Bayes-Stability, for proving algorithm-dependent generalization error bounds. The new framework combines ideas from both the PAC-Bayesian theory and the notion of algorithmic stability. Applying the Bayes-Stability method, we obtain new data-dependent generalization bounds for stochastic gradient Langevin dynamics (SGLD) and several other noisy gradient methods (e.g., with momentum, mini-batch and acceleration, Entropy-SGD). Our result recovers (and is typically tighter than) a recent result in Mou et al. (2018) and improves upon the results in Pensia et al. (2018). Our experiments demonstrate that our data-dependent bounds can distinguish randomly labelled data from normal data, which provides an explanation to the intriguing phenomena observed in Zhang et al. (2017a).

(2) We also study the setting where the total loss is the sum of a bounded loss and an additional \ell_2 regularization term. We obtain new generalization bounds for the continuous Langevin dynamic in this setting by developing a new Log-Sobolev inequality for the parameter distribution at any time. Our new bounds are more desirable when the noisy level of the process is not small, and do not become vacuous even when T tends to infinity.

Stochastic Gradient Hamiltonian Monte Carlo with Variance Reduction for Bayesian Inference. Zhize Li, Tianyi Zhang, Shuyu Cheng, Jun Zhu, Jian Li. Machine Learning, 2019. [ArXiv]

Gradient-based Monte Carlo sampling algorithms, like Langevin dynamics and Hamiltonian Monte Carlo, are important methods for Bayesian inference. In large-scale settings, full-gradients are not affordable and thus stochastic gradients evaluated on mini-batches are used as a replacement. In order to reduce the high variance of noisy stochastic gradients, Dubey et al. [2016] applied the standard variance reduction technique on stochastic gradient Langevin dynamics and obtained both theoretical and experimental improvements. In this paper, we apply the variance reduction tricks on Hamiltonian Monte Carlo and achieve better theoretical convergence results compared with the variance-reduced Langevin dynamics. Moreover, we apply the symmetric splitting scheme in our variance-reduced Hamiltonian Monte Carlo algorithms to further improve the theoretical results. The experimental results are also consistent with the theoretical results. As our experiment shows, variance-reduced Hamiltonian Monte Carlo demonstrates better performance than variance-reduced Langevin dynamics in Bayesian regression and classification tasks on real-world datasets.

A Simple Proximal Stochastic Gradient Method for Nonsmooth Nonconvex Optimization. Zhize Li, Jian Li. Thirty-second Conference on Neural Information Processing Systems (NeurIPS 2018 spotlight) [ArXiv]

We analyze stochastic gradient algorithms for optimizing nonconvex, nonsmooth finite-sum problems. In particular, the objective function is given by the summation of a differentiable (possibly nonconvex) component, together with a possibly non-differentiable but convex component. We propose a proximal stochastic gradient algorithm based on variance reduction, called ProxSVRG+. The algorithm is a slight variant of the ProxSVRG algorithm [Reddi et al. NIPS2016]. Our main contribution lies in the analysis of ProxSVRG+. It recovers several existing convergence results (in terms of the number of stochastic gradient oracle calls and proximal operations), and improves/generalizes some others. In particular, ProxSVRG+ generalizes the best results given by the SCSG algorithm, recently proposed by [Lei at al. NIPS2017] for the smooth nonconvex case. ProxSVRG+ is more straightforward than SCSG and yields simpler analysis. Moreover, ProxSVRG+ outperforms the deterministic proximal gradient descent (ProxGD) for a wide range of minibatch sizes, which partially solves an open problem proposed in \cite{reddi2016proximal}. Finally, for nonconvex functions satisfied Polyak-\L{}ojasiewicz condition, we show that ProxSVRG+ achieves global linear convergence rate without restart. ProxSVRG+ is always no worse than ProxGD and ProxSVRG/SAGA, and sometimes outperforms them (and generalizes the results of SCSG) in this case.

eps-Coresets for Clustering (with Outliers) in Doubling Metrics. Lingxiao Huang, Shaofeng H.-C. Jiang, Jian Li, Xuan Wu. The 59th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2018) [Full version in ArXiv]

We present the first efficient algorithm that constructs an eps-coreset for the a general class of clustering problems in a doubling metric. To this end, we establish the first relation between the doubling dimension of and the shattering dimension (or VC-dimension) of the range space induced by the distance. Such a relation is not known before, since one can easily construct instances in which neither one can be bounded by (some function of) the other. Surprisingly, we show that if we allow a small (1+ -eps)-distortion of the distance function $d$ (the distorted distance is called the smoothed distance function), the shattering dimension can be upper bounded. We also introduce the notion of $\tau$-error {\em probabilistic shattering dimension}, and prove a (drastically better) upper bound for the probabilistic shattering dimension for weighted doubling metrics. We believe the new relation between doubling and shattering dimensions is of independent interest and may find other applications. Furthermore, we study robust coresets for (k,z)-clustering with outliers in a doubling metric. We show an improved connection between $\alpha$-approximation and robust coreset. This also leads to improvement upon the previous best known bound of the size of robust coreset for Euclidean space [Feldman and Langberg, STOC 11]. The new bound entails a few new results in clustering and property testing. As another application, we show constant-sized (\eps, k, z)-centroid sets in doubling metrics can be constructed by extending our coreset construction. Prior to our result, constant-sized centroid sets for general clustering problems were only known for Euclidean spaces. We can apply our centroid set to accelerate the local search algorithm (studied in [Friggstad et al., FOCS 2016]) for the (k, z)-clustering problem in doubling metrics.

Nearly Instance Optimal Sample Complexity Bounds for Top-k Arm Selection. Lijie Chen, Jian Li, Mingda Qiao. The 20th International Conference on Artificial Intelligence and Statistics (AISTATS 2017). [full version in ArXiv]

In the Best-k-Arm problem, we are given n stochastic bandit arms, each associated with an unknown reward distribution. We are required to identify the k arms with the largest means by taking as few samples as possible. In this paper, we make progress towards a complete characterization of the instance-wise sample complexity bounds for the Best-k-Arm problem. On the lower bound side, we obtain a novel complexity term to measure the sample complexity that every Best-k-Arm instance requires. This is derived by an interesting and nontrivial reduction from the Best-1-Arm problem. We also provide an elimination based algorithm that matches the instancewise lower bound within doubly-logarithmic factors. The sample complexity of our algorithm is strictly better than the state-of-the-art for Best-k-Arm (module constant factors).

Nearly Optimal Sampling Algorithms for Combinatorial Pure Exploration. Lijie Chen, Anupam Gupta, Jian Li, Mingda Qiao, Ruosong Wang. In the 30th Annual Conference on Learning Theory (COLT 2017) [Paper]

We study the combinatorial pure exploration problem CPE in a stochastic multi-armed bandit game. In a CPE instance, we are given $n$ stochastic arms with unknown reward distributions, as well as a family $\subsetfam$ of feasible subsets over the arms. Let the weight of an arm be the mean of its reward distribution. Our goal is to identify the feasible subset in $\subsetfam$ with the maximum total weight, using as few samples as possible. The problem generalizes the classical best arm identification problem and the top-$k$ arm identification problem, both of which have attracted significant attention in recent years. We provide a novel \emph{instance-wise} lower bound for the sample complexity of the problem, as well as a nontrivial sampling algorithm, matching the lower bound up to a factor of $\ln|\subsetfam|$. For an important class of combinatorial families (including spanning trees, matchings, and path constraints), we also provide polynomial time implementation of the sampling algorithm, using the equivalence of separation and optimization for convex program, and the notion of approximate Pareto curves in multi-objective optimization (note that $|\subsetfam|$ can be exponential in $n$). We also show that the $\ln|\subsetfam|$ factor is inevitable in general, through a nontrivial lower bound construction utilizing a combinatorial structure resembling the Nisan-Wigderson design. Our results significantly improve several previous results for several important combinatorial constraints, and provide a tighter understanding of the general \combibandit problem. We further introduce an even more general problem, formulated in geometric terms. We are given $n$ Gaussian arms with unknown means and unit variance. Consider the $n$-dimensional Euclidean space $\mathbb{R}^n$, and a collection $\anssetcol$ of disjoint subsets. Our goal is to determine the subset in $\anssetcol$ that contains the mean profile (which is the $n$-dimensional vector of the means), using as few samples as possible. The problem generalizes most pure exploration bandit problems studied in the literature. We provide the first nearly optimal sample complexity upper and lower bounds for the problem.

Towards Instance Optimal Bounds for Best Arm Identification. Lijie Chen, Jian Li, Mingda Qiao. In the 30th Annual Conference on Learning Theory (COLT 2017) [ArXiv]

We study the best arm identification (BEST-1-ARM) problem, which is defined as follows. We are given n stochastic bandit arms. The ith arm has a reward distribution Di with an unknown mean \mu_i. Upon each play of the ith arm, we can get a reward, sampled i.i.d. from Di. We would like to identify the arm with largest mean with probability at least 1-\delta, using as few samples as possible. We obtain nearly instance optimal sample complexity for best arm, thereby making significant progress towards a complete resolution of the ( gap-entropy conjecture), from both the upper and lower bound sides. The paper is a followup of our early paper: On the Optimal Sample Complexity for Best Arm Identification. Lijie Chen, Jian Li. [ArXiv] The gap entropy conjecture - concerning the instance optimality of the best-1-arm problem. See COLT16 open problem

Combinatorial Multi-Armed Bandit with General Reward Functions, Wei Chen, Wei Hu, Fu Li, Jian Li, Yu Liu, Pinyan Lu. Neural Information Processing Systems (NIPS), 2016.

In this paper, we study the stochastic combinatorial multi-armed bandit (CMAB) framework that allows a general nonlinear reward function, whose expected value may not depend only on the means of the input random variables but possibly on the entire distributions of these variables. Our framework enables a much larger class of reward functions such as the $\max()$ function and nonlinear utility functions. Existing techniques relying on accurate estimations of the means of random variables, such as the upper confidence bound (UCB) technique, do not work directly on these functions. We propose a new algorithm called stochastically dominant confidence bound (SDCB), which estimates the distributions of underlying random variables and their stochastically dominant confidence bounds. We prove that if the underlying variables have known finite supports, SDCB can achieve $O(\log T)$ distribution-dependent regret and $\tilde{O}(\sqrt{T})$ distribution-independent regret, where $T$ is the time horizon. For general arbitrary distributions, we further use a discretization technique and show an $\tilde{O}(\sqrt{T})$ regret bound. We apply our results to the $K$-MAX problem and the expected utility maximization problems. In particular, for $K$-MAX, we provide the first polynomial-time approximation scheme (PTAS) for its offline problem, and give the first $\tilde{O}(\sqrt{T})$ bound on the $(1-\epsilon)$-approximation regret of its online problem, for any $\epsilon > 0$.

K-Means Clustering with Distributed Dimension. Hu Ding, Yu Liu, Lingxiao Huang, Jian Li. The 33rd International Conference on Machine Learning (ICML 2016).

Distributed clustering has attracted significant attention in recent years. In this paper, we study the k-means problem in the distributed dimension setting, where the dimensions of the data are partitioned across multiple machines. We provide new approximation algorithms, which incur low communication costs and achieve constant approximation factors. The communication complexity of our algorithms significantly improve on existing algorithms. We also provide the first communication lower bound, which nearly matches our upper bound in a wide range of parameter setting. Our experimental results show that our algorithms outperform existing algorithms on real data sets in the distributed dimension setting.

Pure Exploration of Multi-armed Bandit Under Matroid Constraints. Lijie Chen, Anupum Gupta, Jian Li. Conference on Learning Theory (COLT 2016). [Full version]

We study the pure exploration problem subject to a matroid constraint (BEST-BASIS) in a stochastic multi-armed bandit game. In a BEST-BASIS instance, we are given n stochastic arms with unknown reward distributions, as well as a matroid M over the arms. Let the weight of an arm be the mean of its reward distribution. Our goal is to identify a basis of M with the maximum total weight, using as few samples as possible. The problem is a significant generalization of the best arm identification problem and the top-k arm identification problem, which have attracted significant attentions in recent years. We study both the exact and PAC versions of BEST-BASIS, and provide algorithms with nearly-optimal sample complexities for these versions. Our results generalize and/or improve on several previous results for the top-k arm identification problem and the combinatorial pure exploration problem (when the combinatorial constraint is a matroid).

On Top-k Selection in Multi-Armed Bandits and Hidden Bipartite Graphs. Wei Cao, Jian Li, Zhize Li, Yufei Tao. Neural Information Processing Systems (NIPS), 2015.

This paper discusses how to efficiently choose from n unknown distributions the k ones whose means are the greatest by a certain metric, up to a small relative error. We study the topic under two standard settings—multi-armed bandits and hidden bipartite graphs. For both settings, we prove lower bounds on the total number of samples needed, and propose optimal algorithms whose sample complexities match those lower bounds.

Stochastic Online Greedy Learning with Semi-bandit Feedbacks. Tian Lin, Jian Li, Wei Chen. Neural Information Processing Systems (NIPS), 2015.

We study the online learning problem when the input to the greedy algorithm is stochastic with unknown parameters that have to be learned over time. We first propose the greedy regret and eps-quasi greedy regret as learning metrics comparing with the performance of offline greedy algorithm. We propose two online greedy learning algorithms with semi-bandit feedbacks, which achieve O(log T) problem-dependent regret bound for a general class of combinatorial structures and reward functions that allow greedy solutions.

Learning Arbitrary Statistical Mixtures of Discrete Distributions. Jian Li, Yuval Rabani, Leonard J. Schulman, Chaitanya Swamy, In ACM Symposium on the Theory of Computing (STOC 2015)[ArXiv]

We study the problem of learning from unlabeled samples very general statistical mixture models on large finite sets. Specifically, the model to be learned, \mix, is a probability distribution over probability distributions p, where each such p is a probability distribution over [n] ={1,2,...,n}. When we sample from \mix, we do not observe p directly, but only indirectly and in very noisy fashion, by sampling from [n] repeatedly, independently K times from the distribution p. The problem is to infer \mix to high accuracy in transportation (earthmover) distance. We give the first efficient algorithms for learning this mixture model without making any restricting assumptions on the structure of the distribution. We bound the quality of the solution as a function of the size of the samples K and the number of samples used. Our model and results have applications to a variety of unsupervised learning scenarios, including learning topic models and collaborative filtering.

Optimal PAC Multiple Arm Identification with Applications to Crowdsourcing, Yuan Zhou, Xi Chen, Jian Li. International Conference on Machine Learning (ICML 2014). [full version]

We study the problem of selecting K arms with the highest expected rewards in a stochastic N-armed bandit game. We propose a new PAC algorithm, which, with probability at least 1-delta , identifies a set of K arms with average reward at most \epsilon away from the top-k arm. Naive uniform sampling requires O(nlnn) samples. We show it is possible to achieve linear sample complexity. We show this is also a lower bound (meaning our upper bound is worse-case optimal).

AI+Finance+Time Series

Towards Generalizable Reinforcement Learning for Trade Execution. Chuheng Zhang, Yitong Duan, Xiaoyu Chen, Jianyu Chen, Jian Li, Li Zhao. The 32th International Joint Conference on Artificial Intelligence (IJCAI 2023)

The goal of automated feature generation is to liberate machine learning experts from the laborious task of manual feature generation, which is crucial for improving the learning performance of tabular data. The major challenge in automated feature generation is to efficiently and accurately identify useful features from a vast pool of candidate features. In this paper, we present OpenFE, an automated feature generation tool that provides competitive results against machine learning experts. OpenFE achieves efficiency and accuracy with two components: 1) a novel feature boosting method for accurately estimating the incremental performance of candidate features. 2) a feature-scoring framework for retrieving effective features from a large number of candidates through successive featurewise halving and feature importance attribution. Extensive experiments on seven benchmark datasets show that OpenFE outperforms existing baseline methods. We further evaluate OpenFE in two famous Kaggle competitions with thousands of data science teams participating. In one of the competitions, features generated by OpenFE with a simple baseline model can beat 99.3\% data science teams. In addition to the empirical results, we provide a theoretical perspective to show that feature generation is beneficial in a simple yet representative setting.

OpenFE: Automated Feature Generation with Expert-level Performance. Tianping Zhang, Zheyu Zhang, Zhiyuan Fan, Haoyan Luo, Fengyuan Liu, Qian Liu, Wei Cao, Jian Li. The 40th International Conference on Machine Learning (ICML 2023) [Github]

The goal of automated feature generation is to liberate machine learning experts from the laborious task of manual feature generation, which is crucial for improving the learning performance of tabular data. The major challenge in automated feature generation is to efficiently and accurately identify useful features from a vast pool of candidate features. In this paper, we present OpenFE, an automated feature generation tool that provides competitive results against machine learning experts. OpenFE achieves efficiency and accuracy with two components: 1) a novel feature boosting method for accurately estimating the incremental performance of candidate features. 2) a feature-scoring framework for retrieving effective features from a large number of candidates through successive featurewise halving and feature importance attribution. Extensive experiments on seven benchmark datasets show that OpenFE outperforms existing baseline methods. We further evaluate OpenFE in two famous Kaggle competitions with thousands of data science teams participating. In one of the competitions, features generated by OpenFE with a simple baseline model can beat 99.3\% data science teams. In addition to the empirical results, we provide a theoretical perspective to show that feature generation is beneficial in a simple yet representative setting.

AEC-GAN: Adversarial Error Correction GANs for Auto-Regressive Long Time-series Generation. Lei Wang, Liang Zeng, Jian Li. The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI 2023) [Paper] [ Show Abstract ]

Large-scale high-quality data is critical for training modern deep neural networks. However, data acquisition can be costly or time-consuming for many time-series applications, thus researchers turn to generative models for generating synthetic time-series data. In particular, recent generative adversarial networks (GANs) have achieved remarkable success in time-series generation. Despite their success, existing GAN models typically generate the sequences in an auto-regressive manner, and we empirically observe that they suffer from severe distribution shifts and bias amplification, especially when generating long sequences. To resolve this problem, we propose Adversarial Error Correction GAN (AEC-GAN), which is capable of dynamically correcting the bias in the past generated data to alleviate the risk of distribution shifts and thus can generate high-quality long sequences. AEC-GAN contains two main innovations: (1) We develop an error correction module to mitigate the bias. In the training phase, we adversarially perturb the realistic time-series data and then optimize this module to reconstruct the original data. In the generation phase, this module can act as an efficient regulator to detect and mitigate the bias. (2) We propose an augmentation method to facilitate GAN's training by introducing adversarial examples. Thus, AEC-GAN can generate high-quality sequences of arbitrary lengths, and the synthetic data can be readily applied to downstream tasks to boost their performance. We conduct extensive experiments on six widely used datasets and three state-of-the-art time-series forecasting models to evaluate the quality of our synthetic time-series data in different lengths and downstream tasks. Both the qualitative and quantitative experimental results demonstrate the superior performance of AEC-GAN over other deep generative models for time-series generation.

Less Is More: Fast Multivariate Time Series Forecasting with Light Sampling-oriented MLP Structures. Tianping Zhang, Yizhuo Zhang, Wei Cao, Jiang Bian, Xiaohan Yi, Shun Zheng, Jian Li [ArXiv]

Multivariate time series forecasting has seen widely ranging applications in various domains, including finance, traffic, energy, and healthcare. To capture the sophisticated temporal patterns, plenty of research studies designed complex neural network architectures based on many variants of RNNs, GNNs, and Transformers. However, complex models are often computationally expensive and thus face a severe challenge in training and inference efficiency when applied to large-scale real-world datasets. In this paper, we introduce LightTS, a light deep learning architecture merely based on simple MLP-based structures. The key idea of LightTS is to apply an MLP-based structure on top of two delicate down-sampling strategies, including interval sampling and continuous sampling, inspired by a crucial fact that down-sampling time series often preserves the majority of its information. We conduct extensive experiments on eight widely used benchmark datasets. Compared with the existing state-of-the-art methods, LightTS demonstrates better performance on five of them and comparable performance on the rest. Moreover, LightTS is highly efficient. It uses less than 5% FLOPS compared with previous SOTA methods on the largest benchmark dataset. In addition, LightTS is robust and has a much smaller variance in forecasting accuracy than previous SOTA methods in long sequence forecasting tasks.

DeepScalper: A Risk-Aware Reinforcement Learning Framework to Capture Fleeting Intraday Trading Opportunities. Shuo Sun, Rundong Wang, Wanqi Xue, Xu He, Junlei Zhu, Jian Li and Bo An. The 31st ACM International Conference on Information and Knowledge Management (CIKM 2022).

Reinforcement learning (RL) techniques have shown great success in many challenging quantitative trading tasks, such as portfolio management and algorithmic trading. Especially, intraday trading is one of the most profitable and risky tasks because of the intraday behaviors of the financial market that reflect billions of rapidly fluctuating capitals. However, a vast majority of existing RL methods focus on the relatively low frequency trading scenarios (e.g., day-level) and fail to capture the fleeting intraday investment opportunities due to two major challenges: 1) how to effectively train profitable RL agents for intraday investment decision-making, which involves high-dimensional fine-grained action space; 2) how to learn meaningful multi-modality market representation to understand the intraday behaviors of the financial market at tick-level. Motivated by the efficient workflow of professional human intraday traders, we propose DeepScalper, a deep reinforcement learning framework for intraday trading to tackle the above challenges. Specifically, DeepScalper includes four components: 1) a dueling Q-network with action branching to deal with the large action space of intraday trading for efficient RL optimization; 2) a novel reward function with a hindsight bonus to encourage RL agents making trading decisions with a long-term horizon of the entire trading day; 3) an encoder-decoder architecture to learn multi-modality temporal market embedding, which incorporates both macro-level and micro-level market information; 4) a risk-aware auxiliary task to maintain a striking balance between maximizing profit and minimizing risk. Through extensive experiments on real-world market data spanning over three years on six financial futures (2 stock index and 4 treasury bond), we demonstrate that DeepScalper significantly outperforms many state-of-the-art baselines in terms of four financial criteria. Furthermore, we conduct a series of exploratory and ablative studies to analyze the contributions of each component in DeepScalper.

Integrating Diverse Policies for Portfolio Management via Combining Imitation Learning and Reinforcement Learning. Hui Niu, Siyuan Li and Jian Li. The 31st ACM International Conference on Information and Knowledge Management (CIKM 2022).

Portfolio management is a fundamental problem in finance. It involves periodic reallocations of assets to maximize the expected returns within an appropriate level of risk exposure. Deep reinforcement learning (RL) has been considered a promising approach to solving this problem owing to its strong ability in sequential decision making. However, due to the non-stationary nature of financial markets, applying RL techniques to portfolio optimization remains a challenging problem. Extracting trading knowledge from various expert strategies could be helpful for agents to accommodate the changing markets. In this paper, we propose \textit{MetaTrader}, a novel two-stage RL-based approach for portfolio management, which learns to integrate diverse trading policies to adapt to various market conditions. In the first stage, MetaTrader incorporates an imitation learning objective into the reinforcement learning framework. Through imitating different expert demonstrations, MetaTrader acquires a set of trading policies with great diversity. In the second stage, MetaTrader learns a meta-policy to recognize the market conditions and decide on the most proper learned policy to follow. We evaluate the proposed approach on three real-world index datasets and compare it to state-of-the-art baselines. The empirical results demonstrate that MetaTrader significantly outperforms those baselines in balancing profits and risks. Furthermore, thorough ablation studies validate the effectiveness of the components in the proposed approach.

Trade When Opportunity Comes: Price Movement Forecasting via Locality-Aware Attention and Adaptive Refined Labeling. Zeng, Liang, Lei Wang, Hui Niu, Jian Li, Ruchen Zhang, Zhonghao Dai, Dewei Zhu, and Ling Wang. [ArXiv]

FactorVAE: A Probabilistic Dynamic Factor Model Based on Variational Autoencoder for Predicting Cross-sectional Stock Returns. Yitong Duan, Lei Wang, Qizhong Zhang, Jian Li. AAAI Conference on Artificial Intelligence (AAAI 2022).

As an asset pricing model in economics and finance, factor model has been widely used in quantitative investment. Towards building more effective factor models, recent years have witnessed the paradigm shift from linear models to more flexible nonlinear data-driven machine learning models. However, due to low signal-to-noise ratio of the financial data, it is quite challenging to learn effective factor models. In this paper, we propose a novel factor model, FactorVAE, as a probabilistic model with inherent randomness for noise modeling. Essentially, our model integrates the dynamic factor model (DFM) with the variational autoencoder (VAE) in machine learning, and we propose a prior-posterior learning method based on VAE, which can effectively guide the learning of model by approximating an optimal posterior factor model with future information. Particularly, considering that risk modeling is important for the noisy stock data, FactorVAE can estimate the variances from the distribution over the latent space of VAE, in addition to predicting returns. The experiments on the real stock market data demonstrate the effectiveness of FactorVAE, which outperforms various baseline methods.

DoubleEnsemble: A New Ensemble Method Based on Sample Reweighting and Feature Selection for Financial Data Analysis. Chuhang Zhang, Yuanqi Li, Xi Chen, Yifei Jin, Pingzhong Tang, Jian Li. The IEEE International Conference on Data Mining (ICDM 2020). [ArXiv]

Modern machine learning models (such as deep neural networks and boosting decision tree models) have become increasingly popular in financial market prediction, due to their superior capacity to extract complex non-linear patterns. However, since financial datasets have very low signal-to-noise ratio and are non-stationary, complex models are often very prone to overfitting and suffer from instability issues. Moreover, as various machine learning and data mining tools become more widely used in quantitative trading, many trading firms have been producing an increasing number of features (aka factors). Therefore, how to automatically select effective features becomes an imminent problem. To address these issues, we propose DoubleEnsemble, an ensemble framework leveraging learning trajectory based sample reweighting and shuffling based feature selection. Specifically, we identify the key samples based on the training dynamics on each sample and elicit key features based on the ablation impact of each feature via shuffling. Our model is applicable to a wide range of base models, capable of extracting complex patterns, while mitigating the overfitting and instability issues for financial market prediction. We conduct extensive experiments, including price prediction for cryptocurrencies and stock trading, using both DNN and gradient boosting decision tree as base models. Our experiment results demonstrate that DoubleEnsemble achieves a superior performance compared with several baseline methods.

AutoAlpha: an Efficient Hierarchical Evolutionary Algorithm for Mining Alpha Factors in Quantitative Investment Tianping Zhang, Yuanqi Li, Yifei Jin, Jian Li. [ArXiv]

The multi-factor model is a widely used model in quantitative investment. The success of a multi-factor model is largely determined by the effectiveness of the alpha factors used in the model. This paper proposes a new evolutionary algorithm called AutoAlpha to automatically generate effective formulaic alphas from massive stock datasets. Specifically, first we discover an inherent pattern of the formulaic alphas and propose a hierarchical structure to quickly locate the promising part of space for search. Then we propose a new Quality Diversity search based on the Principal Component Analysis (PCA-QD) to guide the search away from the well-explored space for more desirable results. Next, we utilize the warm start method and the replacement method to prevent the premature convergence problem. Based on the formulaic alphas we discover, we propose an ensemble learning-to-rank model for generating the portfolio. The backtests in the Chinese stock market and the comparisons with several baselines further demonstrate the effectiveness of AutoAlpha in mining formulaic alphas for quantitative trading.

An Adaptive Master-Slave Regularized Model for Unexpected Revenue Prediction Enhanced with Alternative Data. Jin Xu, Jingbo Zhou, Jia Yongpo, Jian Li and Hui Xiong. 36th IEEE International Conference on Data Engineering (ICDE 2020).

Revenue prediction is an essential component in security analysis since the revenue of a company has a great impact on the per-formance of its stock. For investment, one of the most valuablepieces of information is the company’s unexpected revenue, which is the difference between the officially reported revenue and theconsensus estimate for revenue predicted by analysts. Since it is theunexpected revenue that indicates something exceeding or underanalysts’ expectation, it is an indispensable factor that influencesthe performance of a stock. Besides conventional trading data fromstock market and companies’ financial reports, recent years havewitnessed an extensive application of alternative data for gainingan information edge in stock investment. In this paper, we study the challenging problem to predict bet-ter unexpected revenue of a company via machine learning withalternative data. To the best of our knowledge, this is the first workstudying this problem in the literature. However, it is nontrivial toquantitatively model the relations between the unexpected revenueand the information provided by alternative data with a machinelearning approach. Thus we proposed an adaptive master-slaveregularized model, AMS for short, to effectively leverage alternative data for unexpected revenue prediction. Our AMS model firsttrains a master model upon the company graph, which captures therelations among companies, using a graph neural network (GNN). Then for a target company, the master model generates an adaptiveslave-model, which is specially optimized for this target company.Finally, we use this slave-model to predict the unexpected revenueof the target company. Besides its excellent prediction performance,another critical advantage of our AMS model lies in its superior in-terpretability, which is crucial for portfolio managers to understandthe predicted results. With extensive experiments using two real-world alternative datasets, we have demonstrated the effectivenessof our model against a set of competitors.

BRITS: Bidirectional Recurrent Imputation for Time Series. Wei Cao, Dong Wang,Jian Li, Hao Zhou, Lei Li, Yitan Li. Thirty-second Conference on Neural Information Processing Systems (NeurIPS 2018) [paper]

Time series are widely used as signals in many classification/regression tasks. It is ubiquitous that time series contains many missing values. Given multiple correlated time series data, how to fill in missing values and to predict their class labels? Existing imputation methods often impose strong assumptions of the underlying data generating process, such as linear dynamics in the state space. In this paper, we propose BRITS, a novel method based on recurrent neural networks for missing value imputation in time series data. Our proposed method directly learns the missing values in a bidirectional recurrent dynamical system, without any specific assumption. The imputed values are treated as variables of RNN graph and can be effectively updated during the backpropagation. BRITS has three advantages: (a) it can handle multiple correlated missing values in time series; (b) it generalizes to time series with nonlinear dynamics underlying; (c) it provides a data-driven imputation procedure and applies to general settings with missing data. We evaluate our model on three real-world datasets, including an air quality dataset, a health-care data, and a localization data for human activity. Experiments show that our model outperforms the state-of-the-art methods in both imputation and classification/regression accuracies.

When Will You Arrive? Estimating Travel Time Based on Deep Neural Networks. Dong Wang, Junbo Zhang, Wei Cao, Jian Li, Yu Zheng. The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI 2018) [Paper] [Code and Data]

Estimating the travel time of any path (denoted by a sequence of connected road segments) in a city is of great importance to traffic monitoring, route planning, ridesharing, taxi/Uber dispatching, etc. However, it is a very challenging problem, affected by diverse complex factors, including spatial correlations, temporal dependencies, external conditions (e.g. weather, traffic lights). Prior work usually focuses on estimating the travel times of individual road segments or sub-paths and then summing up these times, which leads to an inaccurate estimation because such approaches do not consider road intersections/ traffic lights, and local errors may accumulate. To address these issues, we propose an end-to-end Deep learning framework for Travel Time Estimation (called DeepTTE) that estimates the travel time of the whole path directly. More specifically, we present a geo-convolution operation by integrating the geographic information into the classical convolution, capable of capturing spatial correlations. By stacking recurrent unit on the geo-convoluton layer, our DeepTTE can capture the temporal dependencies as well. A multi-task learning component is given on the top of DeepTTE, that learns to estimate the travel time of both the entire path and each local path simultaneously during the training phase. Extensive experiments on two trajectory datasets show our DeepTTE significantly outperforms the state-of-the-art methods.

DeepSD: Supply-Demand Prediction for Online Car-hailing Services using Deep Neural Networks. Dong Wang, Wei Cao, Jian Li, Jieping Ye. The 33th IEEE International Conference on Data Engineering (ICDE 2017). [paper]

The online car-hailing service has gained great popularity all over the world. As more passengers and more drivers use the service, it becomes increasingly more important for the the car-hailing service providers to effectively schedule the drivers to minimize the waiting time of passengers and maximize the driver utilization, thus to improve the overall user experience. In this paper, we study the problem of predicting the real-time car-hailing supply-demand, which is one of the most important component of an effective scheduling system. Our objective is to predict the gap between the car-hailing supply and demand in a certain area in the next few minutes. Based on the prediction, we can balance the supply-demands by scheduling the drivers in advance. We present an end-to-end framework called Deep Supply-Demand (DeepSD) using a novel deep neural network structure. Our approach can automatically discover complicated supply-demand patterns from the car-hailing service data while only requires a minimal amount hand-crafted features. Moreover, our framework is highly flexible and extendable. Based on our framework, it is very easy to utilize multiple data sources (e.g., car-hailing orders, weather and traffic data) to achieve a high accuracy. We conduct extensive experimental evaluations, which show that our framework provides more accurate prediction results than the existing methods.

LLMs (Pre-trained/Generative/Large Language/Foundation Models)

Latent Consistency Models: Synthesizing High-Resolution Images with Few-step Inference. Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, Hang Zhao. [Paper][Github][Demo]

Latent Diffusion models (LDMs) have achieved remarkable results in synthesizing high-resolution images. However, the iterative sampling is computationally intensive and leads to slow generation. Inspired by Consistency Models, we propose Latent Consistency Models (LCMs), enabling swift inference with minimal steps on any pre-trained LDMs, including Stable Diffusion. Efficiently distilled from pre-trained classifier-free guided diffusion models, a high-quality 768×768 2~4-step LCM takes only 32 A100 GPU hours for training. Furthermore, we introduce Latent Consistency Fine-tuning (LCF), a novel method that is tailored for fine-tuning LCMs on customized image datasets. Evaluation on the LAION-5B-Aesthetics dataset demonstrates that LCMs achieve state-of-the-art text-to-image generation performance with few-step inference.

Generative Table Pre-training Empowers Models for Tabular Prediction. Tianping Zhang, Shaowen Wang, Shuicheng YAN, Li Jian, Qian Liu. The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023). [Paper][Github]

Recently, the topic of table pre-training has attracted considerable research interest. However, how to employ table pre-training to boost the performance of tabular prediction remains an open challenge. In this paper, we propose TapTap, the first attempt that leverages table pre-training to empower models for tabular prediction. After pre-training on a large corpus of real-world tabular data, TapTap can generate high-quality synthetic tables to support various applications on tabular data, including privacy protection, low resource regime, missing value imputation, and imbalanced classification. Extensive experiments on 12 datasets demonstrate that TapTap outperforms a total of 16 baselines in different scenarios. Meanwhile, it can be easily combined with various backbone models, including LightGBM, Multilayer Perceptron (MLP) and Transformer. Moreover, with the aid of table pre-training, models trained using synthetic data generated by TapTap can even compete with models using the original dataset on half of the experimental datasets, marking a milestone in the development of synthetic tabular data generation.

Not All Tasks Are Born Equal: Understanding Zero-Shot Generalization. Jing Zhou, Zongyu Lin, Yanan Zheng, Jian Li, Zhilin Yang. The Eleventh International Conference on Learning Representations (ICLR 2023). [Paper]

Recent work has achieved remarkable zero-shot performance with multi-task prompted pretraining, but little has been understood. For the first time, we show that training on a small number of key tasks beats using all the training tasks, while removing these key tasks substantially hurts performance. We also find that these key tasks are mostly question answering (QA) tasks. These novel findings combined deepen our understanding about zero-shot generalization—training on certain tasks such as QA encodes general knowledge transferable to a wide range of tasks. In addition, to automate this procedure, we devise a method that (1) identifies key training tasks without observing the test tasks by examining the pairwise generalization results and (2) resamples training tasks for better data distribution. Empirically, our approach achieves improved results across various model scales and tasks.

AEC-GAN: Adversarial Error Correction GANs for Auto-Regressive Long Time-series Generation. Lei Wang, Liang Zeng, Jian Li. The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI 2023) [Paper] [ Show Abstract ]

Large-scale high-quality data is critical for training modern deep neural networks. However, data acquisition can be costly or time-consuming for many time-series applications, thus researchers turn to generative models for generating synthetic time-series data. In particular, recent generative adversarial networks (GANs) have achieved remarkable success in time-series generation. Despite their success, existing GAN models typically generate the sequences in an auto-regressive manner, and we empirically observe that they suffer from severe distribution shifts and bias amplification, especially when generating long sequences. To resolve this problem, we propose Adversarial Error Correction GAN (AEC-GAN), which is capable of dynamically correcting the bias in the past generated data to alleviate the risk of distribution shifts and thus can generate high-quality long sequences. AEC-GAN contains two main innovations: (1) We develop an error correction module to mitigate the bias. In the training phase, we adversarially perturb the realistic time-series data and then optimize this module to reconstruct the original data. In the generation phase, this module can act as an efficient regulator to detect and mitigate the bias. (2) We propose an augmentation method to facilitate GAN's training by introducing adversarial examples. Thus, AEC-GAN can generate high-quality sequences of arbitrary lengths, and the synthetic data can be readily applied to downstream tasks to boost their performance. We conduct extensive experiments on six widely used datasets and three state-of-the-art time-series forecasting models to evaluate the quality of our synthetic time-series data in different lengths and downstream tasks. Both the qualitative and quantitative experimental results demonstrate the superior performance of AEC-GAN over other deep generative models for time-series generation.

Learning to Break the Loop: Analyzing and Mitigating Repetitions for Neural Text Generation. Jin Xu, Xiaojiang Liu, Jianhao Yan, Deng Cai, Huayang Li, Jian Li. Proceedings of the Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2022). [ArXiv]

While large-scale neural language models, such as GPT2 and BART, have achieved impressive results on various text generation tasks, they tend to get stuck in undesirable sentence-level loops with maximization-based decoding algorithms (\textit{e.g.}, greedy search). This phenomenon is counter-intuitive since there are few consecutive sentence-level repetitions in human corpora (e.g., 0.02\% in Wikitext-103). To investigate the underlying reasons for generating consecutive sentence-level repetitions, we study the relationship between the probabilities of the repetitive tokens and their previous repetitions in the context. Through our quantitative experiments, we find that 1) Language models have a preference to repeat the previous sentence; 2) The sentence-level repetitions have a \textit{self-reinforcement effect}: the more times a sentence is repeated in the context, the higher the probability of continuing to generate that sentence; 3) The sentences with higher initial probabilities usually have a stronger self-reinforcement effect. Motivated by our findings, we propose a simple and effective training method \textbf{DITTO} (Pseu\underline{D}o-Repet\underline{IT}ion Penaliza\underline{T}i\underline{O}n), where the model learns to penalize probabilities of sentence-level repetitions from pseudo repetitive data. Although our method is motivated by mitigating repetitions, experiments show that DITTO not only mitigates the repetition issue without sacrificing perplexity, but also achieves better generation quality. Extensive experiments on open-ended text generation (Wikitext-103) and text summarization (CNN/DailyMail) demonstrate the generality and effectiveness of our method.

FlipDA: Effective and Robust Data Augmentation for Few-Shot Learning. Jing Zhou, Yanan Zheng, Jie Tang, Jian Li, and Zhilin Yang. 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022). [ArXiv]

Most previous methods for text data augmentation are limited to simple tasks and weak baselines. We explore data augmentation on hard tasks (i.e., few-shot natural language understanding) and strong baselines (i.e., pretrained models with over one billion parameters). Under this setting, we reproduced a large number of previous augmentation methods and found that these methods bring marginal gains at best and sometimes degrade the performance much. To address this challenge, we propose a novel data augmentation method FlipDA that jointly uses a generative model and a classifier to generate label-flipped data. Central to the idea of FlipDA is the discovery that generating label-flipped data is more crucial to the performance than generating label-preserved data. Experiments show that FlipDA achieves a good tradeoff between effectiveness and robustness--- it substantially improves many tasks while not negatively affecting the others.

FewNLU: Benchmarking state-of-the-art methods for few-shot natural language understanding. Zheng, Yanan, Jing Zhou, Yujie Qian, Ming Ding, Jian Li, Ruslan Salakhutdinov, Jie Tang, Sebastian Ruder, and Zhilin Yang. 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022). [ArXiv]

The few-shot natural language understanding (NLU) task has attracted much recent attention. However, prior methods have been evaluated under a disparate set of protocols, which hinders fair comparison and measuring progress of the field. To address this issue, we introduce an evaluation framework that improves previous evaluation procedures in three key aspects, i.e., test performance, dev-test correlation, and stability. Under this new evaluation framework, we re-evaluate several state-of-the-art few-shot methods for NLU tasks. Our framework reveals new insights: (1) both the absolute performance and relative gap of the methods were not accurately estimated in prior literature; (2) no single method dominates most tasks with consistent performance; (3) improvements of some methods diminish with a larger pretrained model; and (4) gains from different methods are often complementary and the best combined model performs close to a strong fully-supervised baseline. We open-source our toolkit, FewNLU, that implements our evaluation framework along with a number of state-of-the-art methods.

LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition. Jin Xu, Xu Tan, Yi Ren, Tao Qin, Jian Li, Sheng Zhao, and Tie-Yan Liu. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery Data Mining (KDD 2020). [Paper]

Machine Learning (RL, NLP, GDBT, Network Embedding, meta-learning, AutoML)

GLIME: General, Stable and Local LIME Explanation. Zeren Tan, Tian Yang, Jian Li. Thirty-seventh Conference on Neural Information Processing Systems. 2023 (NeurIPS 2023, spotlight) [paper]

Although Local Interpretable Model-agnostic Explanations (LIME) \cite{ribeiro2016should} is a widely adopted method for understanding model behavior, it suffers from instability with respect to random seeds \cite{zafar2019dlime, shankaranarayana2019alime, bansal2020sam} and exhibits low local fidelity (i.e., how the explanation explains model's local behaviors) \cite{rahnama2019study, laugel2018defining}. Our study demonstrates that this instability is caused by small sample weights, resulting in the dominance of regularization and slow convergence. Additionally, LIME's sampling approach is non-local and biased towards the reference, leading to diminished local fidelity and instability to references. To address these challenges, we propose \textsc{Glime}, an enhanced framework that extends LIME and unifies several previous methods. Within the \textsc{Glime} framework, we derive an equivalent formulation of LIME that achieves significantly faster convergence and improved stability. By employing a local and unbiased sampling distribution, \textsc{Glime} generates explanations with higher local fidelity compared to LIME, while being independent of the reference choice. Moreover, \textsc{Glime} offers users the flexibility to choose sampling distribution based on their specific scenarios.

ImGCL: Revisiting Graph Contrastive Learning on Imbalanced Node Classification. Liang Zeng, Lanqing Li, Ziqi Gao, Pinlin Zhao, Jian Li. The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI 2023)

Graph contrastive learning (GCL) has attracted a surge of attention due to its superior performance for learning node/graph representations without labels. However, in practice, the underlying class distribution of unlabeled nodes for the given graph is usually imbalanced. This highly imbalanced class distribution inevitably deteriorates the quality of learned node representations in GCL. Indeed, we empirically find that most state-of-the-art GCL methods cannot obtain discriminative representations and exhibit poor performance on imbalanced node classification. Motivated by this observation, we propose a principled GCL framework on Imbalanced node classification (ImGCL), which automatically and adaptively balances the representations learned from GCL without labels. Specifically, we first introduce the online clustering based progressively balanced sampling (PBS) method with theoretical rationale, which balances the training sets based on pseudo-labels obtained from learned representations in GCL. We then develop the node centrality based PBS method to better preserve the intrinsic structure of graphs, by upweighting the important nodes of the given graph. Extensive experiments on multiple imbalanced graph datasets and imbalanced settings demonstrate the effectiveness of our proposed framework, which significantly improves the performance of the recent state-of-the-art GCL methods. Further experimental ablations and analyses show that the ImGCL framework consistently improves the representation quality of nodes in under-represented (tail) classes.

Analyzing and Mitigating Interference in Neural Architecture Search. Jin Xu, Xu Tan, Kaitao Song, Renqian Luo, Yichong Leng, Tao Qin, Tie-Yan Liu, Jian Li. The 39th International Conference on Machine Learning (ICML 2022, spotlight) [ArXiv]

Weight sharing is a popular approach to reduce the cost of neural architecture search (NAS) by reusing the weights of shared operators from previously trained child models. However, the rank correlation between the estimated accuracy and ground truth accuracy of those child models is low due to the interference among different child models caused by weight sharing. In this paper, we investigate the interference issue by sampling different child models and calculating the gradient similarity of shared operators, and observe: 1) the interference on a shared operator between two child models is positively correlated with the number of different operators; 2) the interference is smaller when the inputs and outputs of the shared operator are more similar. Inspired by these two observations, we propose two approaches to mitigate the interference: 1) MAGIC-T: rather than randomly sampling child models for optimization, we propose a gradual modification scheme by modifying one operator between adjacent optimization steps to minimize the interference on the shared operators; 2) MAGIC-A: forcing the inputs and outputs of the operator across all child models to be similar to reduce the interference. Experiments on a BERT search space verify that mitigating interference via each of our proposed methods improves the rank correlation of super-pet and combining both methods can achieve better results. Our discovered architecture outperforms RoBERTa by 1.1 and 0.6 points and ELECTRA by 1.6 and 1.1 points on the dev and test set of GLUE benchmark. Extensive results on the BERT compression, reading comprehension and ImageNet task demonstrate the effectiveness and generality of our proposed methods.

Policy Search by Target Distribution Learning for Continuous Control. Chuheng Zhang, Yuanqi Li, Jian Li. The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI 2020, Oral). [ArXiv]

We observe that several existing policy gradient methods (such as vanilla policy gradient, PPO, A2C) may suffer from overly large gradients when the current policy is close to deterministic (even in some very simple environments), leading to an unstable training process. To address this issue, we propose a new method, called \emph{target distribution learning} (TDL), for policy improvement in reinforcement learning. TDL alternates between proposing a target distribution and training the policy network to approach the target distribution. TDL is more effective in constraining the KL divergence between updated policies, and hence leads to more stable policy improvements over iterations. Our experiments show that TDL algorithms perform comparably to (or better than) state-of-the-art algorithms for most continuous control tasks in the MuJoCo environment while being more stable in training.

Relation Extraction with Temporal Reasoning Based on Memory Augmented Distant Supervision. Jianhao Yan, Lin He, Ruqin Huang, Jian Li and Ying Liu. Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2019). [paper]

Distant supervision (DS) for relation extraction is an important labor-saving method of completing knowledge base. However, DS is confronted with the wrong labeling problem. Previous work achieves a great success but ignores the value of DS as a suitable framework for temporal reasoning. This paper proposes a solution to predicting whether two given entities participate in a relation at any specific time spot. We construct a dataset called WIKI-TIME which additionally includes the valid period of a certain relation of two entities in the knowledge base. We propose a novel neural model to ensure that both the temporal information encoding and sequential reasoning can be performed. Experiment results show that, comparing with the best of existing models, our model exceeds the performance in WIKI-TIME dataset and achieves similar results in NYT-10.

NetSMF: Large-Scale Network Embedding as Sparse Matrix Factorization. Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Chi Wang, Kuansan Wang and Jie Tang. The 2019 Web Conference (WWW 2019). Full paper (Oral). [paper]

We study the problem of large-scale network embedding, which aims to learn latent representations for network mining applications. Previous research shows that 1) popular network embedding benchmarks, such as DeepWalk, are in essence implicitly factorizing a matrix with a closed form, and 2) the explicit factorization of such matrix generates more powerful embeddings than existing methods. However, directly constructing and factorizing this matrix—which is dense—is prohibitively expensive in terms of both time and space, making it not scalable for large networks. In this work, we present the algorithm of large-scale network embedding as sparse matrix factorization (NetSMF). NetSMF leverages theories from spectral sparsification to efficiently sparsify the aforementioned dense matrix, enabling significantly improved effi- ciency in embedding learning. The sparsified matrix is spectrally close to the original dense one with a theoretically bounded ap- proximation error, which helps maintain the representation power of the learned embeddings. We conduct experiments on networks of various scales and types. Results show that among both popular benchmarks (i.e., DeepWalk and LINE) and factorization based methods, NetSMF is the only method that achieves both high effi- ciency and effectiveness. We show that NetSMF requires only 24 hours to generate effective embeddings for a large-scale academic collaboration network with tens of millions of nodes, while it would cost DeepWalk months and is computationally infeasible for the dense matrix factorization solution.

Gradient Boosting With Piece-Wise Linear Regression Trees. Yu Shi, Jian Li, Zhize Li. The 28th International Joint Conference on Artificial Intelligence (IJCAI 2019). [ArXiv]

Gradient boosting using decision trees as base learners, so called Gradient Boosted Decision Trees (GBDT), is a very successful ensemble learning algorithm widely used across a variety of applications. Recently, various GDBT construction algorithms and implementation have been designed and heavily optimized in some very popular open sourced toolkits such as Xgboost and LightGBM. In this paper, we show that both the accuracy and efficiency of GBDT can be further enhanced by using more complex base learners. Specifically, we extend gradient boosting to use \textit{piecewise linear regression trees} (PL Trees), instead of \textit{piecewise constant regression trees}. We show PL Trees can accelerate convergence of GBDT. Moreover, our new algorithm fits better to modern computer architectures with powerful Single Instruction Multiple Data (SIMD) parallelism. We propose optimization techniques to speedup our algorithm. The experimental results show that GBDT with PL Trees can provide very competitive testing accuracy with comparable or less training time. Our algorithm also produces much concise tree ensembles, thus can often reduce testing time costs.

Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec. Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, Jie Tang. The 11th ACM International Conference on Web Search and Data Mining (WSDM 2018). [ArXiv] [ Show Abstract ]

Since the invention of word2vec, the skip-gram model has significantly advanced the research of network embedding, such as the recent emergence of the DeepWalk, LINE, PTE, and node2vec approaches. In this work, we show that all of the aforementioned models with negative sampling can be unified into the matrix factorization framework with closed forms. Our analysis and proofs reveal that: (1) DeepWalk empirically produces a low-rank transformation of a network's normalized Laplacian matrix; (2) LINE, in theory, is a special case of DeepWalk when the size of vertices' context is set to one; (3) As an extension of LINE, PTE can be viewed as the joint factorization of multiple networks' Laplacians; (4) node2vec is factorizing a matrix related to the stationary distribution and transition probability tensor of a 2nd-order random walk. We further provide the theoretical connections between skip-gram based network embedding algorithms and the theory of graph Laplacian. Finally, we present the NetMF method as well as its approximation algorithm for computing network embedding. Our method offers significant improvements over DeepWalk and LINE for conventional network mining tasks. This work lays the theoretical foundation for skip-gram based network embedding methods, leading to a better understanding of latent network representation learning.

Learning Gradient Descent: Better Generalization and Longer Horizons. Kaifeng Lv, Shunhua Jiang, Jian Li. The 34th International Conference on Machine Learning (ICML 2017). [ArXiv] [ Show Abstract ][Code]

Training deep neural networks is a highly nontrivial task, involving carefully selecting appropriate training algorithms, scheduling step sizes and tuning other hyperparameters. Trying different combinations can be quite labor-intensive and time consuming. Recently, researchers have tried to use deep learning algorithms to exploit the landscape of the loss function of the training problem of interest, and learn how to optimize over it in an automatic way. In this paper, we propose a new learning-to-learn model and some useful and practical tricks. Our optimizer outperforms generic, hand-crafted optimization algorithms and state-of-the-art learning-to-learn optimizers by DeepMind in many tasks. We demonstrate the effectiveness of our algorithms on a number of tasks, including deep MLPs, CNNs, and simple LSTMs.

DESTPRE : A Data-Driven Approach to Destination Prediction for Taxi Rides. Mengwen Xu, Dong Wang, Jian Li. UbiComp 2016. [ Paper ]

With the wide use of mobile devices, predicting the destination of moving vehicles has become an increasingly important problem for location based recommendation systems and destination-based advertising. Most existing approaches are based on various Markov chain models, in which the historical trajectories are used to train the model and the top-k most probable destinations are returned. We identify certain limitations of the previous approaches. Instead, we propose a new data-driven framework, called DESTPRE, which is not based on a probabilistic model, but directly operates on the trajectories and makes the prediction. We make use of only historic trajectories, without individual identity information. Our design of DESTPRE, although simple, is a result of several useful observations from the real trajectory data. DESTPRE involves an index based on Bucket PR Quadtree and Minwise hashing, for efficiently retrieving similar trajectories, and a clustering on destinations for predictions. By incorporating some additional ideas, we show that the prediction accuracy can be further improved. We have conducted extensive experiments on real Beijing Taxi dataset. The experimental results demonstrate the effectiveness of DESTPRE.

Trinary-Projection Trees for Approximate Nearest Neighbor Search. Jingdong Wang, Naiyan Wang, You Jia; Jian Li, Gang Zeng, Hongbin Zha, Xian-Sheng Hua. The IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2013 [Paper]

Stochastic Optimization (stochastic combinatorial optimization problems, stochastic geometry problems, stochastic online optimization, Bandits)

Multi-token Markov Game with Switching Costs. Jian Li, Daogao Liu. ACM-SIAM Symposium on Discrete Algorithms (SODA22). [paper]

We study a general Markov game with metric switching costs: in each round, the player adaptively chooses one of several Markov chains to advance with the objective of minimizing the expected cost for at least k chains to reach their target states. If the player decides to play a different chain, an additional switching cost is incurred. The special case in which there is no switching cost was solved optimally by Dumitriu, Tetali, and Winkler [DTW03] by a variant of the celebrated Gittins Index for the classical multi-armed bandit (MAB) problem with Markovian rewards [Gittins 74, Gittins79]. However, for multi-armed bandit (MAB) with nontrivial switching cost, even if the switching cost is a constant, the classic paper by Banks and Sundaram [BS94] showed that no index strategy can be optimal. In this paper, we complement their result and show there is a simple index strategy that achieves a constant approximation factor if the switching cost is constant and k=1. To the best of our knowledge, this is the first index strategy that achieves a constant approximation factor for a general MAB variant with switching costs. For the general metric, we propose a more involved constant-factor approximation algorithm, via a nontrivial reduction to the stochastic k-TSP problem, in which a Markov chain is approximated by a random variable. Our analysis makes extensive use of various interesting properties of the Gittins index.

Algorithms and Adaptivity Gaps for Stochastic k-TSP. Haotian Jiang, Jian Li, Daogao Liu, Sahil Singla. The 11th Innovations in Theoretical Computer Science (ITCS 2020). [ArXiv]

Given a metric $(V,d)$ and a $\depot \in V$, the classical $\KTSP$ problem is to find a tour originating at $\depot$ of minimum length that visits at least $k$ nodes in $V$. In this work, motivated by applications where the input to an optimization problem is uncertain, we study two stochastic versions of $\KTSP$. In \StochKTSP, originally defined by Ene-Nagarajan-Saket, each vertex $v$ in the given metric $(V,d)$ contains a stochastic reward $R_v$. The goal is to adaptively find a tour of minimum expected length that collects at least reward $k$; Ene et al. give an $O(\log k)$-approximation adaptive algorithm for this problem, and left open if there is an $O(1)$-approximation algorithm. We totally resolve their open question, and even give an $O(1)$-approximation \emph{non-adaptive} algorithm for \StochKTSP. We also introduce and obtain similar results for the \StochKCost problem. In this problem each vertex $v$ has a stochastic cost $C_v$, and the goal is to visit and select at least $k$ vertices to minimize the expected \emph{sum} of tour length and cost of selected vertices. Besides being a natural stochastic generalization of \KTSP, this problem is also interesting because it generalizes the Price of Information framework by Singla from deterministic probing costs to metric probing costs. Our techniques are based on two crucial ideas: ``repetitions'' and ``critical scaling''. In general, replacing a random variable with its expectation leads to very poor results. We show that for our problems, if we truncate the random variables at an ideal threshold, then their expected values form a good surrogate. Here, we rely on running several repetitions of our algorithm with the same threshold, and then argue concentration using Freedman's and Jogdeo-Samuels inequalities. Unfortunately, this ideal threshold depends on how far we are from achieving our target $k$, which a non-adaptive algorithm does not know. To overcome this barrier, we truncate the random variables at various different scales and identify a ``critical'' scale.

Maximizing Expected Utility for Stochastic Combinatorial Optimization Problems; Jian Li, and Amol Deshpande; Mathematics of Operations Research (MOR), Vol. 44, No. 1, 2018. Preliminary version in Proceedings of the 52nd Annual IEEE Symposium on Foundations of Computer Science (FOCS 2011), Palm Springs, California, 2011. [Paper] [ArXiv].

We consider the problem of maximizing the expected utility E[u(X(S))] where X(S)=\sum_{i \in S}X_i and S is a feasible set for some combinatorial optimization problems (such as shortest path, minimum spanning tree, knapsack, etc.). (1) We present an additive PTAS for bounded Lipshitz utility function. This has implications in the VaR problem such as Pr[X(S)\leq 1](which corresponds to a step utility function). (2) We give PTAS for increasing concave functions (which are widely used to model risk-averse behaviors) and increasing function with bounded derivative.

A PTAS for a Class of Stochastic Dynamic Programs. Hao Fu, Jian Li and Pan Xu. The 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018) [ArXiv]

We develop a framework for obtaining polynomial time approximation schemes (PTAS) for a class of stochastic dynamic programs. Using our framework, we obtain the first PTAS for several stochastic combinatorial optimization problems. In particular, we obtain PTAS for the ProbeMax Probemax: We are given a set of n items, each item i has a value Xi which is an independent random variable with a known (discrete) distribution. We can probe a subset P of items sequentially. Each time after probing an item i, we observe its value realization, which follows the distribution of Xi. We can adaptively probe at most m items and each item can be probed at most once. The reward is the maximum among the m realized values. Our goal is to design an adaptive probing policy such that the expected value of the reward is maximized. To the best of our knowledge, the best known approximation ratio is 1-1/e, due to Asadpour et al. We also obtain PTAS for several other problems: Committed Pandora’s Box, Stochastic Target, Stochastic Blackjack Knapsack.

Stochastic k-Center and j-Flat-Center Problems. Lingxiao Huang, Jian Li. ACM-SIAM Symposium on Discrete Algorithms (SODA17). [ArXiv]

Solving geometric optimization problems over uncertain data have become increasingly important in many applications and have attracted a lot of attentions in recent years. In this paper, we study two important geometric optimization problems, the k-center problem and the j-flat-center problem, over stochastic/uncertain data points in Euclidean spaces. For the stochastic k-center problem, we would like to find k points in a fixed dimensional Euclidean space, such that the expected value of the k-center objective is minimized. For the stochastic j-flat-center problem, we seek a j-flat (i.e., a j-dimensional affine subspace) such that the expected value of the maximum distance from any point to the j-flat is minimized. We consider both problems under two popular stochastic geometric models, the existential uncertainty model, where the existence of each point may be uncertain, and the locational uncertainty model, where the location of each point may be uncertain. We provide the first PTAS (Polynomial Time Approximation Scheme) for both problems under the two models. Our results generalize the previous results for stochastic minimum enclosing ball and stochastic enclosing cylinder.

Combinatorial Multi-Armed Bandit with General Reward Functions, Wei Chen, Wei Hu, Fu Li, Jian Li, Yu Liu, Pinyan Lu. Neural Information Processing Systems (NIPS), 2016.

In this paper, we study the stochastic combinatorial multi-armed bandit (CMAB) framework that allows a general nonlinear reward function, whose expected value may not depend only on the means of the input random variables but possibly on the entire distributions of these variables. Our framework enables a much larger class of reward functions such as the $\max()$ function and nonlinear utility functions. Existing techniques relying on accurate estimations of the means of random variables, such as the upper confidence bound (UCB) technique, do not work directly on these functions. We propose a new algorithm called stochastically dominant confidence bound (SDCB), which estimates the distributions of underlying random variables and their stochastically dominant confidence bounds. We prove that if the underlying variables have known finite supports, SDCB can achieve $O(\log T)$ distribution-dependent regret and $\tilde{O}(\sqrt{T})$ distribution-independent regret, where $T$ is the time horizon. For general arbitrary distributions, we further use a discretization technique and show an $\tilde{O}(\sqrt{T})$ regret bound. We apply our results to the $K$-MAX problem and the expected utility maximization problems. In particular, for $K$-MAX, we provide the first polynomial-time approximation scheme (PTAS) for its offline problem, and give the first $\tilde{O}(\sqrt{T})$ bound on the $(1-\epsilon)$-approximation regret of its online problem, for any $\epsilon > 0$.

Pure Exploration of Multi-armed Bandit Under Matroid Constraints. Lijie Chen, Anupum Gupta, Jian Li. Conference on Learning Theory (COLT 2016). [Full version]

We study the pure exploration problem subject to a matroid constraint (BEST-BASIS) in a stochastic multi-armed bandit game. In a BEST-BASIS instance, we are given n stochastic arms with unknown reward distributions, as well as a matroid M over the arms. Let the weight of an arm be the mean of its reward distribution. Our goal is to identify a basis of M with the maximum total weight, using as few samples as possible. The problem is a significant generalization of the best arm identification problem and the top-k arm identification problem, which have attracted significant attentions in recent years. We study both the exact and PAC versions of BEST-BASIS, and provide algorithms with nearly-optimal sample complexities for these versions. Our results generalize and/or improve on several previous results for the top-k arm identification problem and the combinatorial pure exploration problem (when the combinatorial constraint is a matroid).

[Survey] Approximation Algorithms for Stochastic Combinatorial Optimization Problems. Jian Li and Yu Liu. Journal of the Operations Research Society of China. (invited article) [paper]

Stochastic optimization has established itself as a major method to handle uncertainty in various optimization problems, by modeling the uncertainty by a probability distribution over possible realizations. Traditionally, the main focus in stochastic optimization has been various stochastic mathematical programming (such as linear programming, convex programming). In recent years, there has been a surge of interest in stochastic combinatorial optimization problems from the theoretical computer science community. In this article, we survey some of the recent results on various stochastic versions of classical combinatorial optimization problems. Since most problems in this domain are NP-hard (or \#P-hard, or even PSPACE-hard), we focus on the results which provide polynomial time approximation algorithms, with provable approximation guarantees. Our discussions are centered around a few representative problems, such as stochastic knapsack, stochastic matching, multi-armed bandit etc. We use these examples to introduce several popular stochastic models, such as the fixed set model, 2-stage stochastic optimization model, stochastic adaptive probing model etc, as well as some useful techniques for designing approximation algorithms for stochastic combinatorial optimization problems, including the linear programming relaxation approach, boosted sampling, content resolution schemes, Poisson approximation etc. We also provide some open research questions along the way. Our purpose is to provide the readers a quick glimpse to the models, problems and techniques in this area, and hopefully inspire new contributions.

On the Optimal Sample Complexity for Best Arm Identification. Lijie Chen, Jian Li. [ArXiv]

We study the best arm identification (BEST-1-ARM) problem, which is defined as follows. We are given n stochastic bandit arms. The ith arm has a reward distribution Di with an unknown mean μi. Upon each play of the ith arm, we can get a reward, sampled i.i.d. from Di. We would like to identify the arm with largest mean with probability at least 1-δ, using as few samples as possible. We also study an important special case where there are only two arms, which we call the sign problem. We achieve a very detailed understanding of the optimal sample complexity of sign, simplifying and significantly extending a classical result by Farrell in 1964, with a completely new proof. Using the new lower bound for sign, we obtain the first lower bound for BEST-1-ARM that goes beyond the classic Mannor-Tsitsiklis lower bound, by an interesting reduction from sign to BEST-1-ARM. To complement our lower bound, we also provide a nontrivial algorithm for BEST-1-ARM, which achieves a worst case optimal sample complexity, improving upon several prior upper bounds on the same problem.

On Top-k Selection in Multi-Armed Bandits and Hidden Bipartite Graphs. Wei Cao, Jian Li, Yufei Tao, Zhize Li. Neural Information Processing Systems (NIPS), 2015. [full paper]

This paper discusses how to efficiently choose from n unknown distributions the k ones whose means are the greatest by a certain metric, up to a small relative error. We study the topic under two standard settings—multi-armed bandits and hidden bipartite graphs. For both settings, we prove lower bounds on the total number of samples needed, and propose optimal algorithms whose sample complexities match those lower bounds.

Stochastic Online Greedy Learning with Semi-bandit Feedbacks. Tian Lin, Jian Li, Wei Chen. Neural Information Processing Systems (NIPS), 2015. [full paper]

We study the online learning problem when the input to the greedy algorithm is stochastic with unknown parameters that have to be learned over time. We first propose the greedy regret and eps-quasi greedy regret as learning metrics comparing with the performance of offline greedy algorithm. We propose two online greedy learning algorithms with semi-bandit feedbacks, which achieve O(log T) problem-dependent regret bound for a general class of combinatorial structures and reward functions that allow greedy solutions.

Approximating the Expected Values for Combinatorial Optimization Problems over Stochastic Points. Lingxiao Huang, Jian Li. The 42nd International Colloquium on Automata, Languages, and Programming (ICALP 2015), to appear. [ArXiv] subsumes a previous draft: Minimum Spanning Trees, Perfect Matchings and Cycle Covers Over Stochastic Points in Metric Spaces.

We consider the stochastic geometry model where the location of each node is a random point in a given metric space, or the existence of each node is uncertain. We study the problems of computing the expected lengths of several combinatorial or geometric optimization problems over stochastic points, including closest pair, minimum spanning tree, k-clustering, minimum perfect matching, and minimum cycle cover. We also consider the problem of estimating the probability that the length of the closest pair, or the diameter, is at most, or at least, a given threshold. Most of the above problems are known to be #P-hard. We obtain FPRAS for most of them in both the existential and the locational uncertainty models.

epsilon-Kernel Coresets for Stochastic Points. Jian Li, Jeff Phillips and Haitao Wang The 24rd Annual European Symposium on Algorithms (ESA 2016). [AxXiv]
- We consider the standard stochastic geometry model in which the existence or the location of each point may be stochastic. We show there exists constant-size kernel coreset (a kernel can approximate either the expected value or the distribution of the width of the point set in every direction) and we can construct such kernel coresets in nearly linear time. We show its applications to several function extent problems, tracking moving points, and shape fitting problems in the stochastic setting.
Optimal PAC Multiple Arm Identification with Applications to Crowdsourcing, Yuan Zhou, Xi Chen, Jian Li. International Conference on Machine Learning (ICML 2014). [full version]
- We study the problem of selecting K arms with the highest expected rewards in a stochastic N-armed bandit game. We propose a new PAC algorithm, which, with probability at least 1-delta , identifies a set of K arms with average reward at most \epsilon away from the top-k arm. Naive uniform sampling requires O(nlnn) samples. We show it is possible to achieve linear sample complexity. We also establish a matching lower bound (meaning our upper bound is worse-case optimal).
Range Queries on Uncertain Data, Jian Li, Haitao Wang, International Symposium on Algorithms and Computation (ISAAC 2014). [paper]
- We are given a set of stochastic points on the real line, we consider the problem of building data structures on P to answer range queries: find the top-k points that lie in the given interval with the highest probability.
A Fully Polynomial Approximation Scheme for Approximating a Sum of Random Variables, Jian Li and Tianlin Shi. In Operation Research Letters (ORL), 2014 [ArXiv]
- We show there is an FPTAS for approximating Pr[∑X_i≤1] for a set of independent (not necessarily identically distributed) random variables X_i.
Stochastic Combinatorial Optimization via Poisson Approximation. Jian Li and Wen Yuan. In the 45th ACM Symposium on the Theory of Computing (STOC 2013), USA,2 013 [Paper] [Full version in ArXiv

We propose a new technique, called Poisson approximation, for approximating/optimizing the sum of a set of random variables. In many cases, it reduces the stochastic problem to a deterministic multidimensional packing/covering problem. We can apply it to several problems, such as (1) Expected utility maximization (we can reproduce and generalize one result in our FOCS11 paper.) (2)Stochastic knapsack (proposed in [Dean et al. MOR08]): we are given a set items. The size and/or the profit of an item may not be fixed values and only their probability distributions are known to us in advance. The actual size and profit of an item are revealed to us as soon as it is inserted into the knapsack. We want to find a policy to insert the items such that the total expected profit is maximized. We obtain a bicriterion PTAS (i.e., 1+eps approximation with 1+eps capacity), even we allow correlation and cancellation of jobs. (3) Bayesian Online Selection with knapsack constraint: it is a variant of stochastic knapsack where the size and the profit of an item are revealed before the decision whether to select the item is made.

When LP is the Cure for Your Matching Woes: Improved Bounds for Stochastic Matchings. Nikhil Bansal, Anupam Gupta, Jian Li, Julian Mestre, Viswanath Nagarajan, Atri Rudra. In the 18th Annual European Symposium on Algorithms (ESA 2010). (Best Paper Award) [Paper][Slides] Journal version in Algorithimca, 2011.[Journal Version]
- We study the stochastic matching problem, which finds applications in kidney exchange, online dating and online ads. Consider a random graph model where each possible edge e is present independently with some probability pe. We are given these numbers pe, and want to build a large/heavy matching in the randomly generated graph. However, the only way we can find out whether an edge is present or not is to query it, and if the edge is indeed present in the graph, we are forced to add it to our matching. Further, each vertex i is allowed to be queried at most ti times. How should we adaptively query the edges to maximize the expected weight of the matching? Our main result is the first constant approximation for the weighted stochastic matching.

Computational Geometry

K-Means Clustering with Distributed Dimension. Hu Ding, Yu Liu, Lingxiao Huang, Jian Li. The 33rd International Conference on Machine Learning (ICML 2016).

Distributed clustering has attracted significant attention in recent years. In this paper, we study the k-means problem in the distributed dimension setting, where the dimensions of the data are partitioned across multiple machines. We provide new approximation algorithms, which incur low communication costs and achieve constant approximation factors. The communication complexity of our algorithms significantly improve on existing algorithms. We also provide the first communication lower bound, which nearly matches our upper bound in a wide range of parameter setting. Our experimental results show that our algorithms outperform existing algorithms on real data sets in the distributed dimension setting.

Odd Yao-Yao Graphs are Not Spanners. Yifei Jin, Jian Li, Wei Zhan. In 34th International Symposium on Computational Geometry (SoCG 2018). [ArXiv]

It is a long standing open problem whether Yao-Yao graphs YYk are all spanners. Bauer and Damian showed that all YY6k for k \geq 6 are spanners. Li and Zhan generalized their result and proved that all even Yao-Yao graphs YY2k are spanners (for k \geq 42). However, their technique cannot be extended to odd Yao-Yao graphs, and whether they are spanners are still elusive. In this paper, we show that, surprisingly, for any integer k \geq 1, there exist odd Yao-Yao graph YY2k+1 instances, which are not spanners. This provides a negative answer to (Problem 70) in [http://cs.smith.edu/~orourke/TOPP/P70.html]

Almost All Even Yao-Yao Graphs Are Spanners. Jian Li, Wei Zhan. The 24rd Annual European Symposium on Algorithms (ESA 2016). [ArXiv]

It is an open problem whether Yao-Yao graphs YYk (also known as sparse-Yao graphs) are all spanners when the integer parameter k is large enough. In this paper we show that, for any integer k≥42, the Yao-Yao graph YY2k is a tk-spanner, with stretch factor tk=4.27+O(k−1) when k tends to infinity. Our result generalizes the best known result which asserts that all YY6k are spanners for k large enough [Bauer and Damian, SODA'13]. Our proof is also somewhat simpler.

A PTAS for the Weighted Unit Disk Cover Problem, Jian Li, Yifei Jin. The 42nd International Colloquium on Automata, Languages, and Programming (ICALP 2015) [ArXiv]

We are given a set of weighted unit disks and a set of points in Euclidean plane. The minimum weight unit disk cover (\UDC) problem asks for a subset of disks of minimum total weight that covers all given points. For the weighted \UDC\ problem, several constant approximations have been developed. However, whether the problem admits a PTAS has been an open question. We present the first PTAS for \UDC. Our result implies the first PTAS for the minimum weight dominating set problem in unit disk graphs, the first PTAS for the maximum lifetime coverage problem and an improved constant approximation ratio for the connected dominating set problem in unit disk graphs.

Approximating the Expected Values for Combinatorial Optimization Problems over Stochastic Points. Lingxiao Huang, Jian Li. The 42nd International Colloquium on Automata, Languages, and Programming (ICALP 2015). [ArXiv] subsumes a previous draft: Minimum Spanning Trees, Perfect Matchings and Cycle Covers Over Stochastic Points in Metric Spaces.

We consider the stochastic geometry model where the location of each node is a random point in a given metric space, or the existence of each node is uncertain. We study the problems of computing the expected lengths of several combinatorial or geometric optimization problems over stochastic points, including closest pair, minimum spanning tree, k-clustering, minimum perfect matching, and minimum cycle cover. We also consider the problem of estimating the probability that the length of closest pair, or the diameter, is at most, or at least, a given threshold. Most of the above problems are known to be #P-hard. We obtain FPRAS for most of them in both the existential and the locational uncertainty models.

Approximation Algorithms for the Connected Sensor Cover Problem. Lingxiao Huang, Jian Li, Qicai Shi. The 21st Annual International Computing and Combinatorics Conference (COCOON'15) , Journal version in TCS [paper]

We are given a set of sensors and a set of target points in the Euclidean plane. In MIN-ConnectSensorCover, our goal is to find a set of sensors of minimum cardinality, such that all target points are covered, and all sensors can communicate with each other (i.e., the communication graph is connected). In Budgeted-CSC problem, our goal is to choose a set of B sensors, such that the number of targets covered by the chosen sensors is maximized and the communication graph is connected. We obtain constant approximations for both problems.

Linear Time Approximation Schemes for Geometric Maximum Coverage, Jian Li, Haitao Wang, Bowei Zhang, Ningye Zhang. The 21st Annual International Computing and Combinatorics Conference (COCOON'15) , Journal version in TCS. [paper]

We study approximation algorithms for the following geometric version of the maximum coverage problem: Let P be a set of n weighted point in the plane. We would like to place m rectangles such that total weights of covered points in P is maximized. For any m, we present linear time approximation schemes that can find a 1+eps approximation to the optimal solution.

epsilon-Kernel Coresets for Stochastic Points. Jian Li, Jeff Phillips and Haitao Wang The 24rd Annual European Symposium on Algorithms (ESA 2016). [AxXiv]
- We consider the standard stochastic geometry model in which the existence or the location of each point may be stochastic. We show there exists constant-size kernel coreset (a kernel can approximate either the expected value or the distribution of the width of the point set in every direction) and we can construct such kernel coresets in nearly linear time. We show its applications to several function extent problems, tracking moving points, and shape fitting problems in the stochastic setting.
Range Queries on Uncertain Data, Jian Li, Haitao Wang, International Symposium on Algorithms and Computation (ISAAC 2014). [paper]
- We are given a set of stochastic points (with uniform distr) on the real line, we consider the problem of building data structures on P to answer range queries: find the top-k points that lie in the given interval with the highest probability.
Algorithms on Minimizing the Maximum Sensor Movement for Barrier Coverage of a Linear Domain; Danny Z. Chen, Yan Gu, Jian Li and Haitao Wang; In 13th Scandinavian Symposium and Workshops on Algorithm Theory (SWAT 2012), Helsinki, Finland, 2012 [ArXiv] Journal version in Discrete Computational Geometry (DCG), 2013 [ Journal doi

Given a set of small intervals, and a target interval I, we would like to move the small intervals to cover the target interval. Our goal is to minimize the maximum movement of any interval. We obtain an O(n^2 log n) algorithm for this problem.

Trinary-Projection Trees for Approximate Nearest Neighbor Search. Jingdong Wang, Naiyan Wang, You Jia; Jian Li, Gang Zeng, Hongbin Zha, Xian-Sheng Hua. The IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2013 Paper
Efficient Algorithms for One-Dimensional k-Center Problems. Danny Z. Chen, Jian Li, Haitao Wang. Theoretical Computer Science (TCS), 2015 [ArXiv]

We consider the problem of finding k centers for n weighted points on a real line. This (weighted) k-center problem was solved in O(n log n) time previously by using Cole's parametric search and other complicated approaches. In this paper, we present an easier O(n log n) time algorithm that avoids the parametric search, and in certain special cases our algorithm solves the problem in O(n) time.