Jian Li's Publications

Large Language Models (LLMs) have demonstrated remarkable capabilities across numerous tasks, yet principled explanations for their underlying mechanisms and several phenomena, such as scaling laws, hallucinations, and related behaviors, remain elusive. In this work, we revisit the classical relationship between compression and prediction, grounded in Kolmogorov complexity and Shannon information theory, to provide deeper insights into LLM behaviors. By leveraging the Kolmogorov Structure Function and interpreting LLM compression as a two-part coding process, we offer a detailed view of how LLMs acquire and store information across increasing model and data scales -- from pervasive syntactic patterns to progressively rarer knowledge elements. Motivated by this theoretical perspective and natural assumptions inspired by Heap's and Zipf's laws, we introduce a simplified yet representative hierarchical data-generation framework called the Syntax-Knowledge model. Under the Bayesian setting, we show that prediction and compression within this model naturally lead to diverse learning and scaling behaviors of LLMs. In particular, our theoretical analysis offers intuitive and principled explanations for both data and model scaling laws, the dynamics of knowledge acquisition during training and fine-tuning, factual knowledge hallucinations in LLMs. The experimental results validate our theoretical predictions.

We study the stochastic probing problem under a general monotone norm objective. Given a ground set , each element has an independent nonnegative random variable with known distribution. Probing an element reveals its value, and the sequence of probed elements must satisfy a prefix-closed feasibility constraint. A monotone norm determines the reward , where is the set of probed elements and is the vector with for and 0 otherwise. The goal is to design a probing strategy maximizing the expected reward . We focus on the adaptivity gap: the ratio between the expected rewards of optimal adaptive and optimal non-adaptive strategies. We resolve an open question posed in [GNS17, KMS24], showing that for general monotone norms, the adaptivity gap is at most polylog . For monotone symmetric norms, we prove a constant adaptivity gap, improving the previous bound from [PRS23].

We study the general norm optimization for combinatorial problems, initiated by Chakrabarty and Swamy (STOC 2019). We propose a general formulation that captures a large class of min-norm combinatorial problems. We present a general equivalent reduction of the norm minimization problem to a multi-criteria optimization problem with logarithmic budget constraints, up to a constant approximation factor. Leveraging this reduction, we obtain constant factor approximation algorithms for the norm minimization versions of several covering problems, such as interval cover, multi-dimensional knapsack cover, and logarithmic factor approximation for set cover. We also study the norm minimization versions for perfect matching, s-t path and s-t cut. We show the natural linear programming relaxations for these problems have a large integrality gap. To complement the negative result, we show that, for perfect matching, there is a bi-criteria result: for any constant eps , we can find in polynomial time a nearly perfect matching and its cost is at most 1+eps times of the optimum for perfect matching. Moreover, we establish the existence of a polynomial-time -loglogn approximation algorithm for the norm minimization s-t path problem.

Recent advancements in diffusion models have been leveraged to address inverse problems without additional training, and Diffusion Posterior Sampling (DPS) (Chung et al., 2022a) is among the most popular approaches. Previous analyses suggest that DPS accomplishes posterior sampling by approximating the conditional score. While in this paper, we demonstrate that the conditional score approximation employed by DPS is not as effective as previously assumed, but rather aligns more closely with the principle of maximizing a posterior (MAP). This assertion is substantiated through an examination of DPS on 512 512 ImageNet images, revealing that: 1) DPS s conditional score estimation significantly diverges from the score of a well-trained conditional diffusion model and is even inferior to the unconditional score; 2) The mean of DPS s conditional score estimation deviates significantly from zero, rendering it an invalid score estimation; 3) DPS generates high-quality samples with significantly lower diversity. In light of the above findings, we posit that DPS more closely resembles MAP than a conditional score estimator, and accordingly propose the following enhancements to DPS: 1) we explicitly maximize the posterior through multi-step gradient ascent and projection; 2) we utilize a light-weighted conditional score estimator trained with only 100 images and 8 GPU hours. Extensive experimental results indicate that these proposed improvements significantly enhance DPS s performance.

In this work, we investigate a particular implicit bias in the gradient descent training process, which we term Feature Averaging , and argue that it is one of the principal factors contributing to non-robustness of deep neural networks. Despite the existence of multiple discriminative features capable of classifying data, neural networks trained by gradient descent training exhibit a tendency to learn the average of these features, rather than distinguishing and leveraging each feature individually. In particular, we provide a detailed theoretical analysis of the training dynamics of gradient descent in a two-layer ReLU network for a binary classification task, where the data distribution consists of multiple clusters with orthogonal cluster feature vectors. We rigorously prove that gradient descent converges to the regime of feature averaging, wherein the weights associated with each hidden-layer neuron represent an average of the cluster means (each mean corresponding to a distinct feature). Furthermore, we prove that, with the provision of more granular supervised information, a two-layer neural network is capable of learning individual features. We also conduct extensive experiments using synthetic datasets, MNIST and CIFAR-10 to substantiate the phenomenon of feature averaging and its role in adversarial robustness of neural networks. We hope the theoretical and empirical insights can provide a deeper understanding of how detailed supervision can enhance model robustness.

We propose a hypergraphbased factor model with temporal residual contrastive learning (FactorGCL) that employs a hypergraph structure to better capture high-order nonlinear relationships among stock returns and factors. To mine hidden factors that supplement human-designed prior factors for predicting stock returns, we design a cascading residual hypergraph architecture, in which the hidden factors are extracted from the residual information after removing the influence of prior factors. Additionally, we propose a temporal residual contrastive learning method to guide the extraction of effective and comprehensive hidden factors by contrasting stock-specific residual information over different time periods. Our extensive experiments on real stock market data demonstrate that FactorGCL not only outperforms existing state-of-the-art methods but also mines effective hidden factors for predicting stock returns.

We propose OpenFE++, which leverages the feature interactions from both the feature and temporal dimensions to construct a substantially reduced candidate feature set, thereby enhancing efficiency and effectiveness in feature generation. In the feature dimension, OpenFE++ utilizes locally interacted features to generate meaningful candidate features without exhaustively enumerating all possibilities. In the temporal dimension, it evaluates the lagged effects among different features and generates temporally meaningful features via the representative lagged periods, eliminating the need for enumeration in sequence length. Thus, OpenFE++ can efficiently generate effective, generalizable and interpretable features to boost the forecasting performance of machine learning models. We conduct extensive experiments on fourteen widely used benchmark datasets (ten benchmarks for tabular tasks and four benchmarks for time-series tasks) to demonstrate that OpenFE++ outperforms other baseline models in both efficiency and effectiveness.

Designing small-sized \emph{coresets}, which approximately preserve the costs of the solutions for large datasets, has been an important research direction for the past decade. We consider coreset construction for a variety of general constrained clustering problems. We significantly extend and generalize the results of a very recent paper (Braverman et al., FOCS'22), by demonstrating that the idea of hierarchical uniform sampling (Chen, SICOMP'09; Braverman et al., FOCS'22) can be applied to efficiently construct coresets for a very general class of constrained clustering problems with general assignment constraints, including capacity constraints on cluster centers, and assignment structure constraints for data points (modeled by a convex body B). Our main theorem shows that a small-sized ϵ-coreset exists as long as a complexity measure Lip(B) of the structure constraint, and the \emph{covering exponent} ϵ(X) for metric space (X,d) are bounded. The complexity measure Lip(B) for convex body B is the Lipschitz constant of a certain transportation problem constrained in B, called \emph{optimal assignment transportation problem}. We prove nontrivial upper bounds of Lip(B) for various polytopes, including the general matroid basis polytopes, and laminar matroid polytopes (with better bound). As an application of our general theorem, we construct the first coreset for the fault-tolerant clustering problem (with or without capacity upper/lower bound) for the above metric spaces, in which the fault-tolerance requirement is captured by a uniform matroid basis polytope.

Fine-tuning large-scale pretrained models is prohibitively expensive in terms of computational and memory costs. LoRA, as one of the most popular Parameter-Efficient Fine-Tuning (PEFT) methods, offers a cost-effective alternative by fine-tuning an auxiliary low-rank model that has significantly fewer parameters. Although LoRA reduces the computational and memory requirements significantly at each iteration, extensive empirical evidence indicates that it converges at a considerably slower rate compared to full fine-tuning, ultimately leading to increased overall compute and often worse test performance. In our paper, we perform an in-depth investigation of the initialization method of LoRA and show that careful initialization (without any change of the architecture and the training algorithm) can significantly enhance both efficiency and performance. In particular, we introduce a novel initialization method, LoRA-GA (Low Rank Adaptation with Gradient Approximation), which aligns the gradients of low-rank matrix product with those of full fine-tuning at the first step. Our extensive experiments demonstrate that LoRA-GA achieves a convergence rate comparable to that of full fine-tuning (hence being significantly faster than vanilla LoRA as well as various recent improvements) while simultaneously attaining comparable or even better performance. For example, on the subset of the GLUE dataset with T5-Base, LoRA-GA outperforms LoRA by 5.69% on average. On larger models such as Llama 2-7B, LoRA-GA shows performance improvements of 0.34, 11.52%, and 5.05% on MT-bench, GSM8K, and Human-eval, respectively. Additionally, we observe up to 2-4 times convergence speed improvement compared to vanilla LoRA, validating its effectiveness in accelerating convergence and enhancing model performance.

Constructing small-sized coresets for various clustering problems in different metric spaces has attracted significant attention for the past decade. A central problem in the coreset literature is to understand what is the best possible coreset size for (k,z)-clustering in Euclidean space. While there has been significant progress in the problem, there is still a gap between the state-of-the-art upper and lower bounds. For instance, the best known upper bound for k-means (z=2) is min{O(k3/2 −2),O(k −4)} [1,2], while the best known lower bound is (k −2) [1]. In this paper, we make progress on both upper and lower bounds. For a large range of parameters (i.e., ,k), we have a complete understanding of the optimal coreset size. In particular, we obtain the following results: (1) We present a new coreset lower bound (k −z−2) for Euclidean (k,z)-clustering when šݦ (k−1/(z+2)). In view of the prior upper bound O~z(k −z−2) [1], the bound is optimal. The new lower bound is surprising since (k −2) [1] is ``conjectured" to be the correct bound in some recent works (see e.g., [1,2]]). (2) For the upper bound, we provide efficient coreset construction algorithms for (k,z)-clustering with improved or optimal coreset sizes in several metric spaces. [1] Cohen-Addad, Larsen, Saulpic, Schwiegelshohn. STOC'22. [2] Cohen-Addad, Larsen, Saulpic, Schwiegelshohn, Sheikh-Omar, NeurIPS'22.

Market making (MM) via Reinforcement Learning (RL) has attracted significant attention in financial trading. Most existing RL-based MM methods focus on optimizing single-price level strategies which fail at frequent order cancellations and loss of queue priority. By comparison, strategies involving multiple price levels align better with actual trading scenarios. However, given the complexity that multi-price level RL strategies involve a comprehensive trading action space, the challenge of effectively training RL persists. Inspired by the effective workflow of professional human market makers, we propose Imitative Market Maker (IMM), a novel RL framework leveraging knowledge from both suboptimal signal-based experts and direct policy interactions. Our framework starts with introducing effective state and action formulations that well encode information about multiprice level orders. Furthermore, IMM integrates a representation learning unit capable of capturing both short- and long-term market trends to mitigate adverse selection risk. Subsequently, IMM designs an expert strategy based on predictive signals, and trains the agent through the integration of RL and imitation learning techniques to achieve efficient learning. Extensive experimental results on four real-world market datasets demonstrate the superiority of IMM against current RL-based MM strategies.

Price movement forecasting aims at predicting the future trends of financial assets based on the current market conditions and other relevant information. Recently, machine learning (ML) methods have become increasingly popular and achieved promising results for price movement forecasting in both academia and industry. Most existing ML solutions formulate the forecasting problem as a classification (to predict the direction) or a regression (to predict the return) problem over the entire set of training data. However, due to the extremely low signal-to-noise ratio and stochastic nature of financial data, good trading opportunities are extremely scarce. As a result, without careful selection of potentially profitable samples, such ML methods are prone to capture the patterns of noises instead of real signals. To address this issue, we propose a novel price movement forecasting framework named LARA consisting of two main components: Locality-Aware Attention (LA-Attention) and Iterative Refinement Labeling (RA-Labeling). (1) LA-Attention automatically extracts the potentially profitable samples by attending to label information. Moreover, equipped with metric learning techniques, LA-Attention enjoys task-specific distance metrics and effectively distributes attention to potentially profitable samples. (2) RA-Labeling further iteratively refines the noisy labels of potentially profitable samples, and combines the learned predictors robust to the unseen and noisy samples. In a set of experiments on three real-world financial markets: stocks, cryptocurrencies, and ETFs.

In recent years, there has been a growing interest in applying reinforcement learning (RL) techniques to order execution owing to RL s strong sequential decision-making ability. However, realistic order execution tasks usually involve a large fne-grained action space and a long trading duration. The former hinders the RL agents from effcient exploration. The latter increases the task complexity, since the agent must capture price advantages throughout the day as well as micro changes within a few seconds on the limited order books. In addressing these challenges, we propose MacMic, a novel Hierarchical RL-based order execution approach that captures market patterns and executes orders from different temporal scales. MacMic employs a high-level agent to split the parent order into smaller slices at coarse-grained time steps. Then a low-level agent is adopted to execute these slices by placing fxed-size sub-orders at a continuous time. Besides, to balance the multifaceted objectives of the two tasks, MacMic pretrains a causal stacking hidden Markov model (SHMM) to obtain both effective macro-level and micro-level market states. Comprehensive experimental results on 200 stocks across the US and China A-share markets validate the effectiveness of the proposed method.

Feature importance scores (FIS) estimation is an important problem in many data-intensive applications. Traditional approaches can be divided into two types; model-specific methods and model-agnostic methods. In this work, we present FeatureLTE, a novel learning-based approach to FIS estimation. For the first time, as we demonstrate through extensive experiments, it is possible to build general-purpose pre-trained models for FIS estimation. Therefore, FIS estimation reduces to prediction outputs from a pre-trained FeatureLTE model. Pre-trained FeatureLTE models enjoy several desired advantages, including accuracy, robustness, efficiency, and evolvability, and FeatureLTE models really begin to shine on large datasets where traditional methods often find themselves unable to scale. We build our pre-trained models for binary classification and regression problems using observations from nearly 1,000 public datasets. We systematically evaluate various design choices of FeatureLTE model construction and carefully design meta features to make sure that they are computationally lightweight. Based on our evaluation, FeatureLTE is on par with the best existing FIS estimators in terms of FIS quality, and achieves up to 339.48x speedup without sacrificing the quality of FIS estimates on large-scale datasets. Finally, we release two pre-trained FeatureLTE models for binary classification and regression problems that are ready to use on almost all tabular datasets, along with the repository of 701 binary classification datasets and 256 regression datasets with pre-computed feature importance scores to promote future research along this direction.

Latent Diffusion models (LDMs) have achieved remarkable results in synthesizing high-resolution images. However, the iterative sampling is computationally intensive and leads to slow generation. Inspired by Consistency Models, we propose Latent Consistency Models (LCMs), enabling swift inference with minimal steps on any pre-trained LDMs, including Stable Diffusion. Efficiently distilled from pre-trained classifier-free guided diffusion models, a high-quality 768 768 2~4-step LCM takes only 32 A100 GPU hours for training. Furthermore, we introduce Latent Consistency Fine-tuning (LCF), a novel method that is tailored for fine-tuning LCMs on customized image datasets. Evaluation on the LAION-5B-Aesthetics dataset demonstrates that LCMs achieve state-of-the-art text-to-image generation performance with few-step inference.

Although Local Interpretable Model-agnostic Explanations (LIME) \cite{ribeiro2016should} is a widely adopted method for understanding model behavior, it suffers from instability with respect to random seeds \cite{zafar2019dlime, shankaranarayana2019alime, bansal2020sam} and exhibits low local fidelity (i.e., how the explanation explains model's local behaviors) \cite{rahnama2019study, laugel2018defining}. Our study demonstrates that this instability is caused by small sample weights, resulting in the dominance of regularization and slow convergence. Additionally, LIME's sampling approach is non-local and biased towards the reference, leading to diminished local fidelity and instability to references. To address these challenges, we propose \textsc{Glime}, an enhanced framework that extends LIME and unifies several previous methods. Within the \textsc{Glime} framework, we derive an equivalent formulation of LIME that achieves significantly faster convergence and improved stability. By employing a local and unbiased sampling distribution, \textsc{Glime} generates explanations with higher local fidelity compared to LIME, while being independent of the reference choice. Moreover, \textsc{Glime} offers users the flexibility to choose sampling distribution based on their specific scenarios.

Recently, the topic of table pre-training has attracted considerable research interest. However, how to employ table pre-training to boost the performance of tabular prediction remains an open challenge. In this paper, we propose TapTap, the first attempt that leverages table pre-training to empower models for tabular prediction. After pre-training on a large corpus of real-world tabular data, TapTap can generate high-quality synthetic tables to support various applications on tabular data, including privacy protection, low resource regime, missing value imputation, and imbalanced classification. Extensive experiments on 12 datasets demonstrate that TapTap outperforms a total of 16 baselines in different scenarios. Meanwhile, it can be easily combined with various backbone models, including LightGBM, Multilayer Perceptron (MLP) and Transformer. Moreover, with the aid of table pre-training, models trained using synthetic data generated by TapTap can even compete with models using the original dataset on half of the experimental datasets, marking a milestone in the development of synthetic tabular data generation.

We consider the sparse moment problem of learning a k-spike mixture in high dimensional space from its noisy moment information in any dimension. We measure the accuracy of the learned mixtures using transportation distance. Previous algorithms either assume certain separation assumptions, use more recovery moments, or run in (super) exponential time. Our algorithm for the 1-dimension problem (also called the sparse Hausdorff moment problem) is a robust version of the classic Prony's method, and our contribution mainly lies in the analysis. We adopt a global and much tighter analysis than previous work (which analyzes the perturbation of the intermediate results of Prony's method). A useful technical ingredient is a connection between the linear system defined by the Vandermonde matrix and the Schur polynomial, which allows us to provide tight perturbation bound independent of the separation and may be useful in other contexts. To tackle the high dimensional problem, we first solve the 2-dimensional problem by extending the 1-dimension algorithm and analysis to complex numbers. Our algorithm for the high dimensional case determines the coordinates of each spike by aligning a 1-d projection of the mixture to a random vector and a set of 2d-projections of the mixture. Our results have applications to learning topic models and Gaussian mixtures, implying improved sample complexity results or running time over prior work.

The goal of automated feature generation is to liberate machine learning experts from the laborious task of manual feature generation, which is crucial for improving the learning performance of tabular data. The major challenge in automated feature generation is to efficiently and accurately identify useful features from a vast pool of candidate features. In this paper, we present OpenFE, an automated feature generation tool that provides competitive results against machine learning experts. OpenFE achieves efficiency and accuracy with two components: 1) a novel feature boosting method for accurately estimating the incremental performance of candidate features. 2) a feature-scoring framework for retrieving effective features from a large number of candidates through successive featurewise halving and feature importance attribution. Extensive experiments on seven benchmark datasets show that OpenFE outperforms existing baseline methods. We further evaluate OpenFE in two famous Kaggle competitions with thousands of data science teams participating. In one of the competitions, features generated by OpenFE with a simple baseline model can beat 99.3\% data science teams. In addition to the empirical results, we provide a theoretical perspective to show that feature generation is beneficial in a simple yet representative setting.

Recent work has achieved remarkable zero-shot performance with multi-task prompted pretraining, but little has been understood. For the first time, we show that training on a small number of key tasks beats using all the training tasks, while removing these key tasks substantially hurts performance. We also find that these key tasks are mostly question answering (QA) tasks. These novel findings combined deepen our understanding about zero-shot generalization training on certain tasks such as QA encodes general knowledge transferable to a wide range of tasks. In addition, to automate this procedure, we devise a method that (1) identifies key training tasks without observing the test tasks by examining the pairwise generalization results and (2) resamples training tasks for better data distribution. Empirically, our approach achieves improved results across various model scales and tasks.

Gradient Boosting Decision Tree (GBDT) has achieved remarkable success in a wide variety of applications. The split finding algorithm, which determines the tree construction process, is one of the most crucial components of GBDT. However, the split finding algorithm has long been criticized for its bias towards features with a larger number of potential splits. This bias introduces severe interpretability and overfitting issues in GBDT. To this end, we provide a fine-grained analysis of bias in GBDT and demonstrate that the bias originates from 1) the systematic bias in the gain estimation of each split and 2) the bias in the split finding algorithm resulting from the use of the same data to evaluate the split improvement and determine the best split. Based on the analysis, we propose unbiased gain, a new unbiased measurement of gain importance using out-of-bag samples. Moreover, we incorporate the unbiased property into the split finding algorithm during tree construction and develop UnbiasedGBM to solve the overfitting issue of GBDT. We empirically assess the performance of UnbiasedGBM and unbiased gain in a large-scale empirical study comprising 60 tabular datasets and demonstrate that: 1) UnbiasedGBM exhibits better performance than popular GBDT implementations such as LightGBM, XGBoost, and Catboost on average on the 60 datasets and 2) unbiased gain achieves better average performance in feature selection than popular feature importance methods including gain importance, permutation feature importance, and SHAP importance.

Optimized trade execution is to sell (or buy) a given amount of assets in a given time with the lowest possible trading cost. Recently, reinforcement learning (RL) has been applied to optimized trade execution to learn smarter policies from market data. However, we find that many existing RL methods exhibit considerable overfitting which prevents them from real deployment. In this paper, we provide an extensive study on the overfitting problem in optimized trade execution. First, we model the optimized trade execution as offline RL with dynamic context, where the context represents market variables that cannot be influenced by the trading policy and are collected in an offline manner. Under this framework, we derive the generalization bound and find that the overfitting issue is caused by large context space and limited context samples in the offline setting. Accordingly, we propose to learn compact representations for context to address the overfitting problem, either by leveraging prior knowledge or in an end-to-end manner. To evaluate our algorithms, we also implement a carefully designed simulator based on historical limit order book (LOB) data to provide a high-fidelity benchmark for different algorithms. Our experiments on the high-fidelity simulator demonstrate that our algorithms can effectively alleviate overfitting and achieve better performance.

Large-scale high-quality data is critical for training modern deep neural networks. However, data acquisition can be costly or time-consuming for many time-series applications, thus researchers turn to generative models for generating synthetic time-series data. In particular, recent generative adversarial networks (GANs) have achieved remarkable success in time-series generation. Despite their success, existing GAN models typically generate the sequences in an auto-regressive manner, and we empirically observe that they suffer from severe distribution shifts and bias amplification, especially when generating long sequences. To resolve this problem, we propose Adversarial Error Correction GAN (AEC-GAN), which is capable of dynamically correcting the bias in the past generated data to alleviate the risk of distribution shifts and thus can generate high-quality long sequences. AEC-GAN contains two main innovations: (1) We develop an error correction module to mitigate the bias. In the training phase, we adversarially perturb the realistic time-series data and then optimize this module to reconstruct the original data. In the generation phase, this module can act as an efficient regulator to detect and mitigate the bias. (2) We propose an augmentation method to facilitate GAN's training by introducing adversarial examples. Thus, AEC-GAN can generate high-quality sequences of arbitrary lengths, and the synthetic data can be readily applied to downstream tasks to boost their performance. We conduct extensive experiments on six widely used datasets and three state-of-the-art time-series forecasting models to evaluate the quality of our synthetic time-series data in different lengths and downstream tasks. Both the qualitative and quantitative experimental results demonstrate the superior performance of AEC-GAN over other deep generative models for time-series generation.

Graph contrastive learning (GCL) has attracted a surge of attention due to its superior performance for learning node/graph representations without labels. However, in practice, the underlying class distribution of unlabeled nodes for the given graph is usually imbalanced. This highly imbalanced class distribution inevitably deteriorates the quality of learned node representations in GCL. Indeed, we empirically find that most state-of-the-art GCL methods cannot obtain discriminative representations and exhibit poor performance on imbalanced node classification. Motivated by this observation, we propose a principled GCL framework on Imbalanced node classification (ImGCL), which automatically and adaptively balances the representations learned from GCL without labels. Specifically, we first introduce the online clustering based progressively balanced sampling (PBS) method with theoretical rationale, which balances the training sets based on pseudo-labels obtained from learned representations in GCL. We then develop the node centrality based PBS method to better preserve the intrinsic structure of graphs, by upweighting the important nodes of the given graph. Extensive experiments on multiple imbalanced graph datasets and imbalanced settings demonstrate the effectiveness of our proposed framework, which significantly improves the performance of the recent state-of-the-art GCL methods. Further experimental ablations and analyses show that the ImGCL framework consistently improves the representation quality of nodes in under-represented (tail) classes.

This paper revisits building machine learning algorithms that involve interactions between entities, such as those between financial assets in an actively managed portfolio, or interactions between users in a social network. Our goal is to forecast the future evolution of ensembles of multivariate time series in such applications (e.g., the future return of a financial asset or the future popularity of a Twitter account). Designing ML algorithms for such systems requires addressing the challenges of high-dimensional interactions and non-linearity. We propose a novel framework, which we dub as the additive influence model. Under our modeling assumption, we show that it is possible to decouple the learning of high-dimensional interactions from the learning of non-linear feature interactions. To learn the high-dimensional interactions, we leverage kernel-based techniques, with provable guarantees, to embed the entities in a low-dimensional latent space. To learn the non-linear feature-response interactions, we generalize prominent machine learning techniques, including designing a new statistically sound non-parametric method and an ensemble learning algorithm optimized for vector regressions.

Portfolio management is a fundamental problem in finance. It involves periodic reallocations of assets to maximize the expected returns within an appropriate level of risk exposure. Deep reinforcement learning (RL) has been considered a promising approach to solving this problem owing to its strong capability in sequential decision making. However, due to the non-stationary nature of financial markets, applying RL techniques to portfolio optimization remains a challenging problem. Extracting trading knowledge from various expert strategies could be helpful for agents to accommodate the changing markets. In this paper, we propose MetaTrader, a novel two-stage RL-based approach for portfolio management, which learns to integrate diverse trading policies to adapt to various market conditions. In the first stage, MetaTrader incorporates an imitation learning objective into the reinforcement learning framework. Through imitating different expert demonstrations, MetaTrader acquires a set of trading policies with great diversity. In the second stage, MetaTrader learns a meta-policy to recognize the market conditions and decide on the most proper learned policy to follow. We evaluate the proposed approach on three real-world index datasets and compare it to state-of-the-art baselines. The empirical results demonstrate that MetaTrader significantly outperforms those baselines in balancing profits and risks. Furthermore, thorough ablation studies validate the effectiveness of the components in the proposed approach.

We study the generalized load-balancing (GLB) problem, where we are given n jobs, each of which needs to be assigned to one of m unrelated machines with processing times {pij}. The load of each machine i is a symmetric monotone norm function of the processing times. Our goal is to minimize the generalized makespan, which is another symmetric monotone norm over the m-dimensional machine load vector. This problem significantly generalizes many classic optimization problems, e.g., makespan minimization, set cover, minimum-norm load-balancing, etc. We obtain a polynomial time randomized algorithm that achieves an approximation factor of O(logn), matching the lower bound of set cover up to constant factor.

While large-scale neural language models, such as GPT2 and BART, have achieved impressive results on various text generation tasks, they tend to get stuck in undesirable sentence-level loops with maximization-based decoding algorithms (\textit{e.g.}, greedy search). This phenomenon is counter-intuitive since there are few consecutive sentence-level repetitions in human corpora (e.g., 0.02\% in Wikitext-103). To investigate the underlying reasons for generating consecutive sentence-level repetitions, we study the relationship between the probabilities of the repetitive tokens and their previous repetitions in the context. Through our quantitative experiments, we find that 1) Language models have a preference to repeat the previous sentence; 2) The sentence-level repetitions have a \textit{self-reinforcement effect}: the more times a sentence is repeated in the context, the higher the probability of continuing to generate that sentence; 3) The sentences with higher initial probabilities usually have a stronger self-reinforcement effect. Motivated by our findings, we propose a simple and effective training method \textbf{DITTO} (Pseu\underline{D}o-Repet\underline{IT}ion Penaliza\underline{T}i\underline{O}n), where the model learns to penalize probabilities of sentence-level repetitions from pseudo repetitive data. Although our method is motivated by mitigating repetitions, experiments show that DITTO not only mitigates the repetition issue without sacrificing perplexity, but also achieves better generation quality. Extensive experiments on open-ended text generation (Wikitext-103) and text summarization (CNN/DailyMail) demonstrate the generality and effectiveness of our method.

Recent findings (e.g., arXiv:2103.00065) demonstrate that modern neural networks trained by full-batch gradient descent typically enter a regime called Edge of Stability (EOS). In this regime, the sharpness, i.e., the maximum Hessian eigenvalue, first increases to the value 2/(step size) (the progressive sharpening phase) and then oscillates around this value (the EOS phase). This paper aims to analyze the GD dynamics and the sharpness along the optimization trajectory. Our analysis naturally divides the GD trajectory into four phases depending on the change of the sharpness. We empirically identify the norm of output layer weight as an interesting indicator of sharpness dynamics. Based on this empirical observation, we attempt to theoretically and empirically explain the dynamics of various key quantities that lead to the change of sharpness in each phase of EOS. Moreover, based on certain assumptions, we provide a theoretical proof of the sharpness behavior in EOS regime in two-layer fully-connected linear neural networks. We also discuss some other empirical findings and the limitation of our theoretical results.

Proving algorithm-dependent generalization error bounds for gradient-type optimization methods has attracted significant attention recently in learning theory. However, most existing trajectory-based analyses require either restrictive assumptions on the learning rate (e.g., fast decreasing learning rate), or continuous injected noise (such as the Gaussian noise in Langevin dynamics). In this paper, we introduce a new discrete data-dependent prior to the PAC-Bayesian framework, and prove a high probability generalization bound for Floored GD (i.e. a finite precision version of gradient descent). We remark that our bound holds for nonconvex and nonsmooth scenarios. Moreover, our theoretical results provide numerically favorable upper bounds of testing errors (e.g., 0.037 on MNIST). Using a similar technique, we can also obtain new generalization bounds for certain variants of SGD. Furthermore, we study the generalization bounds for gradient Langevin Dynamics (GLD). Using the same framework with a carefully constructed continuous prior, we show a new high probability tighter generalization bound for GLD. The new faster rate is due to the concentration of the difference between the gradient of training samples and that of the prior.

In many classic clustering problems, we seek to sketch a massive data set of n points in a metric space, by segmenting them into k categories or clusters, each cluster represented concisely by a single point in the metric space. Two notable examples are the k-center/k-supplier problem and the k-median problem. In practical applications of clustering, the data set may evolve over time, reflecting an evolution of the underlying clustering model. In this paper, we initiate the study of a dynamic version of clustering problems that aims to capture these considerations. In this version there are T time steps, and in each time step t in {1,2, ,T}, the set of clients needed to be clustered may change, and we can move the k facilities between time steps. More specifically, we study two concrete problems in this framework: the Dynamic Ordered k-Median and the Dynamic k-Supplier problem. We first consider the Dynamic Ordered k-Median problem, where the objective is to minimize the weighted sum of ordered distances over all time steps, plus the total cost of moving the facilities between time steps. We present one constant-factor approximation algorithm for T=2 and another approximation algorithm for fixed T>=3. Then we consider the Dynamic k-Supplier problem, where the objective is to minimize the maximum distance from any client to its facility, subject to the constraint that between time steps the maximum distance moved by any facility is no more than a given threshold. When the number of time steps T is 2, we present a simple constant factor approximation algorithm and a bi-criteria constant factor approximation algorithm for the outlier version, where some of the clients can be discarded. We also show that it is NP-hard to approximate the problem with any factor for T>=3.

Recent works on implicit regularization have shown that gradient descent converges to the max-margin direction for logistic regression with one-layer or multi-layer linear networks. In this paper, we generalize this result to homogeneous neural networks, including fully-connected and convolutional neural networks with ReLU or LeakyReLU activations. In particular, we study the gradient flow (gradient descent with infinitesimal step size) optimizing the logistic loss or cross-entropy loss of any homogeneous model (possibly non-smooth), and show that if the training loss decreases below a certain threshold, then we can define a smoothed version of the normalized margin which increases over time. We also formulate a natural constrained optimization problem related to margin maximization, and prove that both the normalized margin and its smoothed version converge to the objective value at a KKT point of the optimization problem. Furthermore, we extend the above results to a large family of loss functions. We conduct several experiments to justify our theoretical finding on MNIST and CIFAR-10 datasets. For gradient descent with constant learning rate, we observe that the normalized margin indeed keeps increasing after the dataset is fitted, but the speed is very slow. However, if we schedule the learning rate more carefully, we can observe a more rapid growth of the normalized margin. Finally, as margin is closely related to robustness, we discuss potential benefits of training longer for improving the robustness of the model.

Generalization error (also known as the out-of-sample error) measures how well the hypothesis obtained from the training data can generalize to previously unseen data. Obtaining tight generalization error bounds is central to statistical learning theory. In this paper, we study the generalization error bound in learning general non-convex objectives, which has attracted significant attention in recent years. In particular, we study the (algorithm-dependent) generalization bounds of various iterative gradient based methods.

(1)We develop a new framework, termed Bayes-Stability, for proving algorithm-dependent generalization error bounds. The new framework combines ideas from both the PAC-Bayesian theory and the notion of algorithmic stability. Applying the Bayes-Stability method, we obtain new data-dependent generalization bounds for stochastic gradient Langevin dynamics (SGLD) and several other noisy gradient methods (e.g., with momentum, mini-batch and acceleration, Entropy-SGD). Our result recovers (and is typically tighter than) a recent result in Mou et al. (2018) and improves upon the results in Pensia et al. (2018). Our experiments demonstrate that our data-dependent bounds can distinguish randomly labelled data from normal data, which provides an explanation to the intriguing phenomena observed in Zhang et al. (2017a).

(2) We also study the setting where the total loss is the sum of a bounded loss and an additional \ell_2 regularization term. We obtain new generalization bounds for the continuous Langevin dynamic in this setting by developing a new Log-Sobolev inequality for the parameter distribution at any time. Our new bounds are more desirable when the noisy level of the process is not small, and do not become vacuous even when T tends to infinity.

Given a metric $(V,d)$ and a $\depot \in V$, the classical $\KTSP$ problem is to find a tour originating at $\depot$ of minimum length that visits at least $k$ nodes in $V$. In this work, motivated by applications where the input to an optimization problem is uncertain, we study two stochastic versions of $\KTSP$. In \StochKTSP, originally defined by Ene-Nagarajan-Saket, each vertex $v$ in the given metric $(V,d)$ contains a stochastic reward $R_v$. The goal is to adaptively find a tour of minimum expected length that collects at least reward $k$; Ene et al. give an $O(\log k)$-approximation adaptive algorithm for this problem, and left open if there is an $O(1)$-approximation algorithm. We totally resolve their open question, and even give an $O(1)$-approximation \emph{non-adaptive} algorithm for \StochKTSP. We also introduce and obtain similar results for the \StochKCost problem. In this problem each vertex $v$ has a stochastic cost $C_v$, and the goal is to visit and select at least $k$ vertices to minimize the expected \emph{sum} of tour length and cost of selected vertices. Besides being a natural stochastic generalization of \KTSP, this problem is also interesting because it generalizes the Price of Information framework by Singla from deterministic probing costs to metric probing costs. Our techniques are based on two crucial ideas: ``repetitions'' and ``critical scaling''. In general, replacing a random variable with its expectation leads to very poor results. We show that for our problems, if we truncate the random variables at an ideal threshold, then their expected values form a good surrogate. Here, we rely on running several repetitions of our algorithm with the same threshold, and then argue concentration using Freedman's and Jogdeo-Samuels inequalities. Unfortunately, this ideal threshold depends on how far we are from achieving our target $k$, which a non-adaptive algorithm does not know. To overcome this barrier, we truncate the random variables at various different scales and identify a ``critical'' scale.

Gradient-based Monte Carlo sampling algorithms, like Langevin dynamics and Hamiltonian Monte Carlo, are important methods for Bayesian inference. In large-scale settings, full-gradients are not affordable and thus stochastic gradients evaluated on mini-batches are used as a replacement. In order to reduce the high variance of noisy stochastic gradients, Dubey et al. [2016] applied the standard variance reduction technique on stochastic gradient Langevin dynamics and obtained both theoretical and experimental improvements. In this paper, we apply the variance reduction tricks on Hamiltonian Monte Carlo and achieve better theoretical convergence results compared with the variance-reduced Langevin dynamics. Moreover, we apply the symmetric splitting scheme in our variance-reduced Hamiltonian Monte Carlo algorithms to further improve the theoretical results. The experimental results are also consistent with the theoretical results. As our experiment shows, variance-reduced Hamiltonian Monte Carlo demonstrates better performance than variance-reduced Langevin dynamics in Bayesian regression and classification tasks on real-world datasets.

Gradient boosting using decision trees as base learners, so called Gradient Boosted Decision Trees (GBDT), is a very successful ensemble learning algorithm widely used across a variety of applications. Recently, various GDBT construction algorithms and implementation have been designed and heavily optimized in some very popular open sourced toolkits such as Xgboost and LightGBM. In this paper, we show that both the accuracy and efficiency of GBDT can be further enhanced by using more complex base learners. Specifically, we extend gradient boosting to use \textit{piecewise linear regression trees} (PL Trees), instead of \textit{piecewise constant regression trees}. We show PL Trees can accelerate convergence of GBDT. Moreover, our new algorithm fits better to modern computer architectures with powerful Single Instruction Multiple Data (SIMD) parallelism. We propose optimization techniques to speedup our algorithm. The experimental results show that GBDT with PL Trees can provide very competitive testing accuracy with comparable or less training time. Our algorithm also produces much concise tree ensembles, thus can often reduce testing time costs.

We study the problem of large-scale network embedding, which aims to learn latent representations for network mining applications. Previous research shows that 1) popular network embedding benchmarks, such as DeepWalk, are in essence implicitly factorizing a matrix with a closed form, and 2) the explicit factorization of such matrix generates more powerful embeddings than existing methods. However, directly constructing and factorizing this matrix which is dense is prohibitively expensive in terms of both time and space, making it not scalable for large networks. In this work, we present the algorithm of large-scale network embedding as sparse matrix factorization (NetSMF). NetSMF leverages theories from spectral sparsification to efficiently sparsify the aforementioned dense matrix, enabling significantly improved effi- ciency in embedding learning. The sparsified matrix is spectrally close to the original dense one with a theoretically bounded ap- proximation error, which helps maintain the representation power of the learned embeddings. We conduct experiments on networks of various scales and types. Results show that among both popular benchmarks (i.e., DeepWalk and LINE) and factorization based methods, NetSMF is the only method that achieves both high effi- ciency and effectiveness. We show that NetSMF requires only 24 hours to generate effective embeddings for a large-scale academic collaboration network with tens of millions of nodes, while it would cost DeepWalk months and is computationally infeasible for the dense matrix factorization solution.

We consider the problem of maximizing the expected utility E[u(X(S))] where X(S)=\sum_{i \in S}X_i and S is a feasible set for some combinatorial optimization problems (such as shortest path, minimum spanning tree, knapsack, etc.). (1) We present an additive PTAS for bounded Lipshitz utility function. This has implications in the VaR problem such as Pr[X(S)\leq 1](which corresponds to a step utility function). (2) We give PTAS for increasing concave functions (which are widely used to model risk-averse behaviors) and increasing function with bounded derivative.

Time series are widely used as signals in many classification/regression tasks. It is ubiquitous that time series contains many missing values. Given multiple correlated time series data, how to fill in missing values and to predict their class labels? Existing imputation methods often impose strong assumptions of the underlying data generating process, such as linear dynamics in the state space. In this paper, we propose BRITS, a novel method based on recurrent neural networks for missing value imputation in time series data. Our proposed method directly learns the missing values in a bidirectional recurrent dynamical system, without any specific assumption. The imputed values are treated as variables of RNN graph and can be effectively updated during the backpropagation. BRITS has three advantages: (a) it can handle multiple correlated missing values in time series; (b) it generalizes to time series with nonlinear dynamics underlying; (c) it provides a data-driven imputation procedure and applies to general settings with missing data. We evaluate our model on three real-world datasets, including an air quality dataset, a health-care data, and a localization data for human activity. Experiments show that our model outperforms the state-of-the-art methods in both imputation and classification/regression accuracies.

We analyze stochastic gradient algorithms for optimizing nonconvex, nonsmooth finite-sum problems. In particular, the objective function is given by the summation of a differentiable (possibly nonconvex) component, together with a possibly non-differentiable but convex component. We propose a proximal stochastic gradient algorithm based on variance reduction, called ProxSVRG+. The algorithm is a slight variant of the ProxSVRG algorithm [Reddi et al. NIPS2016]. Our main contribution lies in the analysis of ProxSVRG+. It recovers several existing convergence results (in terms of the number of stochastic gradient oracle calls and proximal operations), and improves/generalizes some others. In particular, ProxSVRG+ generalizes the best results given by the SCSG algorithm, recently proposed by [Lei at al. NIPS2017] for the smooth nonconvex case. ProxSVRG+ is more straightforward than SCSG and yields simpler analysis. Moreover, ProxSVRG+ outperforms the deterministic proximal gradient descent (ProxGD) for a wide range of minibatch sizes, which partially solves an open problem proposed in \cite{reddi2016proximal}. Finally, for nonconvex functions satisfied Polyak-\L{}ojasiewicz condition, we show that ProxSVRG+ achieves global linear convergence rate without restart. ProxSVRG+ is always no worse than ProxGD and ProxSVRG/SAGA, and sometimes outperforms them (and generalizes the results of SCSG) in this case.

We present the first efficient algorithm that constructs an eps-coreset for the a general class of clustering problems in a doubling metric.

To this end, we establish the first relation between the doubling dimension of and the shattering dimension (or VC-dimension) of the range space induced by the distance. Such a relation is not known before, since one can easily construct instances in which neither one can be bounded by (some function of) the other. Surprisingly, we show that if we allow a small (1+ -eps)-distortion of the distance function $d$ (the distorted distance is called the smoothed distance function), the shattering dimension can be upper bounded. We also introduce the notion of $\tau$-error {\em probabilistic shattering dimension}, and prove a (drastically better) upper bound for the probabilistic shattering dimension for weighted doubling metrics. We believe the new relation between doubling and shattering dimensions is of independent interest and may find other applications.

Furthermore, we study robust coresets for (k,z)-clustering with outliers in a doubling metric. We show an improved connection between $\alpha$-approximation and robust coreset. This also leads to improvement upon the previous best known bound of the size of robust coreset for Euclidean space [Feldman and Langberg, STOC 11]. The new bound entails a few new results in clustering and property testing.

As another application, we show constant-sized (\eps, k, z)-centroid sets in doubling metrics can be constructed by extending our coreset construction. Prior to our result, constant-sized centroid sets for general clustering problems were only known for Euclidean spaces. We can apply our centroid set to accelerate the local search algorithm (studied in [Friggstad et al., FOCS 2016]) for the (k, z)-clustering problem in doubling metrics.

We develop a framework for obtaining polynomial time approximation schemes (PTAS) for a class of stochastic dynamic programs. Using our framework, we obtain the first PTAS for several stochastic combinatorial optimization problems. In particular, we obtain PTAS for the ProbeMax Probemax: We are given a set of n items, each item i has a value Xi which is an independent random variable with a known (discrete) distribution. We can probe a subset P of items sequentially. Each time after probing an item i, we observe its value realization, which follows the distribution of Xi. We can adaptively probe at most m items and each item can be probed at most once. The reward is the maximum among the m realized values. Our goal is to design an adaptive probing policy such that the expected value of the reward is maximized. To the best of our knowledge, the best known approximation ratio is 1-1/e, due to Asadpour et al. We also obtain PTAS for several other problems: Committed Pandora s Box, Stochastic Target, Stochastic Blackjack Knapsack.

It is a long standing open problem whether Yao-Yao graphs YYk are all spanners. Bauer and Damian showed that all YY6k for k \geq 6 are spanners. Li and Zhan generalized their result and proved that all even Yao-Yao graphs YY2k are spanners (for k \geq 42). However, their technique cannot be extended to odd Yao-Yao graphs, and whether they are spanners are still elusive. In this paper, we show that, surprisingly, for any integer k \geq 1, there exist odd Yao-Yao graph YY2k+1 instances, which are not spanners. This provides a negative answer to (Problem 70) in [http://cs.smith.edu/~orourke/TOPP/P70.html]

Estimating the travel time of any path (denoted by a sequence of connected road segments) in a city is of great importance to traffic monitoring, route planning, ridesharing, taxi/Uber dispatching, etc. However, it is a very challenging problem, affected by diverse complex factors, including spatial correlations, temporal dependencies, external conditions (e.g. weather, traffic lights). Prior work usually focuses on estimating the travel times of individual road segments or sub-paths and then summing up these times, which leads to an inaccurate estimation because such approaches do not consider road intersections/ traffic lights, and local errors may accumulate. To address these issues, we propose an end-to-end Deep learning framework for Travel Time Estimation (called DeepTTE) that estimates the travel time of the whole path directly. More specifically, we present a geo-convolution operation by integrating the geographic information into the classical convolution, capable of capturing spatial correlations. By stacking recurrent unit on the geo-convoluton layer, our DeepTTE can capture the temporal dependencies as well. A multi-task learning component is given on the top of DeepTTE, that learns to estimate the travel time of both the entire path and each local path simultaneously during the training phase. Extensive experiments on two trajectory datasets show our DeepTTE significantly outperforms the state-of-the-art methods.

The suffix array is a fundamental data structure for many applications that involve string searching and data compression. We obtain the \emph{first} in-place suffix array construction algorithms that are optimal both in time and space for (read-only) integer alphabets. We provide the first linear time in-place algorithm for read-only integer alphabets with |Sigma|=O(n) (i.e., the input string cannot be modified). This algorithm settles the open problem posed by [Franceschini and Muthukrishnan, ICALP'07]. For the read-only general alphabets (i.e., only comparisons are allowed), we present an optimal O(nlogn) time in-place suffix sorting algorithm, recovering the result obtained by Franceschini and Muthukrishnan, which was an open problem posed by [Manzini and Ferragina, ESA'02].

Since the invention of word2vec, the skip-gram model has significantly advanced the research of network embedding, such as the recent emergence of the DeepWalk, LINE, PTE, and node2vec approaches. In this work, we show that all of the aforementioned models with negative sampling can be unified into the matrix factorization framework with closed forms. Our analysis and proofs reveal that: (1) DeepWalk empirically produces a low-rank transformation of a network's normalized Laplacian matrix; (2) LINE, in theory, is a special case of DeepWalk when the size of vertices' context is set to one; (3) As an extension of LINE, PTE can be viewed as the joint factorization of multiple networks' Laplacians; (4) node2vec is factorizing a matrix related to the stationary distribution and transition probability tensor of a 2nd-order random walk. We further provide the theoretical connections between skip-gram based network embedding algorithms and the theory of graph Laplacian. Finally, we present the NetMF method as well as its approximation algorithm for computing network embedding. Our method offers significant improvements over DeepWalk and LINE for conventional network mining tasks. This work lays the theoretical foundation for skip-gram based network embedding methods, leading to a better understanding of latent network representation learning.

Training deep neural networks is a highly nontrivial task, involving carefully selecting appropriate training algorithms, scheduling step sizes and tuning other hyperparameters. Trying different combinations can be quite labor-intensive and time consuming. Recently, researchers have tried to use deep learning algorithms to exploit the landscape of the loss function of the training problem of interest, and learn how to optimize over it in an automatic way. In this paper, we propose a new learning-to-learn model and some useful and practical tricks. Our optimizer outperforms generic, hand-crafted optimization algorithms and state-of-the-art learning-to-learn optimizers by DeepMind in many tasks. We demonstrate the effectiveness of our algorithms on a number of tasks, including deep MLPs, CNNs, and simple LSTMs.

We study the combinatorial pure exploration problem CPE in a stochastic multi-armed bandit game. In a CPE instance, we are given $n$ stochastic arms with unknown reward distributions, as well as a family $\subsetfam$ of feasible subsets over the arms. Let the weight of an arm be the mean of its reward distribution. Our goal is to identify the feasible subset in $\subsetfam$ with the maximum total weight, using as few samples as possible. The problem generalizes the classical best arm identification problem and the top-$k$ arm identification problem, both of which have attracted significant attention in recent years. We provide a novel \emph{instance-wise} lower bound for the sample complexity of the problem, as well as a nontrivial sampling algorithm, matching the lower bound up to a factor of $\ln|\subsetfam|$. For an important class of combinatorial families (including spanning trees, matchings, and path constraints), we also provide polynomial time implementation of the sampling algorithm, using the equivalence of separation and optimization for convex program, and the notion of approximate Pareto curves in multi-objective optimization (note that $|\subsetfam|$ can be exponential in $n$). We also show that the $\ln|\subsetfam|$ factor is inevitable in general, through a nontrivial lower bound construction utilizing a combinatorial structure resembling the Nisan-Wigderson design. Our results significantly improve several previous results for several important combinatorial constraints, and provide a tighter understanding of the general \combibandit problem. We further introduce an even more general problem, formulated in geometric terms. We are given $n$ Gaussian arms with unknown means and unit variance. Consider the $n$-dimensional Euclidean space $\mathbb{R}^n$, and a collection $\anssetcol$ of disjoint subsets. Our goal is to determine the subset in $\anssetcol$ that contains the mean profile (which is the $n$-dimensional vector of the means), using as few samples as possible. The problem generalizes most pure exploration bandit problems studied in the literature. We provide the first nearly optimal sample complexity upper and lower bounds for the problem.

We study the best arm identification (BEST-1-ARM) problem, which is defined as follows. We are given n stochastic bandit arms. The ith arm has a reward distribution Di with an unknown mean \mu_i. Upon each play of the ith arm, we can get a reward, sampled i.i.d. from Di. We would like to identify the arm with largest mean with probability at least 1-\delta, using as few samples as possible. We obtain nearly instance optimal sample complexity for best arm, thereby making significant progress towards a complete resolution of the ( gap-entropy conjecture), from both the upper and lower bound sides. The paper is a followup of our early paper: On the Optimal Sample Complexity for Best Arm Identification. Lijie Chen, Jian Li. [ArXiv] The gap entropy conjecture - concerning the instance optimality of the best-1-arm problem. See COLT16 open problem

Crowdsourcing database systems have been proposed to lever- age crowd-powered operations to encapsulate the complex- ities of interacting with the crowd. Existing systems suffer from two major limitations. Firstly, in order to optimize a query, they often adopt the traditional tree model to se- lect an optimized table-level join order. However, the tree model provides a coarse-grained optimization, which gener- ates the same order for different joined tuples and limits the optimization potential that different joined tuples can be optimized by different orders. Secondly, they mainly focus on optimizing the monetary cost. In fact, there are three optimization goals (i.e., smaller monetary cost, lower laten- cy, and higher quality) in crowdsourcing, and it calls for a system to enable multi-goal optimization. To address these limitations, we develop a crowd-powered database system CDB that supports crowd-based query op- timizations. CDB has the following fundamental differences compared with the existing systems. Firstly, CDB employs a graph-based query model that provides more fine-grained query optimization. Secondly, CDB adopts a unifieded frame- work to perform the multi-goal optimization based on the graph model. We have implemented our system and deployed it on AMT and CrowdFlower. We have also created a benchmark for evaluating crowd-powered databases. We have conducted both simulated and real experiments, and the experimental results demonstrate the performance su- periority of CDB on cost, latency and quality.

In recent years, the capacitated center problems have at- tracted a lot of research interest. Given a set of vertices V , we want to find a subset of vertices S, called centers, such that the maximum cluster radius is minimized. Moreover, each center in S should satisfy some capacity constraint, which could be an upper or lower bound on the number of vertices it can serve. Capacitated k-center problems with one-sided bounds (upper or lower) have been well studied in previous work, and a constant factor approximation was obtained. We are the first to study the capacitated center problem with both ca- pacity lower and upper bounds (with or without outliers). We assume each vertex has a uniform lower bound and a non-uniform upper bound. For the case of opening exactly k centers, we note that a generaliza- tion of a recent LP approach can achieve constant factor approximation algorithms for our problems. Our main contribution is a simple combi- natorial algorithm for the case where there is no cardinality constraint on the number of open centers. Our combinatorial algorithm is simpler and achieves better constant approximation factor compared to the LP approach.

The online car-hailing service has gained great popularity all over the world. As more passengers and more drivers use the service, it becomes increasingly more important for the the car-hailing service providers to effectively schedule the drivers to minimize the waiting time of passengers and maximize the driver utilization, thus to improve the overall user experience. In this paper, we study the problem of predicting the real-time car-hailing supply-demand, which is one of the most important component of an effective scheduling system. Our objective is to predict the gap between the car-hailing supply and demand in a certain area in the next few minutes. Based on the prediction, we can balance the supply-demands by scheduling the drivers in advance. We present an end-to-end framework called Deep Supply-Demand (DeepSD) using a novel deep neural network structure. Our approach can automatically discover complicated supply-demand patterns from the car-hailing service data while only requires a minimal amount hand-crafted features. Moreover, our framework is highly flexible and extendable. Based on our framework, it is very easy to utilize multiple data sources (e.g., car-hailing orders, weather and traffic data) to achieve a high accuracy. We conduct extensive experimental evaluations, which show that our framework provides more accurate prediction results than the existing methods.

In the Best-k-Arm problem, we are given n stochastic bandit arms, each associated with an unknown reward distribution. We are required to identify the k arms with the largest means by taking as few samples as possible. In this paper, we make progress towards a complete characterization of the instance-wise sample complexity bounds for the Best-k-Arm problem. On the lower bound side, we obtain a novel complexity term to measure the sample complexity that every Best-k-Arm instance requires. This is derived by an interesting and nontrivial reduction from the Best-1-Arm problem. We also provide an elimination based algorithm that matches the instancewise lower bound within doubly-logarithmic factors. The sample complexity of our algorithm is strictly better than the state-of-the-art for Best-k-Arm (module constant factors).

We study the k-regret minimizing query (k-RMS), which is a useful operator for supporting multi-criteria decision-making. Given two integers k and r, a k-RMS returns r tuples from the database which minimize the k-regret ratio, defined as one minus the worst ratio between the k-th maximum utility score among all tuples in the database and the maximum utility score of the r tuples returned. A solution set contains only r tuples, enjoying the benefits of both top-k queries and skyline queries. Proposed in 2012, the query has been studied extensively in recent years. In this paper, we advance the theory and the practice of k-RMS in the following aspects. First, we develop efficient algorithms for k-RMS (and its decision version) when the dimensionality is 2. The running time of our algorithms outperforms those of previous ones. Our experimental results show that our algorithms are more efficient than previous ones on both synthetic and real datasets up to three orders of magnitude. Second, we show that k-RMS is NP-hard even when the dimensionality is 3. This provides a complete characterization of the complexity of k-RMS, and answers an open question in previous studies. In addition, we present approximation algorithms for the problem when the dimensionality is 3 or larger.

Solving geometric optimization problems over uncertain data have become increasingly important in many applications and have attracted a lot of attentions in recent years. In this paper, we study two important geometric optimization problems, the k-center problem and the j-flat-center problem, over stochastic/uncertain data points in Euclidean spaces. For the stochastic k-center problem, we would like to find k points in a fixed dimensional Euclidean space, such that the expected value of the k-center objective is minimized. For the stochastic j-flat-center problem, we seek a j-flat (i.e., a j-dimensional affine subspace) such that the expected value of the maximum distance from any point to the j-flat is minimized. We consider both problems under two popular stochastic geometric models, the existential uncertainty model, where the existence of each point may be uncertain, and the locational uncertainty model, where the location of each point may be uncertain. We provide the first PTAS (Polynomial Time Approximation Scheme) for both problems under the two models. Our results generalize the previous results for stochastic minimum enclosing ball and stochastic enclosing cylinder.

In this paper, we study the stochastic combinatorial multi-armed bandit (CMAB) framework that allows a general nonlinear reward function, whose expected value may not depend only on the means of the input random variables but possibly on the entire distributions of these variables. Our framework enables a much larger class of reward functions such as the $\max()$ function and nonlinear utility functions. Existing techniques relying on accurate estimations of the means of random variables, such as the upper confidence bound (UCB) technique, do not work directly on these functions. We propose a new algorithm called stochastically dominant confidence bound (SDCB), which estimates the distributions of underlying random variables and their stochastically dominant confidence bounds. We prove that if the underlying variables have known finite supports, SDCB can achieve $O(\log T)$ distribution-dependent regret and $\tilde{O}(\sqrt{T})$ distribution-independent regret, where $T$ is the time horizon. For general arbitrary distributions, we further use a discretization technique and show an $\tilde{O}(\sqrt{T})$ regret bound. We apply our results to the $K$-MAX problem and the expected utility maximization problems. In particular, for $K$-MAX, we provide the first polynomial-time approximation scheme (PTAS) for its offline problem, and give the first $\tilde{O}(\sqrt{T})$ bound on the $(1-\epsilon)$-approximation regret of its online problem, for any $\epsilon > 0$.

Choosing a good location when opening a new store is crucial for the future success of a business. Traditional methods in- clude online manual survey, analytic models based on census data, which are either unable to adapt to the dynamic mar- ket or very time consuming. The rapid increase of the avail- ability of big data from various types of mobile devices, such as online query data and online positioning data, provides us with the possibility to develop automatic and accurate data-driven prediction models for business store placemen- t. In this paper, we propose a Demand Distribution Driven Store Placement (D3SP) framework for business store place- ment by mining search query data from Baidu Maps. D3SP detects the spatial-temporal distributions of customer demands on different business services via query data from Baidu Maps, the largest online map search engine in China, and detects the gaps between demand and supply. Then we determine candidate locations via clustering such gaps. In the final stage, we solve the location optimization problem by predicting and ranking the number of customers. We not only deploy supervised regression models to predict the num- ber of customers, but also use learn-to-rank model to directly rank the locations. We evaluate our framework on various types of businesses in real-world cases, and the experiments results demonstrate the effectiveness of our methods. D3SP as the core function for store placement has already been implemented as a core component of our business analyt- ics platform and could be potentially used by chain store merchants on Baidu Nuomi.

With the wide use of mobile devices, predicting the destination of moving vehicles has become an increasingly important problem for location based recommendation systems and destination-based advertising. Most existing approaches are based on various Markov chain models, in which the historical trajectories are used to train the model and the top-k most probable destinations are returned. We identify certain limitations of the previous approaches. Instead, we propose a new data-driven framework, called DESTPRE, which is not based on a probabilistic model, but directly operates on the trajectories and makes the prediction. We make use of only historic trajectories, without individual identity information. Our design of DESTPRE, although simple, is a result of several useful observations from the real trajectory data. DESTPRE involves an index based on Bucket PR Quadtree and Minwise hashing, for efficiently retrieving similar trajectories, and a clustering on destinations for predictions. By incorporating some additional ideas, we show that the prediction accuracy can be further improved. We have conducted extensive experiments on real Beijing Taxi dataset. The experimental results demonstrate the effectiveness of DESTPRE.

We consider the standard stochastic geometry model in which the existence or the location of each point may be stochastic. We show there exists constant-size kernel coreset (a kernel can approximate either the expected value or the distribution of the width of the point set in every direction) and we can construct such kernel coresets in nearly linear time. We show its applications to several function extent problems, tracking moving points, and shape fitting problems in the stochastic setting.

It is an open problem whether Yao-Yao graphs YYk (also known as sparse-Yao graphs) are all spanners when the integer parameter k is large enough. In this paper we show that, for any integer k\geq 42, the Yao-Yao graph YY2k is a tk-spanner, with stretch factor tk=4.27+O(k−1) when k tends to infinity. Our result generalizes the best known result which asserts that all YY6k are spanners for k large enough [Bauer and Damian, SODA'13]. Our proof is also somewhat simpler.

We study the pure exploration problem subject to a matroid constraint (BEST-BASIS) in a stochastic multi-armed bandit game. In a BEST-BASIS instance, we are given n stochastic arms with unknown reward distributions, as well as a matroid M over the arms. Let the weight of an arm be the mean of its reward distribution. Our goal is to identify a basis of M with the maximum total weight, using as few samples as possible. The problem is a significant generalization of the best arm identification problem and the top-k arm identification problem, which have attracted significant attentions in recent years. We study both the exact and PAC versions of BEST-BASIS, and provide algorithms with nearly-optimal sample complexities for these versions. Our results generalize and/or improve on several previous results for the top-k arm identification problem and the combinatorial pure exploration problem (when the combinatorial constraint is a matroid).

Distributed clustering has attracted significant attention in recent years. In this paper, we study the k-means problem in the distributed dimension setting, where the dimensions of the data are partitioned across multiple machines. We provide new approximation algorithms, which incur low communication costs and achieve constant approximation factors. The communication complexity of our algorithms significantly improve on existing algorithms. We also provide the first communication lower bound, which nearly matches our upper bound in a wide range of parameter setting. Our experimental results show that our algorithms outperform existing algorithms on real data sets in the distributed dimension setting.

Crowdsourced entity resolution has recently attracted a significant attention because it can harness the wisdom of crowds to improve the quality of entity resolution. However existing techniques either cannot achieve perfect quality or incur huge monetary costs. To address these problems, we propose a cost-effective crowdsourced entity resolution framework, which can significantly reduce the monetary cost while keeping high quality. We first define a partial order on the pairs of records. Then we select a pair as a question and ask the crowd to check whether the records in the pair refer to the same entity. After getting the answer of this pair, we infer the answers of other pairs based on the partial order. Next we iteratively select pairs without answers to ask until all pairs have answers. We devise effective algorithms to judiciously select the pairs to ask in order to reduce the number of asked pairs. To further reduce the cost, we propose a grouping technique to group the pairs such that we only ask one pair instead of all pairs in each group. We develop error-tolerant techniques to tolerate the errors introduced by the partial order and the crowd. Experimental results show that our method reduces the cost to 1.25\% of existing approaches (or existing approaches take more than 80 times money of our method) while not sacrificing the quality.

This paper discusses how to efficiently choose from n unknown distributions the k ones whose means are the greatest by a certain metric, up to a small relative error. We study the topic under two standard settings multi-armed bandits and hidden bipartite graphs. For both settings, we prove lower bounds on the total number of samples needed, and propose optimal algorithms whose sample complexities match those lower bounds.

We study the online learning problem when the input to the greedy algorithm is stochastic with unknown parameters that have to be learned over time. We first propose the greedy regret and eps-quasi greedy regret as learning metrics comparing with the performance of offline greedy algorithm. We propose two online greedy learning algorithms with semi-bandit feedbacks, which achieve O(log T) problem-dependent regret bound for a general class of combinatorial structures and reward functions that allow greedy solutions.

We consider the problem of finding k centers for n weighted points on a real line. This (weighted) k-center problem was solved in O(n log n) time previously by using Cole's parametric search and other complicated approaches. In this paper, we present an simpler O(n log n) time algorithm that avoids the parametric search, and in certain special cases our algorithm solves the problem in O(n) time.

We are given a set of weighted unit disks and a set of points in Euclidean plane. The minimum weight unit disk cover (\UDC) problem asks for a subset of disks of minimum total weight that covers all given points. For the weighted \UDC\ problem, several constant approximations have been developed. However, whether the problem admits a PTAS has been an open question. We present the first PTAS for \UDC. Our result implies the first PTAS for the minimum weight dominating set problem in unit disk graphs, the first PTAS for the maximum lifetime coverage problem and an improved constant approximation ratio for the connected dominating set problem in unit disk graphs.

We consider the stochastic geometry model where the location of each node is a random point in a given metric space, or the existence of each node is uncertain. We study the problems of computing the expected lengths of several combinatorial or geometric optimization problems over stochastic points, including closest pair, minimum spanning tree, k-clustering, minimum perfect matching, and minimum cycle cover. We also consider the problem of estimating the probability that the length of the closest pair, or the diameter, is at most, or at least, a given threshold. Most of the above problems are known to be #P-hard. We obtain FPRAS for most of them in both the existential and the locational uncertainty models.

We study the problem of learning from unlabeled samples very general statistical mixture models on large finite sets. Specifically, the model to be learned, \mix, is a probability distribution over probability distributions p, where each such p is a probability distribution over [n] ={1,2,...,n}. When we sample from \mix, we do not observe p directly, but only indirectly and in very noisy fashion, by sampling from [n] repeatedly, independently K times from the distribution p. The problem is to infer \mix to high accuracy in transportation (earthmover) distance. We give the first efficient algorithms for learning this mixture model without making any restricting assumptions on the structure of the distribution. We bound the quality of the solution as a function of the size of the samples K and the number of samples used. Our model and results have applications to a variety of unsupervised learning scenarios, including learning topic models and collaborative filtering.

We are given a set of sensors and a set of target points in the Euclidean plane. In MIN-ConnectSensorCover, our goal is to find a set of sensors of minimum cardinality, such that all target points are covered, and all sensors can communicate with each other (i.e., the communication graph is connected). In Budgeted-CSC problem, our goal is to choose a set of B sensors, such that the number of targets covered by the chosen sensors is maximized and the communication graph is connected. We obtain constant approximations for both problems.

We study approximation algorithms for the following geometric version of the maximum coverage problem: Let P be a set of n weighted point in the plane. We would like to place m rectangles such that total weights of covered points in P is maximized. For any m, we present linear time approximation schemes that can find a 1+eps approximation to the optimal solution.

We are given a set of stochastic points on the real line, we consider the problem of building data structures on P to answer range queries: find the top-k points that lie in the given interval with the highest probability.

We study the problem of selecting K arms with the highest expected rewards in a stochastic N-armed bandit game. We propose a new PAC algorithm, which, with probability at least 1-delta , identifies a set of K arms with average reward at most \epsilon away from the top-k arm. Naive uniform sampling requires O(nlnn) samples. We show it is possible to achieve linear sample complexity. We also establish a matching lower bound (meaning our upper bound is worse-case optimal).

We consider the multi-shop ski rental problem. This problem generalizes the classic ski rental problem to a multi-shop setting, in which each shop has different prices for renting and purchasing a pair of skis, and a consumer has to make decisions on when and where to buy. We given optimal close-form online competitive strategy from the consumer's perspective and a linear time algorithm for computing the optimal strategy. The problem finds applications in cost management in IaaS cloud and scheduling in distributed computing.

We show there is an FPTAS for approximating Pr[∑X_i ?] for a set of independent (not necessarily identically distributed) random variables X_i.

We obtain an efficient poly-time algorithm for the pairwise kidney exchange problem proposed by Roth et al. [J. Econ.Th. 05]. Their original algorithm runs in exponential time. We also provide an alternative short proof of the fact that there exists a majorizing vector in a polymatroid (original proof due to Tamir [MOR95]).

(1) We present the first constant approximation for the fault-tolerant k-median problem even when the demands are nonuniform (each demand point should be assigned to a given number of facilities). This generalizes a previous constant approximation for the uniform case and improves on a previous logarithmic approximation for the general nonuniform case. (2) We present the first constant approximation for the fault-tolerant facility location problem even when the weight function are non-monotone. This generalizes a previous constant approximation for the monotone case.

we investigate the problem of identifying influential users in mobile social networks. Influential users are individuals with high centrality in their social-contact graphs. Traditional approaches find these users through centralized algorithms. However, the computational complexity of these algorithms is known to be very high, making them unsuitable for large-scale networks. We propose a lightweight and distributed protocol, iWander, to identify influential users through fixed length random-walk sampling. We prove that random-walk sampling with O(log n) steps, where n is the number of nodes in a graph, comes quite close to sampling vertices approximately according to their degrees. The most attractive feature of iWander is its extremely low control-message overhead, which lends itself well to mobile applications.

We propose a new technique, called Poisson approximation, for approximating/optimizing the sum of a set of random variables. In many cases, it reduces the stochastic problem to a deterministic multidimensional packing/covering problem. We can apply it to several problems, such as (1) Expected utility maximization (we can reproduce and generalize one result in our FOCS11 paper.) (2) Stochastic knapsack (proposed in [Dean et al. MOR08]): we are given a set items. The size and/or the profit of an item may not be fixed values and only their probability distributions are known to us in advance. The actual size and profit of an item are revealed to us as soon as it is inserted into the knapsack. We want to find a policy to insert the items such that the total expected profit is maximized. We obtain a bicriterion PTAS (i.e., 1+eps approximation with 1+eps capacity), even we allow correlation and cancellation of jobs. (3) Bayesian Online Selection with knapsack constraint: it is a variant of stochastic knapsack where the size and the profit of an item are revealed before the decision whether to select the item is made.

We present the first constant approximation for the fault-tolerant k-median problem (each demand point should be assigned to a given number of facilities).

Given a set of small intervals, and a target interval I, we would like to move the small intervals to cover the target interval. Our goal is to minimize the maximum movement of any interval. We obtain an O(n^2 log n) algorithm for this problem.

We study a problem referred to as the load-distance balancing (LDB) problem, where the objective is assigning a set of clients to a set of given servers. Each client suffers a delay, that is, the sum of the network delay (which is proportional to the distance to its server) and the congestion delay at this server, a nondecreasing function of the number of clients assigned to the server. We address two flavors of LDB—the first one seeking to minimize the maximum incurred delay, and the second one targeted for minimizing the average delay. We also provide bounds for the price of anarchy for the game theoretic version of the problem.

We study the problem of generating synthetic databases having declaratively specified characteristics. This problem is motivated by database system and application testing, data masking, and benchmarking. We argue that a natural, expressive, and declarative mechanism for specifying data characteristics is through cardinality constraints; a cardinality constraint specifies that the output of a query over the generated database have a certain cardinality. While the data generation problem is in tractable in general, we present efficient algorithms that can handle a large and useful class of constraints.

We extend our VLDB09 paper to continuous distributions. We give exact algorithm for polynomial PDFs and approximation algorithms for arbitrary continuous distributions using techniques in approximation theory, such as splines, quadratures, with convergence guarantees better than naive Monte Carlo.

We obtain tight approximation results for the machine activation problem proposed in our SODA10 work. We also obtain the first lnn-approximation for the universal facility location problem in non-metric spaces.

We provide constant factor approximation algorithm for the densest k-subgraph problem on several graph classes, such as chordal graphs, disk graphs etc.

We study the stochastic matching problem, which finds applications in kidney exchange, online dating and online ads. Consider a random graph model where each possible edge e is present independently with some probability pe. We are given these numbers pe, and want to build a large/heavy matching in the randomly generated graph. However, the only way we can find out whether an edge is present or not is to query it, and if the edge is indeed present in the graph, we are forced to add it to our matching. Further, each vertex i is allowed to be queried at most ti times. How should we adaptively query the edges to maximize the expected weight of the matching? Our main result is the first constant approximation for the weighted stochastic matching.

We consider the clustering with diversity problem: given a set of colored points in a metric space, partition them into clusters such that each cluster has at least ell points, all of which have distinct colors. We give a 2-approximation to this problem for any ell when the objective is to minimize the maximum radius of any cluster. We show that the approximation ratio is optimal unless P\ne NP by providing a matching lower bound. Several extensions to our algorithm have also been developed for handling outliers. This problem can be considered as a metric variant of the l-diversity problem, a popular problem for privacy-preserving data publication

The central framework we introduce considers a collection of m machines (unrelated or related) with each machine i having an activation cost of ai. There is also a collection of n jobs that need to be performed, and pi,j is the processing time of job j on machine i. Standard scheduling models assume that the set of machines is fixed and all machines are available. However, in our setting, we assume that there is an activation cost budget of A ?we would like to select a subset S of the machines to activate with total cost a(S) A and find a schedule for the n jobs on the machines in S minimizing the makespan (or any other metric).

We propose two parameterized ranking functions, called PRF and PRFe, for ranking probabilistic datasets. PRF and PRFe generalize or can approximate many of the previously proposed ranking functions. We present novel generating functions-based algorithms for efficiently ranking large datasets according to these ranking functions

We address the problem of finding a “best ?deterministic query answer to a query over a probabilistic database. We propose the notion of a consensus world (or a consensus answer) which is a deterministic world (answer) that minimizes the expected distance to the possible worlds (answers). We consider this problem for various types of queries including SPJ queries, Top-k ranking queries, group-by aggregate queries, and clustering. For different distance metrics, we obtain polynomial time optimal or approximation algorithms for computing the consensus answers (or prove NP-hardness).

We are also given a set of queries, Q1.....Qm, with the query Qi requiring access to a subset of relations distributed in different nodes of the network. For each query, a query plan is provided, in the form of a rooted tree, which specifies the operations to be performed on the relations and the order in which to perform them. Given this input, our goal is to find a data movement plan that minimizes the total communication cost incurred while executing the queries. We provide efficient exact algorithm when the communication network is a tree and approximation algorithm for more general networks.

We have an edge weighted undirected network G(V, E) and n selfish players where player i wants to choose a low cost path from source vertex si to destination vertex ti . The cost of each edge is equally split among players who pass it. The price of stability is defined as the ratio of the cost of the best Nash equilibrium to that of the optimal solution. We present an O(logn/ log logn) upper bound on price of stability for the single sink case, i.e., ti =t for all i.

There is a large literature devoted to the problem of finding an optimal (min-cost) prefix-free code with an unequal letter-cost encoding alphabet of size. While there is no known polynomial time algorithm for solving it optimally, there are many good heuristics that all provide additive errors to optimal. The additive error in these algorithms usually depends linearly upon the largest encoding letter size.
This paper was motivated by the problem of finding optimal codes when the encoding alphabet is infinite. Because the largest letter cost is infinite, the previous analyses could give infinite error bounds. We provide a new algorithm that works with infinite encoding alphabets. When restricted to the finite alphabet case, our algorithm often provides better error bounds than the best previous ones known.