ATCS – Learning and Prediction

ATCS – Selected Topics in Learning, Prediction, and Optimization (with application in Finance)

2020 Spring

Lecturer: Jian Li ( lapordge at gmail dot com)

TA: Tianping Zhang ( ztp18 at mails.tsinghua.edu.cn )

time: every Monday 9:50am-12:15am

Room: TBD (we use ZOOM for now)

We intend to cover a subset of the following topics (tentative):

(1) I assume you already know all basics (convex optimization and machine learning, stochastic gradient descent, gradient boosting, deep learning basics, CNN, RNN, please see my undergrad course). If you don't know any machine learning, I do not suggest you to take this course. I will recall some concepts briefly when necessary.

(2) online learning, multi-armed bandit, statistical learning theory, theory of deep learning

I won't stickly follow the above order....I may skip something mentioned above and cover something not mentioned above...It is a graduate course.

I will be talking about several applications of ML and optimization in Finance (trading, pricing derivatives etc), and of course in typical CS areas like vision, nlp, social networks as well...

I will teach about 2/3 of the classes. For the rest, I will choose some topics and students need to do class presentation.

Tentative topics for class presentation: GAN, adverserial training, federated learning, AutoML, various financial applications.

Some knowledge about convex optimization may be useful. See this course (by S. Boyd) and a previous course by myself. But it will be fine if you didn't take those courses. Basic machine learning knowledge is a must.
Andrew Ng's undergrad lecture notes

The course is a blending of theory and practice. We will cover both the underlying mathematics as well as interesting heuristics.

A similar previous course.

Grading:

Homeworks (40pts, 1 homework every two or three weeks. We will have 2 coding assignments for this semester)
5 pts for taking notes: Each student should take notes for at least one lecture (maybe two), using LaTex (use this template sample.tex algorithm2e.sty).
Course projects (55 pts. 10pts for mid-report, 10pts for final presentation, 35pts for final report)
No exam.

Schedule:

Feb 17	Basics of Online learning The expert problem. Multiplicative weights method Online gradient descent	Multiplicative weights method: a meta-algorithm and its applications. (A survey) Sanjeev Arora, Elad Hazan, and Satyen Kale. [pdf] The method can be used to solve approximately the zero-sum game and linear program, and is also closely related to Adaboost. Our exposition of expert problem follows this note Online convex programming and generalized infinitesimal gradient ascent. M. Zinkevich Selected Reading: Tracking the Best Expert (this can deal with dynamic regret)
Feb 24	Basics of Stock and Future Markets. Cover's universal portfolio. 2nd order bound Online Mirror Descent: Fenchel conjugate function (closely related to other notion of duality, such as polar dual, dual norm)	Universal Portfolios Universal Portfolios With and Without Transaction Costs The Squint paper by Koolen and Van Erven Selected reading: online newton method (see this book Chapter 4) (this can solve the universal portfolio problem)
Mar 2	Online Mirror Descent Interval regret, Sleeping Expert, Adaptive Regret	Online mirror descent (see this note or this(chapter 5)) selected reading: OMD is nearly optimal in a certain sense
Mar 9	Bandit: stochastic: UCB adversarial: EXP3 Pure exploration: Median Elimination	Finite-time Analysis of the Multiarmed Bandit Problem The non-stochastic multi-armed bandit problem Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems Selected Reading: OMD (this survey chapter 5)
Mar 16	Thompson Sampling Contextual bandit: EXP4 epsilon-greedy	A Tutorial on Thompson Sampling The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits Taming the monster: A fast and simple algorithm for contextual bandits Selected Reading: Olivier Chapelle and Lihong Li. An empirical evaluation of Thompson sampling. In NIPS, 2011. Russo, D. and B. Van Roy. 2014. “Learning to optimize via posterior sampling”. Mathematics of Operations Research. 39(4): 1221–1243. (Regret analysis of Thompson sampling, via a connection to UCB)
Mar 23	Hyperparameter optimization Hyperband Bayesian Optimization Online-to-batch Introduction to Stock and Future Market Efficient Market Hypothesis	Hyperband: A novel bandit-based approach to hyperparameter optimization Practical Bayesian Optimization of Machine Learning Algorithms Online to Batch Selected Reading: Gaussian process optimization in the bandit setting: No regret and experimental design
Mar 30	Digress to Martingale convergence theorem Brief Intro to Future and Option No arbitrage argument Basics of Brownian Motion Ito Calculus	See corresponding chapters in Options, Futures and Other Derivatives
Aril 6	Break. No class.
Apr 13.	Pricing Derivatives Binomial Tree model Black-Scholes Merton (BSM) model Pricing derivative via no regret online learning	See corresponding chapters in Options, Futures and Other Derivatives
Apr 20	Relation between SDE and PDE Feynman-Kac formula Risk neutral valuation representation of BS formula Fokker-Planck formula Robust pricing options using online learning Volatility smile	Online Trading Algorithms and Robust Option Pricing Learning, Regret Minimization and Option Pricing Robust option pricing: Hannan and Blackwell meet Black and Scholes Selected Reading: How to Hedge an Option Against an Adversary: Black-Scholes Pricing is Minimax Optimal Minimax Option Pricing Meets Black-Scholes in the Limit
Apr 27	Quick review of Classical Statistical Learning Theory Rademacher complexity Maxima of (sub)Gaussian process Chaining Packing/Covering number VC-dimension	There are many good refereces for these classical results I mainly followed the very well written (by van Handel) [lecture note] Probability in High Dimension (chapter 5 and 7) [Book] Neural Network Learning: Theoretical Foundations is also a very good source.
May 4	Labor day break, class moved to May 9
May 9	Pseudo-dimension, Fat-Shattering dimension Rademacher complexity Relation between these dimensions and covering/packing numbers.	[Book] Neural Network Learning: Theoretical Foundations (Chapter 11) Rademacher and Gaussian complexities: Risk bounds and structural results Scale-sensitive dimensions, uniform convergence, and learnability Selected reading: Local Rademacher Complexities Fast rates in statistical and online learning arxiv.org › cs
May 11	Classical VC Bounds of Neural Networks (overview) generalization bounds in deep learning Rethinking generalization Spectral norm generalization bound Generalization via Compression	[Book] Neural Network Learning: Theoretical Foundations (Chapter 7,8) Understanding deep learning requires rethinking generalization Spectrally-normalized margin bounds for neural networks Stronger generalization bounds for deep nets via a compression approach Selected Reading: A pac-bayesian approach to spectrally-normalized margin bounds for neural networks Non-vacuous generalization bounds at the imagenet scale: a PAC-bayesian compression approach Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data Uniform convergence may be unable to explain generalization in deep learning
May 18	RKHS MMD, Universality, integral operator generalization bound for kernel method (overview) why existing (kernel) generalization bounds are not enough	A Primer on Reproducing Kernel Hilbert Spaces [Book] Learning with Kernels. Support Vector Machines, Regularization, Optimization, and Beyond. By Bernhard Schölkopf and Alexander J. Smola, Chapter 1 and 2 Universality, characteristic kernels and RKHS embedding of measures A Kernel Two-Sample Test Selected reading: On Learning with Integral Operators ON THE MATHEMATICAL FOUNDATIONS OF LEARNING
May 25	student presentation group 1	About Double Descent Reconciling modern machine-learning practice and the classical bias–variance trade-off Deep double descent: Where bigger models and more data hurt The generalization error of random features regression: Precise asymptotics and double descent curve Surprises in high-dimensional ridgeless least squares interpolation Optimal regularization can mitigate double descent
Jun 1	student presentation group 2	About Kernels To understand deep learning we need to understand kernel learning Rademacher and Gaussian complexities: Risk bounds and structural results Learning Kernels Using Local Rademacher Complexity Neural Tangent kernel Training Neural Networks as Learning Data-Adaptive Kernels
Jun 8	Project presentation

References:

[Book] Introduction to online convex optimization

[Book] Learning, Prediction and Games

[Book] Options, Futures and Other Derivatives

[Book] Advances in Financial Machine Learning

[Book] Foundation of Machine Learning

[Book] Understanding Machine Learning: From Theory to Algorithms

Lecture notes for STAT928: Statistical Learning and Sequential Prediction

Python is the default programming language we will use in the course.

If you haven't use it before, don't worry. It is very easy to learn (if you know any other programming language), and is a very efficient language, especially

for prototyping things related scientific computing and numerical optimization. Python codes are usually much shorter than C/C++ code (the lauguage has done a lot for you). It is also more flexible and generally faster than matlab.

A standard combination for this class is Python+numpy (a numeric lib for python)+scipy (a scientific computing lib for python)+matplotlab (for generating nice plots)

Another somewhat easier way is to install Anaconda (it is a free Python distribution with most popular packages).

Universal Portfolios

Universal Portfolios With and Without Transaction Costs

Online mirror descent (see this note or this(chapter 5))

selected reading: OMD is nearly optimal in a certain sense

Finite-time Analysis of the Multiarmed Bandit Problem

The non-stochastic multi-armed bandit problem

A Tutorial on Thompson Sampling

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits

Selected Reading:

Minimax Option Pricing Meets Black-Scholes in the Limit

Fast rates in statistical and online learning

A Primer on Reproducing Kernel Hilbert Spaces

Reconciling modern machine-learning practice and the classical bias–variance trade-off

Deep double descent: Where bigger models and more data hurt

The generalization error of random features regression: Precise asymptotics and double descent curve

Surprises in high-dimensional ridgeless least squares interpolation

Optimal regularization can mitigate double descent

To understand deep learning we need to understand kernel learning

Learning Kernels Using Local Rademacher Complexity