ATCS ¨C Learning and Optimization - theory and practice
2018 Spring
Lecturer: Jian Li
TA: Chuheng Zhang zhangchuheng123 at live dot com
Room: 6B108
time: every monday 9:50am-12:15am
Prerequisite: undergrad machine learning (e.g., machine learing in coursera by Andrew Ng) , some basic knowledge about deep learning
There are some overlaps with my previous course ATCS16 and ATCS17.
We intend to cover a subset of the following topics (tentative):
(1) Classical machine learning and prediction algorithms: Adaboost, gradient boosting, random forest£¬SVD, principle component analysis (PCA)£¬Nonnegative matrix factorization (NMF), topic modeling, matrix completion, dictionary learning, spectral clustering, Gaussian process
(2) Deep Learning: review basics very very quickly (CNN, RNN, autoencoder, GAN please see my undergrad course), word2vec, attention, other GANs, e.g. WGAN, seq2seq, meta-learning, optimization and generalization issues, Deep reinforcement learning
(3) Continuous optimization: Gradient Descent, Stochastic Gradient Descent, mirror descent, variance reduction, proximal method, distributed(ADMM), (Maybe) non-convex optimization, frank-wolfe, Online learning, Universal portfolio, Multiplicative weighting method
¡¡
I won't stickly follow the above order....I may skip something mentioned above and cover something not mentioned above...It is a graduate course.
I will be talking about several applications of ML and optimization in Finance, and of course in typical CS areas like vision, nlp, social networks as well...
Some knowledge about convex optimization may be useful. See John¡¯s class http://itcs.tsinghua.edu.cn/~john/convex2013.html and a previous course by myself. But it will be fine if you didn't take those courses. Basic machine learning knowledge will be very useful. If you don't know any machine learning, I would suggest you to read some notes from Andrew Ng's undergrad lecture notes
The course is a blending of theory and practice. We will cover both the underlying mathematics as well as interesting heuristics.
Grading:
Schedule:
Feb 26. |
(quickly) review supervised learning Additive model, Adaboost, Gradient boosting, xgboost random forests (not covered in class, but you need to know). ¡¡ |
Reading: Additive models, Adaboost, gradient boosting: The element of statistical learning, Ch 9 and 10 Random forest: The element of statistical learning, Ch 9 and 10 XGBoost: A Scalable Tree Boosting System Further reading: Some learning theory about boosting: Foundation of Machine learning, Ch. 6. lightGBM (another popular GDBT implementation) Awesome Random Forest http://jiwonkim.org/awesome-random-forest/ |
Mar 5 | Online Learning, The expert problem. Multiplicative weights method (with application to zero-sum game), Gradient descent for online learning, Universal Porfolio | Multiplicative weights method: a meta-algorithm and its
applications. (A survey) Sanjeev Arora, Elad Hazan, and Satyen Kale. [pdf]
The method can be used to solve approximately the zero-sum game and linear program, and is also closely related to Adaboost. Online convex programming and generalized infinitesimal gradient ascent. M. Zinkevich. |
Mar 12 | (Cont.)Universal Portfolio, Online-to-Batch, Norm, Fenchel Conjugate, Some basics (strongly convex, Bregman Divergence) |
Proof follows:
Universal Portfolios With and Without Transaction Costs Online to Batch: http://ttic.uchicago.edu/~tewari/lectures/lecture13.pdf |
Mar 19 | Online Mirror Descent, Follow the Regularized Leader, UCB for multi-armed bandit | Mirror Descent (equivalence of FTRL and OMD) www-stat.wharton.upenn.edu/~rakhlin/courses/stat991/papers/lecture_notes.pdf¡¡ |
Mar 26 | UCB for multi-armed bandit
(cont.)
epsilon-greedy algorithm Deep learning basics (slides1) |
Peter Auer, Nicolo Cesa-Bianchi, and
Paul Fischer. Finite-time
analysis of the multiarmed bandit problem, 2002 BP follows the description from http://www.offconvex.org/ |
Apr 2 | Deep learning basics (CNN slides2, RNN slides3) Unfortunately, I will not be in town. So TA will go over the slides (probably in Chinese.....) | Again, I expect you to know basic deep learning already. |
Apr 9 | Topic modeling
(nonnegative matrix factorization, anchor word assumption) Word2Vec |
Learning Topic Models - Going beyond SVD(an more practical algorithm) A Practical Algorithm for Topic Modeling with Provable GuaranteesMikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado, Greg S.; Dean, Jeff (2013). Distributed representations of words and phrases and their compositionality. ¡¡ |
Apr 16 | GloVe, Deep Walk,
Word2Vec as implicit matrix factorization embedding in deep learning Start Attention |
Neural Word Embedding as Implicit Matrix Factorization¡¡further reading: Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec. |
Apr 23 | Attention slides(1) slides(2) |
Show, Attend and Tell: Neural Image Caption Generation with Visual
Attention Neural Machine Translation by Jointly Learning to Align and Translate Self Attention for Machine Translation ConvS2S Attention Is All You Need |
May 7 | Finishing attention. Normalization (batch norm, layer norm, group norm) slides (Very) Brief introduction to transfer learning and multi-task learning slides Start Generalization (algorithmic stability) |
Reading: How transferable are features in deep neural networks?
When Will You Arrive? Estimating Travel
Time Based on Deep Neural Networks |
May 14¡¡ | Generalization Algorithmic stability, PAC Bayes Generalization of SGD and SGLD (Langevin dynamics) |
Train faster, generalize better: Stability of stochastic gradient descent generalization of SGLD (this note provides a simple and elementary proof of the first result of the next paper, which uses techniques from SDE) further readings: Generalization Bounds of SGLD for Non-convex Learning Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data ¡¡ |
May 21 | Convergence of GD (convex,
strongly convex) Stochastic GD, Variance Reduction |
Accelerating Stochastic Gradient
Descent using Predictive Variance further readings: Introductory Lectures on Stochastic Optimization Potential-Function Proofs for First-Order Methods Katyusha: The First Direct Acceleration of Stochastic Gradient Methods Lower bound: Tight Complexity Bounds for Optimizing Composite Objectives |
May 28¡¡ | Heavy ball method (ODE interpretation and
convergence analysis) Lower bound of first order method Nesterov's acceleration (ODE interpretation and analysis)¡¡ |
¡¡ further readings: A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights Potential-Function Proofs for First-Order Methods Optimal bounds for convex functions: (upper bound) Katyusha: The First Direct Acceleration of Stochastic Gradient Methods (Lower bound:) Tight Complexity Bounds for Optimizing Composite Objectives Nonconvex: see table 1 and 2 in my paper: A Simple Proximal Stochastic Gradient Method for Nonsmooth Nonconvex Optimization. |
Jun 4 | Reinforcement learning. policy iteration, value iteration, MC, TD, actor-critic |
Berkeley Deep Reinforcement Learning course: http://rll.berkeley.edu/deeprlcourse/ |
Jun 11 (last class) | Policy Gradient, advantage, actor-critic, A3C | Further reading: google the following natural policy gradient (NPG) TRPO (trust region policy optimization) PPO(proximal policy optimization) More material see http://rll.berkeley.edu/deeprlcourse/ |
¡¡ | Course Project Presentation | ¡¡ |
¡¡ | ¡¡ | ¡¡ |
¡¡ | ¡¡ | ¡¡ |
¡¡ | ¡¡ | ¡¡ |
¡¡
¡¡
References:
[Book] The elements of statistical learning [ESL]
[Book] Convex Optimization
[Book] Introduction to online convex optimization
[Book] Learning, Prediction and Games
[Book] Lectures on Modern Convex Optimization
Ankur Moitra's (MIT) lecture notes (Algorithmic machine learning) lecture notes
¡¡
Berkeley Deep Reinforcement Learning course:
http://rll.berkeley.edu/deeprlcourse/
blog about nonconvex optimization: https://www.offconvex.org/
¡¡
Python is the default programming language we will use in the course.
If you haven't use it before, don't worry. It is very easy to learn (if you know any other programming language), and is a very efficient language, especially
for prototyping things related scientific computing and numerical optimization. Python codes are usually much shorter than C/C++ code (the lauguage has done a lot for you). It is also more flexible and generally faster than matlab.
A standard combination for this class is Python+numpy (a numeric lib for python)+scipy (a scientific computing lib for python)+matplotlab (for generating nice plots)
Another somewhat easier way is to install Anaconda (it is a free Python distribution with most popular packages).
¡¡
¡¡
¡¡
¡¡
¡¡