ATCS – Learning and Prediction

ATCS – Selected Topics in Learning, Prediction, and Optimization

2017 Spring

TA: Zizhe Li liplace@126.com

Room: 3313

time: every monday 9:50am-12:15am

We intend to cover a subset of the following topics (tentative):

(1) quickly recall basics about convex optimization and machine learning: linear/logistic regression, regularization, newton method, stochastic gradient descent (asychronized, variance reduction method), generative vs discrimitive, variance vs bias

(2) Off-the-shelf machine learning and prediction algorithms: k-nn, SVM, kernel trick, clustering, Adaboost, gradient boosting, random forest

(3) Online learning and sequential prediction. Multi-armed bandit, Universal portfolio, Multiplicative weighting method, online convex optimization, basic time series

(4) linear algebra-based learning algorithms: SVD, principle component analysis (PCA), independent component analysis (ICA), Nonnegative matrix factorization (NMF), topic modeling, matrix completion, dictionary learning, tensor method, spectral clustering

(5) Deep Learning. very quickly (basics, CNN, RNN, please see my undergrad course). Some recent interesting stuffs, like AlphaGo, Poker, WGAN. (we will see)

(Maybe): word2vec, hashing/sketching, community detection, non-convex optimization, some statistics....

I won't stickly follow the above order....I may skip something mentioned above and cover something not mentioned above...It is a graduate course.

I will be talking about several applications of ML and optimization in Finance, and of course in typical CS areas like vision, nlp, social networks as well...

Some knowledge about convex optimization may be useful. See John’s class http://itcs.tsinghua.edu.cn/~john/convex2013.html and a previous course by myself. But it will be fine if you didn't take those courses. Basic machine learning knowledge will be very useful. If you don't know any machine learning, I would suggest you to read some notes from Andrew Ng's undergrad lecture notes

The course is a blending of theory and practice. We will cover both the underlying mathematics as well as interesting heuristics.

A similar previous course.

Grading:

Homeworks (30pts, 1 homework every two or three weeks, the homeworks may have some small coding assignments)
10 pts for taking notes: Each student should take notes for at least one lecture (maybe two), using LaTex (use this template sample.tex algorithm2e.sty).
Course projects (40 pts)
Final exam (20pts): Take home, 24 hours.
No exam.

Schedule:

Feb 20	The expert problem. Multiplicative weights method	Multiplicative weights method: a meta-algorithm and its applications. (A survey) Sanjeev Arora, Elad Hazan, and Satyen Kale. [pdf] The method can be used to solve approximately the zero-sum game and linear program, and is also closely related to Adaboost.
Feb 27	Gradient descent for online learning SVD, PCA	Online convex programming and generalized infinitesimal gradient ascent. M. Zinkevich. a PCA tutorial
Mar 6	Linear Regression, LR from SVD, probabilistic interpetation of LR. Ridge Regression, Guass-Markov theorem. basics of MLE.
Mar 13	Sparsity: LASSO, Group Lasso, elastic net. Compressive sensing basics. Matrix completion. Norm and Dual Norm. Nuclear Norm. (sub)-Differential of Nuclear norm. Nuclear minimization = SDP. Matching pursuit. Orthogonal matching pursuit.	More details about sub-differential of nuclear norm: Characterization of the subdifferential of some matrix norm, Linear Algebra Appl., 170(1992),pp.33-45. Nuclear minimization = SDP MLE of Gaussian (mean, variance)
Mar 20	Dictionary learning (k-SVD, online learning approach). Nonnegative matrix factorization. Crame-Rao	Futher readings: Computing a Nonnegative Matrix Factorization--Provably Learning topic models--going beyond SVD Factoring nonnegative matrices with linear programs
Mar 27	Consistency of MLE, asymptotic normality of MLE. Atomic norm. Coding complexity regularization.	Futher readings: The Convex Geometry of Linear Inverse Problems https://en.wikipedia.org/wiki/Structured_sparsity_regularization Learning with Structured Sparsity
Apr 3	Convex Optimization. Strongly Convexity. Condition number. Convergence of gradient descent (for strongly convex, for lipschitz gradient).	Mentioned in the beginnin gof the class: Graph sparsity: A Nearly-Linear Time Framework for Graph-Structured Sparsity graph laplacian regularization Integral of second derivative result can be found in ESL chapter 5 More general Tikhonov regularization (the differential operator regularization I mentioned in the class)
Apr 10	Stochastic Gradient Descent, Reduced Variance SGD, lower bound of first order method
Apr 17	Heavy ball method (Momentum), ODE intepretation of heavy ball, Nesterov's Accelerate GD and ODE intepretation, convergence analysis	A very nice blog post why momentum works (with animations!)
Apr 24	Conjugate Gradient, Chebyshev polynomial, Convex conjugate function
May 6	Proximal Method	main ref: http://web.stanford.edu/~boyd/papers/prox_algs.html
May 8	ADMM	main ref: http://stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf
May 15	Brief review of kernel method. Random Fourier Feature, Feature hashing, Johnson–Lindenstrauss lemma	Random Features for Large-Scale Kernel Machines Feature Hashing for Large Scale Multitask Learning JL lemma (see e.g. blog post, page by Yitong Yin)
May 22
May 29
Jun 6	(presentation)

References:

[Book] The elements of statistical learning [ESL]

[Book] Convex Optimization

[Book] Introduction to online convex optimization

[Book] Learning, Prediction and Games

[Book] All of Statistics: A Concise Course in Statistical Inference

[Book] Optimization Methods in Finance

[Book] Analysis of Financial Time Series

[Book] Statistical Models and Methods for Financial Markets

Ankur Moitra's (MIT) lecture notes (Algorithmic machine learning) lecture notes

Python is the default programming language we will use in the course.

If you haven't use it before, don't worry. It is very easy to learn (if you know any other programming language), and is a very efficient language, especially

for prototyping things related scientific computing and numerical optimization. Python codes are usually much shorter than C/C++ code (the lauguage has done a lot for you). It is also more flexible and generally faster than matlab.

A standard combination for this class is Python+numpy (a numeric lib for python)+scipy (a scientific computing lib for python)+matplotlab (for generating nice plots)

Another somewhat easier way is to install Anaconda (it is a free Python distribution with most popular packages).

Futher readings:

Computing a Nonnegative Matrix Factorization--Provably

Learning topic models--going beyond SVD

Factoring nonnegative matrices with linear programs

Futher readings:

The Convex Geometry of Linear Inverse Problems