|I. Course description||II. Course textbooks||III. Course outline|
|IV. Course expectations||V. Grading information|
The information explosion of the past few years has us drowning in data but often starved of knowledge. Many companies that gather huge amounts of electronic data have now begun applying data science principles and models to discover and extract pieces of information useful for making smart business decisions.
In this course we will study a variety of techniques for data mining and machine learning. Some emphasis will be given to approaches that are scalable to very large data sets and/or those that are relatively robust when faced with a large number of predictors, and algorithms for heterogeneous or streaming data. We will mostly be using Python (specially, Scikit-Learn), though a few problems in R will also be given.
The central goal of this course is to convey an understanding of the pros and cons of different data-driven models, so that you can (i) make an informed decision on what approaches to consider when faced with real-life problems requiring predictive modeling, (ii) apply models properly on real datasets so to make valid conclusions. This goal will be reinforced through both theory and hands-on experience.
More hands-on exercises are offered in the "Data Science Lab" course. Information about the course instructor and TA(s) is available on the Contact tab above.
The material for the lectures is taken from a wide variety of sources. My notes will be available via Canvas.
For help with programming + concepts, “Hands-On Machine Learning with Scikit-Learn and Tensorflow”, by A. Geron (O’Reilly, 2017) is a great reference. The Scikit-Learn website is also reasonably documented.
Author: Max Kuhn and Kjell Johnson (KJ)
Title: Applied Predictive Modeling
Notes: Available through Amazon, Springer and UT Co-op.
Author: Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani (JW)
Title: An Introduction to Statistical Learning with Applications in R
Notes: The authors have kindly provided a free pdf version here, though it may be worth your while to get a hardcopy as well. Relevant sections of these books are indicated next to the topics below. A more detailing listing of readings/videos/other resources is provided through Canvas. Topics without reference to KJ, JW will be covered through lecture notes and a reading list of papers (also see supplementary texts at the bottom).
1. Overview: The data mining process; model fitting and overfitting; decision theory; probability review; Types of predictive analytics; Local vs. global models; Tackling Big Data; Multiple linear regression.
(3 lectures; KJ: Chapter 1,2; JW Chapters 1-3, B, Ch 1, 2.1-2.3; HTF Ch 1, 2.1-2.6)
Objective: Provide overview and context for this class.
2. Regression: generalized regression, bias-variance tradeoff and overfitting; Model tuning; Basis function expansion; Dealing with large number of features; Ridge, Lasso and Stagewise Approaches; Non-linear methods;
(3 lectures; KJ: Chapter 4-6, 7.1,7.4, 7.5; JW Ch 6, B 3.1, 3.2; HTF Ch 2.7, 2.8, 3.1-3.4, 7.1-7.3, 11.1-11.8)
Objective: Learn to understand predictive models where desired outcome is a numeric quantity.
3. Classification: Bayes decision theory, Naïve Bayes and Bayesian networks; (Scaled) Logistic regression; LDA; Scaling decision trees to big data; Kernel methods and Support Vector Machines (SVMs) for classification and regression; dealing with class imbalance
(6 lectures, KJ: Chapter 11,12,13,14.1,14.2,8.1,16; JW 4, 8, 9; B 4.1-4.3.4; 6.1, 6.2, 7.1, 14.4; HTF Ch 4, 7.10, 9.2, 12, 13.3)
Objective: Learn to build and evaluate predictive models where desired outcome is a class label.
4. Ensemble Methods: Model Averaging, Bagging and Random forests, boosting, Gradient boosting; Bag of Little Bootstraps
(2 lectures; KJ: Chapter 14.3-14.8; JW 8.2; B 14.2, 14.3, HTF Ch 8.7, 8.8, 10.1-10.7, 16)
Objective: Understand the benefits of combining multiple predictive models.
5. Data Pre-Processing: Cleaning, Reduction, Feature Extraction and Visualization: Data quality; Curse of dimensionality, Transformations, Imputation, Sampling, Outlier detection, PCA.
(3 lectures, KJ: Chapter 3, JW 10.1, 10.2, B 12.1; HTF Ch 14.5, 14.8)
Objective: Understand that good data quality is a pre-requisite for effective models, and study some methods for improving data quality.
6. Clustering and Co-clustering: k-means; hierarchical methods, graph partitioning; co-clustering, semi-supervised learning. Market Basket applications
(3 lectures: JW 10.3, 10.5, B 9.1, 9.2; HTF Ch 13.1, 13.2, 14.3, 14.4)
Objective: Learn issues involved in unsupervised learning and the trade-offs among alternative approaches to clustering.
7. Streaming Data Mining, Big Data Analytics: Online learning – basic approaches; Winnow/Voted Perceptrons; Stochastic gradient methods for large data sets. Deep Learning and Tensorflow. (3 lectures)
9. Specialized Topics Topics: (coverage depends on time available and interest of class) Deep learning, Recommender Systems; Customer-product affinity detection;Hadoop/SPARK, Azure ML
10. Term Project Presentations and Discussion (4 classes)
11. Wildcards: A couple of classes may be used for invited talks by visiting experts.
Grading is NOT based on absolute thresholds, e.g. 90+ = A etc.
Author: Joel Grus
Title: Data Science from Scratch
Publisher: O'Reilly (Wes McKinney's Python for Data Analysis is also helpful.
MOOC: Coursera course by Andrew Ng has some very introductory material on linear algebra, e.g. multiplying a matrix with a vector.
Author: Trevor Hastie, Robert Tibshirani, and Jerome Friedman (HTF)
Title: The Elements of Statistical Learning
Publisher: Springer (2nd edition)
Notes: Can get it from Amazon, about $70 but well worth it, or download pdf from http://www-stat.stanford.edu/~tibs/ElemStatLearn/
Author: Pang-Ning Tan, Michael Steinbach, and Vipin Kumar (TSK)
Title: Introduction to Data Mining
Publisher: Addison-Wesley (2005)
Notes: Some chapters are downloadable from this website
Author: Christopher M. Bishop (B)
Title: Pattern Recognition and Machine Learning
Author: Kevin Murphy
Title: Machine Learning: A Probabilistic Perspective,
Publisher: MIT Press
Notes: Covers a very wide range of topics. Lots of examples in Matlab, with source code access.
Disabilities statement: "The University of Texas at Austin provides upon request appropriate academic accommodations for qualified students with disabilities. For more information, contact the Office of the Dean of Students at 471-6259, 471-4641 TTY."
ACADEMIC DISHONESTY, UT Honor Code etc.: