I. Course description II. Course textbooks III. Course outline
IV. Grading information

Course description

The information explosion of the past few years has us drowning in data but often starved of knowledge. Many companies that gather huge amounts of electronic data have now begun applying data mining techniques to their data warehouses to discover and extract pieces of information useful for making smart business decisions.

In this course we will study a variety of techniques for predictive analytics. Particular emphasis will be given to approaches that are scalable to very large data sets and/or those that are relatively robust when faced with a large number of predictors, and algorithms for heterogeneous or streaming data. Many of these capabilities are essential for handing BIG DATA. Connections to relevant business problems shall be made via example studies. We will mostly be using Python (specially Scikit-Learn). The central goal of this course is to convey an understanding of the pros and cons of different predictive modeling techniques, so that you can (i) make an informed decision on what approaches to consider when faced with real-life problems requiring predictive modeling, (ii) apply models properly on real datasets so to make valid conclusions. This goal will be reinforced through both theory and hands-on experience.


Textbooks

The material for the lectures is taken from a wide variety of sources. My notes will be available via Canvas.

For help with programming + concepts, “Hands-On Machine Learning with Scikit-Learn and Tensorflow”, by A. Geron (O’Reilly, 2017) is a great reference. The Scikit-Learn website is also reasonably documented.

The textbook for the course is:

Author: Max Kuhn and Kjell Johnson (KJ)
Title: Applied Predictive Modeling
Publisher: Springer
ISBN: 1461468485
Year: 2013
Notes: Available through Amazon, Springer and UT Co-op.
Author: Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani (JW)
Title: An Introduction to Statistical Learning with Applications in R
Publisher: Springer
Notes: The authors have kindly provided a free pdf version here, though it may be worth your while to get a hardcopy as well.

Relevant sections of these books are indicated next to the topics below. Topics without reference to KJ, JW will be covered through lecture notes and a reading list of papers.

Supplementary texbooks are:

Author: Trevor Hastie, Robert Tibshirani, and Jerome Friedman
Title: The Elements of Statistical Learning
Publisher: Springer (2nd edition)
ISBN: 0387848576
Notes: Can get it from Amazon, about $70 but well worth it, or download pdf from http://www-stat.stanford.edu/~tibs/ElemStatLearn/
Author: Kevin Murphy
Title: Machine Learning: A Probabilistic Perspective,
Publisher: MIT Press
ISBN: 0262018020
Notes: Covers a very wide range of topics. Lots of examples in Matlab, with source code access.
Author :Wes McKinney
Title: Python for Data Analysis
Publisher: O'Reilly
MOOC: Coursera course by Andrew Ng has some very introductory material on linear algebra, e.g. multiplying a matrix with a vector.
Notes:https://class.coursera.org/ml-003/lecture

Course Schedule and Topics

KJ, JW refers to the text; Pi refers to paper set i provided through Canvas.

1. Overview and Recap: Types of predictive analytics; Local vs. global models; Role within data mining process, Software demos; Multivariate regression; Tackling Big Data
(2 lectures; KJ: Chapter 1,2; JW Chapters 1-3 P0)
Objective: Recap and revise key concepts from MIS 380, and provide context for this class.

2. Data Pre-Processing: Transformations, Imputations, Sampling, Outlier detection
(2 lectures, KJ: Chapter 3, JW 10.1, 10.2)
Objective: Understand that good data quality is a pre-requisite for effective models, and study some methods for improving data quality.

3. Overfitting: Bias-variance tradeoff and overfitting; Model tuning
(1 lecture; KJ: Chapter 4,5)

4. Advanced Multivariate regression (partly revision):Basis function expansion; Dealing with large number of features; Ridge, Lasso and Stagewise Approaches; Non-linear methods; Bayesian approach; and Multilevel models. User's guide.
(3 lectures; KJ: Chapter 6, 7.1,7.4, 7.5; JW Ch 6, P1)
Objective: 4. Learn to understand predictive models where desired outcome is a numeric quantity.

5. Classification: Bayes decision theory, Naïve Bayes and Bayesian networks; (Scaled) Logistic regression; LDA; Scaling decision trees to big data; Kernel methods and Support Vector Machines (SVMs) for classification and regression; dealing with class imbalance
(6 lectures, KJ: Chapter 11,12,13,14.1,14.2,8.1,16; JW 4, 8, 9; P2)
Objective: Learn to understand predictive models where desired outcome is a class label.

6. Ensemble Methods: Model Averaging, Bagging and Random forests, boosting, Gradient boosting; Bag of Little Bootstraps
(2 lectures; KJ: Chapter 14.3-14.8; JW 8.2; P3)
Objective: Understand the benefits of combining multiple predictive models.

7. Streaming Data Mining: Online learning – basic approaches; Winnow/Voted Perceptrons; Stochastic gradient methods for large data sets
(1 lecture; P4)

8. Semi-supervised Learning for Big Data: Learning when labeled data is scarce.
(1 lecture; P5)
Objectives for 7 and 8: Both streaming and semi-supervised situations are commonly involved in big data applications, and so ways of predictive modeling in these non-traditional settings are covered

9. Specialized/Advanced Topics: (coverage depends on time available and interest of class)

(a) Multi-task Learning: Predictive analytics for (multi)-relational data; Use of auxiliary information such as networks;

(b) Stochastic BlockModels. Applications to Next Generation Recommender Systems; Customer-product affinity detection;

(c) Ranking; Spam filtering

(d) Distributed Data Mining. Overview; ADMM and distributed logistic regression
(3 lecture; P6)
Objective: Study some important specialized situations where predictive models are deployed, including large-scale predictive modeling using multiple machines, e.g. cloud computing.

10. Term Project Presentations and Discussion
(4 lectures)

11. Wildcards: A couple of classes may be used for invited talks by visiting experts.

Grading information

10+25%Project (groups of 3-5): (project outline + 2 presentations) + term paper due end of the semester
25%5 Asssignments
15%4 pop-quizzes. Best 3 scores counted
20%Written Exam in class)
5%Class participation

Dates for quizzes and the exam will be announced on Canvas. There will be no final exam.

Quizzes will be held in class and of duration 15-30 minutes. Their objective is to review key concepts introduced in class.

At the end of the course, you will get a score out of 100 based on the percentages stated above. Your final grade will be solely based on this score. The grade is primarily based on the curve, i.e. is relative to how the whole class performs; however entire curve may shift up or down a bit depending on how the class as a whole performs relative to past classes.  Grading is NOT based on absolute thresholds, e.g. 90+ = A etc.

ACADEMIC DISHONESTY AND POLICIES ON CHEATING: Faculty at UT are committed to detecting and responding to all instances of scholastic dishonesty and will pursue cases of scholastic dishonesty in accordance with university policy. Scholastic dishonesty, in all its forms, is a blight on our entire academic community. All parties in our community -- faculty, staff, and students -- are responsible for creating an environment that educates outstanding engineers, and this goal entails excellence in technical skills, self-giving citizenry, and ethical integrity. Industry wants engineers who are competent and fully trustworthy, and both qualities must be developed day by day throughout an entire lifetime. Scholastic dishonesty includes, but is not limited to, cheating, plagiarism, collusion, falsifying academic records, or any act designed to give an unfair academic advantage to the student. The fact that you are in this class as an engineering student is testament to your abilities. Penalties for scholastic dishonesty are severe and can include, but are not limited to, a written reprimand, a zero on the assignment/exam, re-taking the exam in question, an F in the course, or expulsion from the University. Don't jeopardize your career by an act of scholastic dishonesty. Details about academic integrity and what constitutes scholastic dishonesty can be found at the website for the UT Dean of Students Office and the General Information Catalog, Section 11-802.

Disabilities statement: "The University of Texas at Austin provides upon request appropriate academic accommodations for qualified students with disabilities. For more information, contact the Office of the Dean of Students at 471-6259, 471-4641 TTY."

NOTICES: