I. Course description	II. Course textbooks	III. Course outline
IV. Grading information

Course description

The information explosion of the past few years has us drowning in data but often starved of knowledge. Many companies that gather huge amounts of electronic data have now begun applying data mining techniques to their data warehouses to discover and extract pieces of information useful for making smart business decisions.

In this course we will study a variety of techniques for data mining and machine learning. Some emphasis will be given to approaches that are scalable to very large data sets and/or those that are relatively robust when faced with a large number of predictors, and algorithms for heterogeneous or streaming data. Many of these capabilities are essential for handing BIG DATA. We will mostly be using Python (specially, Scikit-Learn), though some problems in R will also be given.

The central goal of this course is to convey an understanding of the pros and cons of different data mining techniques, so that you can (i) make an informed decision on what approaches to consider when faced with real-life problems requiring predictive modeling, (ii) apply models properly on real datasets so to make valid conclusions. This goal will be reinforced through both theory and hands-on experience.

Textbooks

The material for the lectures is taken from a wide variety of sources. My notes will be available via Canvas. The textbooks for the course are:

Author: Max Kuhn and Kjell Johnson (KJ)
Title: Applied Predictive Modeling
Publisher: Springer
ISBN: 1461468485
Year: 2013
Notes: Available through Amazon, Springer and UT Co-op.

Author: Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani (JW)
Title: An Introduction to Statistical Learning with Applications in R
Publisher: Springer
Notes: The authors have kindly provided a free pdf version here, though it may be worth your while to get a hardcopy as well.

Relevant sections of these books are indicated next to the topics below. Topics without reference to KJ, JW will be covered through lecture notes and a reading list of papers.

Supplementary texbooks are:

Author: Trevor Hastie, Robert Tibshirani, and Jerome Friedman (HTF)
Title: The Elements of Statistical Learning
Publisher: Springer (2nd edition)
ISBN: 0387848576
Notes: Can get it from Amazon, about $70 but well worth it, or download pdf from http://www-stat.stanford.edu/~tibs/ElemStatLearn/

Author: C. Bishop (B)
Title: Pattern Recognition and Machine Learning
Publisher: Springer

Author: Kevin Murphy
Title: Machine Learning: A Probabilistic Perspective,
Publisher: MIT Press
ISBN: 0262018020
Notes: Covers a very wide range of topics. Lots of examples in Matlab, with source code access.

Author :Wes McKinney
Title: Python for Data Analysis
Publisher: O'Reilly

(also see "Data Science from Scratch" by Joel Grus)

MOOC: Coursera course by Andrew Ng has some very introductory material on linear algebra, e.g. multiplying a matrix with a vector.
Notes:https://class.coursera.org/ml-003/lecture

Course Schedule and Topics

KJ, JW, HTF and B refer to the texts and supplementary books

1. Overview: Data Mining process; Local vs. global models, and other trade-offs.
(2 lectures; KJ: Chapter 1,2; JW Chapters 1-3, B Ch 1, 2.1-2.3; HTF Ch 1, 2.1-2.6)
Objective: Provide overview and context for this class.

2. Regression:Multiple Linear Regression; Basis function expansion; Dealing with large number of features; Ridge, Lasso and Stagewise Approaches; Non-linear methods; Bayesian approach; and Multilevel models. User's guide.
(3 lectures; KJ: Chapter 4-6, 7.1,7.4, 7.5; JW Ch 6, B 3.1, 3.2; HTF Ch 2.7, 2.8, 3.1-3.4, 7.1-7.3, 11.1-11.8)
Objective: Learn to understand predictive models where desired outcome is a numeric quantity.

3. Classification: Bayes decision theory, Naïve Bayes and Bayesian networks; (Scaled) Logistic regression; LDA; Scaling decision trees to big data; Kernel methods and Support Vector Machines (SVMs) for classification and regression; Deep Learning and Tensorflow; dealing with class imbalance
(6 lectures, KJ: Chapter 11,12,13,14.1,14.2,8.1,16; JW 4, 8, 9; B 4.1-4.3.4; 6.1, 6.2, 7.1, 14.4; HTF Ch 4, 7.10, 9.2, 12, 13.3
Objective: Learn to understand predictive models where desired outcome is a class label.

4. Data Pre-Processing: Transformations, Imputations, Sampling, Outlier detection
(3 lectures, KJ: Chapter 3, JW 10.1, 10.2, B 12.1; HTF Ch 14.5, 14.8
Objective: Understand that good data quality is a pre-requisite for effective models, and study some methods for improving data quality.

5 Modern Interactive Visualization:Javascript, Bokeh, R-Studio. Visual Analytics.
(Interspersed with Topics 2 and 3 as needed)
Objective: Visualization of Data as well as model outcomes for better understanding and communication. Visual analytics.

6. Ensemble Methods: Model Averaging, Bagging and Random forests, boosting, Gradient boosting; Bag of Little Bootstraps
(2 lectures; KJ: Chapter 14.3-14.8; JW 8.2; B 14.2, 14.3, HTF Ch 8.7, 8.8, 10.1-10.7, 16
Objective: Understand the benefits of combining multiple predictive models.

7. Clustering: k-means; hierarchical methods, graph partitioning; co-clustering, semi-supervised learning. Market Basket applications
(3 lectures; JW 10.3, 10.5, B 9.1, 9.2; HTF Ch 13.1, 13.2, 14.3, 14.4)
Objective: Learn issues involved in unsupervised learning and the trade-offs among alternative approaches to clustering.

8. Specialized/Advanced Topics: (coverage depends on time available and interest of class)

(a) Multi-task Learning: Predictive analytics for (multi)-relational data; Use of auxiliary information such as networks;

(b) Stochastic BlockModels. Applications to Next Generation Recommender Systems; Customer-product affinity detection;

(d) Distributed Data Mining. Overview; ADMM and distributed logistic regression

(e) Stream Data Mining. more SGD type algorithms
(as time permits)
Objective: Study some important specialized situations where predictive models are deployed, including large-scale predictive modeling using multiple machines, e.g. cloud computing.

Term Project Presentations and Discussion
(4 lectures)

Wildcards: A couple of classes may be used for invited talks by visiting experts.

Grading information

10+20%	Project (groups of 3-5): (project outline + 2 presentations) + term paper due May 10th, 11:59pm
20%	6 Asssignments
15%	4 pop-quizzes. Best 3 scores counted
30%	Two 1-hour tests in class. (3/7; 4/13)
5%	Class participation

There will be no final exam.

Quizzes will be held in class and of duration 15-30 minutes. Their objective is to review key concepts introduced in class.

At the end of the course, you will get a score out of 100 based on the percentages stated above. Your final grade will be solely based on this score. The grade is primarily based on the curve, i.e. is relative to how the whole class performs; however entire curve may shift up or down a bit depending on how the class as a whole performs relative to past classes. Grading is NOT based on absolute thresholds, e.g. 90+ = A etc.

ACADEMIC DISHONESTY AND POLICIES ON CHEATING: Faculty at UT are committed to detecting and responding to all instances of scholastic dishonesty and will pursue cases of scholastic dishonesty in accordance with university policy. Scholastic dishonesty, in all its forms, is a blight on our entire academic community. All parties in our community -- faculty, staff, and students -- are responsible for creating an environment that educates outstanding engineers, and this goal entails excellence in technical skills, self-giving citizenry, and ethical integrity. Industry wants engineers who are competent and fully trustworthy, and both qualities must be developed day by day throughout an entire lifetime. Scholastic dishonesty includes, but is not limited to, cheating, plagiarism, collusion, falsifying academic records, or any act designed to give an unfair academic advantage to the student. The fact that you are in this class as an engineering student is testament to your abilities. Penalties for scholastic dishonesty are severe and can include, but are not limited to, a written reprimand, a zero on the assignment/exam, re-taking the exam in question, an F in the course, or expulsion from the University. Don't jeopardize your career by an act of scholastic dishonesty. Details about academic integrity and what constitutes scholastic dishonesty can be found at the website for the UT Dean of Students Office and the General Information Catalog, Section 11-802.

Disabilities statement: "The University of Texas at Austin provides upon request appropriate academic accommodations for qualified students with disabilities. For more information, contact the Office of the Dean of Students at 471-6259, 471-4641 TTY."

EE-380L Data Mining

Sp 2017

Prof. Joydeep Ghosh

Course description

Textbooks

Course Schedule and Topics

Grading information