I. Course description | II. Course textbooks | III. Course outline | IV. Grading information |
The information explosion of the past few years has us drowning in data but often starved of knowledge. Many companies that gather huge amounts of electronic data have now begun applying data mining techniques to their data warehouses to discover and extract pieces of information useful for making smart business decisions.
In this course we will study a variety of techniques for data mining and machine learning. Some emphasis will be given to approaches that are scalable to very large data sets and/or those that are relatively robust when faced with a large number of predictors, and algorithms for heterogeneous or streaming data. Many of these capabilities are essential for handing BIG DATA. We will mostly be using Python (specially, Scikit-Learn), though some problems in R may also be given.
The central goal of this course is to convey an understanding of the pros and cons of different data mining techniques, so that you can (i) make an informed decision on what approaches to consider when faced with real-life problems requiring predictive modeling, (ii) apply models properly on real datasets so to make valid conclusions. This goal will be reinforced through both theory and hands-on experience.The material for the lectures is taken from a wide variety of sources. My notes will be available via Canvas. There is no mandatory textbook for the course. However I will often give pointers to supplementary readings from the following 4 books (some of these pointers are listed with the course topics below):
KJ, JW, HTF and B refer to the 4 books listed above
1. Overview: Data Mining process; Local vs. global models, and other trade-offs.
(2 lectures; KJ: Chapter 1,2; JW Chapters 1-3, B Ch 1, 2.1-2.3; HTF Ch 1, 2.1-2.6)
Objective: Provide overview and context for this class.
2. Regression:Multiple Linear Regression; Basis function expansion; Dealing with large number of features; Ridge, Lasso and Stagewise Approaches; Non-linear methods; Bayesian approach; and Multilevel models. User's guide.
(3 lectures;
KJ: Chapter 4-6, 7.1,7.4, 7.5; JW Ch 6, B 3.1, 3.2; HTF Ch 2.7, 2.8, 3.1-3.4, 7.1-7.3, 11.1-11.8)
Objective: Learn to understand predictive models where desired outcome is a numeric quantity.
3. Classification: Bayes decision theory, Naïve Bayes and Bayesian networks; (Scaled) Logistic regression; LDA; Scaling decision trees to big data; Kernel methods and Support Vector Machines (SVMs) for classification and regression; Deep Learning and Tensorflow; dealing with class imbalance
(6 lectures,
KJ: Chapter 11,12,13,14.1,14.2,8.1,16; JW 4, 8, 9; B 4.1-4.3.4; 6.1, 6.2, 7.1, 14.4; HTF Ch 4, 7.10, 9.2, 12, 13.3
Objective: Learn to understand predictive models where desired outcome is a class label.
4. Data Pre-Processing: Transformations, Imputations, Sampling, Outlier detection
(3 lectures, KJ: Chapter 3, JW 10.1, 10.2, B 12.1; HTF Ch 14.5, 14.8
Objective: Understand that good data quality is a pre-requisite for effective models, and study some methods for improving data quality.
5 Modern Interactive Visualization:Javascript, Bokeh, R-Studio. Visual Analytics.
(Interspersed with Topics 2 and 3 as needed)
Objective: Visualization of Data as well as model outcomes for better understanding and communication. Visual analytics.
6. Ensemble Methods: Model Averaging, Bagging and Random forests, boosting, Gradient boosting; Bag of Little Bootstraps
(2 lectures; KJ: Chapter 14.3-14.8; JW 8.2; B 14.2, 14.3, HTF Ch 8.7, 8.8, 10.1-10.7, 16
Objective: Understand the benefits of combining multiple predictive models.
7. Clustering: k-means; hierarchical methods, graph partitioning; co-clustering, semi-supervised learning. Market Basket applications
(3 lectures; JW 10.3, 10.5, B 9.1, 9.2; HTF Ch 13.1, 13.2, 14.3, 14.4)
Objective:
Learn issues involved in unsupervised learning and the trade-offs among alternative approaches to clustering.
8. Specialized/Advanced Topics: (coverage depends on time available and interest of class)
(a) Multi-task Learning: Predictive analytics for (multi)-relational data; Use of auxiliary information such as networks;
(b) Stochastic BlockModels. Applications to Next Generation Recommender Systems; Customer-product affinity detection;
(c) Ranking; Spam filtering
(d) Distributed Data Mining. Overview; ADMM and distributed logistic regression
(e) Stream Data Mining. more SGD type algorithms
(as time permits)
Objective: Study some important specialized situations where predictive models are deployed, including large-scale predictive modeling using multiple machines, e.g. cloud computing.
Term Project Presentations and Discussion
(4 lectures)
10+25% | Project (groups of 3-5): (project outline + presentation) + term paper due in May |
25% | 5 Asssignments |
20% | 5 pop-quizzes. Best 4 scores counted |
20% | Exam in class, (April) |
There will be no final exam.
Quizzes will be held in class and of duration 15-30 minutes. Their objective is to review key concepts introduced in class.
At the end of the course, you will get a score out of 100 based on the percentages stated above. Your final grade will be solely based on this score. The grade is primarily based on the curve, i.e. is relative to how the whole class performs; however entire curve may shift up or down a bit depending on how the class as a whole performs relative to past classes. Grading is NOT based on absolute thresholds, e.g. 90+ = A etc.
ACADEMIC DISHONESTY AND POLICIES ON CHEATING: Faculty at UT are committed to detecting and responding to all instances of scholastic dishonesty and will pursue cases of scholastic dishonesty in accordance with university policy. Scholastic dishonesty, in all its forms, is a blight on our entire academic community. All parties in our community -- faculty, staff, and students -- are responsible for creating an environment that educates outstanding engineers, and this goal entails excellence in technical skills, self-giving citizenry, and ethical integrity. Industry wants engineers who are competent and fully trustworthy, and both qualities must be developed day by day throughout an entire lifetime. Scholastic dishonesty includes, but is not limited to, cheating, plagiarism, collusion, falsifying academic records, or any act designed to give an unfair academic advantage to the student. The fact that you are in this class as an engineering student is testament to your abilities. Penalties for scholastic dishonesty are severe and can include, but are not limited to, a written reprimand, a zero on the assignment/exam, re-taking the exam in question, an F in the course, or expulsion from the University. Don't jeopardize your career by an act of scholastic dishonesty. Details about academic integrity and what constitutes scholastic dishonesty can be found at the website for the UT Dean of Students Office and the General Information Catalog, Section 11-802.
Disabilities statement: "The University of Texas at Austin provides upon request appropriate academic accommodations for qualified students with disabilities. For more information, contact the Office of the Dean of Students at 471-6259, 471-4641 TTY."
NOTICES: