UT-Austin Data Mining Seminar Schedule Abstracts

Data Mining Seminar Abstracts

Mailing List

Geoff Gordon
Carnegie Mellon University
Thr, Feb. 23rd, 2012

Spectral learning algorithms for dynamical systems

If we hope to build an intelligent agent, we must solve (at least!) the following problem: by watching an incoming stream of sensor data, hypothesize a model of the external world that explains the data. An appealing representation for such a model is a dynamical system—a recursive rule for updating a state, i.e., a concise summary of past experience that we can use to predict future observations. Unfortunately, to discover the right dynamical system, we must solve difficult temporal and structural credit assignment problems; often, the result is a search space with a host of bad local optima, and disappointing performance in practice. However, recent work on spectral learning algorithms may provide an end run around these problems: these new spectral algorithms are computationally efficient, simple to implement, statistically consistent, and have no local optima. Perhaps even more interesting, they unify and generalize a number of other state-of-the-art learning methods, including Tomasi-Kanade structure from motion, subspace identification for Kalman filters and hidden Markov models, a novel approach to range-only SLAM, and Laplacian Eigenmaps for manifold learning.

bio: Gordon is an Associate Research Professor in the Department of Machine Learning at Carnegie Mellon University, and co-director of the Department's Ph. D. program. He works on multi-robot systems, statistical machine learning, game theory, and planning in probabilistic, adversarial, and general-sum domains. His previous appointments include Visiting Professor at the Stanford Computer Science Department and Principal Scientist at Burning Glass Technologies in San Diego. Dr. Gordon received his B.A. in Computer Science from Cornell University in 1991, and his Ph.D. in Computer Science from Carnegie Mellon University in 1999.

Jaideep Vaidya
Rutgers University
Thr, Feb. 9th, 2012

The Role Mining Problem - A formal perspective

Role based access control is well accepted as the standard best practice for access control within applications and organizations. Role engineering, the task of defining roles and associating permissions to them, is essential to realize the full benefits of the role-based access control paradigm. The essential question is how to devise a complete and correct set of roles -- this depends on how you define goodness/interestingness (when is a role good / interesting?). We formulate the role mining problem (RMP) as a boolean matrix decomposition problem. The optimal decomposition corresponds to an optimal set of roles that can describe the existing user permissions. In addition to the basic RMP, we introduce several different variations of the RMP, including an extended boolean matrix decomposition that has pragmatic implications. By placing this in the context of boolean matrix decomposition, our results are applicable to several other domains including text mining, and knowledge discovery from databases.

bio: Jaideep Vaidya is an Associate Professor of Computer Information Systems at Rutgers University. He received his Masters and Ph.D. from Purdue University and his Bachelors degree from the University of Mumbai. His research interests are in Data Mining, Data Management, Privacy, and Security. He has published over 60 papers in international conferences and archival journals, and has received three best paper awards from the premier conferences in data mining, databases. He is also the recipient of a NSF Career Award and a Rutgers Board of Trustees Research Fellowship for Scholarly Excellence.

Deepak Aagarwal
Yahoo! Research
Tue, Nov. 15th, 2011

Targeting Users in Display Advertising through Factor Models

In performance based display advertising, campaign effectiveness is often measured in terms of conversions that represent some desired user actions like purchases and product information requests on advertisers' web site. Hence, identifying and targeting potential converters is of vital importance to boost campaign performance. This is often accomplished by marketers who define the user base of campaigns based on behavioral, demographic contextual, geographic, search, social, purchase, and other characteristics. Such a process is manual and subjective, it often fails to utilize the full potential of targeting. In this paper we show that by using past converted users of campaigns and campaign meta-data (e.g., ad creatives, landing pages), we can combine disparate user information in a principled way to effectively and automatically target converters for new/existing campaigns. At the heart of our approach is a factor model that estimates the affinity of each user feature to a campaign using historical conversion data. In fact, our approach allows building a conversion model for a brand new campaign through campaign meta-data alone, and hence targets potential converters even before the campaign is run. Through extensive experiments, we show the superiority of our factor model approach relative to several other baselines. Moreover, we show that the performance of our approach at the beginning of a campaign's life is typically better than the other models even when they are trained using all conversion data after campaign completion. This clearly shows the importance and value of using historical campaign data in constructing an effective audience selection strategy for display advertising. --This is joint work with Sandeep Pandey at Yahoo! Research.

Edo Airoldi
Harvard University
Thr, Nov. 10th, 2011

Stochastic blockmodels and other models of biological and information networks

Observations consisting of measurements on pairs of objects (or conditions) arise in a number of settings in the biological sciences (www.yeastgenome.org), with collections of scientific publications (www.jstor.org) and other hyper-linked resources (www.wikipedia.org), and in social networks (www.linkedin.com). Analyses of such data typically aim at identifying structure among the units of interest, in a low dimensional space, to support the generation of substantive hypotheses, to partially automate semantic categorization, to facilitate browsing, and to simplify complex data into useful patterns, more in general. In this lecture, we will survey a few exchangeable graph models. We will then focus on the stochastic blockmodel and show its utility as a quantitative tool for exploring static/dynamic networks. Within this modeling context, we discuss alternative specifications and extensions that address fundamental issues in data analysis of complex interacting systems: bridging global and local phenomena, data integration, dynamics, and scalable inference.

bio:
edo m. airoldi joined the faculty of harvard university in january 2009, as an assistant professor of statistics. he is also a member of the center for systems biology in the faculty of arts and sciences, and a faculty in the systems biology and quantitative genomics phd programs. before joining harvard, he was a postdoctoral fellow at princeton university in the department of computer science and the lewis-sigler institute for integrative genomics. he has received a bachelor's degree in mathematical statistics from bocconi university, and a ph.d. degree in computer science from carnegie mellon university, for which he was awarded the leonard j. savage award honorable mention from the international society of bayesian analysis in 2007. his research interests include statistical methodology and theory with application to molecular biology and integrative genomics, computational social science, and statistical analysis of large biological and information networks. he is a big think delphi fellow (2011-), and has received the john van ryzin award (2006) from the international biometrics society.

Peter Hoff
University of Washington
Fri, Nov. 4th, 2011

Separable Models for Multiway Array Data

Modern datasets are often in the form of matrices or arrays, potentially having correlations along each set of data indices. For example, data involving repeated measurements of several variables over time may exhibit temporal correlation as well as correlation among the variables. Social networks and other relational datasets may exhibit correlation among the senders and receivers of relationships. In this talk I will develop a class of statistical models for the analysis of such array-valued data. Specifically, I will extend the matrix normal model for matrix-valued data to a class of array normal distributions having separable covariance structure. We relate this model to the higher-order SVD for analysis of array data, and show how the model can be motivated in terms of a latent variable representation. Model fitting and parameter estimation for array-normal distributions will be illustrated in the analysis of several examples, including international trade networks, imputation for nation specific life-tables, and estimation of high-order interactions in ANOVA models.

Jeff Schneider
CMU
Fri, Oct. 28th, 2011

Learning Dynamic Models with non-sequenced Data

Virtually all methods of learning dynamic systems from data start with the same basic assumption: the learning algorithm will be given a time sequence of data generated from the dynamic system. We consider the case where the training data comes from the system's operation but with no temporal ordering. The data are simply drawn as individual disconnected points. While making this assumption may seem absurd at first glance, we observe that many scientific modeling tasks have exactly this property.
We propose several methods for solving this problem. We write down an approximate likelihood function that may be optimized to learn dynamic models and show how kernel methods can be used to obtain non-linear models. We propose an alternative method that focuses on achieving temporal smoothness in the learned dynamics. Finally, we consider the case where a small amount of sequenced data is available along with a large amount of non-sequenced data. We propose the use of the Lyapunov equation and the non-sequenced data to provide regularization when performing regression on the sequenced data to learn a dynamic model. We demonstrate our methods on synthetic data and describe the results of our analysis of some bioinformatics data sets.

Bio: Dr. Jeff Schneider is an associate research professor in the Carnegie Mellon University School of Computer Science. He received his PhD in Computer Science from the University of Rochester in 1995. He has over 15 years experience developing, publishing, and applying machine learning algorithms in government, science, and industry. He has dozens of publications and has given numerous invited talks and tutorials on the subject.
In 1995 Dr. Schneider co-founded and became CEO of Schenley Park Research, a company dedicated to bringing new machine learning algorithms to industry. In 2004 he developed a new machine-learning based CNS drug discovery system and spent two years as the CIO of Psychogenics to commercialize the system. Through his academic, commercial, and consulting efforts, he has worked with several dozen companies and government agencies around the world.

Erik Sudderth
Brown University
Fri, Oct. 14th, 2011

Improving the Flexibility and Reliability of Nonparametric Models

Topic models are learned via a statistical model of variation within document collections, but designed to extract meaningful semantic structure. Desirable traits include the ability to incorporate annotations or metadata associated with documents; the discovery of correlated patterns of topic usage; and the avoidance of parametric assumptions, such as manual specification of the number of topics. We first describe a doubly correlated nonparametric topic (DCNT) model which captures all three of these properties. The DCNT models metadata via a flexible, Gaussian regression on arbitrary input features; correlations via a scalable square-root covariance representation; and nonparametric selection from an unbounded series of potential topics via a stick-breaking construction. This structure can be explored via an intuitive graphical interface. While basic inference algorithms such as the Gibbs sampler are easily applied to hierarchical nonparametric Bayesian models, including the DCNT, in practice they can fail in subtle and hard to diagnose ways. We explore this issue via a simpler nonparametric topic model, the hierarchical Dirichlet process (HDP). Using a carefully crafted search algorithm which finds likely configurations under a marginalized representation of the HDP, we demonstrate the poor mixing behavior of conventional HDP sampling algorithms. In the process, we illustrate unrealistic biases of the "toy" datasets commonly used to validate inference algorithms. Applied to a collection of scientific documents, our search algorithm also reveals statistical features which are poorly modeled via the HDP.In this talk I will give an overview over an array of highly scalable techniques for both observed and latent variable models. This makes them well suited for problems such as classification, recommendation systems, topic modeling and user profiling. I will present algorithms for batch and online distributed convex optimization to deal with large amounts of data, and hashing to address the issue of parameter storage for personalization and collaborative filtering. Furthermore, to deal with latent variable models I will discuss distributed sampling algorithms capable of dealing with tens of billions of latent variables on a cluster of 1000 machines. The algorithms described are used for personalization, spam filtering, recommendation, document analysis, and advertising.

Alex Smola
Yahoo! Research
Mon, Sept. 19th, 2011

Scaling Machine Learning to the Internet

In this talk I will give an overview over an array of highly scalable techniques for both observed and latent variable models. This makes them well suited for problems such as classification, recommendation systems, topic modeling and user profiling. I will present algorithms for batch and online distributed convex optimization to deal with large amounts of data, and hashing to address the issue of parameter storage for personalization and collaborative filtering. Furthermore, to deal with latent variable models I will discuss distributed sampling algorithms capable of dealing with tens of billions of latent variables on a cluster of 1000 machines. The algorithms described are used for personalization, spam filtering, recommendation, document analysis, and advertising.

Bio: Alex Smola studied physics in Munich at the University of Technology, Munich, at the Universita degli Studi di Pavia and at AT&T Research in Holmdel. During this time he was at the Maximilianeum München and the Collegio Ghislieri in Pavia. In 1996 he received the Master degree at the University of Technology, Munich and in 1998 the Doctoral Degree in computer science at the University of Technology Berlin. Until 1999 he was a researcher at the IDA Group of the GMD Institute for Software Engineering and Computer Architecture in Berlin (now part of the Fraunhofer Geselschaft). After that, he worked as a Researcher and Group Leader at the Research School for Information Sciences and Engineering of the Australian National University. From 2004 onwards Alex worked as a Senior Principal Researcher and Program Leader at the Statistical Machine Learning Program at NICTA. He is currently working as Principal Research Scientist at Yahoo! Research.

John Langford
Yahoo! Labs
Thr., Mar 24, 2011

Learning with Exploration

"The default successful paradigm of machine learning is supervised learning. Here humans are hired (often through mechanical turk these days) to label items, and then the labeled information is fed into a learning algorithm to create a learned predictor. A more natural, less expensive and accurate approach is observe what works in practice, and use this information to learn a predictor.
For people familiar with the first approach, there are several failure modes ranging from plain inconsistency to mere suboptimality. A core basic issue here is the need for exploration---because if a choice is not explored, we can't optimize for it. The need for exploration implies a need for using exploration to evaluate learned solutions, to guide the optimization of learned predictors, and the need to control the process of exploration so as to accomplish it efficiently.
I will provide an overview of the problems which crop up in this area, and how to solve them. This includes new results on policy evaluation and optimization, with the first ever optimization-based exploration control algorithms for this setting."

Bio: Dr. John Langford is a Senior Researcher at Yahoo! Research. His work includes research in machine learning, game theory, steganography, and Captchas. He was previously a Research Associate Professor at the Toyota Technological Institute in Chicago. He has worked in the past at IBM's Watson Research Center in Yortown, NY under the Goldstine Fellowship. He earned a PhD in computer science from Carnegie Mellon University in 2002 and a Physics/Computer Science double major from CalTech in 1997.

David Pennock
Yahoo! Research
Fri., Mar 11, 2011

Crowdsouring Predictions

"I will describe and compare various methods obtaining predictions ranging from asking an expert to leveraging the wisdom of crowds. I will focus mainly on prediction markets, a particularly effective crowdsourcing method. I will discuss combinatorial prediction markets, or markets designed to extract exponentially many predictions with a fixed budget."

Bio: David Pennock is a Principal Research Scientist at Yahoo! Research in New York City, where he leads a group focused on algorithmic economics. He has over fifty academic publications relating to computational issues in electronic commerce and the web, including papers in PNAS, Science, IEEE Computer, Theoretical Computer Science, AAAI, EC, and WWW. Reports on his research have appeared in Time, Discover, New Scientist, CNN, the Economist, and the New York Times. In 2005, he was named to MIT Technology Review’s list of 35 "top technology innovators under age 35".

Amir Atiya
Cairo University
Tue., Feb 15, 2011

Gaussian Process For Classification

"The Gaussian process classifier (GPC) is a very promising machine learning concept that is based on a probabilistic model of the underlying class probabilities. The model is based on a Bayesian formulation that combines the observed class memberships with an assumed prior distribution. While its counterpart, the Gaussian process regression (GPR), has a simple closed-form solution and therefore enjoys wide applicability, the GPC has some computational issues to overcome. It is given in terms of a prohibitive multi-integral formula, that can only be solved through some approximations or using lengthy algorithms. In this talk I will review the theory behind GPC's. I will review its Bayesian formulation, and its parameter estimation methods. In addition, I will propose a new algorithm that leads to an exact evaluation of the classification."

Bio: Amir Atiya was born in Cairo, Egypt. He received his B.S. degree in 1982 from Cairo University (Egypt), and the M.S. and Ph.D. degrees in 1986 and 1991 from Caltech, Pasadena, CA, all in electrical engineering. He held positions in academia, as well as several positions in financial firms. From 1997 to 2001 he was a Visiting Associate at Caltech. On leave from Cairo University, he recently held research positions in the firms Simplex Risk Management, Hong Kong, Countrywide Corporation in Los Angeles, and Dunn Capital Management, Florida. Currently, he is a Research Scientist with Veros Systems, Texas.
He has been active in research in several fields. He received the Egyptian State Prize for Best Research in Science and Engineering, in 1994. He also received the Young Investigator Award from the International Neural Network Society in 1996. Recently, he received the prestigeous Kuwait Prize by the Kuwait Foundation for the Advancement of Sciences Currently, he is an Associate Editor for IEEE Transactions Neural Networks. He was guest editor of the special issue (July 2001) of IEEE Transactions Neural Networks on Neural Networks in Financial Engineering., and was guest editor of the special issue (September 2005) of IEEE Transactions Neural Networks on Adaptive Learning Systems in Communications Networks. He served as Program Co-chair of the conference IEEE Conference on Computational Intelligence in Financial Engineering (CIFER'03), Hong Kong, March 2003.

Nan Zhang
George Washington University
Thr., Jan 27, 2011

Data Exploration and Privacy Preservation Over Hidden Web Databases

"A large number of online databases are hidden behind form-like web interfaces which allow users to execute search queries by specifying desired (ranges of) attribute values of the sought-after tuple(s). We consider the problem of approximate query processing over hidden databases, and propose novel sampling-based techniques which use a small number of queries through the web interface of a hidden database to produce unbiased estimates with small variance. We also explain the threats posed by such sampling-based techniques on sensitive aggregate information of hidden databases. The protection of sensitive aggregates stands in contrast to the traditional privacy problem where individual tuples must be protected while ensuring access to aggregating information. We propose privacy-preserving techniques to thwart bots from sampling the hidden database to infer aggregate information.
* Collaborative work with Xin Jin of George Washington University, Arjun Dasgupta, Bradley Jewell, Anirban Maiti, and Dr. Gautam Das of University of Texas at Arlington, and Dr. Surajit Chaudhuri of Microsoft Research"

Bio: Dr. Nan Zhang is an Assistant Professor of Computer Science at the George Washington University, Washington, DC, USA. Prior to joining GWU, he was an assistant professor of Computer Science and Engineering at the University of Texas at Arlington from 2006 to 2008. He received the B.S. degree from Peking University in 2001 and the Ph.D. degree from Texas A&M University in 2006, both in computer science. His current research interests span security and privacy issues in databases, data mining, and computer networks, including privacy and anonymity in data collection, publishing, and sharing, privacy-preserving data mining, and wireless network security and privacy. He received the NSF CAREER award in 2008.

Rajeev Rastogi
Yahoo! Labs Bangalore
Fri., Dec 3, 2010

Building Knowledge Bases from the Web

"The web is a vast repository of human knowledge. Extracting structured data from web pages can enable applications like comparison shopping, and lead to improved ranking and rendering of search results. In this talk, I will describe two efforts at Yahoo! Labs to extract records from pages at web scale. The first is a wrapper induction system that handles end-to-end extraction tasks from clustering web pages to learning XPath extraction rules to relearning rules when sites change. The system has been deployed in production within Yahoo! to extract more than 200 million records from ~200 web sites. The second effort exploits content redundancy on the web to automatically extract records without human supervision. Starting with a seed database, we determine values in the pages of each new site that match attribute values in the seed records. We devise a new notion of similarity for matching templatized attribute content, and an apriori style algorithm that exploits templatized page structure to prune spurious attribute matches."

Bio: Rajeev Rastogi is the Vice President of Yahoo! Labs Bangalore where he directs basic and applied research in the areas of web search, advertizing, and cloud computing. Previously Rajeev was at Bell Labs where he was a Bell Labs Fellow and the founding Director of the Bell Labs Research Center in Bangalore. Rajeev is active in the fields of databases, data mining, and networking, and has served on the program committees of several conferences in these areas. He currently serves on the editorial board of the CACM, and has been an Associate editor for IEEE Transactions on Knowledge and Data Engineering in the past. He has published over 125 papers, and filed over 70 patents of which 40 have been issued. Rajeev received his B. Tech degree from IIT Bombay, and a PhD degree in Computer Science from the University of Texas, Austin.

Bee-Chung Chen
Yahoo! Research
Mon., Nov 22, 2010

Latent Factor Models for Web Recommender Systems

"The ability of recommending good content items to users is key to the success of Web sites like Yahoo!. Usually, the set of candidate items are large and dynamic, and different users have different tastes. In this talk, I will give an overview of how latent factor models can be effectively used to predict the "rating" that a user would give to an unseen item, in order to recommend the items with high predicted ratings. In particular, I will begin with matrix factorization methods, which have shown excellent performance in the Netflix competition, where most users and items have sufficient past rating data. However, in many Web applications, recommendations have to be made when we have no or little past rating data for many items or users. I will address this "cold-start" problem by extending matrix factorization with feature-based regression, topic modeling and online learning."

Bio: Bee-Chung Chen is currently a research scientist at Yahoo! Research. His research interests include recommender systems, large scale data analysis and scalable methods for modeling and mining. He received his Ph.D. from the University of Wisconsin - Madison with an outstanding graduate student research award from the Department of Computer Sciences. His work on "cube-space data mining" is recognized by the ACM SIGMOD Doctoral Dissertation Award Honorable Mention. His recent work on "explore/exploit schemes for Web content optimization" won the ICDM 2009 best paper award. He is a key designer of the recommendation algorithms that power a number of major Yahoo! sites.

Nanda Kambhatla
IBM Research - India
Fri., Nov 12, 2010

Building Watson – A Brief Overview of DeepQA and the Jeopardy! Challenge

"A computer system that can directly and precisely answer natural language questions over an open and broad range of knowledge has been envisioned by scientists and writers since the advent of computers themselves. While current computers can store and deliver a wealth of digital content created by humans, they are unable to operate over it in human terms. The quest for building a computer system that can do open-domain Question Answering is ultimately driven by a broader vision that sees computers operating more effectively in human terms rather than strictly computer terms. They should function in ways that understand complex information requirements, as people would express them, for example, in natural language questions or interactive dialogs. Computers should deliver precise, meaningful responses, and synthesize, integrate, and rapidly reason over the breadth of human knowledge as it is most rapidly and naturally produced -- in natural language text. The DeepQA project at IBM shapes a grand challenge in Computer Science that aims to illustrate how the wide and growing accessibility of natural language content and the integration and advancement of Natural Language Processing, Information Retrieval, Machine Learning, Knowledge Representation and Reasoning, and massively parallel computation can drive open-domain automatic Question Answering technology to a point where it clearly and consistently rivals the best human performance. A first stop along the way is the Jeopardy! Challenge, where we are planning to build an automated system that will compete with human grandchampions in the game of Jeopardy!. In this talk, we will give an overview of the DeepQA project and the Jeopardy! Challenge."

Bio: Nanda Kambhatla has nearly two decades of research experience in the areas of Natural Language Processing (NLP), text mining, information extraction, dialog systems, and machine learning. He holds 7 U.S patents and has authored over 40 publications in books, journals, and conferences in these areas. Nanda holds a B.Tech in Computer Science and Engineering from the Institute of Technology, Benaras Hindu University, India, and a Ph.D in Computer Science and Engineering from the Oregon Graduate Institute of Science & Technology, Oregon, USA. Currently, Nanda is the senior manager of the Human Language Technologies department at IBM Research - India, Bangalore. He leads a group of over 20 researchers focused on research in the areas of NLP, advanced text analytics (IE, IR, sentiment mining, etc.), speech analytics and statistical machine translation. Most recently, Nanda was the manager of the Statistical Text Analytics Group at IBM's T.J. Watson Research Center, the Watson co-chair of the Natural Language Processing PIC, and the task PI for the Language Exploitation Environment (LEE) subtask for the DARPA GALE project. He has been leading the development of information extraction tools/products and his team has achieved top tier results in successive Automatic Content Extraction (ACE) evaluations conducted by NIST for extracting entities, events and relations from text from multiple sources, in multiple languages and genres. Earlier in his career, Nanda has worked on natural language web-based and spoken dialog systems at IBM. Before joining IBM, he has worked on information retrieval and filtering algorithms as a senior research scientist at WiseWire Corporation, Pittsburgh and on image compression algorithms while working as a postdoctoral fellow under Prof. Simon Haykin at McMaster University, Canada. Nanda's research interests are focused on NLP and technology solutions for creating, storing, searching, and processing large volumes of unstructured data (text, audio, video, etc.) and specifically on applications of statistical learning algorithms to these tasks.

Jieping Ye
Arizona State University
Thr., Nov 11, 2010

Large-Scale Structured Sparse Learning

"Recent advances in high-throughput technologies have unleashed a torrent of data with a large number of dimensions. Examples include gene expression pattern images, microarray gene expression data, and neuroimages. Variable selection is crucial for the analysis of these data. In this talk, we consider the structured sparse learning for variable selection where the structure over the features can be represented as a hierarchical tree, an undirected graph, or a collection of disjoint or overlapping groups. We show that the proximal operator associated with these structures can be computed efficiently, thus accelerated gradient techniques can be applied to scale structured sparse learning to large-size problems. We demonstrate the efficiency and effectiveness of the presented algorithms using synthetic and real data."

Bio: Jieping Ye is an Associate Professor of the Department of Computer Science and Engineering at Arizona State University. He received his Ph.D. in Computer Science from University of Minnesota, Twin Cities in 2005. His research interests include machine learning, data mining, and biomedical informatics. He won the outstanding student paper award at ICML in 2004, the SCI Young Investigator of the Year Award at ASU in 2007, the SCI Researcher of the Year Award at ASU in 2009, the NSF CAREER Award in 2010, and the KDD best research paper award honorable mention in 2010.

Joachim M Buhmann
Departement Informatik, ETH Zurich
Wed., June 16, 2010

Visual INFORMATION: How to find it in images?

"A picture is worth a thousand words!" But how to find this knowledge in images? Visual scene analysis as well as medical image processing pose fundamental problems for computer science since they relate the epistemological question "what information can be extracted from data" to the computational challenge "how efficiently can we extract relevant information from data". An information theoretic access to model validation is proposed which will address the question of structure sensitive information. Good generalization abilities of structure hypotheses are required to select models. We will demonstrate approaches to automatically extract object models from images by learning and to find discriminative features to predict survival times from tissue microarray analysis in cancer diagnosis.

Bio: Joachim M. Buhmann leads the group of "Pattern Recognition and Machine Learning" in the Department of Computer Science at ETH Zurich. He has been a full professor of Information Science and Engineering (Informatik) since October 2003. Born in 1959 in Friedrichshafen, Germany, he studied Physics at the Technical University Munich and obtained his PhD in Theoretical Biophysics with Professor Klaus Schulten. His doctoral thesis was about pattern recognition in neural networks. He then spent three years as a research assistant and assistant professor at the University of Southern California, Los Angeles. He spent 1991 at the Lawrence Livermore National Laboratory in California. He held the chair of practical Computer Science (praktische Informatik) at the University of Bonn, Germany from 1992 to 2003. His research interests spans the areas of pattern recognition and data analysis, including machine learning, statistical learning theory and applied statistics. Application areas of his research include image analysis, remote sensing and bioinformatics. He has been in the technical committee of the German Pattern Recognition Society (Deutschen Arbeitsgemeinschaft fur Mustererkennung) since 1995, including serving on the board during 2000-2003. He is associate editor for IEEE Transactions on Neural Networks and IEEE Transactions on Image Processing.

Shi Zhong
ClickForensics
Thu, Apr. 22, 2010

Click Fraud and Traffic Quality Modeling for Online Advertising

Internet advertising has been growing fast in the last decade and driving a lot of new research in data mining, machine learning, information retrieval, pricing, and microeconomics. Click Forensics is a local startup company that provides ad click fraud detection and traffic scoring service and technology to online advertising companies (i.e., publishers, ad networks and advertisers). Dr. Zhong will discuss several interesting data mining problems that Click Forensics is working on, including click fraud detection, conversion prediction, keywords categorization and ad targeting segmentation, highlighting the challenges and solutions.

Bio: Dr. Shi Zhong is currently a principal scientist at Click Forensics, focusing on large-scale predictive modeling for ad targeting. Prior to joining Click Forensics, he worked as a senior scientist for Zilliant Science Group (2007-2009) and Yahoo Data Mining & Research Group (2005-2007), and as an assistant professor at Florida Atlantic University (2003-2005). He received a PhD in Data Mining from UT-ECE in 2003. He has published over 30 papers and book chapters on various clustering and semi-supervised learning algorithms and applications. He is the co-inventor of one awarded and six pending patents.

Ryan Adams
Dept. of CS, Univ. of Toronto
Mon, Mar. 15, 2010

Dependent Probabilistic Matrox Factorization

Probabilistic matrix factorization (PMF) is a powerful method for modeling data associated with pairwise relationships. PMF is used in collaborative filtering, computational biology, and document analysis among other application areas. In many of these domains, additional information is available that might assist in the modeling of the pairwise interactions. For example, when modeling movie ratings, we might know when the rating occurred, where the user lives, or what actors appear in the movie. It has been difficult, however, to incorporate this kind of side information into the probabilistic matrix factorization model. I will discuss some recent work to develop a nonparametric Bayesian framework for incorporating side information by coupling together multiple PMF problems via Gaussian process priors. I will show how we have used this model to successfully model the outcomes of NBA basketball games, allowing for the attributes of the teams to vary over time and to include home-team advantage. This is joint work with George Dahl and Iain Murray.

Bio: Ryan Adams is a CIFAR Junior Research Fellow at the University of Toronto. He received his PhD in 2009 from the University of Cambridge as a Gates Cambridge Scholar supervised by Prof. David MacKay. He grew up in Texas before leaving to attend the Massachusetts Institute of Technology. At the International Conference for Machine Learning in 2009 two of his papers were separately awarded honorable mentions for best paper and best student paper. He was recently announced as a finalist for the Savage Prize, given annually by International Society for Bayesian Analysis.

Winter Mason
Yahoo! Research
Wed, Jan. 27, 2010

Inferring Social Networks From Interpersonal Communication

In order to construct and study large social networks from communication data, one must infer unobserved ties (e.g. i is connected to j ) from observed communication events (e.g. i emails j). Often overlooked, however, is the impact this tie definition has on the corresponding network, and in turn the relevance of the inferred network to the research question of interest. We studied the problem of network inference and relevance for two email data sets of different size and origin. In each case, we generated a family of networks parameterized by a threshold condition on the frequency of emails exchanged between pairs of individuals. After demonstrating that different choices of the threshold correspond to dramatically different network structures, we then defined the relevance of these networks with respect to a series of prediction tasks that depend on various network features. In general, we find: a) that prediction accuracy is maximized over a non-trivial range of thresholds; b) that for any prediction task, choosing the optimal value of the threshold yields a sizable (~ 30%) boost in accuracy over naive choices; and c) that the optimal threshold value appears to be (somewhat surprisingly) consistent across data sets and prediction tasks.

Bio: Winter Mason received his B.S. in Psychology from the University of Pittsburgh in 1999. After a few years working for the Biostatistics Center at George Washington University, he began graduate school at Indiana University. Winter received his Ph.D. in Social Psychology and Cognitive Science in 2007. His research interests include social influence in social networks, homophily and group identity, and group problem solving and group search.

Charles Elkan
Professor, Dept. of CSE, UCSD
Wed, Sep. 23, 2009

Accounting for burstiness of words in text mining

A fundamental property of language is that if a word is used once in a document, it is likely to be used again. Statistical models of documents applied in text mining must take this property into account, in order to be accurate. In this talk, I will describe how to model burstiness using a probability distribution called the Dirichlet compound multinomial. In particular, I will present a new topic model based on DCM distributions. The central advantage of topic models is that they allow documents to concern multiple themes, unlike standard clustering methods that assume each document concerns a single theme. On both text and non-text datasets, the new topic model achieves better held-out likelihood than standard latent Dirichlet allocation (LDA).

Bio: Charles Elkan is a professor in the Department of Computer Science and Engineering at the University of California, San Diego. In 2005/06 he was on sabbatical at MIT, and in 1998/99 he was visiting associate professor at Harvard. Dr. Elkan is known for his research in machine learning, data mining, and computational biology. The MEME algorithm he developed with his Ph.D. student Tim Bailey has been used in over 1000 publications in biology. Dr. Elkan has won several best paper awards and data mining contests, and his Ph.D students have held tenure-track or equivalent positions at Columbia University, the University of Washington, the University of Queensland, other universities, and IBM Research.

Deepak Agarwal
Principal Research Scientist, Yahoo! Research
Wed, Sep. 16, 2009

Statistical Challenges in Recommender System for Web Applicatiions

In this talk, I will begin with an overview of statistical challenges that arise in recommender system problems for web applications like content optimization, online advertising. I will then describe some modeling solutions for a content optimization problem that arises in the context of the Yahoo! Front Page. In particular, I will discuss time series models to track item popularity, explore-exploit/sequential design schemes to enhance performance and matrix factorization models to personalize content to users. For some of the methods, I will present experimental results from an actual system at Yahoo!. I will also provide examples of other applications where the techniques are useful and end with discussion of some open problems in the area.

Bio: Dr. Deepak Agarwal is currently a principal research scientist at Yahoo! Research. Prior to joining Yahoo!, he was a member of the statistics department at AT&T Research. His current research interests are on scalable statistical models for recommender systems and online advertising. In particular, he is interested in hierarchical modeling, sequential designs and matrix factorization methods. His other research interests include mining massive graphs, statistical models for social network analysis, anomaly detection using a time series approach and spatial scan statistic for detecting hotspots. Deepak has been a co-author on three best paper awards (Joint Statistical Meetings, 2001; Siam Data Mining 2004; KDD 2007) in the past, he is currently associate editor for Journal of American Statistical Association and regularly serves on program committees of data mining and machine conferences like KDD, WWW, SDM, ICDM, WSDM, ICML and NIPS.

Raghu Ramakrishnan
Chief Scientist for Audience and Cloud Computing, Yahoo!
Wed, Apr 22, 2009

Semantics on the Web: How do we get there?

It is becoming increasingly clear that the next generation of web search and advertising will rely on a deeper understanding of user intent and task modeling, and a correspondingly richer interpretation of content on the web. How we get there, in particular, how we understand web content in richer terms than bags of words and links, is a wide open and fascinating question. I will discuss some of the options here, and look closely at the role that information extraction can play.

Bio: Dr. Raghu Ramakrishnan is Chief Scientist for Audience and Cloud Computing at Yahoo!, and is a Research Fellow, heading the Web Information Management group in Yahoo! Research. His work in database systems, with a focus on data mining, query optimization, and web-scale data management has influenced query optimization in commercial database systems and the design of window functions in SQL:1999. His paper on the Birch clustering algorithm received the SIGMOD 10-Year Test-of-Time award, and he has written the widely-used text "Database Management Systems" (with Johannes Gehrke). Ramakrishnan is a Fellow of the ACM and IEEE and has received the ACM SIGKDD Innovations Award, the ACM SIGMOD Contributions Award, a Distinguished Alumnus Award from IIT Madras, a Packard Foundation Fellowship, and an NSF Presidential Young Investigator Award.

George Karypis
Dept. of Computer Science & Engr., University of Minnesota
Tue, Apr 7, 2009

Advancing Chemical Genetics: Mining the Target-Ligand Activity Matrix

The recent development of various government and University funded screening centers has provided the academic research community with access to state-of-the-art high-throughput and high-content screening facilities that until recently were only available to the pharmaceutical industry. As a result, chemical genetics, which uses small organic molecules to alter the function of proteins, has emerged as an important experimental technique for studying and understanding complex biological systems. However, the methods used to develop small-molecule modulators (chemical probes) of specific protein functions and analyze the phenotypes induced by them have not kept pace with advances in the experimental screening technologies. Developing probes for novel protein targets remains a laborious process, whereas experimental approaches to identify the proteins that are responsible for the phenotypes induced by small molecules require a large amount of time and capital expenditures. New methods for developing probes and identifying the protein targets that are being modulated by small organic compounds are needed. Lack of such tools represents an important problem as it impedes the identification of chemical probes for various proteins and reduces our ability to effectively analyze the experimental results in order to elucidate the molecular mechanisms underlying biological processes. In this talk we present our research in developing data mining-driven computational methods to aid in the expansion of chemical genetics-based approaches for studying and understanding complex biological systems. The key hypothesis underlying this research is that the publicly available information associated with proteins and the molecules that modulate their functions, referred to as the target-ligand activity matrix, contains a wealth of information that if properly analyzed can provide insights connecting the structure of the chemical compounds (chemical space) to the structure of the proteins and their functions (biological space). This talk will illustrate the benefits that can be derived by analyzing the target-ligand activity matrix by presenting novel data mining methods for improving the accuracy of target-specific structure-activity-relationship (SAR) models and for identifying the proteins being targeted by compounds in phenotypic assays. Finally, we will conclude our talk by describing some of our ongoing research in virtual library synthesis that is explicitly designed to go after novel areas of the biological space.

Bio: Prof. Karypis is Associate Professor in the Department of Computer Sciences and Engineering, University of Minnesota-Twin Cities. His research interests span the areas of data mining, bio-informatics, parallel processing, CAD, and scientific computing. His research in data mining is focused on developing innovative new algorithms for a variety of data mining problems including clustering, classification, pattern discovery, and deviation detection, with an emphasis on business applications and information retrieval. Research in bio-informatics is focused on developing algorithms for understanding the function of genes and proteins in different species using data arising from genome-wide expression profiles, using data mining techniques to analyze expression profiles of genes and find groups of genes that behave similarly, and determine the underlying genetic regulatory network.

Joseph Sirosh
Amazon.com
Wed, March 19, 2008

How Analytics Is Revolutionizing E-Commerce

Machine learning and data mining applications are central to eCommerce. Recommendation systems, online risk management, behavioral ad targeting, targeted web and email marketing, online self-service systems, search relevance ranking etc. are just a few of the applications that make extensive use of data mining techniques to enable a unique customer experience on the web. This talk will explore the enormous wealth of opportunities for leveraging data to enhance the eCommerce experience. Using examples drawn from leading web sites, I will provide an overview of how some of these systems work and how they generate great value for customers. We'll then look to some emerging trends and what the future holds for data mining and machine learning on the web.

Bio: Joseph Sirosh is currently Vice President of Business Intelligence and Transaction Risk Management at Amazon.com. Prior to joining Amazon.com he worked at Fair Isaac Corporation as VP of the Advanced Technology R&D group, exploring advanced analytic and data mining applications. At Fair Isaac and at HNC Software prior to that, he has led several significant R&D projects on security and fraud detection, predictive modeling, information retrieval, content management, intelligent agents and bioinformatics. He has made important contributions in the field of neural network algorithms and in understanding the fundamental principles by which information is organized and processed in the brain. Dr. Sirosh has published over 20 technical papers and one book, and has been a lead investigator of various research grants from DARPA and other Government agencies. He holds a PhD and Masters in Computer Science from the University of Texas at Austin, and a B.Tech. in Computer Science & Engineering, from the Indian Institute of Technology, Chennai.

Katya Scheinberg
IBM T. J. Watson Research Center
Thurs, April 05, 2007

Active set methods for large-scale support vector machines

In the past decade one of the most popular approaches to solving classification problem in machine learning is support vector machines(SVM). At the core of the SVM approach lies a convex quadratic programming (QP) problem. This approach has been shown to be very efficient on certain applications in terms of resulting generalization error. There also have been numerous implementations targeting large-scale data sets. Most of these implementations, however, use inferior optimization methods to solve the convex QP, due to inherent computational cost of higher order methods. We will discuss how exploiting the special structure of SVM QPs leads to a competitive implementation of an classical active set method for convex QPs. We will also discuss the advantages on active set methods over first order methods used before. One of the advantages is that the method readily adapts to the parametric mode which computes the entire regularization path for SVM. The method is similar to that described in the paper by Hastie, Rosset, Tibshirani and Zhu. We will compare this method to a dual active set method which computes just one solution on a regularization path. In theory, this is as hard as computing the entrie regularization path, but in practice this is not so. We will describe the challenges of parametric active set method, present computational comparison of the two methods on large-scale classification problems and discuss possible approach of reducing the computational time by computing an approximate regularization path.

Bio: Dr. Katya Scheinberg was born in Moscow, Russia. After 4 years of undergraduate study in applied mathematics at Moscow State University, she joined the Ph.D. program in the Department of Industrial Engineering and Operations Research at Columbia University. She received her Ph.D. in 1997. Her thesis work was dedicated to various aspects of semidefinite programming and interior point methods. During the last two years in her Ph.D. program she was awarded an IBM Fellowship for graduate Students and began collaboration on derivative free methods for general nonlinear optimization. She is currently employed as a Research Staff Member at the Mathematical Sciences Department at T.J. Watson Research Center, IBM where she has been since 1997. Her research interests include theoretical and implementational issues of derivative free optimization (she released an open source software called DFO and is currently working on a book on the subject), numerical stability and efficient algorithms for linear algebra in interior points methods, parametric optimization, conic linear optimization, large-scale support vector machines (she is the author of an open source software called SVM-QP.) She has been recently involved in an effort to bring together the optimization and machine learning community and organized several workshops dedicated to the subject.

Renata Vieira
Universidade do Vale do Rio dos Sinos (UNISINOS)
Fri, March 30, 2007

Have I mentioned it?

Coreference resolution is a well known NLP task, relevant for the more general task of information extraction, among others. A related problem is the problem of identifying if a referring expression is introducing a new entity in the text or if it follows previously mentioned ones. Definite descriptions are a type of referring expressions which are highly ambiguous between these two roles. In fact they sometimes lie in between these two, when mentioning a new entity whose interpretation is anchored in given ones. When developing a classifier that distinguishes between these many roles of definite descriptions we face not only the old AI problem of grasping common sense and world knowledge but also the problem of unbalanced data. In this talk I will present some experiments dealing with these problems and I will mention some NLP tasks that the classifier is useful for.

Bio: Dr. Renata Vieira is Professor in the Computer Science Department at Universidade do Vale do Rio dos Sinos, Sao Leopoldo, Brazil. She received her PhD from the University of Edinburgh in 1998. Her research interests cover issues in computational linguistics, artificial intelligence, including natural language understanding, discourse processing, agent communication, knowledge representation, ontologies and the semantic web.

Rong Jin
Michigan State University
Wed, March 14, 2007

Generalized Maximum Margin Clustering and Unsupervised Kernel

Maximum margin clustering extends the theory of support vector machine to unsupervised learning, and has shown promising performance in recent studies. However, it has three major problems that question its application of real-world applications: (1) it is computationally expensive and difficult to scale to large-scale datasets; (2) it requires data preprocessing to ensure the clustering boundary to pass through the origins, which makes it unsuitable for clustering unbalanced dataset; and (3) its performance is sensitive to the choice of kernel functions. In this paper, we propose the "Generalized Maximum Margin Clustering" framework that addresses the above three problems simultaneously.
The new framework generalizes the maximum margin clustering algorithm in that (1) it allows any clustering boundaries including those not passing through the origins; (2) it significantly improves the computational efficiency by reducing the number of parameters; and (3) it automatically determines the appropriate kernel matrix without any labeled data. Our empirical studies demonstrate the efficiency and the effectiveness of the generalized maximum margin clustering algorithm. Furthermore, in this talk, I will show the theoretical connection among the spectral clustering, the maximum margin clustering and the generalized maximum margin clustering.

Bio: Dr. Rong Jin is an Assistant Prof. of the Computer Science and Engineering Dept. of Michigan State University since 2003. He is working in the areas of statistical machine learning and its application to information retrieval. Dr. Jin holds a B.A. in Engineering from Tianjin University, an M.S. in Physics from Beijing University, and an M.S. and Ph.D. from School of Computer Science of Carnegie Mellon University.

Andrew Tomkins
Yahoo! Research
Wed, May 3, 2006

Routing and growth models for online social networks

In this talk, I'll discuss some results from a number of large online social networks, including the LiveJournal blog hosting site, the Flikr photo sharing and tagging network, and the Yahoo! MyWeb2.0 social search application. I'll give some intuition about the denizens of these networks, covering demographs, interests, and locations. Then I'll discuss the small-world phenomenon in the context of online networks, and describe some results extending Kleinberg's work on small worlds to more general metric spaces and to non-uniform population densities. Finally, time permitting I'll close with a discussion of evolution of these networks over time.

This is a joint work with Ravi Kumar, David Liben-Nowell, Jasmine Novak, and Prabhakar Raghavan.

Bio: Dr. Andrew Tomkins joined Yahoo! Research in 2005 from IBM. His research over the last eight years has focused on measurement, modeling, and analysis of context, communities, and users on the World Wide Web. Prior to joining Yahoo! Research, he managed the "Information Management Principles" group at IBM's Almaden Research Center, and served as Chief Scientist on the WebFountain project. He received Bachelors degrees in Mathematics and Computer Science from MIT, and a PhD in CS from Carnegie Mellon University.

Lise Getoor
University of Maryland - College Park
Fri, Feb 10, 2006

Link Mining

A key challenge for data mining is tackling the problem of mining richly structured datasets, where the objects are linked in some way. Links among the objects may demonstrate certain patterns, which can be helpful for many data mining tasks and are usually hard to capture with traditional statistical models. Recently there has been a surge of interest in this area, fueled largely by interest in web and hypertext mining, but also by interest in mining social networks, security and law enforcement data, bibliographic citations and epidemiological records.
Link mining includes both descriptive and predictive modeling of link data. Classification and clustering in linked relational domains require new data mining models and algorithms. Furthermore, with the introduction of links, new predictive tasks come to light. Examples include predicting the numbers of links, predicting the type of link between two objects, inferring the existence of a link, inferring the identity of an object, finding co-references, and discovering subgraph patterns.
In this talk, I will give an overview of this newly emerging research area. I will describe novel aspects of the modeling, learning and inference tasks. Then, as time permits, I will describe some of my group's recent work on link-based classification and entity resolution in linked data.

Bio: Dr. Lise Getoor is an assistant professor in the Computer Science Department at the University of Maryland, College Park. She received her PhD from Stanford University in 2001. Her current work includes research on link mining, statistical relational learning and representing uncertainty in structured and semi-structured data. Her work in these areas has been supported by NSF, NGA, KDD, ARL and DARPA. In July 2004, she co-organized the third in a series of successful workshops on statistical relational learning, http://www.cs.umd/srl2004. She has published numerous articles in machine learning, data mining, database and AI forums. She is a member of AAAI Executive council, is on the editorial board of the Machine Learning Journal and JAIR and has served on numerous program committees including AAAI, ICML, IJCAI, KDD, SIGMOD, UAI, VLDB, and WWW.

Mehran Sahami
Google Inc.
Tues, May 31, 2005

Using Machine Learning to Improve Information Finding on the Web

Web search is one of the most important applications used on the Internet, and it also poses many interesting opportunities to apply machine learning. In order to better help people find relevant information in a growing sea of data, we discuss various machine learning techniques that can be harnessed to sift, organize, and present relevant information to users. In this talk, we provide a brief background on information retrieval, and then look at some of the challenges faced in searching the Web. We specifically examine applications of machine learning to improve information retrieval, image classification, topical inference of queries, and record linkage. We show how these tasks are directly related to the overarching goal of improving various aspects of search on the web.

Bio: Mehran Sahami is a Senior Research Scientist at Google and is also a Lecturer in the Computer Science Department at Stanford University. At Google, Mehran conducts research in machine learning and information retrieval technologies to help improve information access. Prior to joining Google, Mehran was involved in a number of commercial and research machine learning projects at E.piphany, Xerox PARC, SRI International, and Microsoft Research. He received his BS, MS, and PhD in Computer Science from Stanford, and evidently still has separation anxiety with respect to completely leaving there.

Thomas G. Dietterich
Oregon State University
Fri, Nov 19, 2004

Three Challenges for Machine Learning Research

Over the past 25 years, machine learning research has made huge progress on the problem of supervised learning. This talk will argue that now is the time to consider three new directions. The first direction, which is already being pursued by many groups, is Structural Supervised Learning in which the input instances are no longer independent but instead are related by some kind of sequential, spatial, or graphical structure. A variety of methods are being developed, including hidden Markov support vector machines, conditional random fields, and sliding window techniques. The second new direction is Transfer Learning in which something is learned on one task that can help with a second, separate task. This includes transfer of learned facts, learned features, and learned ontologies. The third new direction is Deployable Learning Systems. Today's learning systems are primarily operated offline by machine learning experts. They provide an excellent way of constructing certain kinds of AI systems (e.g., speech recognizers, handwriting recognizers, data mining systems, etc.). But it is rare to see learning systems that can be deployed in real applications in which learning takes place on-line and without expert intervention. Deployed learning systems must deal with such problems as changes in the number, quality, and semantics of input features, changes in the output classes, and changes in the underlying probability distribution of instances. There are also difficult software engineering issues that must be addressed in order to make learning systems maintainable after they are deployed.

Bio:Dr. Thomas G. Dietterich (AB Oberlin College 1977; MS University of Illinois 1979; PhD Stanford University 1984) joined the Oregon State University faculty in January 1985. In 1987, he was named a Presidential Young Investigator for the NSF. In 1990, he published, with Dr. Jude Shavlik, the book entitled Readings in Machine Learning, and he also served as the Technical Program Co-Chair of the National Conference on Artificial Intelligence (AAAI-90). From 1992-1998 he held the position of Executive Editor of the journal Machine Learning. The American Association for Artificial Intelligence named him a Fellow in 1994, and the Association for Computing Machinery did the same in 2003. In 2000, he co-founded a new, free electronic journal: The Journal of Machine Learning Research. He served as Technical Program Chair of the Neural Information Processing Systems (NIPS) conference in 2000 and General Chair in 2001. He currently President of the International Machine Learning Society, a member of the DARPA Information Science and Technology Study Group, and he also serves on the Board of Trustees of the NIPS Foundation.

Foster Provost
New York University
Thurs, Nov 11, 2004

Classification and Learning with Networked Data: some observations and results

As information systems record and provide access to increasing amounts of data, connections between entities become available for analysis. Customer accounts are linked by communications and other transactions. Organizatons are linked by joint activities. Text documents are hyperlinked. Such networked data create opportunities for improving classification. For example, for detecting fraud a common and successful strategy is to use transactions to link a questionable account to previous fraudulent activity. Document classification can be improved by considering hyperlink structure. Marketing can change dramatically when customer communication is taken into account. In this talk I will focus on two unique characteristics of classification with networked data. (1) Knowing the classifications of some entities in the network can improve the classification of others. (2) Very-high-cardinality categorical attributes (e.g., identifiers) can be used effectively in learned models. I will discuss methods for taking advantage of these characteristics, and will demonstrate them on various real and synthetic data sets.

This is a joint work with Claudia Perlich and Sofus Macskassy.

Bio:Dr. Foster Provost is Associate Professor of Information Systems and NEC Faculty Fellow at New York University's Stern School of Business. He is Editor-in-Chief of the journal Machine Learning, and a founding board member of the International Machine Learning Society. Professor Provost's recent research focuses on mining networked data, economic machine learning, and applications of machine learning and data mining. Previously, at NYNEX/Bell Atlantic Science and Technology, he studied a variety of applications of machine learning to telecommunications problems including fraud detection, network diagnosis and monitoring, and customer contact management.

Michael J. Pazzani
Division Director, Information and Intelligent Systems, National Science Foundation
Thurs, Sept 30, 2004

Machine Learning for Personalized Wireless Portals

People have access to vast stores of information on the World Wide Web ranging from on line publications to electronic commerce. All this information, however, used to be accessible only while users are tethered to a computer at home or in an office. Wireless data and voice access to this vast store allows unprecedented access to information from any location at any time. The presentation of this information must be tailored to the constraints of mobile devices. Although browsing and searching are the acceptable methods of locating information on the wired web, those operations soon become cumbersome and inefficient in the wireless setting and impossible in voice interfaces. Small screens, slower connections, high latency, limited input capabilities, and the serial nature of voice interfaces present new challenges. This talk focuses on personalization techniques that are essential for the usability of hand held wireless devices.

Bio:Dr. Michael J. Pazzani is the Director of the Information and Intelligent Systems Division of the National Science Foundation. He received his Ph.D. in Computer Science from UCLA and is on leave from a full professor at the University of California, Irvine where he also served as department chair of Information and Computer Science at UCI for five years. Dr. Pazzani serves on the Board of Regents of the National Library of Medicine. He is a fellow of the American Association of Artificial Intelligence and has published numerous papers in machine learning, personalization, information retrieval, and cognitive science.

Soumen Chakrabarti
Indian Institute of Technology at Bombay
Fri, Aug 20, 2004

Is question answering an acquired skill?

We present a question answering (QA) system which learns how to detect and rank answer passages by analyzing questions and their answers (QA pairs) provided as training data. Our key technique is to recover, from the question, fragments of what might have been posed as a structured query, had a suitable schema been available. One fragment comprises _selectors_: tokens that are likely to appear (almost) unchanged in an answer passage. The other fragment contains question tokens which give clues about the _answer_type_, and are expected to be _replaced_ in the answer passage by tokens which _specialize_ or _instantiate_ the desired answer type. Selectors are like constants in where-clauses in relational queries, and answer types are like column names. We propose a simple conditional exponential model over a pair of feature vectors, one derived from the question and the other derived from the a candidate passage. Features are extracted using a lexical network and surface context as in named entity extraction, except that there is no direct supervision available in the form of fixed entity types and their examples. We do not need any manually-designed question type system or supervised question classification. Using the exponential model, we filter candidate passages and see substantial improvement in the mean rank at which the first answer passage is found. With TREC QA data, our system achieves mean reciprocal rank (MRR) that compares favorably with the best scores in recent years, and generalizes from one corpus to another.

This is a joint work with Ganesh Ramakrishnan, Deepa Paranjpe, and Pushpak Bhattacharyya.

Bio:Dr. Soumen Chakrabarti received his B.Tech in Computer Science from the Indian Institute of Technology, Kharagpur, in 1991 and his M.S. and Ph.D. in Computer Science from the University of California, Berkeley in 1992 and 1996. At Berkeley he worked on compilers and runtime systems for running scalable parallel scientific software on message passing multiprocessors. He was a Research Staff Member at IBM Almaden Research Center from 1996 to 1999, where he worked on the Clever Web search project and led the Focused Crawling project. In 1999 he moved as Assistant Professor to Department of Computer Science and Engineering at the Indian Institute of Technology, Bombay, where he has been an Associate professor since 2003. In Spring 2004 he is Visiting Associate professor at Carnegie-Mellon. He has published in the WWW, SIGIR, SIGKDD, SIGMOD, VLDB, ICDE, SODA, STOC, SPAA and other conferences as well as Scientific American, IEEE Computer, VLDB and other journals. He holds eight US patents on Web-related inventions. He has served as vice-chair or program committee member for WWW, SIGIR, SIGKDD, VLDB, ICDE, SODA and other conferences, and guest editor or editorial board member for DMKD and TKDE journals. He is also author of a new book on Web Mining. His current research interests include question answering, Web analysis, monitoring and search, mining irregular and relational data, and textual data integration.

Arindam Banerjee
University of Texas at Austin
Fri, Apr 16, 2004

Bregman Clustering and Co-clustering

A wide variety of distortion measures such as squared Euclidean distance, relative entropy etc., have been used in the literature for parametric clustering. In recent years, several such distortion measures have also been used for co-clustering, i.e., simultaneous clustering of the rows and columns of two-dimensional data matrices. In this talk, we discuss a unification of a large class of clustering and co-clustering techniques using Bregman divergences.
The talk has two parts. In the first part, we propose and analyze parametric hard clustering with Bregman divergences. The proposed class of algorithms unify centroid-based parametric clustering approaches, such as classical kmeans and information-theoretic clustering, which arise by special choices of the Bregman divergence. Further, we show an explicit bijection between Bregman divergences and exponential families that leads to a simple soft clustering algorithm for all Bregman divergences. In the second part, we discuss a general framework for co-clustering wherein loss functions corresponding to all Bregman divergences can be used and various conditional expectation based constraints can be considered, thereby giving rise to different parametric co-clustering models on a wide range of data matrices. The minimum Bregman information solution, a direct generalization of maximum entropy and least squares, plays a critical role in the analysis that leads an elegant meta-algorithm guaranteed to achieve local optimality. Empirical evidence is provided to establish the generality and efficacy of the proposed framework.

This is a joint work with Inderjit Dhillon, Joydeep Ghosh, Srujana Merugu and Dharmednra Modha.

Inderjit S. Dhillon
University of Texas at Austin
Fri, Mar 26, 2004

Information-Theoretic Clustering, Co-clustering and Matrix Approximations

Given a two-dimensional co-occurrence matrix, we consider the problem of co-clustering or simultaneous clustering of the rows and columns. Viewing the (scaled) co-occurrence matrix as an empirical joint probability distribution of two discrete random variables, we can pose the co-clustering problem as an optimization problem in information theory --- the optimal co-clustering maximizes the mutual information between the clustered random variables subject to constraints on the number of row and column clusters. It can be shown that this objective is identical to that of finding a "low-complexity" non-negative matrix approximation that is closest in relative entropy to the given matrix under clustering constraints. Based on this formulation, we present a co-clustering algorithm that monotonically increases the preserved mutual information by intertwining both the row and column clusterings at all stages. We apply the algorithm to some problems in text mining: distributional clustering of words applied to text classification, and simultaneous clustering of words and documents. Experimental results indicate that distributional word clustering leads to improved text classification when the training data is small in size, and co-clustering can overcome problems due to high-dimensionality and sparsity. Generalizations to varied distortion measures, such as Bregman divergences, will be discussed.

This is a joint work with A. Banerjee, J. Ghosh, Y. Guan, S. Mallela, S. Merugu and D. Modha.

Daniel Boley
University of Minnesota
Fri, Mar 12, 2004

Training Support Vector Machines via Adaptive Clustering

We discuss a way to speed up the training process for support vector machines by means of clustering the training data and using representative points for the training process. An iterative procedure is proposed to refine the resulting support vector machine. This procedure eventually converges to the true support vector machine that would be obtained using all the original training data, but often the earlier approximate support vector machines will suffice. The approach is modular by allowing the user to plug in arbitrary methods for both the clustering and the support vector machine training. Using off-the-shelf methods for both tasks, we illustrate the approach with an artificial dataset and a "real" dataset.

Bio:Dr. Daniel Boley received his A.B. degree Summa Cum Laude in Mathematics and with Distinction in All Subjects from Cornell University in 1974, and his M.S. and Ph.D. degrees in Computer Science from Stanford University in 1976 and 1981, respectively. Since 1981, he has been on the faculty of the Department of Computer Science and Engineering at the University of Minnesota. He has had extended visiting positions at the Los Alamos Scientific Laboratory, the IBM Research Center in Zurich (Switzerland), the Australian National University in Canberra, Stanford University, and the University of Salerno (Italy). Dr. Boley is known for his past work on numerical linear algebra methods for control problems, parallel algorithms, iterative methods for matrix eigenproblems, error correction for floating point computations, inverse problems in linear algebra, as well as his more recent work on numerical methods in robot motion planning and unsupervised document categorization in data mining. He has been an editor for the SIAM Journal of Matrix Analysis and has chaired several technical symposia.

Jiawei Han
Department of Computer Science
University of Illinois - Urbana-Champaign
Fri, Oct 24, 2003

Mining Dynamics of Data Streams in Multidimensional Space

Traditional data mining systems and methods assume that data resides on disks or in main memory, and a data mining process can scan the data sets multiple times to uncover interesting patterns. However, real-time production systems and other dynamic environments often generate tremendous (potentially infinite) volumes of stream data in fast speed---the volume of data is too huge to be stored on disks or even if it were stored on disks, it could be too expensive be fetched and scanned multiple times. Moreover, many applications may require real time mining of unusual patterns in data streams, including finding unusual network or telecommunication traffic, real-time pattern mining in video surveillance, detecting suspicious on-line transactions or terrorist activities, and so on. Recently there have been substantial growing research interests in developing methods for querying and mining stream data. In this talk, I will first present an overview of recent developments on stream data querying and mining and outline what are the major challenging research problems for mining dynamics of data streams in multi-dimensional space. In particular, I will be addressing the following issues in detail: (1) multi-dimensional on-line analysis methods for discovering unusual patterns in stream data; (2) mining time-sensitive frequent patterns in stream data; (3) mining clusters and outliers in stream data; and (4) dynamic stream classification methods.

Bio:Dr. Jiawei Han is a professor at Department of Computer Science, University of Illinois at Urbana-Champaign. He has been working on research into data mining, data warehousing, database systems, spatial and multimedia databases, deductive and object-oriented databases, Web databases, bio-medical databases, etc. with over 250 journal and conference publications. He has chaired or served in many program committees of international conferences and workshops, including 2001 and 2002 SIAM-Data Mining Conference (PC co-chair), 2004 and 2002 International Conferences on Data Engineering (PC vice-chair), ACM SIGKDD conferences (2001 best paper award chair, 2002 student award chair), 2002 and 2003 ACM SIGMOD conference (2000 demo/exhibit program chair), etc. He also served or is serving on the editorial boards for Data Mining and Knowledge Discovery: An International Journal, IEEE Transactions on Knowledge and Data Engineering, and Journal of Intelligent Information Systems. His textbook "Data Mining: Concepts and Techniques" (Morgan Kaufmann, 2001) has been popularly adopted for data mining courses in universities. His research is currently supported by NSF, NSF/ITR, ONR, Univ. of Illinois, IBM Faculty Award, and a gift from Microsoft Research.

David Page, Jr.
Department of Biostatistics and Bioinformatics, Department of Computer Sciences
University of Wisconsin - Medison
Tues, Aug 19, 2003

Machine Learning in Cancer Genetics

In recent years several novel types of genetic data have become available, and a primary application area for these types of data is in the study of cancer. Two of the most widely-publicized sources of such data are gene expression microarrays, or "gene chips," and single-nucleotide polymorphisms, or "SNPs." This talk will review these types of data and prior applications of machine learning to gene chip data. It then will present novel results using machine learning to analyze gene chip and SNP data for multiple myeloma, an incurable form of blood cancer. To our knowledge, the application to SNP data is the first such machine learning application. The talk will close with a brief discussion of other novel biological data types to which machine learning also should be applicable.

This is joint work with Michael Waddell (also of UW) and with John Shaughnessy, Fenghuang Zhan and Bart Barlogie of the Lambert Laboratory for Myeloma Genetics and the Myeloma Institute at the University of Arkansas for Medical Sciences.

Pedro Domingos
Department of Computer Science and Engineering
University of Washington
Fri, Oct 25, 2002

Learning From Networks of Examples

Most machine learning algorithms assume that examples are independent of each other, but many (or most) domains violate this assumption. For example, in real markets customers' buying decisions are influenced by their friends and acquaintances, but data mining for marketing ignores this (as does traditional economics). In this talk I will describe how we can learn models that account for example dependences, and use them to make better decisions. For example, in the marketing domain we are able to pinpoint the most influential customers, "seed" the network by marketing to them, and unleash a wave of word of mouth. We mine these models from collaborative filtering systems and knowledge-sharing Web sites, and show that they are surprisingly robust to imperfect knowledge of the network. I will also survey other applications of learning from networks of examples we are working on, including: combining link and content information in Google-style Web search; automatically translating between ontologies on the Semantic Web; and predicting the evolution of scientific communities.

This is a joint work with Matt Richardson.

Bio:Dr. Domingos received his BA(1988) and M.S(1992) in EECS from IST, in Lisbon, and his M.S.(1994) and Ph.D.(1997) in Information and Computer Science from the University of California at Irvine. After spending two years as an assistant professor at IST, he joined the faculty of the University of Washington in 1999. He is the author or co-author of over 80 technical publications in machine learning, data mining, and other areas.

Andrew McCallum
WhizBang Labs & Carnegie Mellon University
Fri, Feb 22, 2002

Information Extraction with Finite State Models

Information extraction is the process of filling fields in a database by automatically extracting sub-sequences of human readable text. Finite state machines are the dominant model for information extraction both in research and industry. In this talk I will give several examples of information extraction tasks at WhizBang Labs, and then describe two new finite state models designed to take special advantage of the multi-faceted nature of text on the web. Maximum Entropy Markov Models (MEMMs) are discriminative sequence models that allow each observation to be represented as a collection of arbitrary overlapping features (such as word identity, capitalization, part-of-speech and formatting, plus agglomerative features of the entire sequence and features from the past and future). Conditional Random Fields (CRFs) are a generalization of MEMMs that solve a fundamental limitation of MEMMs and all other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. I will introduce both models, skim over their parameter estimations algorithms, and present experimental results on real and synthetic data.

This is a joint work with Fernando Pereira, John Lafferty and Dayne Freitag.

Bio:Dr. Andrew McCallum received his BA from Dartmouth College and his PhD in Computer Science from University of Rochester, then a Post-Doctoral Fellowship at Carnegie Mellon University. He is currently VP of Research and Development at WhizBang! Labs, a company that does machine-learning-based information extraction from the Web. His interests include machine learning and text mining, especially information extraction, document classification and techniques for learning from limited labeled training data.

Oren Etzioni
Department of Computer Science and Engineering
University of Washington
Fri, Nov 2, 2001

Adaptive Web Sites

Today's web sites are intricate but not intelligent; While web navigation is dynamic and idiosyncratic, all too often web sites are fossils cast in HTML. This talk describes our research on Adaptive Web Sites: sites that automatically improve their organization and presentation by learning from visitor access patterns. Adaptive web sites mine the data buried in web server logs to produce more easily navigable web sites.

To demonstrate the feasibility of adaptive web sites, I will describe the problem of index page synthesis and sketch a solution that relies on novel clustering and conceptual clustering techniques. Our preliminary experiments show that high-quality candidate index pages can be generated automatically, and that our techniques outperform existing methods (including the Apriori algorithm, K-means clustering, hierarchical agglomerative clustering, and COBWEB) in this domain.

Foster Provost
New York University
Fri, Oct 26, 2001

Reviving the Learning Curve: A critical study of machine learning performance

Off-the-shelf machine-learning algorithms now are used widely both for manual data analysis and as important components for building intelligent systems. We revive the "learning curve," which relates learning performance to training-set size, to examine two of the most popular families of off-the-shelf classifier induction algorithms: decision-tree inducers and logistic regression. This new analysis (conducted using 37 large data sets, comparing classification and ranking performance) shows some remarkable things. Most generally, one can not safely ignore learning curves when comparing algorithms; on the same data sets, conclusions can be strikingly different depending on training-set size. This calls into question the conclusions of innumerable comparison studies that do not factor the size of the data set into the analysis. More specifically, looking carefully at the learning curves of these two popular induction algorithm families we see clearly that each is preferable for a different type of data.

Sridhar Rajagopalan
IBM Almaden Research Center
Fri, Oct 5, 2001

Three applications of data mining principles

In this talk, I will describe three applications of data mining algorithms similar to Agrawal and Srikant's a priori algorithm. As we go through the sequence, we will see increasing use of simple graph theoretic methods, and explain why the even simpler base technology fails.

Bio:Dr. Sridhar Rajagopalan recieved his Ph.D. from University of California, Berkeley in 1994. He subsequently spent two years at Princeton University as a DIMACS postdoctoral fellow. He has since been a researcher at the IBM Almaden Research Center. He is interested in Algorithms, Databases, Randomization in Computing, and Information Retrieval on the Web.

Joseph Sirosh
HNC Software Inc.
Fri, Sept 21, 2001

Unit Vector Representations of Unstructured Information

One of the frequently encountered challenges in data mining is the problem of representing unstructured information such as words in text, action codes, and similar non numerical information. In this talk, I will introduce a technology developed at HNC called Context Vectors for representing and operating on unstructured data. Using context vectors, many operations on text data, such as search and indexing, categorization, clustering and summarization can be performed using vector operations. I will also review some of the applications of this technology, such as email response, personalization, search, recommendation generation and segmentation.

Bio:Dr. Joseph Sirosh leads Advanced Technology Solutions (ATS), the R&D group of HNC Software Inc. (NASDAQ: HNCS). HNC is a leader in Customer Insight solutions incorporating decision management and customer analytics for better customer insight in the financial/banking, incurance/ healthcare, telecommunications and e-commerce industries. Sirosh oversees and directs research projects on text/multimedia retrieval, wireless web information management, information extraction and personalization, predictive modeling, intelligent agents and genome sequence mining. During his tenure at HNC, he won highly competitive externally funded projects in information retrieval and bioinformatics, helping HNC create breakthrough technology and establish core competencies in new fields. In his career, Sirosh has published more than 20 technical papers and one electronic book, and has been a lead investigator of various research grants from DARPA and other government agencies. Sirosh holds a PhD and Masters in Computer Sciences from the University of Texas at Austin and a BTech degree in Computer Science & Engineering from the Indian Institute of Technology, Madras.

Soumen Chakrabarti
Indian Institute of technology - Bombay
Scheduled for Fri, Sept 14, 2001, but cancelled due to Sept 11

Beyond hubs and authorities: Enhanced topic distillation using text, markup tags, and hyperlinks

Topic distillation is the analysis of hyperlink graph structure to identify mutually reinforcing authorities (popular pages) and hubs (comprehensive lists of links to authorities). Topic distillation is becoming common in Web search engines, but the best-known algorithms model the Web graph at a coarse grain, with whole pages as single nodes. Such models may lose vital details in the markup tag structure of the pages, and thus lead to a tightly linked irrelevant subgraph winning over a relatively sparse relevant subgraph, a phenomenon called topic drift or contamination. The problem is common for narrow topics, because most hubs contain admixtures of other topics. The problem also becomes severe in the face of increasingly complex pages with navigation panels and advertisement links. We present an enhanced topic distillation algorithm which analyzes text, the markup tag trees that constitute HTML pages, and hyperlinks between pages. It thereby identifies subtrees which have high text- and hyperlink-based coherence w.r.t. the query. These subtrees get preferential treatment in the mutual reinforcement process. As a by-product, our algorithm can extract annotated sections of hubs which are best suited to surveying the given topic. Using over 50 topics, 28 from earlier topic distillation work, we have analyzed over 700000 pages and obtained quantitative and anecdotal evidence that the new algorithm reduces topic drift.

This is a joint work with Mukul Joshi and Vivek Tawde.

Daniel Boley
University of Minnesota
Fri, Apr 27, 2001

A Scalable Hierarchical Algorithm for Unsupervised Clustering

Top-down hierarchical clustering can be done in a scalable way. Here we describe a scalable unsupervised clustering algorithm designed for large datasets from a variety of applications. The method constructs a tree of nested clusters top-down, where each cluster in the tree is split according to the leading principal direction. We use a fast principal direction solver to achieve a fast overall method. The algorithm can be applied to any dataset whose entries can be embedded in a high dimensional Euclidean space, and takes full advantage of any sparsity present in the data. We show the performance of the method on text document data, in terms of both scalability and quality of clusters. We demonstrate the versatility of the method in different domains by showing results from text documents, human cancer gene expression data, and astrophysical data. For that last domain, we use an out of core variant of the underlying method which is capable of efficiently clustering large datasets using only a relatively small memory partition.

Bio:Dr. Daniel Boley received his A.B. degree Summa Cum Laude in Mathematics and with Distinction in All Subjects from Cornell University in 1974, and his M.S. and Ph.D. degrees in Computer Science from Stanford University in 1976 and 1981, respectively. Since 1981, he has been on the faculty of the Department of Computer Science and Engineering at the University of Minnesota. He has had extended visiting positions at the Los Alamos Scientific Laboratory, the IBM Research Center in Zurich (Switzerland), the Australian National University in Canberra, Stanford University, and the University of Salerno (Italy). Dr. Boley is known for his past work on numerical linear algebra methods for control problems, parallel algorithms, iterative methods for matrix eigenproblems, error correction for floating point computations, inverse problems in linear algebra, as well as his more recent work on numerical methods in robot motion planning and unsupervised document categorization in data mining. He has been an editor for the SIAM Journal of Matrix Analysis and has chaired several technical symposia.

David Scott
Rice University
Fri, Apr 20, 2001

Nonparametric Clustering for Data Mining

The use of density estimation to find clusters in data is supplementing ad hoc hierarchical methodology. Examples include finding high-density regions, finding modes in a kernel density estimator, and the mode tree. Alternatively, a mixture model may be fit and the mixture components associated with individual clusters. Fitting a high-dimensional mixture model with many components is difficult to estimate in practice. Here, we survey mode and level set methods for finding clusters. We describe a new algorithm that estimates a subset of a mixture model. In particular, we demonstrate how to fit one component at a time and how the fits may be organized to reveal the complete clustering model.

Ed George
Department of MSIS
University of Texas - Austin
Fri, Apr 13, 2001

Bayesian Treed Modeling

When simple parametric models such as linear regression fail to adequately approximate a relationship across an entire set of data, an alternative may be to consider a partition of the data, and then use a separate simple model within each subset of the partition. Such an alternative is provided by a treed model which uses a binary tree to identify such a partition. However, treed models go further than conventional trees (eg CART, C4.5) by fitting models rather than a simple mean or proportion within each subset. In this talk, I will discuss a Bayesian approach for finding and fitting parametric treed models, in particular focusing on Bayesian treed regression. The potential of this approach is illustrated by a cross-validation comparison of predictive performance with neural nets, MARS, and conventional trees on simulated and real data sets. This is joint work with Hugh Chipman, University of Waterloo and Rob McCulloch, University of Chicago. A paper is available at http://bevo2.bus.utexas.edu/GeorgeE/Research%20papers/treed-models.pdf.

Gautam Das
Microsoft Research
Fri, Mar 30, 2001

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

The popularity of data warehousing applications makes the problem of approximately answering aggregation queries important. Recently, researchers in this area have presented several sampling based schemes for approximate processing of aggregation queries. While most of these studies have acknowledged the importance of exploiting workload information in solving the problem, they still fall short in several important ways. First, many of them use ad-hoc schemes for picking samples of the data, sometimes resulting in unacceptable quality of answers even for the given workload. Second, most previous approaches ignore variance in the data distribution in determining the sample, which can also lead to poor quality. We present a comprehensive and practical solution that addresses all these problems. We use basic ideas from stratified sampling to formulate the problem as an optimization problem whose goal is to minimize the error for the given workload. We show how our solution can be implemented efficiently in a database system, and present results of extensive experiments on a commercial database management system that show the superior quality of our method compared to relevant previous work.

Joydeep Ghosh
Electrical and Computer Engineering
University of Texas, Austin
Fri, Mar 23, 2001

Graphical Techniques for Efficient and Balanced Clustering of Very High-Dimensional Data

Segmentation or clustering of large data sets with thousands of dimensions is very challenging because of sparsity problems arising from the ``curse of dimensionality''. Recent graphical approaches provide novel and efficient solutions by working in similarity space rather than in the original vector space in which the data points were specified.
I shall describe one such approach based on constrained, weighted graph-partitioning, that has yielded remarkable results on real-life market basket analysis and on clustering web pages. Issues on which similarity space should be used, and how to evaluate and visualize the clusters will be addressed in the process.

Tom Mitchell
Carnegie Mellon University
and Whizbang! Labs
Fri, Feb 09, 2001

Machine Learning and Extracting Information from the Web

Today's search engines can retrieve and display over a billion web pages, but their use is limited by the fact that they don't analyze the content of these pages in any depth.
This talk will describe research that has resulted in systems that answer these kinds of questions by extracting detailed factual information automatically from millions of web pages. Our approach relies heavily on machine learning algorithms to train the system to find and extract targeted information. For example, in one case we trained our system to find and extract job postings from the web, resulting in the world's largest database of job openings (over 600,000 jobs, see www.flipdog.com). This talk will describe machine learning algorithms for classifying and extracting information from web pages, including results of recent research on using unlabeled data and other kinds of information to improve learning accuracy.

Prabhakar Raghavan
Chief Scientist, Verity Inc.
Fri, Nov 17, 2000

Networks and Sub-Networks in the World-Wide Web

We discuss several structural properties of graphs arising from the world-wide web including the graph of hyperlinks, and the graph induced by connections between distributed search servents. We review a number of algorithmic investigations of the structure of these networks, and conclude by proposing stochastic models for the evolution of these networks.

Bio:Dr. Prabhakar Raghavan is Chief Scientist at Verity, and is also a consulting professor at Stanford University. Prior to Verity, he worked at the IBM Almaden Research Center, where his group worked on algorithms, web mining and search, complexity theory, security and cryptography, optimization and reasoning about knowledge. Prabhakar got his undergraduate degree in electrical engineering from IIT, Madras, and PhD in computer science from UC Berkeley.

Jerome H. Friedman
Stanford University
Fri, Nov 10, 2000

Predictive Data Mining with Multiple Additive Regression Trees

Predicting future outcomes based on past observational data is a common application in data mining. The primary goal is usually predictive accuracy, with secondary goals being speed, ease of use, and interpretability of the resulting predictive model. New automated procedures for predictive data mining, based on "boosting" (CART) regression trees, are described. The goal is a class of fast "off-the-shelf" procedures for regression and classification that are competitive in accuracy with more customized approaches, while being fairly automatic to use (little tuning), and highly robust especially when applied to less than clean data. Tools are presented for interpreting and visualizing these multiple additive regression tree (MART) models.

Nick Street
Department of Management Sciences
The University of Iowa
Thurs, Nov 9, 2000

Feature Selection in Unsupervised Learning via Multi-Objective Evolutionary Search

Feature subset selection is an important knowledge discovery task, because of the insight gained from determining relevant modeling variables and for the improved understandability, scalability, and perhaps accuracy of the resulting models. We explore the feature selection problem in unsupervised learning (clustering) where, unlike supervised learning, there is no single adequate measure of the quality of clusters built using a given feature subset. We frame the problem as a multi-objective optimization task by selecting a number of heuristic criteria to estimate model quality. The combinatorial search for good solutions is performed using ELSA, an evolutionary local selection algorithm that maintains a diverse population of solutions approximating the multi-dimensional Pareto front. We evaluate our approach with both synthetic and real-world data sets of moderate to high dimension. Our results show promise in finding Pareto-optimal solutions through which we can identify significant feature subsets and an appropriate number of clusters.

Ray Mooney
Department of Computer Sciences
University of Texas - Austin
Fri, Oct 27, 2000

Text Mining with Information Extraction

Information extraction (IE) is a form of shallow text understanding that locates specific pieces of data in natural language documents. An IE system is therefore capable of transforming a corpus of unstructured texts or web pages into a structured database. Our previous work has focused on using machine learning methods to automatically construct IE systems from training sets of manually annotated documents. Our current research focuses on form of text mining that extracts a database from a document corpus using a learned IE system and then mines this database for interesting patterns using traditional rule-induction methods. The mined rules can in turn be used to improve the accuracy of information extraction. We will present encouraging results on applying this approach to a corpus of computer job postings from the Usenet newsgroup austin.jobs.

Raghu Ramakrishnan
Professor and Vilas Associate
Department of Computer Sciences, UW-Madison
and CTO, QUIQ
Fri, Oct 13, 2000

Exploratory Analysis of Large Datasets

Exploratory data analysis has a long and rich history, especially in the AI and Statistics communities. Recently, it has gained a lot of attention under the name "Data Mining", with the new twist being an increased emphasis on analyzing large datasets. In this talk, I will discuss applications that motivate data mining and consider some new challenges that arise as a consequence of considering large datasets.

Jaideep Srivastava
Department of Computer Science and Engineering
University of Minnesota
and Yodlee.com
Fri, Sept 29, 2000

Web Mining: A Challenging Frontier for Data Mining

The ease and speed with which business transactions can be carried out over the Web has been a key driving force in the rapid growth of electronic commerce. The Web is revolutionizing the way business interact with each other (B2B) and with each customer (B2C). It has introduced entirely new ways of doing commerce, including e.g. auctions and reverse auctions. It also made it imperative for organizations and companies to optimize their electronic business. Knowledge about the Web user is fundamental for the establishment of viable e-business solutions. Web mining is the application of data mining techniques to acquire this knowledge for e-business. Typical concerns in e-business include improved cross-sells, up-sells, personalized ads, targeted assortments, improved conversion rates, and measurements of the effectiveness of actions.

This talk will

(a) introduce the concept of Web Mining

(b) discuss some of the key challenges,

(c) identify how techniques for mining large-scale scientific data sets can be useful for addressing some of these challenges

(d) discuss what some of the business impacts of doing so would be.

Peter Haas
IBM Almaden Research Center
Mon, Sept 25, 2000

Techniques for Online Exploration of Large Object-Relational Databases

We review some recent techniques for exploring large object-relational databases in an interactive online manner. The idea is to provide continuously updated running estimates of the final query results to the user, along with an indication of the precision of the estimates. The user can then halt the query as soon as the answer is sufficiently precise---the time required to obtain an acceptable approximate answer can be faster by orders of magnitude than the time needed to completely process the query. We focus on methods for online processing of complex aggregation queries and also discuss methods for online visualization and online display of a large set of result records. We briefly relate our work to more traditional OLAP (Online Analytical Processing) techniques and to static sampling in databases.

Jude Shavlik
Departments of Computer Sciences and Biostatistics & Medical Informatics
University of Wisconsin - Madison
Fri, Aug 4, 2000

Using Diverse Evidence Sources to Uncover Coordinately Controlled Genes via Machine Learning and Dynamic Programming

Now that the complete genomes of numerous organisms have been determined, a key problem in computational molecular biology is uncovering the relationships that exist among the genes in each organism and the regulatory mechanisms that control their operation. We are developing computational methods for discovering such regulatory mechanisms and relationships. In our first project, we have developed a computational approach for predicting operons in the genomes of prokaryotic organisms. Operons are contiguous genes on a chromosome that are "turned on" or "shut off" as a unit. Our approach uses machine-learning methods to induce predictive models for this task from a rich variety of data types including DNA sequence data, gene-expression laboratory data, and handcrafted functional annotations associated with genes. We use multiple learned models that individually predict promoters, terminators, and operons themselves. A key part of our approach is a dynamic-programming method that uses our individual operon predictions to map every known and putative gene in a given genome into its most probable operon. We evaluate our approach using data from the E. coli K-12 genome.

This is a joint work with Joseph Bockhorst, Mark Craven, Jeremy Glasner, and David Page. No prior knowledge about molecular biology will be assumed of the audience.

Haym Hirsh
Computer Science Department
Rutgers University
Thurs, Aug 3, 2000

Integrating Multiple Sources of Information in Text Classification

One popular class of supervised learning problems concerns the task of classifying items that are comprised primarily of text. Such problems can occur when assessing the relevance of a Web page to a given user, assigning topic labels to technical papers, and assessing the importance of a user's email. Most commonly this task is performed by extrapolating from a corpus of labeled textual training documents procedures for classifying further unlabeled documents. This talk presents an approach for corpus-based text classification based on WHIRL, a database system that augments traditional relational database technology with textual-similarity operations developed in the information retrieval community. Not only does the approach perform competitively on traditional text classification tasks, we show that it enables the incorporation of a range of sources of information into the classification process in a fairly robust and general fashion.

Dharmendra S. Modha
IBM Almaden Research Center
Wed, Apr 5, 2000

Clustering Hypertext with Applications to Web Searching

Clustering separates unrelated documents and groups related documents, and is useful for discrimination, disambiguation, summarization, organization, and navigation of unstructured collections of hypertext documents. We propose a novel clustering algorithm that clusters hypertext documents using words (contained in the document), out-links (from the document), and in-links (to the document). The algorithm automatically determines the relative importance of words, out-links, and in-links for a given collection of hypertext documents. We annotate each cluster using six information nuggets: summary, breakthrough, review, keywords, citation, and reference. These nuggets constitute high-quality information resources that are representatives of the content of the clusters, and are extremely effective in compactly summarizing and navigating the collection of hypertext documents. We employ web searching as an application to illustrate our results. Anecdotally, when applied to the documents returned by AltaVista in responses to the query abduction, our algorithm separates documents about "alien abduction" from those about "child abduction."

This is a joint work with W. Scott Spangler and will appear in the ACM Hypertext 2000 Conference.

Last updated on Spring 2006 by hyukcho@cs.utexas.edu