Activities and Findings Related to NSF grant

Small: Simultaneous Decomposition and Predictive Modeling on Large Multi-Modal Data

Research and Education Activities:

1. Espoused the philosophy of Simultaneous Decomposition and Prediction (SDaP) to a broad audience (keynote, book chapter, papers). Also gave invited keynote talks abroad (Spain, Brazil, Mexico), in addition to several domestic forums.

2. Formulated a hard decomposition version of SDaP that includes model selection (number of decomposed pieces) and devised new, effective active learning principles for a hard version of SDaP

3. Reviewed the literature on combining multiple clusterings to provide a conceptual hierarchy on the diverse ideas (review paper published, two book chapters in the works), and then formulated a novel framework called C3E that combines classifier ensembles and cluster ensembles to deal with using both labelled and (subsequent) unlabelled data, even when the underlying models change over time. Showed its power via extensive experiments. Journal version of C3E (to appear in IEEE Trans. TKDE). Also devised an ensemble based approach to the imbalanced class problem (when one class is very rare) using alpha divergence. Journal paper appeared in IEEE Trans. KDE.

4. Bayesian formulation of soft SDaP using generative models completed with full variational inference. Subsequently we generalized this approach to one of Constrained Relative Entropy Minimization with Applications to Multitask Learning. This has quite well developed, and the core of Koyejo's PhD thesis, which he successfully defended in May 13. Several publications have already resulted (including one that received Amazon's Best Student Paper at UAI'13), and a journal paper accepted in Machine Learning Jl.

5. The problem of ranking on networks is closely related to this project. We have developed a set of tools using ideas of monotonicity and covexity, with results that are beating other ranking methods such as CofiRank. This work has led to Acharyya's PhD thesis (graduated Aug 13), and papers in UAI12, Recsys13 and UAI14.

6. Fruitful collaboration with Yahoo, with visits to both Mountainview and Barcelona, has resulted. They have provided two very large datasets, but with restrictions on publication, specially one that deals with actual customer data from their Taiwanese properties. However these datasets give us an understanding of real life scale and data quality issues.

7. I offered a 1 month course on advanced data mining at UNICAMP, Brazil (ranked #2 in engineering/sciences in Latin America), with over 6 hours of lectures devoted to ideas and results intimately related to this project.

8. We have recently applied SDAP concepts to high throughput phenotype

extraction from large scale EHR data, and our first work will appear in Jl. Biological

Informatics.

The Major Results/Findings are:

1. The SDAP approach is very promising for rich, large dyadic datasets. The hard version results in good accuracy and interpretable results. However it has difficulty in the cold-start setting, when new entities are encountered instead of new relations among existing entities. This motivates the soft version based on generative models. We have worked out a series of such formulations in great detail, including how to incorporate extra information such as a social network, concept hierarchies, time evolution and even the fact that data may not be missing at random. Initial variational methods proved impractical, so we have now implemented sampling based methods, such as (collapsed) Gibbs sampling, as well as problem reformulations that are more scalable and hence practical.

2. We proposed a new optimization framework that can take the inputs from an ensemble of classifiers as well as results of unsupervised learning, to come up with consensus labels. It is based on Bregman divergences and thus is extremely general. The results have already been published at MCS (basic setting) and ICML (transfer learning setting) and well received. A full-length journal paper will appear in IEEE Trans. KDE.

3. Two types of complexity in data are a) when some classes are very rare, and b) when data may be available at different levels of granularity, e.g. mortality rates may be available at district level but age at individual level. As stepping stones to SDaP formulations for a variety of complex data, we tackled these two problems. For the first, we showed that an ensemble of trees where splitting criterion is based on alpha divergence (generalization of KL), with different values of alpha, is very robust to a range of imbalances. For the second, we formulated a generative scheme and applied it successfully to healthcare data. Both works are accepted/published in journals in addition to conference publications.

4. We have started investigation of the methodology to health informatics where genes and diseases are the two modes, both modes have several associated properties and known interaction, and the gene-disease matrix is partially observed, where known relations are codes as '1', while the rest are all considered 'unknown, likely zero'. We have developed a general approach of "constrained relative entropy minimization". This is a generalization of classical Bayesian methods, but its big advantage is that it can incorporate domain constraints in a natural fashion even in situations where it is otherwise difficult to formulate a "prior" with such knowledge. Moreover, this framework can be applied to multitask learning problems. Our results beat the current "ProDiGe" methods for the challenging gene-disease association problem, and will appear in Machine Learning Jl.

5. A collaboration with Sarnoff focussed on photo recommendation, where people and pictures are the two modes. We obtained state of the art results with both SCOAL and using a method called RLFM as the predictive models within each segment. RLFM combines both probabilistic matrix factorization and regression in a mixed-effects framework, with good abilities for both cold and warm start.

6. We have developed a set of tools using ideas of monotonicity and convexity, to address the general problem of ranking, which is at the core of many recommender systems. Our results are beating other ranking methods such as CofiRank. A spinoff from this work was a new way of determining a suitable divergence for learning GLMs, leading to an invited keynote talk at an ICML13 workshop. A margin enhanced version will appear in UAI'14.

7. We have started exploring how SDAP can be used for tensor data, specifically obtained from EHR, and to be used for high-throughput phenotyping. The base model will appear in Jl. Biomedical Informatics (a top journal for this topic), and our first enhancement has been accepted for KDD'14.