Activities and Findings Related to NSF grant
Small: Simultaneous Decomposition and Predictive Modeling on Large Multi-Modal Data
Research and Education
Activities:
1. Espoused the philosophy
of Simultaneous Decomposition and Prediction (SDaP)
to a broad audience (keynote, book chapter, papers). Also gave invited keynote
talks abroad (Spain, Brazil, Mexico), in addition to several domestic
forums.
2. Formulated a hard
decomposition version of SDaP that includes model
selection (number of decomposed pieces) and devised new, effective active
learning principles for a hard version of SDaP
3. Reviewed the literature
on combining multiple clusterings to provide a
conceptual hierarchy on the diverse ideas (review paper published, two book
chapters in the works), and then formulated a novel framework called C3E that
combines classifier ensembles and cluster ensembles to deal with using both labelled and (subsequent) unlabelled
data, even when the underlying models change over time. Showed its power via
extensive experiments. Journal version of C3E (to appear in
IEEE Trans. TKDE). Also devised an ensemble based
approach to the imbalanced class problem (when one class is very rare) using
alpha divergence. Journal paper appeared in IEEE Trans. KDE.
4. Bayesian formulation of
soft SDaP using generative models completed with full
variational inference. Subsequently we generalized
this approach to one of Constrained Relative Entropy Minimization with
Applications to Multitask Learning. This has quite well developed, and the core
of Koyejo's PhD thesis, which he successfully
defended in May 13. Several publications have already resulted (including one
that received Amazon's Best Student Paper at UAI'13), and a journal paper
accepted in Machine Learning Jl.
5. The problem of ranking
on networks is closely related to this project. We have developed a set of
tools using ideas of monotonicity and covexity, with
results that are beating other ranking methods such as CofiRank.
This work has led to Acharyya's PhD thesis (graduated
Aug 13), and papers in UAI12, Recsys13 and UAI14.
6. Fruitful collaboration
with Yahoo, with visits to both Mountainview and
Barcelona, has resulted. They have provided two very large datasets, but with
restrictions on publication, specially one that deals
with actual customer data from their Taiwanese properties. However these
datasets give us an understanding of real life scale and data quality issues.
7. I offered a 1 month
course on advanced data mining at UNICAMP, Brazil (ranked #2 in
engineering/sciences in Latin America), with over 6 hours of lectures devoted
to ideas and results intimately related to this project.
8. We have recently
applied SDAP concepts to high throughput phenotype
extraction from large scale EHR data, and our first work
will appear in Jl. Biological
Informatics.
The Major Results/Findings are:
1. The SDAP approach is
very promising for rich, large dyadic datasets. The hard version results in
good accuracy and interpretable results. However it has difficulty in the
cold-start setting, when new entities are encountered instead of new relations
among existing entities. This motivates the soft version based on generative
models. We have worked out a series of such formulations in great detail,
including how to incorporate extra information such as a social network,
concept hierarchies, time evolution and even the fact that data may not be
missing at random. Initial variational methods proved
impractical, so we have now implemented sampling based methods, such as
(collapsed) Gibbs sampling, as well as problem reformulations that are more
scalable and hence practical.
2. We proposed a new
optimization framework that can take the inputs from an ensemble of classifiers
as well as results of unsupervised learning, to come up with consensus labels.
It is based on Bregman divergences and thus is
extremely general. The results have already been published at MCS (basic
setting) and ICML (transfer learning setting) and well received. A full-length
journal paper will appear in IEEE Trans. KDE.
3. Two types of complexity
in data are a) when some classes are very rare, and b) when data may be
available at different levels of granularity, e.g. mortality rates may be
available at district level but age at individual level. As stepping
stones to SDaP formulations for a variety of
complex data, we tackled these two problems. For the first, we showed that an
ensemble of trees where splitting criterion is based on alpha divergence
(generalization of KL), with different values of alpha, is very robust to a
range of imbalances. For the second, we formulated a generative scheme and
applied it successfully to healthcare data. Both works are accepted/published
in journals in addition to conference publications.
4. We have started
investigation of the methodology to health informatics where genes and diseases
are the two modes, both modes have several associated properties and known
interaction, and the gene-disease matrix is partially observed, where known
relations are codes as '1', while the rest are all considered 'unknown, likely
zero'. We have developed a general approach of "constrained relative
entropy minimization". This is a generalization of classical Bayesian
methods, but its big advantage is that it can incorporate domain constraints in
a natural fashion even in situations where it is otherwise difficult to
formulate a "prior" with such knowledge. Moreover, this framework can
be applied to multitask learning problems. Our results beat the current "ProDiGe" methods for the challenging gene-disease
association problem, and will appear in Machine Learning Jl.
5.
A collaboration with Sarnoff focussed on photo recommendation, where people and pictures
are the two modes. We obtained state of the art results with both SCOAL and
using a method called RLFM as the predictive models within each segment. RLFM
combines both probabilistic matrix factorization and regression in a
mixed-effects framework, with good abilities for both cold and warm start.
6. We have developed a set
of tools using ideas of monotonicity and convexity, to address the general
problem of ranking, which is at the core of many recommender systems. Our
results are beating other ranking methods such as CofiRank.
A spinoff from this work was a new way of determining a suitable divergence for
learning GLMs, leading to an invited keynote talk at an ICML13 workshop. A margin enhanced version will appear in UAI'14.
7. We have started exploring how SDAP can be used for tensor data, specifically obtained from EHR, and to be used for high-throughput phenotyping. The base model will appear in Jl. Biomedical Informatics (a top journal for this topic), and our first enhancement has been accepted for KDD'14.