Inter-Agency “Co-opetition”
via Hidden Web Databases
(with Prof. Gautam Das, UT Arlington and Prof. Byron Gao, Texas State Univ. San Marcos)
Abstract:
Structured
hidden databases are widely prevalent on the Web. They provide restricted form
interfaces that allow users to execute search queries by specifying attribute
values of the desired records, and the system responds by ranking and returning
a few (e.g., top-k) matching records.
For example, cars.com offers a form interface for its used-car database which
allows users to search by make, model, year, etc., and returns the top-1500
cars ranked by price. Agencies or companies that own such
databases have two conflicting goals: (i) provide
information to the target audience such as customers or 3rd party
applications, while (ii) avoiding leakage of private or sensitive information
to competitors, malicious users etc. We propose innovative data
querying and analytical techniques that can simultaneously address both goals
with minimal changes to existing interfaces. These methods will work in distributed
settings where there can be multiple data sites that cooperate or are in competition
(adversarial). Since both
effective information sharing and privacy/security issues are of central
concern in recent efforts to share health information (EHRs),
law enforcement agency collaboration, e-governance, etc., the impact and
application of the proposed research will be immediate and broad.
Distributed Cooperation: We have recently pioneered
techniques for obtaining random samples of hidden databases via their public
interfaces. Samples reveal approximate aggregate information about the
database, which is useful in the cooperative setting for powering diverse
third-party applications such as Web mash-ups and meta-search engines. We now
propose to develop sampling and data analytic techniques, with focus on data
heterogeneity, scalability and characterization of sampling bias. We show how
distributed data mining can be carried out across multiple databases, under a
variety of data sharing constraints, thereby enabling secure cooperation. Moreover, the above cooperation schemes
will be significantly generalized to cover semi-structured data as well as
unstructured (free text) queries.
Competition: Addressing adversarial
settings is even more urgent, as the recently developed methods of revealing
aggregate information from a model number of samples have exposed the vulnerability of a large number of
existing databases. We propose several promising defense mechanisms including query
auditing, answer perturbation, on-demand dummy insertion and front-end
interface redesign. Finally, both settings will be unified wherein aggregate
suppression methods are employed for sensitive parts of databases while trusted
allies are provided mechanisms to bypass specific defense mechanisms. A
multi-agent system for this purpose is proposed based on the mathematically
sound “collective intelligence” framework. A flexible testbed
demonstrating sampling and mining of hidden databases ,
as well as robust distributed cooperation
shall be developed to implement, test and validate the developed
techniques. Large, real-life databases from Microsoft, Yahoo! and other sources shall be used for these purposes.
The proposed team of three faculty will provide
complementary expertise in query processing/aggregation, database design,
distributed data mining, multi-agent systems, web interfaces and information
retrieval, needed to adequately cover all aspects of the proposed work.