Structured hidden databases are widely prevalent on the Web. They provide restricted form interfaces that allow users to execute search queries by specifying attribute values of the desired records, and the system responds by ranking and returning a few (e.g., top-k) matching records. For example, cars.com offers a form interface for its used-car database which allows users to search by make, model, year, etc., and returns the top-1500 cars ranked by price. Agencies or companies that own such databases have two conflicting goals: (i) provide information to the target audience such as customers or 3rd party applications, while (ii) avoiding leakage of private or sensitive information to competitors, malicious users etc. We propose innovative data querying and analytical techniques that can simultaneously address both goals with minimal changes to existing interfaces. These methods will work in distributed settings where there can be multiple data sites that cooperate or are in competition (adversarial). Since both effective information sharing and privacy/security issues are of central concern in recent efforts to share health information (EHRs), law enforcement agency collaboration, e-governance, etc., the impact and application of the proposed research will be immediate and broad.
Distributed Cooperation: We have recently pioneered techniques for obtaining random samples of hidden databases via their public interfaces. Samples reveal approximate aggregate information about the database, which is useful in the cooperative setting for powering diverse third-party applications such as Web mash-ups and meta-search engines. We now propose to develop sampling and data analytic techniques, with focus on data heterogeneity, scalability and characterization of sampling bias. We show how distributed data mining can be carried out across multiple databases, under a variety of data sharing constraints, thereby enabling secure cooperation. Moreover, the above cooperation schemes will be significantly generalized to cover semi-structured data as well as unstructured (free text) queries.
Competition: Addressing adversarial settings is even more urgent, as the recently developed methods of revealing aggregate information from a model number of samples have exposed the vulnerability of a large number of existing databases. We propose several promising defense mechanisms including query auditing, answer perturbation, on-demand dummy insertion and front-end interface redesign. Finally, both settings will be unified wherein aggregate suppression methods are employed for sensitive parts of databases while trusted allies are provided mechanisms to bypass specific defense mechanisms. A multi-agent system for this purpose is proposed based on the mathematically sound "collective intelligence" framework. A flexible testbed demonstrating sampling and mining of hidden databases , as well as robust distributed cooperation shall be developed to implement, test and validate the developed techniques. Large, real-life databases from Microsoft, Yahoo! and other sources shall be used for these purposes.
The proposed team of three faculty will provide complementary expertise in query processing/aggregation, database design, distributed data mining, multi-agent systems, web interfaces and information retrieval, needed to adequately cover all aspects of the proposed work.