Multi-Modal Similarity Search on Web-Scale Data

The field of Multi-Modal Similarity Search, as part of the HEADT Centre, is motivated by the aspects of:

Motivation

  • Document search (i.e., web search) is traditionally based on full text search. Given one search document, we enable the search of documents similar in textual and further properties in big sets of documents.
  • We enable the filtering of search results by metadata attributes like location.
  • Relational database management systems (RDBMs) offer operators for similarity search on text and other attributes. However, they are not suited for large dataset sizes as they occur for web generated sets of documents (such as web pages). On the other hand, Map-Reduce based systems are suited for web-scale data, however they do not offer similarity operators. We aim to enhance Map-Reduce based systems with operators for multi-modal similarity search, including the generation of indexes that allows the system to perform subsequent similarity searches more efficiently.

Our Approach

  • We adapt filter-and-verification approaches from textual similarity joins. Filtering means that we only index a prefix (=subset of words) of a document. We can probe this prefix-index with a query document and get potentially similar documents. These candidate documents are verified by computing the similarity according to a chosen similarity measure such as Jaccard. The filters guarantee the completeness of the result.
  • We enhance these indexes with other domains such as location. This can be a hybrid indexing structure, for example, an inverted index combined with a grid.
  • We use LexisNexis’ HPCC as compute platform together with its programming language ECL which subsumes MapReduce. HPCC is particularly well suited for the problem at hand, because it consists of a batch-oriented ETL component (Thor) and an online querying component (Roxie). We use Thor for indexing and Roxie for querying.
hpcc
Usage of HPCC for textual search: Thor creates the search index(es) and Roxie executes the search

Project Members

Prof. Johann-Christoph Freytag (PhD)
Dipl.-Inf. Fabian Fier
Eva Höfer