Authors: Prof. Johann-Christoph Freytag, Fabian Fier
Prof. Freytag and Fabian Fier held a talk about Textual Similarity Search on Big Data at the annual HPCC Conference of LexisNexis, a subsidiary of our project partner Elsevier.
Finding similar objects in large data sets is an important database operation. The operation is used in applications like plagiarism detection, document clustering, or duplicate removal. With increasing data set sizes, this problem cannot be solved canonically anymore. With straightforward approaches, the runtime becomes very large even for moderately large data sets and despite using distributed systems.
We give an overview on finding similarity on text data and discuss scalable solution approaches. In our research, we experimentally compare algorithmic approaches in order to optimize the runtime. We show that this is a complex problem due to many involved parameters such as data properties like skew, runtime parameters, or implementation details. We give insights to our practical findings when comparing implementations on Hadoop with implementations on HPCC. This talk addresses practitioners as well as theoreticians who are interested in similarity search, text processing, and scalable algorithmic approaches that are inspired by MapReduce and are adaptable to HPCC.