Let the trees do the work: Hands-on optimal document search with Mercury-Settrie
The digitization of archives presents infinite possible worlds as they reproduce so quickly—hundreds, thousands, millions of documents. But finding what we are looking for can be daunting. In this scenario, trees now come to our rescue in a different manner: they provide efficient solutions for document search.
In computer science, a tree data structure is a hierarchical structure connecting nodes. It resembles the branches of trees or roots. These trees have a parent node, from which children nodes grow. From the many tree data structures, we will focus on SetTrie, which efficiently performs operations on sets.
Mercury-settrie: efficient document search made open-source
BBVA AI Factory has recently released an open-source library, Mercury. Today we will focus on mercury-settrie , one of the micro-repositories we can find in this library. Arising from the necessity of improving our recommender engine’s existing implementation, we coded the algorithm in C++ and made it accessible as a Python library. Performance is 200 times faster and 16 times more memory-efficient than a pure Python implementation.
The algorithm implemented is SetTrie. The SetTrie is a data structure that efficiently searches and stores data sets. It organizes the data into a tree structure, where each path down the tree represents a set. The items are sorted to enable efficient searching.
One of the key advantages of the SetTrie algorithm is that it can be used for a wide range of applications, including text search, data compression, and more. It’s also highly adaptable and can be optimized for different data and search patterns. Here we introduce the simplest case: searching for a set of documents.
Hands-on!
Before we proceed, make sure to open a console window. In our example, we use a Linux system. The exact syntax may vary depending on the platform. Ensure you have `Python 3.8` or better installed and `pip`. Upgrade `pip` to the latest version and install `mercury-settrie`.
pip install --upgrade pip
pip install mercury-settrie
Use any folder to download the “The 20 Newsgroups Text Dataset”. We will search for any document containing a specific set of words within it. You only have to:
wget http://qwone.com/~jason/20Newsgroups/20news-18828.tar.gz && tar -xf 20news-18828.tar.gz
Going forward, please ensure that you have your preferred Python interpreter, whether Jupyter, an IDE, or the Python interpreter directly, open and ready to use for executing the code examples. First, we write an elementary example function that converts the content of a text file into the set of its words.
def words(file_name):
# 1. Read as a list of strings, one line per string, ignoring encoding errors.
with open(file_name, errors = 'ignore') as f:
txt = [s.rstrip('n') for s in f.readlines()]
# 2. Remove anything that is not a letter or a number.
rex = re.compile('[^a-zA-Z0-9 ]')
txt = [rex.sub('', s) for s in txt]
#3. Split by space and push everything into a set.
ret = set()
for s in txt:
ret.update(s.split(' '))
return ret
Now, we only have to create a SetTrie object and populate it with the words in each document.
import os, re, settrie
st = settrie.SetTrie()
for path, _, ff in os.walk('20news-18828'):
for f in ff:
file_name = '%s/%s' % (path, f)
st.insert(words(file_name), file_name)
That’s all! The object `st` is a SetTrie. Now, you can query the data in logarithmic time. The object is Pythonic. It can be saved and loaded as a pickle and returns an iterator as the result of any query. Here, we convert the iterator into a list using `list()` and provide the query as a set of words.
it = st.supersets({'SunView', 'XView'})
print(list(it))
If you got this far, you have a simple implementation of a very efficient document search that can be customized to suit your requirements.
Conclusions
The SetTrie algorithm is an efficient and versatile solution for searching and storing sets of data. Using this algorithm, we can make document search faster and more efficient while contributing to our forests’ conservation. To see everything the library can do, check the documentation and start using mercury-settrie!