Implementing Indexes – A Simple Index

In Pharo, we have an excellent tool to query, navigate and discover classes and methods in the system. The tool is Spotter, this tool allows us to search the image for different elements:

  • Classes
  • Methods
  • Menu entries
  • Help topics
  • and many more…

To access to it is as easy as doing Shift + Enter,

Spotter
Spotter allows us to look for classes, methods and other elements doing a text search on the name of the element.

Spotter gives us access to all the classes in the system, giving us the power to learn, discover and increase our productivity.

However, it has a problem. If the image is a large image (it contains a couple of millions of objects and/or thousand of classes — one company has just 30 millions line of code) the lookup of the elements is slow. We need to implement a text search index.

A simple “Begins With” index

Spotter allows us to do full-text searches, looking up the text we want (the needle) in any part of the system. Although, we are going to start with an easier index, one that allows us to look for the entries that begin with the needle.

If we look up for Obj, the index will answer for example Object, and ObjectLayout; but it will not include DisplayableObject.  We will see how to implement a full-text search and even fuzzy logic searches in future entries!

For implementing the index we are going to build a Trie.  From Wikipedia:

A trie also called a digital tree or prefix tree, is a kind of search tree—an ordered tree data structure used to store a dynamic set or associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree defines the key with which it is associated. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Keys tend to be associated with leaves, though some inner nodes may correspond to keys of interest. Hence, keys are not necessarily associated with every node [Full Article].

In this case, we are going to start with the implementation that is available in Github Pharo-Containers/Containers-Trie. We thanks Benoit St-Jean for his implementation.

This implementation is based on the first iteration of Benoît St-Jean.

To install it is as easy as executing the following Metacello expression:

Metacello new
      baseline: 'ContainersTrie';
      repository: 'github://pharo-containers/Containers-Trie/src';
      load.

Once installed we will have the CTTrie class to create a new trie and use it.

Creating a new Trie is as easy as creating a new instance of CTTrie

aNewTrie := CTTrie new

Once we have an instance we can start to add elements in it and then performing a search on it.  The instances of CTTrie understand a series of messages to access, add and update the nodes in the trie.

A trie is a collection of associations. Each association has a key and a value. The Key should always be a String. The value can be of any type.

  • at: aKey – returns the stored value for this key, throws an error if not found.
  • at: aKey put: aValue – store the association of the key and the value in the trie.
  • at: aKey update: aBlock – updates the existing element in the trie with the result of executing the block passed as parameter. The block has a single argument that is the previous value associated with the key.
  • at: aKey update: anUpdateBlock initial: anInitialBlock – if the trie contains a value associated with the key, the block anUpdateBlock is used to update the value receiving as optional argument the previous value. If the trie does not contain an association for the key, a new association is created with the result of executing anInitialBlock.
  • removeKey: aKey – removes the association with the given key from the trie, throws an error if not found
  • removeKey: aKey ifAbsent: aBlock – removes the association with the given key from the trie. If the key is not found, the block is executed.

These methods allow us to access and update the trie using the full key.

As an example:

aTrie := CTTrie new.
aTrie at: #cat put: 'Meow'.
aTrie at: #dog put: 'Guau'.
aTrie at: #cat.
aTrie at: #fox put: '???'.
aTrie at: #fox.
aTrie at: #fox update: [ 'What does the fox say?' ].
aTrie at: #fox

As far as we are, the Trie is not more powerful than the Dictionary instances in Pharo.  So let’s see, which is the API to do queries to the trie.

  • allValues – it returns a collection with all the values stored in the trie.
  • allValuesBeginningWith: aString – it returns all the values that are stored in the trie and which keys start with the string passed as parameter.
  • withAllValuesDo: aBlock – it iterates all the trie and executes the block on each of the values in the trie. This method does not create the intermediate collection, so it is faster than using allValues and then iterating the collection.
  • withAllValuesBeginningWith: aString do: aBlock – it iterates all the elements that have a key starting with the passed string and executes the block on each of them. This method does not create the intermediate collection, so it is faster than using allValuesBeginningWith: and then iterating the collection.

So now, we are able to query the trie and get all the elements that have keys that begin with a given string.

Let’s index all the classes in the system.

"Let's create a new Trie"
behaviors := CTTrie new.

"We will iterate all the behaviors (classes and traits) and store
them in the trie with their name as key"
SystemNavigation default allBehaviorsDo: [ :aBehavior | 
	behaviors at: aBehavior name put: aBehavior].

"We can access the trie to make queries!"
behaviors at: #Object.
behaviors allValuesBeginningWith: 'Obj'.

What is the advantage?

Using a trie provides a fast way of performing the lookup of keys including a given prefix. Without the use of a trie, we need to iterate the full collection to compare each of the keys (for example using the message beginsWith: that is present in String).

This operation is very expensive when we have thousands and thousands of elements to iterate, in the case of Spotter, thousands and thousands of methods and classes.

Using a trie the lookup effort does not depend on the size of the indexed collection but in the size of the prefix that we are looking for. More precisely, looking up data in a trie is faster in the worst case, O(m) time (where m is the length of a search string).

How is it implemented?

A trie is an unbalanced tree. We have implemented the trie using two classes CTTrie and CTTrieNode.

CTTrie is the frontend of the implementation and it is used by the user to access the associations in it. It contains a single node that is the root of the tree. This class implements all the messages that the user wants to use and hides the implementation details in the nodes of the tree.

Each node of the tree is an instance of CTTrieNode, and each of these instances has:

  • The node value
  • The children of the node
  • And the first character of the subtree that this node starts.

In a Trie, the key of an entry is encoded in the path from the root to the node containing the value.

In the following picture, we can see an example. The code to generate this example is the following:

aTrie := CTTrie new
aTrie at:'bat' put:'bat'.
aTrie at:'cat' put:'cat'.
aTrie at:'cow' put:'cow'.

trie

This example has 3 entries, for the clarity of the example, the keys and the values are the same, but you can store anything in the values.

As we can see from the picture, the keys of the associations are encoded in the path from the root of the tree. For the key bat, we start looking up the first letter of the string to select the path to get the answer, then the second letter and finally the third one.

The same lookup procedure can be executed to find the keys cow and cat.

If we perform the lookup of a key and we do not end in a leaf of the tree (a leaf node is one node with a value), we have found that the key is not present in the trie.

If we want to have all the entries for a given prefix, we will get a subtree. In this case if we want all the keys starting with $c we will get all the right subtree.

With this structure, it is only needed to perform n accesses to memory to get to the result, where n is the length of the prefix or key that we are looking for.

This is just a small introduction, for better understanding, why don’t you try with the Pharo implementation, adding elements to it and inspecting the structure of the resulting trie.

Still, we have problems…

The implementation we have just seen presents two problems:

  • It is not space-efficient, as a lot of nodes are creating and many of them do not contain nor children nor values.
  • It does not allow us to perform searches in the middle of the keys.

We will address both of these problems in the next entry of this series. Where we are going to see the optimized version of this trie and how to implement a suffix trie.

In the meanwhile, you can read the implementation, check the tests and understand how it is implemented. It is a really straight forward implementation, and why not… start reading the implementation of the optimized version!!!

Published by Pablo Tesone

I am one of the Pharo Consortium Engineers. I am an enthusiast object-oriented programming and environment!

3 thoughts on “Implementing Indexes – A Simple Index

Leave a comment