Auto-completing search queries

Our search engine (RICE) has a feature called 'auto-complete'. As you type into the search box, it will suggest terms that you might be searching for.

RICE uses auto-completion to predict your searches

To do this we use JavaScript events to respond as you type within the search box.  Each time you press a key, we send a request back to the search engine,  the search engine then responds with a set of possible queries ordered by most likely match. These are then displayed instantaneously using a technology called AJAX. Simple!

So, how do we work out which search term will be the most likely? Using a technique called n-grams we use a sub-set of the current input to predict what the user wants based on the content of our website.

What are n-grams?

Brace yourself:

A n-gram is a sequence of any number of items from a given sequence. We use the content of our own web pages to build these sequences. So, lets look at an example of an n-gram, consider the following snippet of text:

You can contact Shropshire Council by phone,
email, web, in person or in writing

To take a tri-gram (an n-gram of 3) of this text, imagine a window of three words sliding over the text one word at a time.  This give us the following tri-grams:

You can contact
can contact Shropshire
contact Shropshire Council
Shropshire Council by
Council by phone
by phone email
phone email web
email web  in
web in person
in person or
person or in
or in writing

If you imagine this happening on every page of our website, for every level of n-gram (not only tri-grams, but uni-grams, bi-grams and quad-grams) we can create a list of the most common sequences of words. As you could probably guess, the word most likely to follow 'Shropshire' on our website is 'Council'.

Tuning performance in the database (this is where is gets really complicated)

Evaluating and using n-grams is computationally expensive. If we want auto-complete to be useful it has to work in as close to real-time as possible, presenting search options as fast as you can type them.

Powering our search engine is PostgreSQL, the worlds most advanced open source relational database. PostgreSQL allows indexes to be created based upon the result of an expression, not just the raw field value, known as 'expression indexes'.

At the database level, our implementation of n-grams defines numerous expression indexes. This allows the complex matches to be executed quickly and efficiently. Most databases would require executing the function on every row within the table, that would be slow, exactly what we don't want in this situation.

The following is a snippet from the n-gram database schema, this demonstrates the expression indexes:

CREATE INDEX "1GRAM_BUILT_G1_S10_IDX" ON "1GRAMS"
    USING btree("substring"("GRAM1", 1, 10));
CREATE INDEX "1GRAM_BUILT_G1_S1_IDX" ON "1GRAMS"
    USING btree("substring"("GRAM1", 1, 1));
CREATE INDEX "1GRAM_BUILT_G1_S2_IDX" ON "1GRAMS"
    USING btree("substring"("GRAM1", 1, 2));
CREATE INDEX "1GRAM_BUILT_G1_S3_IDX" ON "1GRAMS"
    USING btree("substring"("GRAM1", 1, 3));

We then query the database using procedural SQL functions, the following is the uni-gram query function:

CREATE OR REPLACE FUNCTION query1grams(g1 text) RETURNS SETOF "1gramtype" AS
$BODY$
DECLARE
	g1l integer;
	g1q text;
	tmp "1gramtype";
BEGIN
	g1q := g1;

	g1l := char_length(g1q);

	IF g1l > 0 THEN
		IF g1l > 10 THEN
			g1l := 10;
			g1q := substring(g1q,1,g1l);
		END IF;
		FOR tmp IN SELECT g."ID",g."GRAM1",g."POPULARITY"
                 FROM "1GRAMS" g WHERE substring(g."GRAM1",1,g1l) = g1q
                 ORDER BY g."POPULARITY" DESC LOOP
			RETURN NEXT tmp;
		END LOOP;
	END IF;

	RETURN;
END;
$BODY$
LANGUAGE plpgsql VOLATILE COST 100 ROWS 1000;
ALTER FUNCTION query1grams(text) OWNER TO "search";

This function looks at the length of the n-gram and generates a database query which will correctly target our defined expression indexes.

Using pgAdmin III, we can look at the query plan, as shown below:

RICE n-grams query plan

As you can see, PostgreSQL is executing the query using an index scan, this is fast and efficient and results in very quick auto-completion for the search engine.