Our search engine (RICE) has a feature called 'auto-complete'. As you type into the search box, it will suggest terms that you might be searching for.
To do this we use JavaScript events to respond as you type within the search box. Each time you press a key, we send a request back to the search engine, the search engine then responds with a set of possible queries ordered by most likely match. These are then displayed instantaneously using a technology called AJAX. Simple!
So, how do we work out which search term will be the most likely? Using a technique called n-grams we use a sub-set of the current input to predict what the user wants based on the content of our website.
What are n-grams?
Brace yourself:
A n-gram is a sequence of any number of items from a given sequence. We use the content of our own web pages to build these sequences. So, lets look at an example of an n-gram, consider the following snippet of text:
You can contact Shropshire Council by phone, email, web, in person or in writing
To take a tri-gram (an n-gram of 3) of this text, imagine a window of three words sliding over the text one word at a time. This give us the following tri-grams:
You can contact can contact Shropshire contact Shropshire Council Shropshire Council by Council by phone by phone email phone email web email web in web in person in person or person or in or in writing
If you imagine this happening on every page of our website, for every level of n-gram (not only tri-grams, but uni-grams, bi-grams and quad-grams) we can create a list of the most common sequences of words. As you could probably guess, the word most likely to follow 'Shropshire' on our website is 'Council'.
Tuning performance in the database (this is where is gets really complicated)
Evaluating and using n-grams is computationally expensive. If we want auto-complete to be useful it has to work in as close to real-time as possible, presenting search options as fast as you can type them.
Powering our search engine is PostgreSQL, the worlds most advanced open source relational database. PostgreSQL allows indexes to be created based upon the result of an expression, not just the raw field value, known as 'expression indexes'.
At the database level, our implementation of n-grams defines numerous expression indexes. This allows the complex matches to be executed quickly and efficiently. Most databases would require executing the function on every row within the table, that would be slow, exactly what we don't want in this situation.
The following is a snippet from the n-gram database schema, this demonstrates the expression indexes:
CREATE INDEX "1GRAM_BUILT_G1_S10_IDX" ON "1GRAMS" USING btree("substring"("GRAM1", 1, 10)); CREATE INDEX "1GRAM_BUILT_G1_S1_IDX" ON "1GRAMS" USING btree("substring"("GRAM1", 1, 1)); CREATE INDEX "1GRAM_BUILT_G1_S2_IDX" ON "1GRAMS" USING btree("substring"("GRAM1", 1, 2)); CREATE INDEX "1GRAM_BUILT_G1_S3_IDX" ON "1GRAMS" USING btree("substring"("GRAM1", 1, 3));
We then query the database using procedural SQL functions, the following is the uni-gram query function:
CREATE OR REPLACE FUNCTION query1grams(g1 text) RETURNS SETOF "1gramtype" AS $BODY$ DECLARE g1l integer; g1q text; tmp "1gramtype"; BEGIN g1q := g1; g1l := char_length(g1q); IF g1l > 0 THEN IF g1l > 10 THEN g1l := 10; g1q := substring(g1q,1,g1l); END IF; FOR tmp IN SELECT g."ID",g."GRAM1",g."POPULARITY" FROM "1GRAMS" g WHERE substring(g."GRAM1",1,g1l) = g1q ORDER BY g."POPULARITY" DESC LOOP RETURN NEXT tmp; END LOOP; END IF; RETURN; END; $BODY$ LANGUAGE plpgsql VOLATILE COST 100 ROWS 1000; ALTER FUNCTION query1grams(text) OWNER TO "search";
This function looks at the length of the n-gram and generates a database query which will correctly target our defined expression indexes.
Using pgAdmin III, we can look at the query plan, as shown below:
As you can see, PostgreSQL is executing the query using an index scan, this is fast and efficient and results in very quick auto-completion for the search engine.