We are all using databases and full-text search engines on a daily basis, but what is their difference from a technical point of view?

For databases, everything is very black and white: you store data and query it. Exact matches are returned, while everything else is ignored. This applies both for relational databases like MySQL or PostgreSQL, as well as NoSQL systems such as MongoDB.

Full-text search engines like Elasticsearch or Solr can see in shades of grey, and return more than just exact matches. It is more about the meaning or concept than the exact form. Every query finds every similar term and then ranks the results according to their quality.

But my database can do that too: queries like `SELECT * FROM my_table WHERE my_text LIKE ‘%blue%’` will find more than only exact matches.

Robin: I can use LIKE %term%. Batman: No and No

This has two issues:

  • First, widely used B-tree indexes cannot help you with queries that start with a wildcard. You need to do a full table scan over all rows and then try to match anywhere in your column. If you have more than a few thousand entries this… will… be… slow… 
  • Second, there is no concept of fuzziness, which helps with typos, different spellings, or even phonetic matches. When searching this is often what you want.

Storing

Now that we know why we want to use full-text search, how does it work? There are some additional steps we need to take when we store data:

  • Remove any formatting: If it is not already plain-text, you want to get rid of all the visual information. For example, with HTML we would strip all the tags from the input.
  • Tokenize: Extract all the words from our plain-text. This is normally done by splitting elements on spaces and punctuation marks. Additionally, convert all the extracted words to lowercase if your use case is not case-sensitive.
  • Stop words: These are very common words that add little information since they are everywhere, and would only increase your index of words. Words like *the*, *a*, *and*, *if*,… are removed. The list varies between implementations and is obviously language dependent.
  • Stemming: You are interested in the concept and not its specific form. Stemming reduces words to their root by removing plurals or other variations. Again, this is language dependent and one popular solution is Snowball.
  • Synonyms: Optionally you can define synonyms, like having `quick` in your text, but also wanting to find it when someone searches for `fast`.

So for example, with Elasticsearch the sentence:

`The two <em>lazy dogs</em>, were slower than the less lazy dog, Rover.`

becomes

`two`, `lazi`, `dog`, `were`, `slow`, `than`, `less`, `lazi`, `dog`, `rover`

Storing step-by-step

If you want to try this out for yourself, run the following request (assuming you have Elasticsearch running on the same machine on its default port 9200). We are not going to dive into the syntax details, since this article is more about the concepts, but this gives you the ability to play around and experiment.

bash
$ curl -XGET localhost:9200/_analyze?pretty -d '{
  "char_filter": ["html_strip"],
  "tokenizer" : "standard",
  "filter" : ["lowercase", "stop", "snowball"],
  "text": "The two <em>lazy dogs</em>, were slower than the less lazy dog, Rover."
}'

This will result in this response:

json
{
  "tokens" : [
  {
    "token" : "two",
    "start_offset" : 4,
    "end_offset" : 7,
    "type" : "<ALPHANUM>",
    "position" : 1
  },
  {
    "token" : "lazy",
    "start_offset" : 8,
    "end_offset" : 12,
    "type" : "<ALPHANUM>",
    "position" : 2
  },
  ...
  ]
}

Running the same query for the text `These dogs are <b>fast</b>.`, you will get `dog` and `fast`.

Applying structure and synonyms

Now we can store our texts, which you can do that with the following request:

bash
$ curl -XPUT localhost:9200/my_index -d '{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym_filter": {
          "type": "synonym", 
          "synonyms": [ 
            "quick,fast",
            "lazy,sleepy"
          ]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "char_filter": ["html_strip"],
          "tokenizer" : "standard",
          "filter" : ["lowercase", "stop", "snowball", "my_synonym_filter"]
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "my_title": {
          "type": "string",
          "analyzer": "my_analyzer"
        }
      }
    }
  }
}'

If you are not familiar with the terms `index` and `type`, think of them as a way to structure your documents in Elasticsearch. The `mapping` is relatively similar to a schema from relational databases. Basically we are telling Elasticsearch to apply our analyzer `my_analyzer` and a synonym to the field `my_title`.

Checking the results

To test if this has worked, run this request, which should have the same result as previously:

bash
$ curl -XGET 'localhost:9200/my_index/_analyze?field=my_title&pretty=true' -d '{
  "text": "The two lazy dogs, were slower than the less lazy dog, Rover."
}'

Now you can add our two documents with the IDs 1 and 2:

bash
$ curl -XPUT localhost:9200/my_index/my_type/1 -d '{
  "my_title": "The two lazy dogs, were slower than the less lazy dog, Rover."
}'
$ curl -XPUT localhost:9200/my_index/my_type/2 -d '{
  "my_title": "These dogs are fast."
}'

This will result in a list of stemmed words saved in Elasticsearch. You can imagine something like the following list, with the number of occurrences and their position in square brackets:

           ID 1       ID 2
dog        2 [3,10]   1 [1]
fast       0          1 [3]
lazi       2 [2,9]    0
less       1 [8]      0
rover      1 [11]     0
slower     1 [5]      0
than       1 [6]      0
two        1 [1]      0
were       1 [4]      0

Searching

Once we have our data stored, we want to search it using a simple query like this:

bash
$ curl -XPOST localhost:9200/my_index/_search?pretty -d '{
  "query": {
    "match": {
      "my_title": "fast"
    }
  }
}'
{
  ...
  "hits" : {
    "total" : 1,
    "max_score" : 0.41258186,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "my_type",
        "_id" : "2",
        "_score" : 0.41258186,
        "_source" : {
          "my_title" : "These dogs are fast."
        }
      }
    ]
  }
}

When you see that this is working as expected, you can try different queries with `dog`, `dogs`, or `the` and hopefully get the results you were expecting. This also works for our synonyms:

bash
$ curl -XPOST localhost:9200/my_index/_search?pretty -d '{
  "query": {
    "match": {
      "my_title": "quick"
    }
  }
}'
{
  ...
  "hits" : {
    "total" : 1,
    "max_score" : 0.41258186,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "my_type",
        "_id" : "2",
        "_score" : 0.41258186,
        "_source" : {
          "my_title" : "These dogs are fast."
        }
      }
    ]
  }
}

Scoring

The main question that remains is how the relevance `_score` is being calculated, which makes our search engine see shades of grey instead of just black and white.

Without going into all the details  (including Okapi-BM25, which will become the default in Elasticsearch with the next major release), the three concepts which are at work here are:

  • Term Frequency (TF): The more often a document contains the search term, the more relevant this document is. Thus if a document contains the term `dog` twice and another one only once, the first document is more relevant.
  • Inverse Document Frequency (IDF): If you are searching for a term, less frequent terms are more relevant than more frequent ones. Imagine you are searching for `dog` and it appears hundreds of times while `rover` only appears a dozen times in all of your documents. Then the few documents containing `rover` have a higher IDF value than all the `dog` documents.
  • Field length: Finding your search term in shorter fields (like a title) makes them more relevant than much longer fields (such as a text body).

These three attributes are already calculated when you store your data and you can even query this information, which is known as the weight of a term:

bash
$ curl -XGET 'localhost:9200/my_index/my_type/_search?pretty=true&explain=true' -d '{
  "query": {
    "term": {
      "my_title": "dog"
    }
  }
}'

This gives you a really long output with a score (weight) of 0.39291072 for the first document and 0.25811607 for the second one. You are also given all of the steps that lead to those numbers. For general intuition, the main differences are that the first string contains `dog` twice, whereas the second string is shorter. The effect of the term frequency is stronger than that of the field length here.

Scoring with multiple terms

So far, all of this has been about queries for a single term. To query multiple terms together, you are using the vector space model. This sounds very fancy, but is basically one dimension per term and you can then calculate the difference to your search terms. So if you are searching for `dog rover`, we have a two-dimensional space. Let’s assume `dog` appears frequently and has a (made up) weight of 1 whereas `rover` appears rarely and thus has a weight of 5. Imagine `dog` is on the x-axis and `rover` on the y-axis, so documents containing `dog` and `rover` are a vector going to (1,5) and the best matches. Documents only containing `rover` are at (0, 5) and `dog` only at (1,0). Then you can calculate the similarity. In this example `rover` with (0,5) is much closer to (1,5) than `dog` with (1,0).

vector-space-model

This is all you need to know to understand how scoring works.

Running fuzzy queries

For the sake of completeness, this is the syntax to run fuzzy queries — something your relational database cannot do. 

bash
$ curl -XPOST localhost:9200/my_index/_search?pretty -d '{
  "query": {
    "match": {
      "my_title": {
        "query": "rovar",
        "fuzziness": "AUTO"
      }
    }
  }
}'

Here we are misspelling `Rover` as `rovar`, and we still find it. Remember, you don’t want to run a query like `SELECT * FROM my_table WHERE my_title LIKE “%over” OR my_title LIKE “R%ver” OR …`.

Full-text search conclusion

Full-text search does some additional work when storing data as well as when searching, but makes it much more flexible than conventional databases. Now that you know how this works, you might find many suitable use cases for it — so go forth and search!

 

Full-Text Search Explained: Fuzzy Requests

About The Author
- Philipp is part of the infrastructure team and a Developer Advocate at Elastic, spreading the love and knowledge of full-text search, analytics, and real-time data. He is a frequent speaker at conferences and meetups about all things search & analytics, databases, cloud computing, and devops. Philipp lives in Vienna where he enjoys experimenting with software, organizing meetups, and sports.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>