saylornotes

The Blog of Chris Saylor

Search Results

    Elasticsearch Series: Let It Work for You

    May 19, 2016 engineering

    This is the second entry in a series on Elasticsearch and how we use it in our applications. See the previous entry on Rebuilding Indices with No Downtime.

    Elasticsearch has a vast number of ways to query the data in your index. How you query your data is very much dependent upon where it is being used, so in this post we’re going to talk about how we use it in our instructor search application.

    Historically, most of our searches were performed on denormalized MySQL tables. This included using an arc distance formula as part of a SQL statement with a ton of “like” statements and rudimentary ordering based on the input by the user. As you no doubt have guessed, this isn’t performant and the results were not very good.

    The results weren’t relevant.

    Relevancy is key when doing any sort of search, and this is incredibly difficult to accomplish with a database that wasn’t designed for search. It’s up to the engineer to figure out what is relevant to the user based on what was searched. If the user searched for an instructor by their name, it seems logical to sort their name alphabetically. What if they searched for an instructor by name who is within 10 miles of current location? Do you sort by their distance from the user and then alphabetic?

    These are tough questions to answer and even if you do find an answer, sorting them in any sort of structured or weighted way is very difficult with this sort of setup.

    Lucene Lights the Way

    Elasticsearch is built upon a search engine called Lucene. It is the engine that takes the query you provide Elasticsearch and gets the data from the index. One very important function it serves is a calculation of “score”. This is a calculation that codifies how relevant a particular result is to the original query.

    At the beginning when we were first migrating off the search from MySQL to Elasticsearch, it was really tempting to sort the same way we did with the original MySQL query, but we soon learned that the results returned were just as bad (if not worse) than MySQL. We had to learn how to shape our query to take advantage of the “score”.

    Show Me the Query!

    First things first, we have a filter of distance available on the application: let’s say we’re trying to find instructors in New York, NY within 5 miles of a central point. We might do something like this:

    {
      "query": {
        "bool": {
          "filter": [{
            "geo_distance": {
              "distance": "8.04672km",
              "address.geo": {
                "lat": "40.7643358",
                "lon": "-73.9849351"
              }
            }
          }]
        }
      }
    }
    

    This would find instructors within 5 miles of the center of New York, NY:

    1. Chris Saylor — 1.4 miles away
    2. Aaron Sonders — 0.5 miles away
    3. Bethany Smith — 3 miles away
    4. Chris Saylor — 0.2 miles away

    Notice the results are scattered at varying distances from that center because this query doesn’t have any relevancy. Wait…what? You just said that’s what Elasticsearch does for you! It comes down to filtering versus querying. Anything you filter is a binary thing: what “must” be in results. As such, it doesn’t calculate a score. On the other hand, when you query for something, it is what “should” be included and this is where score is calculated. It would be tempting (and easy) to sort by the distance from the center, but that would only work if that is the only factor.

    Now, let’s look for an instructor named “Chris Saylor” within 5 miles of New York, NY:

    {
      "query": {
        "bool": {
          "must": [{
            "query_string": {
              "default_field": "full_name",
              "query": "chris saylor"
            }
          }],
          "filter": [{
            "geo_distance": {
              "distance": "8.04672km",
              "address.geo": {
                "lat": "40.7643358",
                "lon": "-73.9849351"
              }
            }
          }]
        }
      }
    }
    
    1. Chris Saylor — 1.4 miles away
    2. Chris Saylor — 0.2 miles away
    3. Anthony Saylor — 1.6 miles away

    The full name query has relevancy (thanks Lucene), but the distances are still scattered because the geo distance filter doesn’t get weighted. Let’s make it contribute to the weight using an Elasticsearch mechanism called function scoring:

    {
      "query": {
        "function_score": {
          "query": {
            "bool": {
              "must": [{
                "query_string": {
                  "default_field": "full_name",
                  "query": "chris saylor"
                }
              }],
              "filter": [{
                "geo_distance": {
                  "distance": "8.04672km",
                  "address.geo": {
                    "lat": "40.7643358",
                    "lon": "-73.9849351"
                  }
                }
              }]
            }
          },
          "functions": [{
            "linear": {
              "address.geo": {
                "origin": {
                  "lat": "40.7643358",
                  "lon": "-73.9849351"
                },
                "scale": "8.04672km"
              }
            }
          }]
        }
      }
    }
    
    1. Chris Saylor — 0.2 miles away
    2. Chris Saylor — 1.4 miles away
    3. Anthony Saylor — 1.6 miles away

    What’s going on here? We introduced a linear decaying function that calculates a value (between 1 and 0) that decays the further away a result is from the central point. The relevance of the distance is now included with the relevance of the name we searched. Since there is more than one instructor named “Chris Saylor” in that location, the instructor closer to that central point will appear higher in the results.


    This post barely scratches the surface of the power that Elasticsearch has for complex queries of data, but I hope it has given you some insight on how we use it in a real world application. Give it a shot and search for an instructor near you and perhaps even take one of their classes.

    Related Posts

    Elasticsearch Series: Rebuilding Indices with No Downtime March 2, 2016

    This is the first entry in a series on Elasticsearch and how we use it in Zumba’s applications. Many of Zumba’s applications have some form of …

    Meta: How this blog is built and deployed April 11, 2019

    It is an unspoken rule that if you utilize something other than Wordpress for a blog that you must include an article on how it is built. This is that …

    Building a Chess bot for Slack August 23, 2018

    With Atlassian’s announcement suspending development of Stride and dropping support for Hipchat in favor of Slack, I decided that the time was right …

    Why Use Generators in PHP January 2, 2017

    I’ve heard many co-workers, friends, and colleagues at meetups acknowledge the existence of generators in PHP, but not understand why they would use …

    comments powered by Disqus