tl;dr: our search system was outdated and failing. We managed to keep it alive even as we built a new, better system. It has been 403 days since search suffered a total outage.
Prologue: The problem
At the start of 2016, we started noticing a rise in tickets complaining of search-related issues. Where we once might have received a 100 or so per month, we were now averaging around 500. This prompted a rational evaluation of the systems we had in place to facilitate search inside Freshdesk. It was not a rosy picture.
Some major issues with search were:
- Tags and reference numbers not working
- Search inaccurate in other languages
- Alphanumerics not working
In Gmail, if you want to look at the emails from a particular sender, you can simply search ‘from:sender-name’ and you’ll get what you want. Going further, you can search for specific text in the emails from a particular person. If you wanted to look for a meeting agenda from Dave, you’d search, ‘from:dave meeting agenda’. These are two completely different facets of search; however, the problem was that these were facets that weren’t provided for in the old system at scale.
Taking a hard look at our setup made one thing clear: our search just wasn’t able to keep up with the rapid growth that the product was seeing.
Chapter 1: Approach
We realized that if we were going to overhaul how search works, then we might as well do it across all our products. The requirements being similar, it didn’t make sense to have separate machines for every product. So we decided the best way to go forward was to build a comprehensive, universal infrastructure for search. This functionality could then be integrated into the different products.
Now, existing users were already accustomed to a certain way of searching. The challenge was to build all the upgrades without breaking how the system previously worked. So, much of our time was initially spent on research and development. We had to figure out a way to stabilize areas where the old system fell short, so as to keep it alive even as we conceptualized and built a new one.
To do this, we needed to know exactly what problems customers were having with search. So we trawled through every ticket that we thought was related to search. If the issue didn’t reflect the larger problem and had a quick fix, we’d fix it and resolve the ticket. Those that were relevant to what we were trying to tackle, we’d save them as reference for our R&D and assign them a custom status after adding the necessary remarks as notes.
Digging deeper, we discovered that most issues were the result of missing data – the information people were trying to search for was simply not indexed to be searched for. In some cases, the text wouldn’t have been processed enough for words or sentences to be searchable; and in others, there was a mismatch of data. In addition to our conceptualizing, the support tickets gave us vital insights for us to draft a comprehensive list of use cases.
However, the fact remained that our search was deprecating, and we had to do something about it.
Chapter 2: TRIGmetry
Typically, when you have to debug an issue, you have access to all manner of graphs and detailed logs which can tell you exactly what went wrong. Our situation, though, wasn’t quite so typical. We barely had any metrics to tell us how our machines were performing. We didn’t even have remote access to our console. All we had was a big, flashing, red ticker, screaming at us, giving us nothing more to go on except that something was faulty. We couldn’t improve something we couldn’t measure so, after we built our new infrastructure for search, we wanted to be able to measure how we were doing and improve how we were doing it. And that’s why TRIGmetry was born.
Using open-source tools, we put together our own comprehensive custom toolkit to help us know our machines inside out. Telegraf is a plugin-driven server agent for collecting and reporting metrics, that we run on our machines to send out information about them. We use InfluxDB, a time series database platform for metrics and events, to store this information about our machines. Grafana helps us monitor and visualize this data in the form of graphs on our dashboards, and kapacitoR helps us with analytics, detecting anomalies and sending alerts that we can take action on. Whenever kapacitoR detects an abnormality, we get notified of it immediately on Slack. With multiple types of alerts and dashboards set up, we can now identify the problem as soon as something begins to malfunction.
Before TRIGmetry, when we received an alert, it could have been anything from data loss to a machine outage, which could have potentially taken down other machines in a domino effect. The machines we were using were also of varying configurations and weren’t always compatible with each other. Elasticsearch, the platform we use to power our search, was not on its latest version, and it did not offer backward-compatibility. This meant that if we updated to the latest version, we couldn’t guarantee that everything would still work.
Chapter 3: The third act
Initially, when the search team was just three people, we split the work into product and infrastructure requirements. Our infrastructure being hosted on AWS’s Elastic Compute Cloud (EC2), we ran benchmarks following our R&D exercises to narrow our choices down to a smaller range of EC2 machines. We ended up picking machines from the R4 and C4 series even as we were gathering use cases from the product to build improved functionality.
When we moved to the latest version of Elasticsearch, we reworked everything from scratch. As we revamped our entire system of search, we also had make sure that we didn’t lose any of the existing functionality. We’ve always believed that there is no bad code – just code that has outlived its time.
While building the new infrastructure, we learnt to think more proactively as opposed to reacting when things went south. Our learnings also helped us keep the older infrastructure stable long enough for us to migrate everything into the newer one. Once we’d finished the revamp, we started a counter to keep us motivated – it’s the number of days that have gone by without search suffering a total outage. Anything we built, anything we changed, we did so keeping in mind that the counter shouldn’t read ‘zero’ the next day.
Epilogue: All’s well that ends well
It’s been just over a year since we began our quest to build a new infrastructure for search. Since then, our team has tripled in size, with people working on independent modules inside our search infrastructure and information retrieval ecosystem.
The search team at Freshworks
Today, we’re handling around 25 million requests per day, with a potential capacity of upto 100 million.
The dashboards we built over TRIGmetry have separate panels dedicated to monitoring Elasticsearch hosts, the proxies, and the platform service machines. Providing for a variety of use cases and a diverse customer base helped us dramatically improve our multilingual capabilities. What we’ve learned on this journey has been pivotal in improving the core engineering of our products.
Our current setup is more than capable of handling whatever we demand of it for another year or so. We believe that we have learnt enough to innovate constantly, evolve with the time, and keep our systems state-of-the-art. If there’s one thing this year taught us, it’s that things can go wrong at any time – all you can do is equip yourself, as best as you can, to handle it when that happens.
As of today, August 23, 2017, the counter hanging over our cubicles reads 403 days. All’s well.
If you want to get into the details of our story, ask us in the comments!