At Freshdesk, we know that our customers depend on us to support their customers. And we take our application availability and performance very seriously.
However, we have had a very rough two days, with unexpected performance issues leading to a downtime yesterday that lasted for approximately 1.34 hours.
Starting from roughly 10:20 AM PST yesterday, some users started reporting serious performance issues. Our operations team immediately took to investigating the cause, and discovered that the system was repeatedly getting hung on a deadlock due to a suboptimal DB query hitting our database.
By 11:15 AM users started reporting that the app was completely unavailable, and we started working on a work-around. We temporarily disabled the offending DB query and tried to restore service.
Some users were able to access their support portals over the next twenty minutes and by 11:54 AM services were completely restored in all locations.
As I am typing this, we are experiencing performance degradation and slowness issues again today. Our engineers are working overtime to get to the root of this. We do not know the root cause of this yet. At this moment I want to apologize to all our customers for the trouble and frustration that this has caused. I will do a followup post once things have stabilized and brought back to normal.
Please accept my sincere apologies once again.
As of 8AM PST the slowness issue has been resolved and things are back to normal. The root cause of the slowness was abnormally high load on our memcached servers which resulted in our cache servers crashing. The abnormal load was caused by a human configuration error.
Once again, I know how critical the app’s availability is to every Freshdesk customer, and I would like to personally apologize for the distress you would have faced during the outage. We are stepping up our efforts to maximize the reliability and performance of Freshdesk, and are taking all steps to ensure such issues do not occur again.