Fixed connectivity issues on Booklet communities this morning, and taking steps to improve stability

Posted .

Problem 

I woke up today to reports of connectivity issues on Booklet. I could connect from my phone, and assumed the issue was a temporary failure. I checked deeper once I boarded a flight and could connect to wifi. (Hello from 40k feet above Greenland). It turned out that the errors had started about 12 hours ago, and meant that Booklet had been down for about half of its users. 

The issue has now been fixed, and I'm taking steps to make sure this doesn't happen again. 

What went wrong

Booklet runs in two data centers: New York and Los Angeles. Your computer automatically connects to the closest data center, which make the app faster. New York is the primary region, and Los Angeles is a secondary region. New York has multiple databases for redundancy and data backups, but LA had a single database that was keeping a single read-only copy of the database.

About 12 hours ago, the underlying hardware running the Los Angeles database failed. The Los Angeles servers could not connect to the database, they had no backup, and the app went down in Los Angeles. Anybody connecting to the Los Angeles data center during this time (mostly the US West coast, South America, and Asia) could not use Booklet. Anybody connecting to the New York data center had full functionality.

I've turned off the Los Angeles data center temporarily until the problem is fixed, which has restored functionality for all users. 

The issue started with hardware failure from our hosting provider, Fly.io. However, Booklet hasn't yet set up standard monitoring and procedures to detect and fix this semi-common error - so a small problem escalated to a big problem.

Learning and improving

Booklet launched about 5 days ago, and quickly scaled from dozens of users to thousands. With the launch, we've had to rapidly adjust what we're working on to help scale. We've been scaling servers and implemented caching layers since launch, which sped up Booklet significantly.

A 12 hour outage is not acceptable. So, we will prioritize setting up monitoring and stability improvements as our top priority for the week. This will make keep Booklet stable in the future, and also help us detect issues faster in the future.

I'm sorry for the outage this morning. We will use the learnings to improve Booklet.

Replies