Monitoring and Observability
From wikipedia:
In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.
It is bad enough to know about problems in your systems through your customers. Worst yet is to figure out that what you have been looking at as a measure of any metric is wrong.
I have been following e-commerces in Brazil out of curiosity on how they break. Having worked for a hosting provider and a couple of e-commerces I can say I know the general design of these systems, which is not that different from other kind of business except in volume.
Small incidents on e-commerces directly hurt sales, marketing strategies and customers. In the last years we have been seeing a surge of traffic on what became the "Black Friday" — in quotes because it starts on Thursdays 9PM.
Leaving this industry for my current gig I found myself with some spare time to setup a monitoring system to alert me how fast the most important e-commerces would slow down and crash. Year after year it goes like this and I wanted to have my monitoring — as other colleagues have them. But before getting more into this, an architecture intermission.
Regardless of the company, most of the e-commerces run on a mix of legacy and new systems. The core of these systems is usually programmed around a RDBMS and makes heavy use of stored procedures for different reasons — from grouping products to manage inventory or select zip codes. Not getting into the why's and how's, these systems are hard to get rid of. They grow vertically, using expensive and powerful machines but they are limited to the maximum server size — and the way it was programmed.
Much is done to isolate them from the traffic surge, from message bus to APIs to database communication and replicas. Teams compete between themselves to bring new tech and to not become obsolete and so life goes on. The most fancy architecture would have active and passive datacenters, adding some WAN traffic to replicate data, introducing database to database communication — and using that to add new services. The database was the API.
With the advent of cloud computing some of these behemoths were migrated unsuccessfully as-is to virtual machines but many of them still in cages on datacenters, communicating through links to front end apps. Blaming the vendor became a game that no one took seriously anymore.
The cost of rewriting that would only sink years later after losing revenue and marketing investments due to outages and high latency. There is a 2009 post summarizing findings as "each 100ms of latency cost us 1% in sales" from Amazon.com and others.
Just monitoring in a sense of "is my app is up" or "Am I serving my page" does not cut in this complicated scenario. The sum of all internal monitoring is not saying as much as what the customer is seeing. Hence why I used the term Observability.
Most of the requests coming from the internet will go the path of the green arrow. This is fast, distributed and cached. Inside the CDN block you have DNS, Loadbalancers, file servers and so on, all aggressively optimized to serve content as fast as possible.
The path of the blue arrow will be the way of stuff that changes often — but still content and at most it hits a loadbalancer and a server that might have this data cached or local.
Now the path of the red arrow is what interests us. From a mix of legacy, architecture, Conway's law and life itself some of the functions of your website will have to hit the database, best case scenario to read a couple of rows in a not so busy state, worst case scenario lock/table scan/execute stored procedures and get to the code that may contain the business rule of that sale step.
Based on that, I started monitoring the availability, pointing to domains and websites as we usually do — knowing that a given site can be down but appear up if you are only navigating. Right after that I selected some calls that had to hit such a deep layer in the system that would give me a sensible view on how the core system is behaving.
Actually one of the drivers for this monitoring came from load testing these systems. Once identifying these targets, you can introduce a background load to the system by calling them, and then proceed testing other parts of your system. It pays off to introduce a certain level of stress instead of trying to simply reproduce a surge of real users right away.
Aligning these charts across a couple of hours today I could see an extreme degradation of 3 e-commerces reflected on this probe but completely ignored by the regular monitoring (names are redacted).
Basically this check says that all sites are up and responding under 1s for most time. I put some marks around times I knew the site would have an outage — it would be impossible to calculate shipping or put a product in the cart. Both charts are using a log 10 scale to help smooth the peaks.
This is one of the functional checks. I won't disclose which functional checks are but suffice to say that these functions on the website got to query tables and hit functions to be able to get either product availability, shipping, customer data, login. Figure out the most heavy stored procedure on your system and the last minute hack to expose its functionality through ajax and you will start finding out good targets. Or look for "this has to be a complex query because of business rules" kind of database workload.
This is the same data on a linear scale for dramatic effect. Timings over 10 seconds means it is all offline. Other vendors had problems during the day — which if they are following through simple availability metrics or too complex and fragmented monitoring they wouldn't be able to see right away, either from masking or to take a long time to compute the impact. In this case, the impact is on revenue, which is a good metric to observe and follow overall.
I plan to follow this Black Friday again and hope that most if not all services stay up and healthy. And that the ops team behind them have a good time executing what they planned. If you have not planned much, I recommend reading AWS and Google's whitepaper on events and scaling up architecture and working your way up with some load testing.
Cheers !