Tales From The Crypt: Observability-Driven Development #2

npetrele · ‎10-26-2023

Chapter 1 included a ludicrously simple example of how to time a network operation and log it. This time I want to discuss one advantage of developing your apps with observability in mind. You use Observability-Driven Development (ODD) to wrap every line of code that handles network activity with a timer, and a log or alarm that identifies the actual code, the network operation, the source, and destination. The goal is to make it easy to identify problems without having to analyze system, Apache, and network logs that include far more information than you need.

Now suppose the alarm is triggered by an operation that takes longer than your maximum allowed time. Some user initiated a request to or through your application. Your ODD code gave you all the information you need to analyze the problem. You know which line of code caused the alarm. You browse your log and see that this particular line of code doesn't normally set off alarms. So there's a good chance your code is fine.

The alarm only seems to get triggered on certain source IP addresses. You identify the IP addresses as coming from Frostbite Falls, Minnesota and the request is for a destination in Deepinaharta, Texas.

You know the network operation taking place. You know the source and destination IP addresses. You have everything you need to run a Cisco ThousandEyes test from a sample IP in Frostbite Falls, Minnesota to the destination at or near Deepinaharta, Texas. The test reveals that the operation goes through a node in Jebip, Iowa that is introducing far too much latency. Now you're armed with the information you need to get the problem solved.