It’s a nice dream. But in reality, everyone trying to work off one dashboard can turn into an inefficient nightmare.
Trying to decide what monitoring tools should be used can be a controversial topic. Here at Zopa, we’ve had many discussions about which monitoring tool should be used where and why.
Different people want different things from a monitoring tool, and monitoring needs to change depending on what you’re working on and your role within an organisation. And what you want will likely be unimportant to the other stakeholders.
Can we find the monitoring tool to rule them all?
Unfortunately, a shared dashboard is not a silver bullet.
It’s tempting to think that you can create one dashboard to rule them all, but it’s very difficult to pull off and, practically speaking, just not sensible. The consequences can range from poor selection of tools, to misinterpretation of system health and major loss of productivity.
Don’t admit defeat. There is a way.
First off, clear communication of goals is essential. Treat the requirements gathering as you would for any other user story. You need clear specifications from all stakeholders. This means identifying exactly what they need out of a prospective monitoring solution. Then we can make informed decisions about what to use where.
These specifications should be clarified at the outset before doing any research into prospective tools.
Finding the right monitoring tools
Let’s try and find a tool that analysts and an SRE (Site Reliability Engineering) team can both use.
Typically, you’ll find that an analyst will care more about the ability to visualise and aggregate periodic business data, whereas the SRE team might want a high level up/down summary of system health.
We have their specifications, so now we can take a quick look into the tools available. We can find tools that handle each of these use cases well, but they have very little functional overlap. Womp womp…
If we take a tool like Tableau… Tableau does allow us to set alerts (a key requirement for our SRE team), and Tableau is also able to visualize data beautifully (a key requirement for our analyst). However, Tableau data must be structured, which, when we’re thinking about system health metrics, is often not the case.
So, even though it is arguably one of the best tools available to analysts, it’s not going to satisfy all of the SRE team’s specifications.
Monitoring tools for different use cases
The problem isn’t just trying to find something that works for people in different business roles, you need monitoring tools that work for different tasks as well.
As an engineer, I may have multiple systems that require vastly different monitoring solutions. Let’s take Kafka for example. Here at Zopa, we make extensive use of Kafka Streams and run several Kafka brokers. These brokers are likely where we will see warning signs of problems manifest if something goes wrong. Fortunately, Confluent have built many useful metrics into JMX, which we are able to expose with incredible ease using Prometheus, and to graph in Grafana.
But what about Use Case 2…
Our pipeline is distributed across multiple microservices, and we need to keep track of transactions across these various systems. While Grafana/Prometheus works for the Kafka use case, it’s not at all useful for distributed transaction tracing. So the tool that works for use case 1, leaves us struggling on use case 2. But, we do have other options.
At Zopa, we’ve explored using open-tracing with Jaeger, we pass correlation IDs around between our systems, and we’ve investigated paid solutions. Each of these has advantages and disadvantages, and each of them is infinitely more suited to the job of distributed transaction tracing than Grafana. So why shoe-horn?!
Shoe-horning into one solution isn’t the way
The requirement to shoe-horn all use cases into one solution is usually an artificial one, and it’s important to understand why the desire to do so exists. If it’s a general desire to standardise tooling across an organisation, that’s still possible. You just need to standardise based on use case, as opposed to a general “monitoring tooling” catchall.
This is also often a reflection of the fact that those responsible for consuming alerts and dashboards are not responsible for creating them. They don’t want to be jumping around through different tools and don’t understand the needs of a dev.
My take on this, is that if we gather our requirements as outlined above, we should have all the information required to find the best solution for each stake holder.
How to escape Mordor
Find what works best for you instead of trying find the one tool to rule them all. There are plenty of ways to create a generalised or aggregated view, even if it’s just a web page that you write yourself.
Don’t do that though. Use Sensu or something.