How do you know if your IT infrastructure’s operating at peak performance? And how do you determine which elements need to be improved upon or even replaced? Monitoring your infrastructure’s key performance indicators (KPIs) provides crucial information about the health of your network and servers, so it’s important to know what KPIs are appropriate to monitor and how to interpret the data that comes from monitoring them. Below are some essential elements of monitoring your IT infrastructure, along with examples of how data can be monitored and setting up actionable KPIs.
1) Infrastructure KPIs
To monitor your IT infrastructure, you’ll want to set up performance metrics for every aspect you manage, including software, servers, network bandwidth, and databases. These metrics will show how various infrastructure elements perform—from memory usage on a server down to page load times across multiple websites. The important thing here is that you establish appropriate baselines across each metric so that, as time goes on, it becomes apparent when something is amiss in your environment. For example, let’s say one of your performance metrics is page load time—if all sites suddenly start loading much slower than usual, then you’ll know something is wrong and can quickly identify what needs to be fixed.
2) Common Metrics to Measure
Typically, companies will track several standard metrics to see how effective their operations are. Some of the more common metrics include uptime, latency, page load times, errors rates, memory usage, and disk usage. How you choose to measure these largely depends on what kind of infrastructure and applications you have in place. For example, monitoring CPU and memory usage would be reasonable if you are running a Java application. In contrast, running an ERP package that uses an Oracle database may make more sense to monitor, at a minimum, response time and throughput.
3) Analyze Application Performance Data
While monitoring your infrastructure performance and health is essential, watching application-level data is more critical. Effective application-level monitoring includes looking at how individual applications are functioning. So, if you’re running a web application, you want to know things like: How many users are logging in and out on average? What page views are most common? How long do users typically stay on each page? Are most users coming from mobile devices or desktop computers? What search terms did they use to find your site? These numbers will help guide your decision-making regarding marketing and product development. They also make it possible for you to maintain high uptime percentages, which will ensure that your infrastructure isn’t overwhelmed with traffic spikes.
As you build out your monitoring strategy and adopt KPIs around application-level performance data, it would also behoove you to consider some of the more advanced metrics, such as churn. Churn is a standard metric used in SaaS apps that looks at how many users are still actively engaged with your product versus those who have stopped using it altogether. Knowing when there’s a high churn rate can help give you better insight into what needs improvement to keep people engaged with your product or service.
4) Know When to Alert on a Metric
When setting up alerts in your monitoring system, it’s essential to know when to alert on a metric. Depending on what you’re trying to monitor, many different things can go wrong. Suppose you set alerts for every critical KPI and alert every time one goes down. In that case, you’ll get lots of false positives and unnecessary notifications (and no one will pay attention). The common term for this is called “alert fatigue.”
On the other hand, if you don’t alert until it’s really bad, you might have a bigger problem than just a few metrics. Use your knowledge of systems and KPIs to help define what good looks like so that when something is out of whack, people are alerted before it gets worse.
There’s a balance that must be struck, with the “alert fatigue” on one end of the spectrum and the “only when things are really, really bad” on the other. Determine Baseline for Every Aspect of Monitoring: It’s important to know what metrics should look like to have a valid comparison point. If you’re trying to determine if something is broken or not, you need to set a baseline before an incident happens to determine how well it’s functioning at any given point in time.
5) How Much Detail Is Needed?
There are times when you need to be a bit more detailed, and other times you don’t. As a general rule, measure what matters to your business or your customers. In other words, is it critical for your business to have an up-to-date picture of system availability? Then create a KPI that reflects that necessity. If it isn’t crucial for your business, don’t worry about making an unnecessarily complicated dashboard or time-consuming to maintain. The bottom line: pick metrics based on their importance to you, not necessarily their importance in theory.
6) What Information Do I Collect?
Before you set out to monitor your IT infrastructure, you must first decide what information is most important. This may seem like a simple question, but it’s tough to answer. With so many possibilities for metrics and KPIs that can be tracked, you need a process for deciding on what data is most important. We recommend determining which metrics are mandatory for ensuring service availability and end-user satisfaction. Once you’ve got that list put together, you’ll have an easier time deciding which metrics can be monitored separately from your mandatory metrics; these are ideal candidates for filling in gaps when your resources are limited.
7) Know How to Determine Baseline Levels For Each Metric
If you want to be sure that you know what’s going on with your servers and network, it’s essential to have a baseline for each metric you’re tracking. A good baseline will make it easy to identify problems before they get out of hand. Not only can identifying and measuring baseline levels help you identify significant issues, but it will also make normal operations easier to understand. Diagnosing an outage is much more difficult without a solid understanding of how things are supposed to be running normally.
It is generally considered best practice to run your monitoring solution in “evidence gathering” mode for 30 days. In this way, you’ll be able to capture metrics throughout the month to determine when resources may be more highly used (for example, an accounting program may be highly utilized at the end of the month as the finance team closes out the month’s financial reports).
After 30 days, it’s a good idea to log into all servers and ensure that everything looks normal; don’t forget applications required to run on a server (such as Apache) — if something looks strange, investigate further. Suppose everything looks good after your initial checkup. In that case, it will then be a matter of reviewing your evidence-gathering metrics every week or two and determining what baseline levels are acceptable in the future.
8) Use Statistical Analysis to Monitor Changes in Your Metrics Over Time
Having a set baseline for a metric can help you detect anomalies and outlier data. Using statistical analysis to monitor changes in your metrics over time can anticipate problems before impacting operations. As an example, let’s say that your average latency from requesting a customer order to fulfilling that order is typically 50 seconds with 95% confidence. One day, your latency jumps up to 60 seconds and then 58 seconds for four days in a row. This is most likely due to something out of the ordinary occurring within your environment, like a faulty hard drive or connection problems on one of your network switches—not because of demand spikes or customer issues on those days.
It may also behoove you to leverage AI and machine learning in your monitoring solution for larger environments. With AI, a single system can monitor every aspect of your infrastructure from multiple angles simultaneously—collecting data in real-time as well as historical data—and flag anomalies for immediate attention before they become more significant problems.
The Bottom Line: Monitoring Is Essential
Though monitoring is an often-overlooked part of maintaining an adequate infrastructure, it’s one of the most essential aspects. Simply put, if you aren’t monitoring your infrastructure, you aren’t providing reliable service to your customers.
Developing a monitoring posture from scratch can seem like a daunting task. The experts at Axeleos have deep subject matter expertise in setting up a monitoring process that works and that will provide valuable, actionable information upon which your IT team can act. Contact us today and let us help you get your infrastructure monitoring up and running!