Why Performance Monitoring Is Your Secret Weapon for Business Efficiency

If your servers are grinding to a halt, performance monitoring can reveal the cause.

You could say that PC Pro is all about performance monitoring. Every month, we put the latest hardware and software to the test, and award prizes and recommendations for the best devices. But in business, performance monitoring is less about choosing hardware, and more about understanding what your existing systems are doing. While the idea may sound simple, it’s a subject with a lot of intricacy and specificity to it. For example, our famous PC Pro benchmarks produce a single standardised measurement of things such as CPU speed, memory bandwidth and so on. That’s anathema to business-grade performance monitoring, which is all about watching over a long period, observing how and when performance ebbs and flows, and trying to figure out why. You could think of benchmarks as a collection of snapshots, and performance monitoring as the complete motion picture.

Done properly, performance monitoring can be a very worthwhile pursuit. It can expose pinch-points or overload conditions in your systems, and addressing these can give you a big overall productivity boost. It can alert you to unexplained activity, which could be indicative of a bug in your setup or an ongoing below-the-radar security breach. And it can reveal when a particular resource is being underutilised, giving you an opportunity to consolidate your servers or scale back your cloud estate. At the same time, performance monitoring isn’t necessarily something you can deploy quickly and easily, and the information you generate may not point to any obvious conclusions. If you’re considering introducing performance monitoring into your business, you should start by considering how you plan to do it, what information you hope to obtain, and what you can realistically expect to do with that data.

Inside or out?

There are two basic approaches to performance monitoring. One is to have it built into your software; while doing its primary job, your line-of-business app can be logging its own statistics in a form that’s ready to be picked up by an orchestration package or analysis toolbox.

The other approach is to monitor externally, via specialist platforms with sparky names such as Datadog and Dynatrace. These deploy data collection agents to your servers, workstations and cloud resources to watch over network traffic, memory consumption, CPU load and other measurements, and present reports and alerts in a natty integrated dashboard.

Each approach has its pros and cons. Native statistics can alert you to very specific performance metrics that aren’t externally visible – for example, you can see exactly which bit of your e-commerce pipeline is gobbling up disk space. The downside is that reports don’t directly connect to whatever else might be going on on your network at the time, so you miss out on the big picture.

External monitoring, on the other hand, will show you the combined effect of multiple applications, services and virtual machines all running on one physical server. The catch here is the cost – as well as needing the expertise to deploy and manage the thing, the licensing fees for a premium monitoring suite can easily run to tens of thousands of pounds per year.

Monitoring cloud servers

The promise of the cloud is that you never need to worry about running out of resources, so it might seem slightly paradoxical that performance monitoring is a core part of cloud services. But it makes sense, because almost the whole point of a cloud of servers is that it can be scaled up or down at will, either intentionally by a human or automatically by a digital agent. For the latter to work usefully, the agent needs to know when more horsepower is needed or when capacity can be dialled down. That means tracking statistics such as transaction completion time, number of users now arriving, memory footprint of each user session, load on network channels and so forth.

Since cloud providers don’t manage your actual workload, performance monitoring may be the only “live” service they offer, beyond the nuts and bolts of hosting. Statistics are continually collected from all points of the cloud architecture, be that a racktop switch, a gold master VM server image or a webcam pointed at a mercury thermometer on the wall. They do this not merely to help you, but to justify their own business. If your hosting bill unexpectedly shoots up one month, or if your compute pool seems to keep gradually growing and growing, the provider can point to the numbers that confirm a spike in demand, or that point to the part of your application that needs debugging.

How deeply to monitor?

Although the technology exists to monitor your systems in the most exquisite degree of detail, setting up something like that is no trivial matter. And in a connected world, the issue could easily be outside of your visibility anyway. A thought experiment I like to propose to neophyte performance engineers is this: what would you do if you noticed a delay at the start of every YouTube video? I can monitor and manage the device that’s exhibiting the behaviour – and I have some control over its internet connection – but I don’t have access to the upstream devices that may be involved in the delay. I can’t check the YouTube servers, nor the cloud farms that serve up the pre-roll adverts, so what do I expect performance monitoring to do?

Even if your problem is internal, it may be trickier to resolve than you’d expect. Real networks are messy places: they have Wi-Fi routers where they shouldn’t, cloud accounts shouldering business backbone loads paid for on the intern’s credit card, and a sheet of passwords in the drawer of the receptionist’s desk. Simply sticking a pricey performance monitoring system on top of all that isn’t likely to do you much good.

Rather, I recommend you start out with just one or two testable assumptions, and an appreciation of the sources and sinks of data in the system you want to monitor. Don’t underestimate this requirement: even if you try to keep things small scale, it can be dizzying to realise how much interplay is involved in an apparently simple app with a simple job to do. To get the best results you’ll need a willingness to engage at the deepest levels with all parts of your computing estate, and there’s a lot of “cognitive load” involved in boiling down all those log file entries, sensor reports and line-of-business events into something coherent.

Performance monitoring and AI

At this point you might be wondering whether this is an area where AI can help you cut through the mass of data and generate some quick, actionable insights. The short answer is yes, it can, but don’t celebrate too quickly. Do you know how ludicrously vast the computing load of some AI systems can be? We’re talking about electricity consumption bigger than some mid-sized countries: using 20,000 cloud VMs to run a GPT model isn’t unusual. That might not be a consideration if you’re paying a fixed £20 a month for ChatGPT, but it very quickly becomes a concern when the AI service is running on your cloud servers and you’re paying for it all.

At the same time, there’s nothing else out there that can do quite such an incisive job of boiling down the vast quantities of data created by a performance-monitoring workload. Dedicated performance-monitoring systems can help you bring together all your log files and reports, and deal with collation, trend-finding and highlighting anomalies, good and bad. But GPT-type AIs are supremely versatile in their ability to summarise arbitrary quantities of data without really needing to know anything about how that data is structured.

I remember early experiments in this field, running on a battered server running dual-Pentium 133MHz CPUs. Even with that minuscule quantity of processing power, it was possible to feed the machine a database file and come back the next day to be told what the top ten data values were in the top three transaction types. Using modern AI to munch through vast amounts of information might not be the cheapest way to skip the drudge work, but it can be a hugely effective one.

Brute force performance testing

This is a bit of an aside from our main subject, but when I’m talking about server-side monitoring, I find people often get curious about how exactly you test such systems. Business and IT types have understandable jitters about measuring system responses with live user workloads, but at the same time they want a testing scenario to be as realistic as possible,

The common solution is to use VMs to simulate large fleets of desktop PCs, all hammering the server at once with plausible but fictitious transactions, to generate a simulation of real-world loads. You might say accessing a cloud host from a real desktop PC is different to doing so from a virtual PC running inside that same cloud. That is indeed true, but the difference isn’t as drastic as you might think. Let’s say you’re using AWS to deploy tens of thousands of virtual single-user PCs, all going through typical patterns of browsing, data entry and order processing. You’d expect this to produce a very unbalanced load, but the architecture at Amazon is smart (or fortunate) enough to sprinkle repeat instances far away from one another, to avoid hot-spots inside its semi-secret, slightly proprietary cloud architecture. The result is real-world chokes on bandwidth, routing and so forth that make the virtual clients somewhat representative of real PCs.

Testing in this way doesn’t just provide reassurance in terms of data protection. While simply collecting statistics ought to be harmless and unobtrusive, most engineers have had experience with systems that mostly do what they’re supposed to, but that exhibit other less obvious behaviours when working under duress. You don’t want to expose real customers to those if you can possibly avoid it; better to push the limits with a digital twin of your live system.

This only leaves unaddressed the issue of minority platforms – machines, operating systems and browsers that aren’t represented in your VM cohort, perhaps because they don’t take well to being virtualised. If they’re cheap enough, there’s a lot to be said for just throwing a few hundred of them into a rack, strapping them down with cable ties and building them into the otherwise virtual test suite.

Performance monitoring in the IT business

One last challenge I’ve repeatedly run up against as a consultant is when a cloud application doesn’t expose any internal performance data to end users. This tends to apply to “as-a-service” offerings, such as web-based email, productivity and CRM tools. The slick, user-friendly presentation doesn’t even acknowledge that the app is consuming network bandwidth, let alone report how much it’s using. Behind the scenes, however, the traffic stats, message sizes and so forth usually are being recorded – just not anywhere you can access. Your best way forward is to try to get the support team to send you whatever metrics they can. They probably won’t be able to give you anything you can log into, nor even live information about the state of the service at. But they’re often willing to help out a paying customer who’s looking to do the legwork themselves.

If you want to curry favour, though, I recommend that you don’t go in at the outset demanding a vast data grab to dig about in. A small set of logs from a specific time slice may well be enough to get you into the right territory; often a very minimal diagnostic is enough to rule out a lot of potential bottlenecks or load issues. If you keep your own records of any particular times when there’s an apparent problem, you can at least establish what the supplier wants to tell you about that period.

When it comes to looking at your internal systems, the pattern is different. I recall a case of this with one client, whose database and cash register support software were nimble and responsive right after startup, but would slow to a crawl by the end of the day. The IT officer became habituated to regular restarts to clear the problem – but this wasn’t a quick fix, as the slower the database became, the longer the restart would take.

Eventually I had to go and take a look myself. The cause didn’t take long to find, via the Windows Task Manager and Event Viewer. The server was running dozens of duplicate instances of a small logging app that had been designed to collect a few bits of diagnostic data for each transaction. Ironically, these logs might have been useful for performance monitoring; however, the system had been configured to launch fresh instances of the app at intervals sympathetic to the business workload, and had ended up spawning more than 80 overlapping instances, filling up the system RAM and putting huge queues into the storage array.

I didn’t wait for any consent or interaction; I purged the launch schedule and allowed the system to work its way back to normal operation. There was no complaint about this from the support company, and my client went back to running his business instead of fretting over rebooting servers. This probably says more about the tunnel vision of some business app developers than about large-scale performance-monitoring apps in general, and the takeaway here is just what we mentioned at the start: performance monitoring is a varied field, with many different designs and approaches – almost one per incident. Engaging with this field is both inevitable, and at the same time far from business as usual.

Post a Comment

Previous Post Next Post