It’s common for hosting and Cloud service providers to receive support calls related to the slowness of their hosted application. It is so common that many tenders for new hosted or SaaS applications now ask detailed questions about the compute, storage, and network capabilities of the hosting platform.
I have seen requests specifically around how long it takes for a database query to be answered, as well as what the SLA is for a specific screen refresh while using the application. I also often see questions about network latency between the provider’s data centre and the customer location.
These types of question seek to understand how long users will have to wait for the system to complete a task. Given that users may well be on the phone to a customer, it’s clear why it is desirable for the system to respond quickly. After all, many of us have faced this type of problem at some point.
So, what’s the issue?
The problem with trying to blame the hosting provider, is that they don’t control the whole process. While the best providers do everything in their power to ensure systems perform as well as possible, there are elements they can’t influence.
One issue is that the software developers and the infrastructure hosting specialists (which can be the same company or separate) are different teams, so there may be a disconnect between the way the software is designed and the way it has been implemented. I’m not the only person who has spent a considerable part of my career trying to minimise this sort of confusion, by consulting both sides. It’s a major part of any Cloud-based architect’s role.
The next challenge is the connectivity between the hosting/Cloud provider and the customer. If that connection is via the internet, there are many variables and multiple links across the public network that could influence performance.
What can be done?
Software should now be designed while taking this into account, retrying communications requests rather than passively awaiting a response. However, some applications built on legacy technology stacks still haven’t been redesigned with network unreliability in mind. If there is a direct connection between the data centre and the customer premises, the situation is certainly improved, but of course that connectivity comes at an additional cost.
If the hosting provider doesn’t provide the WAN links, there’s potential for them and the carrier to blame one another for poor performance.
Many customers would assume that once the above problems have been addressed, an acceptable performance is guaranteed from the hosted or SaaS application. Unfortunately, I’ve been involved with a number of support escalations where everything listed above was eliminated, but users were still seeing unacceptable system response times.
How can this be resolved?
In a couple of specific cases the customer was concerned enough about system performance that they were considering non-performance contract termination clauses. In both these cases, I went on-site and timed how long it took for an application to perform some standard actions. The same solution was performing acceptably for dozens of other hosted customers. I was able to confirm that the software was significantly underperforming in both locations.
Following this, the hosting company and software provider committed to investigate the whole service, from data centre to desktop. They did this by using WAN monitoring tools, performance monitoring, a review of the database, and a Viavi Observer Gigastor, which would be put on the LAN to see whether it was possible to identify any local problems.
If you haven’t come across the GigaStor before, it is effectively a Sky+ box for your network. It captures every packet that crosses the network and stores it for later analysis. It was agreed that the cost of the investigation would be split between the software provider and hosts, unless it turned out the problem wasn’t related to the software or service (in which case the customer would pay for it). This in-depth investigation would run for a full month to ensure that any peaks in load were taken into account.
What was the outcome?
It probably doesn’t come as any surprise to learn that in both cases the LAN was identified as being the problem. After all, both the software and infrastructure support teams had spent a lot of time investigating the issue before an agreement was reached to look at the whole chain of communication, so the engineers and software team were reasonably confident the problem wasn’t in those respective areas.
The root causes were different, an NIC on a local server ”chattering” in one case, and a problem with a network switch in the other. But the interesting thing was that once the problems were identified, users on both sites reported that they were suddenly getting better performance from other systems too. It was a shame that the reports of other systems being slow weren’t considered as being linked before this point, as that would have saved everyone time. But sometimes it is difficult to get an overview of how complex systems and services interact.
What can be learnt from this?
This really emphasises the point that if you buy business software, and are considering whether a Cloud solution will perform as required, you should look at the whole solution, from desktop to data centre. Advanced can help you to build the desired environment and to manage the end-to-end process, freeing up valuable resources. Learn more about our services here or contact managedit@oneadvanced.com to get started.