Root cause analysis (RCA) is an important step towards defining problems and enabling their resolution. It’s important, because in complex systems or scenarios, there may be many things that could have gone wrong, but which really matter? In this blog, we’ll explore root cause analysis in the context of telecoms networks and explain why effective inventory management is an important contributor to accurate RCA.
What do we mean by root cause analysis? Root cause analysis is really concerned with discovering the primary issue that has caused a known problem. We need to distinguish root causes from others, because telecoms networks – and other systems – are complex. There may be many interlinked issues, some of which occur as a result of another. So, since one problem can trigger another issue, there may be a cascade of problems – but which one was the first in the chain? What is the underlying – or root – cause? RCA enables telecoms engineers to determine which of one or more problems was the trigger.
This matters, because telecoms networks are built from systems that interact and which work in sequences. When one event takes place, others are supposed to follow in a coordinated, automated flow. By the same token, when problems occur, they can also spread rapidly.
This can have serious consequences, because a relatively minor problem can have major repercussions for the entire network and customer base. Not only does discovering the RCA of a problem take time – up to 65% in a typical data center, according to Computer Weekly - and therefore money – the impact on customers can be huge. Issues that cause service disruption to one customer can spread to many thousands, in a short space of time. Indeed, even issues that seem relatively innocuous can transform into problems that disrupt the entire network. If they are unresolved, they can trigger a whole series of related problems.
So, RCA is of fundamental importance. How can you ensure effective RCA for telecoms networks? Of course, the old adage “you can only manage what you can measure” remains relevant. In the context of RCA, this means that you must understand what you are trying to manage. According to the same author, this means that you need a model of the objects and the data you are seeking to manage. An important element of this is the topology of the network. What does this mean?
Network topology includes all of the elements that are to be managed. In telecoms networks – for example, a broadband network – this will include all of the switches and routers, the servers to which they connect, and the means by which they are connected. Network constituents aren’t just connected physically, they may also be connected logically. This means that the components that go to make up a service – a broadband connection of 100mbps, for example, are connected in a particular order to deliver that service.
This also applies to the alarm and alert systems that each element offers.
In any network – not just telecoms – individual components may generate reporting information that includes information about faults. This information is generally collected and integrated into some form of management and monitoring system. As such, an alarm from a router can be traced and tracked. However, an alarm from one thing, may trigger a response and a further alarm from another.
This isn’t the place to explain the fundamentals of network management (that would require some months!), but the point should be clear. In order to manage and assure the performance of a telecoms network (so that customers are happy and so that less time is spent fixing or uncovering problems), the topology of the network, both in terms of physical and logical resources and connections must be known and clearly understood.
Location is also crucial to this understanding. If you don’t know where something is, then you cannot take remedial actions that might be required. While some things can be fixed remotely, you may, ultimately, need to make a physical repair, or even to replace something.
Today’s telecoms networks are generally equipped with sophisticated network management tools. These provide a complete overview of network, service and customer status; offer reporting interfaces for alarms and fault detection; as well as tools to correct issues. Increasingly, we’re moving towards network management solutions that automate the process, taking actions automatically to prevent service loss or serious degradations. Some such systems have also automated the process of root cause analysis, allowing engineers to focus on network enhancement tasks, not routine and remedial activities.
However, to be effective, all network management tools must be able to discover the underlying topology of the network – covering location, connections, physical and logical resources, and providing an overview of the services that are delivered. With increasing dependence on virtualized solutions, these must also be included in the topology map.
Any such tool should be able to auto-discover topology, as this can change. When new network connections are created (a new service to a customer, perhaps), or new resources added to the network (a new customer access point and path, for example), these must be included into the topology.
How can this be achieved? Well, the task is made considerably easier if there is an accurate network inventory system that can handle the discovery function and which can host the entire data model of the network. That’s what we do. CROSS, from CNI, is a complete network inventory solution that maintains a consistent, accurate and single data model of the entire network topology, covering the location of physical, logical, virtual and service elements. It also includes the capacity and the consumption, so that, if a resource is at 50% capacity to meet customer demand, the surplus available for other resources and services is known.
This data model and information set is available for network management solutions, as well as other key business processes. And this means that root cause analysis can also be enabled, quickly and effectively, because the key information required to track alarms and reports to their source – and the relationship between all elements (and thus the potential impact of any issue) can be clearly understood from a single interface.