I ran into many discussions with the network team on bandwidth utilization. The network teams says there is no need for upgrade based on their analysis done using Orion. However, when I check the reports in DC RUM I can see a clear need for additional bandwidth. Sometimes DC RUM reports beyond the line capacity (like almost 20 Mbps on a 15 Mbps access line) which makes me wonder how this is calculated.
The AMD is located in our largest data center (with sniffing port on the MPLS line) only and all the routers send Netflow to the AMD. So I assume that the bandwidth utilization is based on the Netflow. One of my guesses is that the AMD is adding up the Netflow statistics incl. the traffic captured by the sniffing port.
Can someone point out the differences between DC RUM and Orion? And I would also like to understand how to generate reliably bandwidth utilization charts for capacity management.
Answer by Kris Z. ·
DC RUM de-duplicates data sourced from NetFlow and AMD, so double-counting should not occur. De-duplication is implemented this way so there is a possibility to dive deeper into the sources, using the Link dimension. Link name will tell where specific measurement came from.
With NetFlow enabled, AMD can query the routers over SNMP for line names and their speed. This is the way how AMD learns interface/link capacity, so utilization metric can be reported. Without knowing the link capacity, utilization (in expressed in percent) can't be reported. The question is whether you have SNMP polling working on the AMD and whether interface capacity returned by the to the AMD router is correct?
Another aspect to consider is the evaluation interval in which bandwidth is calculated. DC RUM uses 5-min (or 1-min if configured so). What is the monitoring interval in Orion?
In any monitoring interval case - bandwidth utilization reported will be an average over 1-min, 5-min, 15-min or whatever. This wouldn't reflect real contingencies on transaction level, as a transaction may last a couple of seconds or several milliseconds and during that time experience bandwidth contingency. If no transactions happen for the remaining time in the monitoring interval, average utilization will be low. But in DC RUM you will see increased response time, network performance decrease (e.g. because of loss rate happening), decreased realized bandwidth experienced by the real users. Therefore average interface utilization over time alone can't be seen as a KPI telling whether links are fast enough already. End user experience will also limited by the link characteristics like e.g. latency (represented by DC RUM's RTT and ACK RTT measurements). Links of high latency will never deliver fast responses, especially to short transactions, and bandwidth increase won't help.
This response gets too long. Bottom line: discuss with the network team the whole link characteristics and EUE context, not only the bandwidth. EUE is the most important factor here,delievring users' transactions is why the network exists after all:-)
Answer by Kris Z. ·
bart, to be honest I have no idea how to interpret measurements from your NetFlow source. I can only speculate:
Router-0 represents SNMP and/or NetFlow traffic itself
Vlan2 is a virtual interface and it represents some actual VLAN traffic, I guess VLAN of identifier 2. Some of it may be going through GigabitEthernet8.9, but potentially not all of it and/or some of it may be going through other GigabitEthernet interfaces in addition to GigabitEthernet8.9. Hence the differences.
GigabitEthernet8.9 is a physical interface
But that's probably as far as I can take it.
Answer by Bart E. ·
Thanks for the quick response @Kris Z.. The reason for this post is the discussion with the network team. The AMD is using SNMP to connect to all the netflow sources (status successful in the RUM Console).
I do understand that there can be differences and that the answer to my question is not that easy. Maybe it is related to the configuration of the routers, but what I don't really understand is the following:
In the attached metric chart you can see the data retrieved from one particular router. There are three interfaces but as you can see the interface named "GigabitEthernet8.9" has the exact same Upstream as the the interface "Vlan2" (most of the times, but not always). But the respective Downstream values are slightly different and the Router-0 interface has a Downstream only:
Just thinking 'out loud': Can it be that Orion is using the netflow from the Vlan2 interface and DC RUM is getting the flow from GigabitEthernet8.9? But on the other hand those differences are not that big that it can lead to a difference in recommendation. Or should I look at only one interface (e.g. Vlan2) and simply forget about the other interface as it is redundant?
Hope this makes sense at all ...