We are pleased to see that the ODL community has published a performance evaluation. The original report contained some interesting insight into the performance of ODL and also offered ONOS performance numbers for comparison purposes. Though the report explicitly stated that the numbers for other controllers were provided only for reference and did not necessarily represent possible maximum results, there were a number of issues with the way the ONOS tests were conducted and consequently, that resulted in some confusion within the ONOS partners and the broader community. Furthermore, there were a number of aspects critical to a comprehensive performance assessment that were missing from the report.
To address this, the ONOS team reached out to the ODL team and made a number of suggestions for evaluating ONOS performance and for addressing a number of important aspects, such as the effects of high-availability and scalability on performance. The ODL team was very receptive of our suggestions and as a result has published a revised performance report which now includes numbers based on ONOS 1.5.1 release and therefore serves as a better basis of comparison for a single-node controller.
The purpose of this article is to highlight two other suggestions that were proposed to the ODL team for their consideration in future versions of the performance report.
Single Instance Controller
The first such suggestion was that the report did not evaluate the effect of high-availability and scalability on performance; it covered performance of a single-node controller only. Given the vulnerability of a single-node setup, it is unlikely that an operator of a mission-critical network would use a single instance controller in a production setting. Furthermore, while performance of a single-node is certainly a useful baseline, it does not paint a complete picture as it completely ignores crucial architecture, design, and implementation aspects of a fault-tolerant clustered controller.
We know it is impossible to extrapolate performance of a multi-node controller from a single-node one. Without appropriate design of distributed state management primitives and other coordination functions, one can easily get worse performance or even poorer reliability with a multi-node controller. High-availability, being correctness of operation in the face of failures certainly trumps performance and therefore is a critical dimension to be considered.
The aspects of high-availability, scalability and performance interact with each other in complex ways, which is why a characterization of performance alone is not really all that illuminating. As a result of this feedback, the ODL team has added a section to their report as an area of future performance assessments. We are very happy about that and look forward to seeing their findings as they become available.
The second suggestion made to the ODL team centered on whether the tests were structured to expose the controller to sufficient level of stress. The study centers on end-to-end performance, which it defines as external flow programming of a number of OpenFlow switches via REST API. While this does in fact represent one aspect of usage, we pointed out that it should be recognized that there are limitations to this approach and that there are alternate forms of usage, which are far better suited for operating environments that require a highly dynamic control plane.
REST APIs are certainly ubiquitous and very useful, but they are not the highest performing API; in this experiment, the overhead of establishing the HTTP session completely dwarfs the time required to conduct a flow operation (or a similar control action). For this reason, including the REST API in the measurements masks the actual internal performance capabilities of the controller.
Such “masking” is further exacerbated by limiting the experiments to only a small number of switches, which have been shown to have their own performance limit for accepting flow programming requests (OVS tops out at ~2Kfops/s). The study states that this setup represents a realistic end-to-end scenario, but it should be noted that on such a small scale, the resulting numbers speak more to the performance limitations of the REST API libraries and the software switches, rather than to the actual internal performance capabilities of the controller platform itself.
Why is this important? Because, using these numbers to extrapolate expected results in a large production environments would be extremely flawed. In such environments, the larger number of devices results in increased opportunity for parallelism and consequently in increased load on subsystems responsible for distributing state or for tracking interactions with the environments. It will be then when bottlenecks will be revealed that have been previously hidden in tests with low or no parallelism.
For this reason, to properly asses the capabilities and limitations of a controller platform, one must take care to expose the system to adequate levels of parallelism and stress, something that was not accomplished by testing through the REST API and against a small number of relatively slow switches.
This is an area of ongoing discussions with the ODL team as there are slightly different models of use for each platform, with ONOS focusing on ultra-high performance of Java APIs for on-platform applications, whereas ODL is oriented more towards external applications.
A year ago, the ONOS team published a comprehensive study of its controller platform, which establishes a number of important performance metrics, both in the terms of throughput and latency of operations and it does so while considering the critical dimensions of high-availability and scalability. This paper continues to be the performance benchmark for controller performance and we continue to measure our performance in these areas to make sure that all ONOS releases continue to provide as good, or better performance than what is described in this report.
We are pleased that the authors of the ODL performance white-paper were open to our suggestions and that these aspects will be considered for the future editions of their report. We also want to thank them for correcting the ONOS comparative numbers by running their tests using the ONOS 1.5.1 which contains the REST APIs required to make a meaningful comparison.
The revised ODL performance report will be available shortly on the OpenDaylight web-site.