108 lines
16 KiB

At present, there is an increasing shift in electric energy generation from centralized---often environmentally harmful---power plants to distributed renewable energy sources. In their paper on intelligence in future electric energy systems, Strasser et al.~\cite{strasser2015review} describe the new challenges which arise together with this shift to sustainable electric energy systems.
\subsection{New challenges in power system simulations}
Nowadays, \glspl{drts} are most frequently used to get accurate models of the output waveforms of electric energy systems. In \gls{rt} simulations, the equations of one time step in the simulation have to be solved within the corresponding time span in the actual physical world. As Faruque et al.~\cite{faruque2015real} describe, \gls{drts} can be divided into two classes: \textit{full digital} and \gls{phil} real-time simulations. While the former are completely modeled inside the simulator, the latter provide \gls{io} interfaces which allow the user to replace digital models with actual physical components.
Since power grids should be reflected into power models as accurate as possible, more complex grids will naturally result in more complex simulations. Hence, the shift towards distributed electric energy generation poses new challenges regarding \gls{drts} complexity. One possible solution to counteract the arising computational bottlenecks is the distribution of simulation systems into smaller sub-systems~\cite{faruque2015real}.
As a solution to this problem, Stevic et al.~\cite{stevic2017multi} propose a framework which enables geographically distributed laboratories to integrate their off-the-shelf real-time digital simulators virtually, thereby also enabling \gls{rt} co-simulations. Later, Mirz et al.~\cite{mirz2018distributed} summarized other important benefits of such a system: hardware and software of various laboratories can be shared; easy knowledge exchange among research groups is facilitated and encouraged; there is no need to share confidential data since every laboratory can decide to run its own simulations and only share interface variables; laboratories without certain hardware can now, nonetheless, test algorithms on this hardware.
The following subsection presents the implementation of such a system, as presented by Vogel et al.~\cite{vogel2017open}: \textit{VILLASframework}.
\subsection{VILLASframework: distributed real-time co-simulations\label{sec:intro_villas}}
VILLASframework\footnote{\url{}} is an open-source set of tools to enable distributed real-time simulations, published under the \gls{gpl} v3.0. Within VILLASframework, \textit{VILLASnode} instances form gateways for simulation data. \Autoref{tab:villasnode_nodes} shows the interfaces---which are called \textit{node-types} in VILLASnode---which are currently supported. Node-types can roughly be divided into three categories: node-types that can solely communicate with node-types on the same server (\textit{internal communication}), node-types that can communicate with node-types on different servers (\textit{server-server communication}), and node-types that form an interface between a simulator and a server (\textit{simulator-server communication}). An instance of a node-type is called a \textit{node}.
\Autoref{fig:villasframework} shows VILLASframework with its main components: VILLASnode and \textit{VILLASweb}. The figure shows nodes in laboratories that form gateways between software (e.g., a file on a host system) or hardware (e.g., a simulator). A node can also be connected to other nodes; these can be located on the same host system, on a different host system in the same laboratory, or on a host system in a remote laboratory. Within VILLASframework, a distinction must be made between the \textit{soft real-time integration layer} and the \textit{hard real-time integration layer}.
\caption{VILLASweb and VILLASnode, the main components of VILLASframework.}\label{fig:villasframework}
Although node-types that realize internal communication are able to achieve hard real-time, none of the node-types that connect different hosts with each other are able to do so. So far, all node-types rely on the \gls{tcp}---e.g., \textit{amqp} and \textit{mqtt}---or on the \gls{udp}---e.g., \textit{socket}. Both protocols are part of the Internet protocol suite's transport layer and these nodes thus rely on Ethernet as networking technology.
Within Ethernet, a large portion of the latency between submitting a request to send data and actually receiving the data is caused by software overhead, switches between user and kernel space, and interrupts. For example, Larsen and Huggahalli~\cite{larsen2009architectural} report that, on average, it takes \SI{3}{\micro\second} on their Linux system before control is actually handed to the \gls{nic} when a host tries to send a simple ping message. For the Intel\textregistered{} 82571 \SI{1}{\gigabitethernet} controller they used, these \SI{3}{\micro\second} are \SI{72}{\percent} of the time the message spends in the sending node. Similar proportions of software and hardware latency can be seen at the receiving host. After optimizations, Larsen and Huggahalli reduced the latency of an Intel\textregistered{} 82598 \SI{10}{\gigabitethernet} controller to just over \SI{10}{\micro\second}, in which software latency was still predominant.
Another issue of Ethernet is its variability~\cite{larsen2009architectural}: real-time applications require a high predictability and thus low variability of the latency of samples. Furthermore, \gls{qos} support is limited in Ethernet~\cite{reinemo2006overview}. Techniques to avoid and control congestion can become essential for networks with a high load, which can be caused, for example, by a high number of small samples due to real-time communication.
\subsection{Hard real-time communication between different hosts\label{sec:hard_real_time_communication_between_servers}}
Thus, in order to achieve hard real-time between different hosts, a different technology than Ethernet must be used. An alternative technology that is particularly suitable for this purpose is \textit{InfiniBand}. This technology is specifically designed as a low-latency, high-throughput inter-server communication standard. Due to its design, every process assumes that it owns the network interface controller and the operating system does not need to multiplex it to processes. Consequently, processes do not need to invoke system calls---and thus trigger switches between user and kernel space---while transferring data. It is even possible to send data to a remote host without its software noticing that data is written into its memory. Furthermore, InfiniBand has extensive support for \gls{qos} and is a lossless architecture, which means that it---other than Ethernet---does not rely on dropping packets to handle congestion of the network. Finally, the InfiniBand Architecture handles many, more complex, tasks, such as reliability, directly in the hardware.
Because this technology seems so well suited for this purpose, the present work investigates the possibilities of implementing a VILLASnode node-type that relies upon InfiniBand as its communication technology.
\section{Related work\label{sec:related_work}}
The goal of the present work was to develop a communication channel among different host systems that is optimized regarding latency. Therefore, this section will examine previous performance studies on InfiniBand that present optimizations regarding latency.
In their work, MacArthur and Russel evaluate how certain programming decisions affect the performance of messages that are sent over an InfiniBand network~\cite{macarthur2012performance}. They examine several features that potentially affect the performance:
\item The \textbi{operation code}, which determines if a message will be sent with either channel or memory semantics.
\item The \textbi{message size}.
\item The \textbi{completion detection}, which determines whether the completion queue gets actively polled or provides notifications to the waiting application. This setting also heavily affects \acrshort{cpu} utilization.
\item \textbi{Sending data inline}, with which the \acrshort{cpu} directly copies data to the network adapter instead of relying on the adapter's \acrshort{dma}.
\item \textbi{Processing data simultaneously}, by sending data from multiple buffers instead of one.
\item Using a \textbi{work request submission list}, with which instructions are submitted to the network adapter as a list instead of one at a time.
\item Turning \textbi{completion signaling} periodically on and off for certain operations.
\item The \textbi{wire transmission speed}.
They conclude, that an application should use the operation code that best suits its needs. A limiting factor here is often the need to notify the receiver about new data. When comparing the operation codes that support notifying the receive side, i.e., \textit{send} and \textit{\acrshort{rdma} write with immediate}, the performance difference is negligible.
For ``small'' messages ($\leq\SI{1024}{\kibi\byte}$), the message size did not influence the latency too much under normal circumstances. For ``large'' messages ($\geq\SI{1024}{\kibi\byte}$), however, they observed that the latency increased with the message size.
When letting the completion queue provide notifications when new data arrived, they measured a \acrshort{cpu} utilization of \SI{20}{\percent} for messages smaller than \SI{512}{\byte} and \SI{0}{\percent} for messages larger than \SI{4}{\mebi\byte}. When the queue was actively polled, the \acrshort{cpu} utilization turned out to always be \SI{100}{\percent}. Although completion detection with notifications was more resource friendly, they found that it, in case of small messages, resulted in latencies that were almost 4\times{} higher than when actively polling. For large messages this difference diminished. The latencies of messages larger than \SI{16}{\kibi\byte} showed no difference at all anymore.
They advice to send data inline whenever this feature is supported by the network adapter that is used and when the message size is smaller than the cache line size of the adapter. They discovered that sending data inline required a few additional \acrshort{cpu} cycles, but resulted in a latency decrease of up to \SI{25}{\percent}. They also called attention to the fact that sending messages larger than the cache line size of the network adapter inline had a detrimental effect on latency.
With regards to the number of buffers, they found the ideal number of buffers to be around 8 for small messages and 3 for large messages. Using more buffers did not increase the performance any more, and even resulted in slightly worse performance in some cases. By using 8 buffers and sending data inline, they detected one-way latencies as low as \SI{300}{\nano\second}. This is considerably less than the latencies for Ethernet, as reported by Larsen and Huggahalli~\cite{larsen2009architectural}.
Their recommendation regarding the submission of lists of instructions is to only use it when appropriate: whenever it is possible to submit an instruction individually, this should be the preferred method. In that way, the adapter can queue the instructions and is thus kept busy.
Last but not least, they examined the influence of completion signaling. Usually, after a message has been (successfully) sent, the sender gets notified, for example, to release the buffer. MacArthur and Russel first inspected periodic signaling, where only every $\left(\frac{n_{buffers}}{2}\right)^{th}$ message triggered a notification. They found that this usually had little effect on latency. It only had a larger effect when a list with multiple instructions was submitted to the adapter. However, when messages were sent inline, they found that it could be beneficial to disable signaling.
MacArthur and Russel also compared their InfiniBand setup with a contemporary \gls{roce} setup. Although they concluded that InfiniBand outperformed \gls{roce} for large messages, they also concluded that the difference for small messages was negligible. However, as Reinemo et al.\ state in their publication~\cite{reinemo2006overview}, support for \gls{qos} is limited in Ethernet and abundantly available in InfiniBand.
In a later work~\cite{liu2014performance}, Liu and Russel solely focused on throughput. Although they exclusively focused on messages larger than \SI{32}{\kibi\byte}, which are uncommon in VILLASnode, they drew a few conclusions that can generally be applied to communication over InfiniBand. They observed that:
\item in most cases, \acrshort{numa} affinity effects the performance of the network adapter;
\item the performance (with regards to throughput) is sensitive to message alignment;
\item the maximum number of unsignaled instructions before a signaled instruction should be sent is:
\min(\frac{B}{s},1), \qquad \qquad \qquad\:\, \mathrm{if}~\SI{16}{\kibi\byte} < \mathrm{message~size} < \SI{128}{\kibi\byte}\\
\min(\frac{D_{SQ}}{2},D_{SQ}-B), \qquad \mathrm{otherwise}
with $B$ the number of outstanding messages and $D_{SQ}$ the depth of the send queue.
Furthermore, they preferred the \textit{\acrshort{rdma} write with immediate} over the \textit{send} operation.
\section{Structure of the present work}
\paragraph{\ref{chap:basics}~\nameref{chap:basics}} aims to give the reader an understanding of the communication architecture that lies at the heart of the VILLASnode node-type that was implemented as part of the present work. The chapter starts with an introduction on the Virtual Interface Architecture and proceeds with a section that is dedicated to InfiniBand. Before finishing with a section on real-time optimizations, \autoref{chap:basics} elaborates upon the software libraries that are used to access InfiniBand hardware.
\paragraph{\ref{chap:architecture}~\nameref{chap:architecture}} expands on the internals of VILLASnode. After having explained the concept of VILLASnode, this chapter discusses the adaptions that had to be made to its architecture to (efficiently) support an InfiniBand node-type. These include changes to function parameters of the interface between the global VILLASnode instance and an instance of a node-type, to the memory management of VILLASnode, and to the finite-state machine of instances of node-types.
\paragraph{\ref{chap:implementation}~\nameref{chap:implementation}} first discusses the non-trivial parts of the implementation of the benchmark that was used to profile the InfiniBand hardware, the \textit{InfiniBand} node-type, and the benchmark that was used to analyze VILLASnode node-types. Then, it discusses how an additional service type was enabled in the communication manager that was used and how the acquired data from the benchmarks was processed.
\paragraph{\ref{chap:evaluation}~\nameref{chap:evaluation}} evaluates the results that were found with the help of the benchmarks that were presented in the previous chapter.
\paragraph{\ref{chap:conclusion}~\nameref{chap:conclusion}} considers whether the assumptions from \autoref{sec:motivation} (\nameref{sec:motivation}) are legitimate and thus whether the \textit{InfiniBand} node-type is a valuable addition to the VILLASframework.
\paragraph{\ref{chap:future}~\nameref{chap:future}} presents possible optimizations that were not examined in the present work. It begins with a brief examination of the possibilities the \texttt{PREEMPT\_RT} patch could bring, continues with a section on optimizations \& profiling of the VILLASnode source code, and ends with a section on \acrshort{roce}.
In addition to this brief introduction on the structure of the present work, every chapter begins with a paragraph that presents the structure of the sections within that chapter.