masters-thesis/chapters/evaluation.tex

697 lines
75 KiB
TeX

\chapter{Evaluation\label{chap:evaluation}}
This chapter discusses the results of the previously presented benchmarks. \Autoref{sec:evaluation_ca} starts with an evaluation of the custom one-way \gls{hca} benchmark from \autoref{sec:ca_benchmarks}. After these results have been analyzed, \autoref{sec:perftest} will compare them to the results of \texttt{ib\_send\_lat} of the \gls{ofed} Performance Test package. Subsequently, \autoref{sec:evaluation_villasnode} discusses the several VILLASnode node-types that were benchmarked.
\Autoref{tab:benchmark_testsystem} lists the hardware, the operating system, the \gls{ofed} stack version, and the VILLASnode version that were used for all benchmarks. Fedora was selected as \gls{os} because of its support for the \texttt{tuned} daemon (\autoref{sec:tuned}) and because of its easy-to-set-up support for \texttt{PREEMPT\_RT}-patched kernels (\autoref{sec:future_real_time}). At the time of writing the present work, the chosen Fedora and kernel version was the latest combination that was seamlessly supported by this version of the Mellanox\textregistered{} variant of the \gls{ofed} stack.
\input{tables/benchmark_testsystem}
The system was optimized using to the techniques from \autoref{sec:optimizations}. Unless stated otherwise, all analyses that are presented in this chapter have been run under these circumstances. \Autoref{fig:configuration_system} shows the distribution of \glspl{cpu} among cpusets (\autoref{sec:cpu_isolation}). The \glspl{cpu} in the two \textit{real-time-<X>} cpusets are limited to the memory locations in their \gls{numa} node (\autoref{sec:numa}). These memory locations are also the same as those the respective \glspl{hca} will read from or write to. Finally, the system is optimized by setting the \texttt{tuned} daemon to the \textit{latency-performance} profile (\autoref{sec:tuned}).
Thus, all time-critical processes that needed to use the \glspl{hca} \texttt{mlx5\_0} and \texttt{mlx5\_1} were run on the \glspl{cpu} 16, 18, 20, and 22 and 17, 19, 21, and 23, respectively.
\begin{figure}[ht]
\includegraphics{images/configuration_system.pdf}
\vspace{-0.5cm}
\caption{The configuration of the Dell PowerEdge T630 from \autoref{tab:benchmark_testsystem}, which was used in the present work's evaluations. \gls{numa} specific data is acquired with \texttt{numactl}.}\label{fig:configuration_system}
\end{figure}
\section{Custom one-way host channel adapter benchmark\label{sec:evaluation_ca}}
This section examines different possible configurations of communication over an InfiniBand network using the benchmark presented in \autoref{sec:ca_benchmarks}. It is intended to help make a well considered choice regarding the configuration of the InfiniBand VILLASnode node-type and to get a ballpark estimate of the latency this communication technology will show in VILLASnode.
\subsection{Event based polling\label{sec:event_based_polling}}
The first analyses that were performed were meant to examine the characteristics of event based polling (\autoref{fig:event_based_polling}). Since event channels are designed to be \gls{cpu} efficient, in this case, the optimizations from \autoref{sec:cpu_isolation} (``CPU isolation \& affinity'') and \autoref{sec:irq_affinity} (``Interrupt affinity'') were not applied and \autoref{fig:configuration_system} is not relevant. Instead of improving latency, the aforementioned optimizations had an adverse effect and actually increased latency. However, the \texttt{tuned} profile \textit{latency-performance} and memory optimization techniques were applied nevertheless.
\Autoref{tab:oneway_settings_event} shows the settings that were used with the custom one-way benchmark. These settings were introduced in \autoref{sec:tests}. Gray columns in \autoref{tab:oneway_settings_event}, and in all following tables that list benchmark settings, indicate that the settings of these columns were varied during the different runs. Consequently, all settings in the white columns stayed constant whilst performing the different tests. The graphs that were generated from the resulting data are shown in \autoref{fig:oneway_event}.
\input{tables/oneway_settings_event}
In the first three subfigures of \autoref{fig:oneway_event}, $25\cdot8000$ messages of \SI{32}{\byte} were bursted for \gls{rc}, \gls{uc}, and \gls{ud}. This message size was chosen in most of the following tests because it is the minimum size of a message in the VILLASnode \textit{InfiniBand} node-type. Every sample that is sent from one VILLASnode \textit{InfiniBand} node to another contains at least one 8-byte value and always carries \SI{24}{\byte} of metadata.
The first thing that catches the eye are the relatively high median latencies (\autoref{eq:latency}) of all service types: $\tilde{t}_{lat}^{RC}=\SI{3608}{\nano\second}$, $\tilde{t}_{lat}^{UC}=\SI{3598}{\nano\second}$, and $\tilde{t}_{lat}^{UD}=\SI{3389}{\nano\second}$. These were caused by the event channels that were used for synchronization: with abovementioned settings, the benchmark waits until a \texttt{read()} system call returns before it tries to poll the completion queue. Therefore, in the meantime, other processes can be scheduled onto the \gls{cpu} and it will take a certain amount of time to wake the benchmark up again. So, event based polling results in a lower \gls{cpu} utilization compared to busy polling, but, in return, yields a higher latency.
\paragraph{Maxima} The maximum latencies that can be seen were mainly caused by initial transfers immediately after the process started or after a period of hibernation. This is sometimes referred to as the \textit{warm up effect}. Potential solutions for this problem are introduced in \autoref{sec:future_real_time}.
The custom one-way benchmark includes another potential cause for latency maxima. As mentioned in~\autoref{sec:timestamps}, the function that measures and saves the receive timestamps (\autoref{lst:cq_time}) lies in the time-critical path. The worst case situation, in which the two memory regions were only initialized by \texttt{mmap()} but were not yet touched and thus allocated, was examined. This caused maxima of more than \SI{700}{\micro\second}. When the pages were present in the virtual memory, the latency of both save operations was determined to be approximately \SI{40}{\nano\second} together.
Thus, in order to make full use of the capabilities and low latencies of InfiniBand, it is important to carefully pick the operations that lie in the datapath.
\begin{figure}
\vspace{-0.5cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:oneway_event_a}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_event_hist/plot_0.pdf}
\end{minipage}
\end{subfigure}
\vspace{-0.2cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:oneway_event_b}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_event_hist/plot_1.pdf}
\end{minipage}
\end{subfigure}
\vspace{-0.2cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:oneway_event_c}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_event_hist/plot_2.pdf}
\end{minipage}
\end{subfigure}
\vspace{-0.2cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:oneway_event_d}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_event_hist/plot_3.pdf}
\end{minipage}
\end{subfigure}
\vspace{-0.2cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:oneway_event_e}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_event_hist/plot_4.pdf}
\end{minipage}
\end{subfigure}
\vspace{-0.2cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:oneway_event_f}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_event_hist/plot_5.pdf}
\end{minipage}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\includegraphics{plots/oneway_event_hist/legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Results of the one-way benchmark with the settings from \autoref{tab:oneway_settings_event}. These were used to analyze latencies with event based polling.}\label{fig:oneway_event}
\end{figure}
\paragraph{Minima} The small peaks at the left side of the graphs, between approximately \SI{900}{\nano\second} and \SI{2900}{\nano\second}, were caused by how this benchmark implements event based polling. \Autoref{fig:event_based_polling} already showed that after a completion channel notifies the process that a new \gls{cqe} is available, the \gls{cq} must be polled with \texttt{ibv\_poll\_cq()} to acquire \glspl{cqe}. After polling, this benchmark does not immediately return control to \texttt{ibv\_get\_cq\_event()}; rather it tries to poll again to see if new messages arrived in the meantime. If this was the case, these messages did not have to wait until a \texttt{read()} system call returned before they got processed, for that reason, their latency was lower.
\paragraph{Sent confirmations} \Autoref{sec:qp} already discussed at which moment \glspl{cqe} at the send side are generated. In case of a reliable connection (\autoref{fig:oneway_event_a}), entries showed up in the completion queue when a message was delivered to a remote \gls{ca} and when that \gls{ca} acknowledged that it received the message. Naturally,
\begin{equation}
t_{lat}^{comp} = t_{comp}-t_{subm} > t_{recv}-t_{subm} = t_{lat}
\end{equation}
was almost certainly true for every message that was sent.
This was different for the unreliable service types (\gls{uc} and \gls{ud}, \autoref{fig:oneway_event_b} and \autoref{fig:oneway_event_c}), where the \gls{hca} is only responsible for sending a message. Hence, in these cases, the \gls{hca} generated a \gls{cqe} immediately after a message was sent. Thus, for more messages,
\begin{equation}
t_{lat}^{comp} < t_{lat}
\label{eq:tcomp_min_tsubm}
\end{equation}
was true. In \autoref{fig:oneway_event_b}, this cannot be identified yet, but the difference between the median values $\tilde{t}_{lat}^{comp}$ and $\tilde{t}_{lat}$ is getting smaller. For messages that were sent as unreliable datagrams, \autoref{eq:tcomp_min_tsubm} usually holds, and in \autoref{fig:oneway_event_c},
\begin{equation}
\tilde{t}_{lat}^{comp} < \tilde{t}_{lat}
\end{equation}
is even true.
\paragraph{Comparison of the service types} It can be seen that the median latencies of the unreliable service types were barely different from the median latency of the reliable connection. With \SI{3598}{\nano\second} and \SI{3521}{\nano\second}, the median latencies of \gls{uc} and \gls{ud} were just slightly lower than the median latency of \SI{3608}{\nano\second} of the \gls{rc} service type. As expected, this was caused by the absence of acknowledgment messages between the two channel adapters. However, the variability of the three service types differed. With regards to $t_{lat}$, \gls{ud} had the highest ($t_{lat} > \SI{10000}{\nano\second}$ in \SI{0.1665}{\percent} of the cases) and \gls{uc} the lowest ($t_{lat} > \SI{10000}{\nano\second}$ in \SI{0.0595}{\percent} of the cases) dispersion. In the remainder of this section, \SI{10000}{\nano\second} and \SI{10}{\micro\second} will be used interchangeably with regards to the significant figures.
\paragraph{Intermediate pauses} The last three subfigures of \autoref{fig:oneway_event} show the results of the same test, but with an intermediate pause of \SI{1000000000}{\nano\second} (\SI{1}{\second}) and with just $1\cdot8000$ messages per run. One can see that the latency almost doubled. The pause of \SI{1}{\second} was long enough for the \gls{os} to swap out the waiting process, and it took a considerable amount of time to re-activate the process after the \texttt{read()} system call returned. Furthermore, the peaks at the left side of the graphs completely disappeared because now there could never be a second entry in the \gls{cq} after the first entry was acquired.
\subsection{Busy polling\label{sec:busy_polling}}
Event based polling is suitable for semi-time-critical applications in which minimal \gls{cpu} utilization outweighs maximum performance and thus minimal latency. However, if minimal latency is the topmost priority, busy polling (\autoref{fig:poll_based_polling}) should be used.
To be able to compare apples to apples, the settings in \autoref{tab:oneway_settings_busy} are very much alike those in \autoref{tab:oneway_settings_event}, but with a different polling mode. Since busy polling is a \gls{cpu} intensive task, all tests were performed in the optimized environment that was presented at the beginning of this chapter. The results of the tests are displayed in \autoref{fig:oneway_busy}.
\input{tables/oneway_settings_busy}
In the first three subfigures of \autoref{fig:oneway_busy}, again, $25\cdot8000$ messages of \SI{32}{\byte} were bursted for \gls{rc}, \gls{uc}, and \gls{ud}. It is immediately visible that the median latencies $\tilde{t}_{lat}^{RC}=\SI{1269}{\nano\second}$, $\tilde{t}_{lat}^{UC}=\SI{1251}{\nano\second}$, and $\tilde{t}_{lat}^{UD}=\SI{1273}{\nano\second}$ are approximately \SI{65}{\percent} lower than the same latencies for event based polling. This is in line with the findings of MacArthur and Russel~\cite{macarthur2012performance}, who reported a decrease of almost \SI{70}{\percent} in their work.
Since the completion queues on the send side were also busy polled, their latencies also went down. Now, \autoref{eq:tcomp_min_tsubm} holds for both unreliable service types. Note that it could be, depending on the use case, beneficial to busy poll the receive \gls{cq} but to rely on a completion channel that is bound to the send queue. In that way, less \gls{cpu} cores are fully utilized by busy polling, but low latencies are achieved between the sending and receiving node anyway. This approach would naturally result in:
\begin{equation}
t_{lat}^{comp} \gg t_{lat},
\end{equation}
and is suitable for applications that do not need to release the send buffers virtually instantaneous (\autoref{sec:requirements} \& \autoref{sec:proposal}).
\begin{figure}
\vspace{-0.5cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:oneway_busy_a}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_busy_hist/plot_0.pdf}
\end{minipage}
\end{subfigure}
\vspace{-0.2cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:oneway_busy_b}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_busy_hist/plot_1.pdf}
\end{minipage}
\end{subfigure}
\vspace{-0.2cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:oneway_busy_c}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_busy_hist/plot_2.pdf}
\end{minipage}
\end{subfigure}
\vspace{-0.2cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:oneway_busy_d}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_busy_hist/plot_3.pdf}
\end{minipage}
\end{subfigure}
\vspace{-0.2cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:oneway_busy_e}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_busy_hist/plot_4.pdf}
\end{minipage}
\end{subfigure}
\vspace{-0.2cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:oneway_busy_f}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_busy_hist/plot_5.pdf}
\end{minipage}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\includegraphics{plots/oneway_busy_hist/legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Results of the one-way benchmark with the settings from \autoref{tab:oneway_settings_busy}. These were used to analyze latencies with busy polling.}\label{fig:oneway_busy}
\end{figure}
\paragraph{Maxima} The maximum latencies did not decrease with the same proportions as the median latencies, but still notably. With regards to $\max t_{lat}$, the results for the reliable service type decreased with approximately \SI{14}{\percent} and for the unreliable service types with approximately \SI{36}{\percent}. The main reason for the maxima was likely the same as for event based polling: the warm up effect caused peaks at the beginning of the transmission. This conjecture is strengthened by the tests that were done with an intermediate pause of \SI{1}{\second}. For these runs, the yielded maximum latencies were only slightly lower, which indicates that the maxima were not caused by congestion but rather by the scheduling of the polling process. After all, the tests that were performed with an intermediate pause of \SI{1}{\second} between transmissions are unlikely to have been subject to congestion.
\paragraph{Minima} Latency minima as could be seen with event based polling could not arise here. This polling mode polls continuously all the time, so no peaks can arise because of short periods of time during which another polling mode was used.
\paragraph{Variability} The number of messages for which it took more than \SI{10}{\micro\second} to arrive at the receiving host was almost one order of magnitude lower for the \gls{rc} and \gls{ud} service types, and approximately 5 times lower for the \gls{uc} service type. This considerably reduced variability naturally implies a higher predictability. When sending messages in an environment that is based on busy polling, the maximum latency can be estimated with more certainty.
\paragraph{Intermediate pauses} This shows another important difference between event based polling and busy polling. Whereas the runs with event based polling showed more than double the latency when intermediate pauses occurred between transfers, runs that relied on busy polling showed a much smaller difference. Latencies of tests with intermediate pauses were about \SI{20}{\percent} higher than latencies of tests without any pauses when busy polling was applied. The same comparison for tests that relied on event based polling yielded a difference of \SI{120}{\percent}.
Although the median latencies with intermediate pauses when busy polling were substantially better than when waiting for an event, they were still higher than anticipated. Since the process continuously polled the completion queue, and the operating system should thus not have suspended it, it was expected that $\tilde{t}_{lat}$ would be lower for scenarios with less traffic on the link. However, for these cases, $\tilde{t}_{lat}$ was slightly higher in \autoref{fig:oneway_busy}.
It was first suspected that \gls{aspm}, which is described in the \gls{pcie} Base Specifications~\cite{pcisig2010pciexpress}, caused this additional latency. This technique sets the \gls{pcie} link to a lower power state when the device it is connected to---which would in this case be the \gls{hca}---is not used. However, when the tests from \autoref{tab:oneway_settings_busy} were repeated with \gls{aspm} explicitly turned off, the results remained the same.
The second suspicion was related to power saving levels of the \gls{cpu}: the so-called \textit{C-states}. After ensuring that all power savings were turned off---i.e., C0 was the only allowed state---a maximum response latency of \SI{0}{\micro\second} was written to \texttt{/dev/cpu\_dma\_latency}. This virtual file forms an interface to the \gls{pmqos}\footnote{\url{https://www.kernel.org/doc/Documentation/power/pm_qos_interface.txt}}, and writing \zero{} to it expresses to the \gls{os} that the minimum achievable \gls{dma} latency is required. However, this did also not improve $\tilde{t}_{lat}$.
Nevertheless, busy polling is still the more suitable technique for real-time applications. The next sections will explore other techniques to reduce the latency even more. For the methods that are likely to have a similar impact on the different service types, only the \gls{uc} service type was used for the sake of brevity. The unreliable connection was chosen because it showed the best results so far.
\subsection{Differences between the submit and send timestamp\label{sec:difference_timestamps}}
This subsection explores the difference between the moment a work request is submitted to the send queue and the moment the \gls{hca} actually sends the data. The feature of the benchmark that measures this difference is based on \autoref{lst:time_thread}: the sending node keeps updating the timestamp until the \gls{hca} copies the data to one of its virtual lanes.
\input{tables/oneway_settings_submit_send_comparison}
\Autoref{tab:oneway_settings_submit_send_comparison} shows the settings of the two tests that were performed. The results of both are plotted in \autoref{fig:oneway_submit_send_comparison}.
In the results of this test, and in the results of all following tests of this type, all data regarding $t_{lat}^{comp}$ is completely omitted. In the previous two subsections, it could be seen that settings that affect the receive \gls{cq} will affect the send \gls{cq} in a very similar manner. Hence, continuing to plot it would have been redundant. Rather, two similar data sets that must be compared---e.g., $(t_{recv}-t_{send})$ and $(t_{recv}-t_{subm})$---have been plotted in the same graph.
As it turns out, approximately
\begin{equation}
\left(1-\frac{\SI{726}{\nano\second}}{\SI{1253}{\nano\second}}\right)\cdot\SI{100}{\percent}\approx\SI{42}{\percent}
\end{equation}
of the time that was needed to send a message from one node to another node was spent before the \gls{hca} actually copied the data. This timespan includes the notification of the \gls{hca}, but also the accessing and copying of the data from the hosts's main memory to the \gls{hca}'s internal buffers. Note that this test did not measure the time the data spent in the sending node's \gls{hca} since it is not possible to update the timestamp as soon as it resided in the \gls{hca}'s buffers.
This relatively long timespan suggests that the memory access is a bottleneck. The next subsection will discuss a possible solution for small messages.
\begin{figure}[ht!]
\begin{subfigure}{\textwidth}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_submit_send_comparison_hist/plot_0.pdf}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\includegraphics{plots/oneway_submit_send_comparison_hist/legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Results of the one-way benchmark with the settings from \autoref{tab:oneway_settings_submit_send_comparison}. These were used to analyze the difference between $t_{lat}$ and $t_{lat}^{send}$.}\label{fig:oneway_submit_send_comparison}
\end{figure}
\subsection{Inline messages\label{sec:oneway_inline}}
\Autoref{eq:delta_inline} in \autoref{sec:timestamps} already suggested that the difference between $\tilde{t}_{lat}$ and $\tilde{t}_{lat}^{send}$ could be an approximation of the latency decrease that can be achieved by using the \textit{inline} flag that some InfiniBand \glspl{hca}---among them the Mellanox\textregistered{} ConnectX\textregistered-4---support. By setting this flag, introduced in \autoref{sec:postingWRs}, relatively small messages ($\lesssim\SI{1}{\kibi\byte}$) will directly be included in a work request. Accordingly, the \gls{hca}'s \gls{dma} does not need to access the host's main memory to acquire the data when it becomes aware of the submitted \gls{wr}. This suggests that posting small messages inline will eradicate a part of the overhead that was discussed in the last subsection.
\Autoref{tab:oneway_settings_inline} shows which settings were used with the one-way benchmark to analyze this difference. They are almost identical to the settings from \autoref{tab:oneway_settings_submit_send_comparison}, but instead of varying the timestamp that was taken ($t_{subm}$/$t_{send}$), the inline mode was varied. The results are depicted in \autoref{fig:oneway_inline}.
\input{tables/oneway_settings_inline}
Being \SI{1264}{\nano\second}, the median latency for the regularly submitted case was almost identical to the latency in \autoref{fig:oneway_submit_send_comparison}, which makes it very suitable for comparison. In \autoref{sec:difference_timestamps}, it was determined that about \SI{42}{\percent} of the time was lost before the \gls{hca} actually copied the data to its own buffers. The graph shows that messages that were submitted with the inline flag had a
\begin{equation}
\left(1-\frac{\SI{906}{\nano\second}}{\SI{1264}{\nano\second}}\right)\cdot\SI{100}{\percent}\approx\SI{28}{\percent}
\label{eq:inline_decrease}
\end{equation}
lower latency than regularly submitted messages.
Thus, apparently, the additional memory access the \gls{hca} had to perform when a \SI{32}{\byte} message was not directly included in the work request was accountable for \SI{28}{\percent} of the latency. Hence, if possible, it is favorable for latency to include data directly in the work request. Furthermore, as mentioned in \autoref{sec:postingWRs}, another advantage is the fact that the used buffers can be released immediately after submitting the \gls{wr}.
\begin{figure}[ht!]
\begin{subfigure}{\textwidth}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_inline_hist/plot_0.pdf}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\vspace{0.05cm}
\includegraphics{plots/oneway_inline_hist/legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Results of the one-way benchmark with the settings \autoref{tab:oneway_settings_inline}. These were used to analyze the difference between messages that are submitted regularly ($t_{subm}^{reg.}$) and that are submitted inline ($t_{subm}^{inl.}$).}\label{fig:oneway_inline}
\end{figure}
\subsection{RDMA write compared to the send operation}
\Autoref{tab:transport_modes} presented the different operations which are supported for the different service types. So far, all discussed tests relied on \textit{send with immediate}. The second suitable operation to transfer a message to a remote host which also supports an additional 32-bit header as identifier is \textit{\gls{rdma} write with immediate}. In the remainder of this chapter, for the sake of brevity, this operation is simply referred to as \textit{\gls{rdma} write}.
\Autoref{tab:oneway_settings_rdma} describes the settings that were used with the one-way benchmark to compare the \textit{send} operation with \textit{\gls{rdma} write}. Note that \gls{ud} is not included, since none of the \gls{rdma} operations support it. The results of the tests are depicted in \autoref{fig:oneway_rdma}.
\input{tables/oneway_settings_rdma}
In these results, the \textit{\gls{rdma} write} operation seems slower than the \textit{send} operation. However, a few remarks have to be made. First, the maximum latency and the variability of the \gls{rdma} transfers were lower. In case of the \gls{uc} service type, sending messages with \gls{rdma} resulted in 5\times{} less messages with a latency greater than \SI{10}{\micro\second}. (In some iterations of the tests, reductions up to 25\times{} could be seen.) So, although the median latency was slightly higher for \gls{rdma}, the lower variability makes it a more predictable service type.
Secondly, this test relied on the \textit{\gls{rdma} write with immediate}, not \textit{\gls{rdma} write}. The actual \textit{\gls{rdma} write} operation is probably a little faster, but without synchronization there is no way for a process on the receiving side to know when data is available. Since the only other way of synchronizing would be using an additional \textit{send} operation, \textit{\gls{rdma} write with immediate} is the fastest way of sending data with \gls{rdma} and signaling to the receiving node that data is available.
\begin{figure}[ht!]
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:oneway_rdma_a}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_rdma_hist/plot_0.pdf}
\end{minipage}
\end{subfigure}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:oneway_rdma_b}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_rdma_hist/plot_1.pdf}
\end{minipage}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\vspace{0.05cm}
\includegraphics{plots/oneway_rdma_hist/legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Results of the one-way benchmark with the settings from \autoref{tab:oneway_settings_rdma}. These were used to analyze the difference between the \textit{\gls{rdma} write with immediate} and \textit{send with immediate} operation.}\label{fig:oneway_rdma}
\end{figure}
\subsection{Unsignaled messages compared to signaled messages}
\Autoref{sec:postingWRs} discussed that the \gls{ofed} verbs allow to submit \glspl{wr} to the \gls{sq} without generating a notification. Thereafter, \autoref{sec:villas_write} presented how this technique was implemented in the node-type's write-function. This was done to prevent file structures from unnecessarily rippling through the completion queue into the write-function, to subsequently be discarded there. Since MacArthur and Russel~\cite{macarthur2012performance} only observed small performance increases but recommended sending unsignaled for inline messages, the following tests were intended to review the performance increase in the present work's environment.
\Autoref{tab:oneway_settings_unsignaled_inline} shows the settings that were used with the one-way benchmark during these tests and \autoref{fig:oneway_unsignaled_inline} shows the resulting latencies. The median latency $\tilde{t}_{lat}^{sig.}$ of the messages that were sent inline with signaling approximately corresponds to the number from \autoref{fig:oneway_inline}. Thus, since \autoref{fig:oneway_unsignaled_inline} shows that the median latency of unsignaled messages is:
\begin{equation}
\tilde{t}_{lat}^{uns.} \approx 0.87\cdot \tilde{t}_{lat}^{sig.},
\end{equation}
it can be concluded that turning signaling off yields a noteworthy performance increase. By signaling only shortly before the send queue overflows, a decrease in latency of almost \SI{13}{\percent} can be seen.
\input{tables/oneway_settings_unsignaled_inline}
Because previous works~\cite{macarthur2012performance, liu2014performance} were inclined to use \textit{\gls{rdma} write} over \textit{send} operations, the same tests as in \autoref{tab:oneway_settings_unsignaled_inline} were repeated with \textit{\gls{rdma} write} as operation mode. Similar to the results in \autoref{fig:oneway_rdma}, the latency for messages that were sent over \gls{rdma} was worse than for those that were sent normally. However, the relative increase in performance caused by the disabling of the signaling was, being a bit more than \SI{12}{\percent}, almost identical to the increase in \autoref{fig:oneway_unsignaled_inline}.
The settings and the results of these tests can be seen in \autorefap{a:oneway_unsignaled_rdma}.
\begin{figure}[ht!]
\begin{subfigure}{\textwidth}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_unsignaled_inline_hist/plot_0.pdf}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\vspace{0.05cm}
\includegraphics{plots/oneway_unsignaled_inline_hist/legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Results of the one-way benchmark with the settings from \autoref{tab:oneway_settings_unsignaled_inline} to analyze the difference in latency between messages that did and did not cause a \acrfull{cqe}. The \textit{send} operation mode was used in this test.}\label{fig:oneway_unsignaled_inline}
\end{figure}
Based on the results from the previous subsections, $\tilde{t}_{lat} = \SI{786}{\nano\second}$ seems to be the lowest achievable median latency for 32-byte messages. This confirms the implementation of the VILLASnode node-type that was presented in \autoref{sec:villas_read} and~\ref{sec:villas_write}. In the communication between \textit{InfiniBand} node-types, the \textit{send} operation mode is used, when under a configurable threshold messages are sent inline, and a \gls{cqe} for inline-messages is only generated when a counter reaches a configurable threshold.
Although sub-microsecond latencies could easily be achieved in the used environment, there was still a considerable deviation from the latencies MacArthur and Russel~\cite{macarthur2012performance} observed, which had minima around \SI{300}{\nano\second}. A possible explanation for this could be the number of used buffer. The objective of this benchmark was to find the best fit for a VILLASnode node-type. Because a node-type needs a relatively large pool of buffers to be able to process a lot of small samples with high frequencies, this benchmark also assumed a large pool of buffers. MacArthur and Russel, however, observed that latencies in their environment started to increase when more than 16 buffers were used.
\subsection{Variation of message size\label{sec:variation_of_message_size}}
All aforementioned tests assumed an idealized situation with 32-byte messages. Usually, the packets in a real-time co-simulation framework will be a few powers of two larger. \Autoref{tab:oneway_settings_message_size} shows the settings that were used with the one-way benchmark to explore the influence of message size on the latency.
The tests are grouped in three categories: \autoref{fig:oneway_message_size_a} exclusively shows the \gls{rc}, \autoref{fig:oneway_message_size_b} the \gls{uc}, and \autoref{fig:oneway_message_size_c} the \gls{ud} service type. Furthermore, an upward pointing triangle and a dark shade indicate the \textit{send} operation, and a downward pointing triangle and a light shade an \textit{rdma write} operation. Black shades were used for messages that were sent normally and blue shades for messages that were sent inline.
Whenever possible, tests were performed with messages ranging from \SI{8}{\byte} to \SI{32}{\kibi\byte}. However, inline work requests and the \gls{ud} service type do not support messages that big; the adjusted ranges are listed in \autoref{tab:oneway_settings_message_size}.
\input{tables/oneway_settings_message_size}
\paragraph{Constant latency (\SI{8}{\byte}--\SI{256}{\byte})} As can be seen in \autoref{fig:oneway_message_size}, all $\tilde{t}_{lat}$ of messages that were smaller than \SI{256}{\byte} were virtually the same. The only difference is that, as expected from \autoref{eq:inline_decrease}, messages that were sent inline have a median latency that is approximately \SI{28}{\percent} lower than messages that were sent normally. All these $\tilde{t}_{lat}$ were around the values that could be seen for \SI{32}{\byte} messages in \autoref{fig:oneway_busy},~\ref{fig:oneway_inline},~and~\ref{fig:oneway_rdma}. This is similar to MacArthur and Russel's results~\cite{macarthur2012performance}. In their publication, they found that messages smaller than \SI{1024}{\byte} have a somewhat constant latency. In the present work's finding this is only true for messages up to approximately \SI{256}{\byte}.
For all these sizes, the variance of the latencies is minimal. The error bars in \autoref{fig:oneway_message_size} indicate where the boundary to the upper and the lower \SI{10}{\percent} of the values lie.
\paragraph{Increasing latency (\SI{256}{\byte}--\SI{32}{\kibi\byte})} When the message size exceeded \SI{256}{\byte}, $\tilde{t}_{lat}$ started to gradually go up and the variance increased for messages that were sent normally. At \SI{256}{\byte}, $\tilde{t}_{lat}$ for messages that were sent inline even exceeded the median latency of messages that were sent normally. Because not only the message size but also the burst size changed for the blue lines, the inline tests were repeated with a fixed burst size of 2730 messages per burst (\autorefap{a:oneway_message_size_inline}). Since this steep slope between \SI{128}{\byte} and \SI{256}{\byte} is still present for fixed burst sizes, it can be concluded that---although the \gls{hca} allows it---sending data inline is not always favorable.
\begin{figure}[ht!]
\begin{subfigure}{0.351\textwidth}
\includegraphics[width=\linewidth, keepaspectratio]{plots/oneway_message_size_median/plot_0.pdf}
\caption{\gls{rc}}\label{fig:oneway_message_size_a}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{0.312\textwidth}
\includegraphics[width=\linewidth, keepaspectratio]{plots/oneway_message_size_median/plot_1.pdf}
\caption{\gls{uc}}\label{fig:oneway_message_size_b}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{0.312\textwidth}
\includegraphics[width=\linewidth, keepaspectratio]{plots/oneway_message_size_median/plot_2.pdf}
\caption{\gls{ud}}\label{fig:oneway_message_size_c}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{\textwidth}
\centering
\vspace{0.15cm}
\includegraphics{plots/oneway_message_size_median/legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Results of the one-way benchmark with the settings from \autoref{tab:oneway_settings_message_size}. These were used to analyze the influence of message size on the latency. While a triangle indicates $\tilde{t}_{lat}$ for a certain message size, the error bars indicate the upper and lower \SI{10}{\percent} of $t_{lat}$ for that message size.}\label{fig:oneway_message_size}
\end{figure}
The increasing latency of inline messages around \SI{256}{\byte} is in line with the findings of MacArthur and Russel~\cite{macarthur2012performance}. In their work, they claim that this latency step was caused by their adapter's cache line size, which happened to be \SI{256}{\byte}. The \gls{hca} that was used in the present work, however, had a cache line size of a mere \SI{32}{\byte}. Thus, according to their findings, messages that were equal to or bigger than \SI{32}{\byte} should have had latencies which were substantially bigger than latencies of messages that were smaller than \SI{32}{\byte}. However, this was not the case, as can be seen in \autoref{fig:oneway_message_size}. This leads to the conclusion that the increase is not solely caused by the cache line size.
\paragraph{Decreased variability (\SI{4096}{\byte})}The second aspect that catches the eye is located at \SI{4096}{\byte}, which also happened to be the set \gls{mtu} in these tests. For all service types---even for \gls{ud}, which does only support messages up to the \gls{mtu}---the variability of the latency decreased for messages bigger than or equal to \SI{4096}{\byte}. Thus, although the median latency continued to go up, the predictability of the latency also rose.
\paragraph{Further peculiarities} There was no meaningful difference between channel semantics and memory semantics with immediate data. Although the \textit{send} operation was always slightly better in terms of median latency, the operation that best suits the requirements of the application should be used.
To make sure that the increased median latency was not caused by congestion control (\autoref{sec:congestioncontrol}) all tests from \autoref{tab:oneway_settings_message_size} were repeated with an intermediate pause of \SI{5500}{\nano\second} between calls of \texttt{ibv\_post\_send()}. As \autorefap{a:oneway_message_size_wait} shows, this did not influence the median latency.
\section{OFED's round-trip host channel adapter benchmark\label{sec:perftest}}
This section analyzes some assumptions that were made in previous sections. In the first subsection, the results of the round-trip benchmark \texttt{ib\_lat\_send} will be compared to the results from \autoref{sec:variation_of_message_size}. Then, in the second and third subsection, the influence of the \gls{mtu} and the \gls{qp} type on latency will be examined.
\subsection{Correspondence between round-trip and one-way benchmark}
\Autoref{tab:correlation_benchmarks} shows the results for the first tests that were performed with \texttt{ib\_send\_lat}. The median latencies in \autoref{tab:correlation_benchmarks} approximately correspond to the latencies for the same test in \autoref{fig:oneway_message_size}. It stands out that the same leap in latency between \SI{128}{\byte} and \SI{256}{\byte} that could be seen in \autoref{fig:oneway_message_size}, also occurred in these results. To rule out that this leap was solely caused by the fact that messages were not sent inline anymore at \SI{256}{\byte}, the test was also performed with the inline threshold set to a higher value. In this second test, the leap between \SI{128}{\byte} and \SI{256}{\byte} turned out to be even higher.
\input{tables/correlation_benchmarks}
\paragraph{Difference in maximum latencies} A substantial difference between the results of the round-trip benchmark and the custom one-way benchmark were the maxima. This was, in all likelihood, caused by the used sample size. The description on the \gls{ofed} Performance Tests' Git\footnote{\url{https://github.com/linux-rdma/perftest}} states that ``setting a very high number of iteration may have negative impact on the measured performance which are not related to the devices under test. If [\ldots] strictly necessary, it is recommended to use the -N flag (No Peak).'' Therefore, the default setting of the round-trip benchmark was set to 1000 messages per test. Since the custom one-way benchmark was meant to mimic the behavior of InfiniBand hardware in VILLASnode---which would also burst large amounts of small messages at high frequencies---this hint was ignored in the custom benchmark. Every marker in \autoref{fig:oneway_message_size} includes between \SI{27300}{} and \SI{80000}{} time deltas.
Foreseeing the analysis of the \textit{InfiniBand} node-type, the one-way benchmark gave a more realistic view of the way the InfiniBand adapters would behave in VILLASnode. For example, all plots in \autorefap{a:timer_comparison} show low median latencies, however, also latency peaks which are much higher than the median values.
Furthermore, the median latencies the round-trip benchmark yielded were marginally lower than the ones yielded by the custom one-way benchmark. This difference was probably caused by the abovementioned effect as well.
\subsection{Variation of the MTU}
Crupnicoff, Das, and Zahavi~\cite{crupnicoff2005deploying} report that the selected \gls{mtu} does not affect the latency. Since the \gls{mtu} can affect the latency in other technologies---such as Ethernet---this claim was examined. With \texttt{ib\_send\_lat}, it is fairly easy to change the \gls{mtu}. All results of this test are displayed in \autoref{tab:mtu_performance}. Since the \gls{uc} service type is not officially supported by the \gls{rdma} \gls{cm} \gls{qp}, only results for the \gls{rc} and \gls{ud} service type are shown.
The table shows that no extraordinary peaks occurred. The only latency that stands out is marked red. However, since the difference is not substantial, and since this is the only occurrence of such a peak, it can be assumed that the \gls{mtu} indeed does not affect latency.
\input{tables/mtu_performance}
\subsection{RDMA CM queue pairs compared to regular queue pairs}
In all implementations presented in the present work, it was assumed that the performance of a regular \gls{qp} and a \gls{qp} that is managed by the \gls{rdma} \gls{cm} is almost identical. This assumption was evaluated as well.
\Autoref{tab:qp_performance} shows that the median latency for smaller messages was slightly smaller for regular \glspl{qp}. For larger messages, this difference in latency diminished. This inconsiderable difference, however, does not outweigh the ease that comes with the \gls{rdma} communication manager. To get a latency decrease of less than \SI{7}{\percent} (\autoref{tab:qp_performance}'s worst case), a lot of complexity would have to be added to the source code, in order to efficiently manage the \glspl{qp}.
\input{tables/qp_performance}
\section{VILLASnode node-type benchmark\label{sec:evaluation_villasnode}}
Again, all runs of the benchmark in this section were performed in the optimized environment as introduced in \autoref{fig:configuration_system} on the host system from \autoref{tab:benchmark_testsystem}.\footnotemark{}
\footnotetext{A small change to the environment had to be made: all tests that are presented in the following were performed with a customized version of the \textit{latency-performance} \texttt{tuned} profile. The reason for this is discussed in the paragraph ``Optimized environment'' below.}
\paragraph{Timer of the signal node} To find the timer that was best suited for the needs of the analyses that are discussed in this section, separate tests were performed and their results are presented below. Since the ability to generate samples at high rates was a requirement for most of the analyses in the remainder of this section, a fixed, high rate of \SI{100}{\kilo\hertz} was set for the tests to analyze the timers. Four tests were prepared: two with a VILLASnode instance with a timer object that relies on a file descriptor for notifications (\texttt{timerfd}) and two with a timer that relies on the \gls{tsc}. For the former, as can be seen in \autoref{tab:timer_comparison}, more steps were missed at high rates. In the optimized environment, the file descriptor based implementation missed about \SI{0.68}{\percent} of the signals, whereas the \gls{tsc} based implementation only missed \SI{0.50}{\percent} of the steps. Since the implementation with the least missed steps is preferred---after all, when steps are missed, the actual rate that is sent to the node-type under test is lower than the set rate---the \gls{tsc} was chosen as timer for the following tests.
\Autorefap{a:timer_comparison} shows the histograms for these four tests, including the missed steps and an indicator for whether samples were not transmitted by the nodes that were tested. When comparing the median latencies of the four cases, it becomes apparent that the \texttt{timerfd} timer affected the measured $\tilde{t}_{lat}$ more than the \gls{tsc}. Since this means that the benchmark's results with the \gls{tsc} better reflect the actual performance of the node-type under test, this is another advantage of the \gls{tsc}. Furthermore, in case of the unoptimized environment, the latency's variability with the \texttt{timerfd} timer was considerably worse than in the three other cases.
In later tests, it was also discovered that the \gls{tsc} did not perform well with relatively small rates ($\leq\SI{2500}{\hertz}$). As it turned out, for the minimum rate of \SI{100}{\hertz}, approximately \SI{8}{\percent} of the steps were missed. However, using the the \texttt{timerfd} timer for these low rates would noticeably skew the results, and a deviation of \SI{8}{\hertz} is unlikely to influence the latencies of the analyzed nodes. Therefore, the \gls{tsc} was also used for these low rates.
\input{tables/timer_comparison.tex}
\paragraph{Optimized environment} The tests that were done to analyze the behavior of the timers also revealed information about the effect of the optimized and unoptimized environment on latencies. As it turned out, using the \textit{latency-performance} \texttt{tuned} profile was detrimental for the latency and the overall performance. This effect occurred regardless of the used environment. For the cases in \autoref{fig:timer_comparison}, median latencies increased about \SI{700}{\nano\second}, variability and maxima rose, and the \texttt{timerfd} timer missed up to \SI{15}{\percent} of the steps. Further research has shown that the \texttt{force\_latency} flag (line 6, \autoref{lst:tuned_latency_performance}) caused this problem. Therefore, in all tests that are presented in the following, a customized version of the \textit{latency-performance} \texttt{tuned} profile without this flag was used.
\Autoref{fig:timer_comparison} also reveals that running VILLASnode in the optimized environment was beneficial for latency. However, the difference between both environments was not tremendous. It is likely that the reason for this is that the testsystem from \autoref{tab:benchmark_testsystem} was fully dedicated to the tests that were run on it. In a real life scenario, the system would be busy with other processes, and the difference in latency for processes in the shielded cpuset and in the normal pool of \glspl{cpu} would presumably be larger.
\paragraph{Configuration of the InfiniBand nodes} It was found that the number of buffers hardly influenced the performance of the \textit{InfiniBand} node-type. Even MacArthur and Russel's ``ideal'' number of buffers---although impracticable for the purposes of this real-time framework---were investigated~\cite{macarthur2012performance}. Apart from the fact that such a small number of buffers made it impossible to send samples bigger than a few byte at high frequencies, barely any difference in latency could be seen compared to cases with (a lot) more buffers.
A momentous difference, however, could be seen when the size of the receive queue and the number of mandatory work requests in the receive queue was varied. The situation with the lowest latency arose when the size and the number of \glspl{wr} was chosen to be just big enough to support the highest combination of generation rate and message size. For example, in case of \autoref{fig:timer_comparison_d}, latency extrema around \SI{262}{\micro\second} could be seen with this ideal setup. For an arbitrary large number (e.g., a queue depth of 8192 and 8064 mandatory \glspl{wr} in the queue), these extrema peaked at more than \SI{3000}{\micro\second}. This effect was caused by the way the \textit{InfiniBand} node-type's read-function is implemented and probably occurred shortly after the initialization of the receiving \textit{InfiniBand} node. As presented in \autoref{fig:read_implementation}, the read-function first fills the receive queue, before it starts polling the queue and processing the data. When the threshold is large, it takes a certain amount of time before data can be processed. However, it is important to keep in mind that a larger receive queue yields a higher stability because overflows will be less likely.
For the send queue, the opposite is true: in order to signal as little as possible, the send queue can be as large as the \gls{hca} allows it to be. The signaling threshold, that describes the maximum number of unsignaled \glspl{wr} before a signaled \gls{wr} must be sent, is determined according to \autoref{eq:signaling} in \autoref{sec:related_work}. If one sample is sent per call of the write-function, which is true for all following tests,
\begin{equation}
S = \frac{D_{SQ}}{2}
\end{equation}
follows from \autoref{eq:signaling}. Before running any of the following tests, it was verified that this threshold indeed yielded the lowest latency. It turned out that any higher or lower threshold yielded, although marginally, worse latencies.
The settings for the sending and the receiving \textit{InfiniBand} node can be found in \autoref{a:infiniband_config}. These settings were used in all tests that are presented in this section.
\subsection{Comparison between InfiniBand service types}
This subsection presents the tests that were performed to examine how the different InfiniBand service types perform within VILLASnode. It solely focuses on the reliable connection and on unreliable datagrams since these two service types are officially supported by the \gls{rdma} \gls{cm}, and thus require no modification of the \gls{rdma} \gls{cm} library.
\paragraph{Varying the sample generation rate} In the first set of tests, the rate with which samples were generated was varied between \SI{100}{\hertz} and \SI{100}{\kilo\hertz}. All tests were performed until \SI{250000}{} samples were transmitted. Each sample that was sent contained 8 random 64-bit floating-point numbers. For the reliable connection, this added up to
\begin{equation}
8\cdot\SI{8}{\byte} + \SI{24}{\byte} = \SI{88}{\byte}
\end{equation}
per message, taking the 24-byte metadata into account. For unreliable datagrams, this number was
\begin{equation}
\SI{88}{\byte} + \SI{40}{\byte} = \SI{128}{\byte}
\end{equation}
because the 40-byte \gls{grh} of the sending node was attached to every message. Since the messages were relatively small, they were all sent inline.
\Autoref{fig:varying_rate} shows the results the VILLASnode node-type benchmark yielded with the abovementioned settings. Both service types showed an almost identical behavior, regardless of which rate was set: for both types, $\tilde{t}_{lat}$ decreased when the rate was increased. This is in line with prior observations in \autoref{sec:busy_polling}, where latency increased when pauses between the transmission of messages were increased.
Characteristic for InfiniBand is the (almost) non-existent latency difference between messages on reliable connections and unreliable datagrams. Because, as discussed in \autoref{sec:via}, reliability is handled in the \gls{hca} rather than in the operating system, it causes less overhead.
\begin{figure}[ht!]
\begin{subfigure}{\textwidth}
\includegraphics[width=15.2cm, keepaspectratio]{plots/nodetype_varying_rate_IB/median_graph.pdf}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\includegraphics{plots/nodetype_varying_rate_IB/median_legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Results the benchmark yields for the \textit{InfiniBand} node-type with a fixed message size of \SI{88}{\byte} for \gls{rc} and \SI{128}{\byte} for \gls{ud}. The sample generation rate was varied between \SI{100}{\hertz} and \SI{100}{\kilo\hertz} and for every rate, \SI{250000}{} samples were sent.}\label{fig:varying_rate}
\end{figure}
All tests for the \textit{InfiniBand} node-type were only performed for signal generation rates up to \SI{100}{\kilo\hertz}. At higher frequencies, the \textit{signal} node started to miss more and more steps. According to the latencies from \autoref{sec:evaluation_ca}, the sample rate needs to be a lot higher than \SI{100}{\kilo\hertz} before the InfiniBand hardware becomes the bottleneck. Assuming a message resides about \SI{1000}{\nano\second} in the InfiniBand stack and network, rates up to:
\begin{equation}
\frac{1}{\SI{1000}{\nano\second}} = \SI{1}{\giga\hertz}
\end{equation}
are theoretically possible with the numbers measured in the previous section.\footnotemark{} However, two problems arise:
\footnotetext{This is only based on the measured time that a message congests the InfiniBand stack and network; it is assumed that \glspl{wr} can be submitted to the \glspl{qp} with this rate.}
\begin{itemize}
\setlength\itemsep{-0.1em}
\item The refresh rate of the buffers in the receive queue is not indefinitely high. As described in \autoref{sec:villas_implementation}, for its completion queue to be cleared and its receive queue to be refilled, an \textit{InfiniBand} node depends on the rate with which the read-function is invoked. When the \gls{qp} is chosen to be big enough, a node should be able to absorb short peaks in the message rate (e.g., \SI{1}{\giga\hertz}) flawlessly. However, if the rate stays high for an extended amount of time, the buffers will overflow in the current setup.
More on the theoretically achievable rate in \autoref{sec:zero_reference_comparison}.
\item \Autoref{sec:optimizations_datapath} described optimizations that were applied to the \textit{file} node-type. Even though these optimizations considerably increased the maximum signal generation rate, rates well above \SI{100}{\kilo\hertz} were still not achievable. Consequently, to increase this upper limit, the \textit{file} node-type should be optimized further, so that the share it takes in the total datapath decreases.
\end{itemize}
\paragraph{Varying the sample size} In the second set of tests, the generation rate was fixed to \SI{25}{\kilo\hertz}. The message size was varied between 1 and 64 values per sample. This resulted in messages between \SI{32}{\byte} and \SI{536}{\byte} for \gls{rc} and \SI{74}{\byte} and \SI{576}{\byte} for \gls{ud}. Based on the results from \autoref{tab:oneway_settings_message_size} and \autoref{fig:oneway_message_size}, messages smaller than or equal to \SI{188}{\byte} were sent inline.\footnotemark
\footnotetext{Inline sizes that are powers of two are not supported by the Mellanox \gls{hca} used in the present work. The \gls{hca} automatically converts it to the closest value that is larger than the set value. In this case, \SI{188}{\byte} is the closest value larger than \SI{128}{\byte}.}
The first observation to be made is the increasing median latency when messages become bigger than approximately \SI{128}{\byte}. This is in line with the findings from \autoref{sec:variation_of_message_size}. Secondly, the variability of the reliable connection was consistently lower than the variability of unreliable datagram. This was not only true for high rates, but also for lower rates. Finally, it can be observed that the \gls{rc} service type had a lower median latency than \gls{ud}. This is remarkable, and a reason for this could be the fact that the receiving node's \gls{ah} must be added to every work request when the \gls{ud} service type is used. Furthermore, the \gls{grh} is added to every message that is sent with the \gls{ud} service type.
\begin{figure}[ht!]
\begin{subfigure}{\textwidth}
\includegraphics[width=15.2cm, keepaspectratio]{plots/nodetype_varying_sample_size_IB/median_graph.pdf}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\includegraphics{plots/nodetype_varying_sample_size_IB/median_legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Results the benchmark yields for the \textit{InfiniBand} node-type at a fixed sample generation rate of \SI{25}{\kilo\hertz} and a message size that was varied between \SI{32}{\byte} and \SI{536}{\byte} for \gls{rc} and \SI{74}{\byte} and \SI{576}{\byte} for \gls{ud}. For every message size, \SI{250000} samples were sent.}\label{fig:varying_sample_size}
\end{figure}
\paragraph{Varying both the sample size and generation rate} \Autoref{fig:rate_size_3d_RC} aims to give a complete view on the influence of the several possible generation rate and message size combinations by combining the previously presented tests. Since the reliable connection shows---although only slightly---the lowest median latencies, this figure only depicts the measurements for \gls{rc}. In this test, the generation rate was varied between \SI{100}{\hertz} and \SI{100}{\kilo\hertz} and the number of values in a sample between 1 and 64. All tests were performed until \SI{250000}{} samples were transmitted.
\Autoref{fig:rate_size_3d_RC} shows that, in accordance with \autoref{fig:varying_sample_size}, the median latency increased with the message size. Additionally, as can be seen along the rate-axis, a higher message generation rate corresponded to a lower median latency. This could also be seen in \autoref{fig:varying_rate}.
When the \textit{signal} node missed more than \SI{10}{\percent} of the steps for a particular sample rate/sample size combination, this is indicated with a red colored percentage in \autoref{fig:rate_size_3d_RC}. From these numbers, it becomes evident that the file-node was not able to process large amounts of data. With tests that missed a substantial amount of samples, a threshold $T$ can be approximated to:
\begin{equation}
T = \left(1 - \frac{P_{missed}}{\SI{100}{\percent}}\right) \cdot S_{sample} \cdot f_{signal} \qquad\qquad \mathrm{[B]\cdot[Hz]=[B/s]},
\end{equation}
where $P_{missed}$ is the percentage of missed samples, $S_{sample}$ is the sample size, and $f_{signal}$ the set signal generation rate. In case of the VILLASnode node-type benchmark, this value was approximately \SI[per-mode=symbol]{20}{\mebi\byte\per\second}. This is, nevertheless, only a rough estimation; the signal generation rate probably has a higher impact on the threshold than the sample size.
\begin{figure}[ht!]
\begin{subfigure}{\textwidth}
\includegraphics[width=15.2cm, keepaspectratio]{plots/nodetype_3d_IB/median_3d_graph_UD.pdf}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\vspace{0.2cm}
\includegraphics{plots/nodetype_3d_IB/3d_RC_legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{The influence of the message size and generation rate on the median latency between two \textit{InfiniBand} nodes that communicate over an \acrfull{rc}.}\label{fig:rate_size_3d_RC}
\end{figure}
The fact that up to \SI{8}{\percent} of the steps were missed at low rates with the \gls{tsc} was already mentioned at the beginning of this section. Since these rates are non-critical for the node-types that were analyzed, it is improbable that a difference of \SI{8}{\hertz} in case of a set rate of \SI{100}{\hertz}, and \SI{75}{\hertz} in case of \SI{2500}{\hertz}, will noticeably affect the median latency. Using an alternative timer, however, would have considerably skewed the latencies in that range.
\Autorefap{a:rate_size_3d_UC_UD} shows the same graphs for \gls{uc} and \gls{ud}, respectively. Both modes show a very similar behavior to the \gls{rc} service type. As observed before, \gls{ud} shows slightly higher median latencies than \gls{rc}. \gls{uc}, on the other hand, shows slightly lower median latencies. This backs the suspicion that was raised earlier, on why \gls{ud} was slightly slower than \gls{rc}. Regarding latency, \gls{uc} does not have three major disadvantages of both types: it does not need to guarantee delivery of a message, but it does also not require an \gls{ah} with every \gls{wr} and does not need to add the \SI{40}{\byte} \gls{grh} to every message.
Thus, the smallest median latencies among the service types that are officially supported by the \gls{rdma} \gls{cm} were observed for the reliable connection. When varying both the message size and generation rate, the minimum latency of about \SI{1.7}{\micro\second} was observed for high rates and low message sizes. The maximum latency was observed for high rates and low message sizes and was approximately \SI{4.9}{\micro\second}.
\subsection{Comparison to the zero-latency reference\label{sec:zero_reference_comparison}}
The first comparison to be done is between the \textit{InfiniBand} node-type and the \textit{shmem} node-type. The latter uses the \acrshort{posix} shared memory \gls{api} to enable communication between nodes over shared memory regions~\cite{kerrisk2010linux}. Because the latency between two \textit{shmem} nodes will approximately be the time it takes to access memory, its $\tilde{t}_{lat}$ can be approximated to the time $\tilde{t}_{villas}$. $\tilde{t}_{villas}$ is the amount of time that is spent by the super-node, apart from the nodes that are being tested. It thus corresponds to the time that is spent in all blocks of \autoref{fig:villas_benchmark}, minus the time that is spent in the nodes that are being tested.
In the tests that were performed, the sample generation rate was varied between \SI{100}{\hertz} and \SI{100}{\kilo\hertz}, every sample contained 8 64-bit floating-point numbers, and for every rate, \SI{250000} samples were sent. The results of these tests can be seen in \autoref{fig:shmem_infiniband_comparison}. Compared to previous graphs, this graph additionally contains an indication of the missed steps of the \textit{signal} node for generation rate.
\begin{figure}[ht!]
\begin{subfigure}{\textwidth}
\includegraphics[width=15.2cm, keepaspectratio]{plots/nodetype_varying_rate_IB_shmem/median_graph.pdf}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\includegraphics{plots/nodetype_varying_rate_IB_shmem/median_legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Results the benchmark yields for the \textit{shmem} and \textit{InfiniBand} node-type with 8 64-bit floating-point number per sample. The sample generation rate was varied between \SI{100}{\hertz} and \SI{100}{\kilo\hertz} and for every rate, \SI{250000} samples were sent.}\label{fig:shmem_infiniband_comparison}
\end{figure}
\paragraph{Difference in latency} The difference between the latencies of these node-types can be seen as the additional latency that communication over InfiniBand adds. The time penalty that the implementation of the read- and write-function add can be approximated to:
\begin{equation}
t_{\operatorname{r/w-function}}^{IB} \approx \tilde{t}_{lat}^{IB}-\tilde{t}_{lat}^{shmem}-\tilde{t}_{lat}^{HCA},
\end{equation}
with $\tilde{t}_{lat}^{IB}$ the median latency that is measured when transmitting data between two InfiniBand VILLASnode nodes, $\tilde{t}_{lat}^{shmem}$ the median latency of communication between two \textit{shmem} nodes, and $\tilde{t}_{lat}^{HCA}$ the latency that was seen for inline communication in \autoref{sec:evaluation_ca}.
With $\tilde{t}_{lat}^{IB} \approx \SI{2}{\micro\second}$, $\tilde{t}_{lat}^{shmem} \approx \SI{0.3}{\micro\second}$, and $\tilde{t}_{lat}^{HCA} \approx \SI{0.8}{\micro\second}$ this number adds up to approximately \SI{0.9}{\micro\second}. Since, as could be seen in \autoref{sec:busy_polling}, up to \SI{0.3}{\micro\second} latency was added when the send rate decreased, the values for the highest frequency from \autoref{fig:shmem_infiniband_comparison} were used. In that way, the added time should should mainly be caused by the implementations from \autoref{sec:villas_read} and \autoref{sec:villas_write}.
\paragraph{Missed steps} The graph shows that, in most cases, the \textit{signal} node only missed slightly more steps when testing the \textit{InfiniBand} node, than when testing the \textit{shmem} node. This indicates that the \textit{InfiniBand} node-type did not give much back pressure and that its write-function returned fast enough and did therefore not influence the signal generation at these rates. Since median latencies around \SI{2500}{\nano\second} were achieved, transmission rates up to
\begin{equation}
\frac{1}{\SI{2500}{\nano\second}}\approx\SI{400}{\kilo\hertz}
\end{equation}
should be possible. This number is probably more pessimistic than the reality since it does not take into account that the latency is not entirely caused by the sending node.
The same similarities could be seen for other sample sizes and sample generation rates. \Autorefap{a:shmem_3d} shows the results the benchmark yielded when the sample generation rate and the message size were varied for the \textit{shmem} node-type. Regarding missed steps, this graph shows similarities to \autoref{fig:rate_size_3d_RC} in this chapter and \autoref{fig:rate_size_3d_UC} and \autoref{fig:rate_size_3d_UD} in \autorefap{a:rate_size_3d_UC_UD}. Since the common denominator of these tests is the \textit{file} node-type, these results again indicate that the component that caused the most complications in the VILLASnode node-type benchmark's datapath was the \textit{file} node-type.
Thus, since the \textit{file} node-type is currently the bottleneck in the benchmark from \autoref{sec:villas_benchmark}, this node-type should be optimized in order to bring down the number of steps the benchmark misses.
\paragraph{Decline in latency} Analogous to previous observations, the median latency of the \textit{InfiniBand} node-type increased for lower frequencies. Remarkable, however, is that the median latency of the \textit{shmem} node-type also increased---although only slightly---for lower frequencies. Even though this decline is not unambiguously visible in \autoref{fig:shmem_infiniband_comparison}, it is more evident in \autoref{fig:shmem_3d} in \autoref{a:results_benchmarks}.
In a previous subsection the suspicion was raised that techniques such as \gls{aspm} caused this effect. But, since the same effect also occurred with node-types that are independent from the \gls{pcie} bus, the cause of this problem cannot solely lie within \gls{io} optimization techniques. Hence, the (scheduler of the) \gls{os} is probably also partially responsible for the increasing latency at lower rates.
\subsection{Comparison to other node-types}
The objective of the present work that was raised in \autoref{sec:hard_real_time_communication_between_servers} was to implement hard real-time communication between different host systems that run VILLASnode. It showed that none of the server-server node-types that were available at the time of writing the present work were able to realize this (\autoref{tab:villasnode_nodes}).
This subsection examines whether the addition of the \textit{InfiniBand} node-type to the pool of available VILLASnode node-types has an added value. It does so by comparing the results of two commonly used node-types for server-server communication---\textit{zeromq} and \textit{nanomsg}---with the \textit{InfiniBand} node-type and the \textit{shmem} node-type.
In the tests that were performed, the sample size was fixed to 8 values. The rate was varied between \SI{100}{\hertz} and \SI{100}{\kilo\hertz} and every test was conducted until \SI{250000}{} messages were transmitted.
\paragraph{Loopback and physical link} First, the tests were performed in loopback mode, in which the source and target node of the \textit{zeromq} and \textit{nanomsg} node-type were both bound to \texttt{127.0.0.1}. However, to make a fair comparison to the \textit{InfiniBand} node-type tests, which were performed on an actual physical link, these tests had to be performed on a physical link as well.
To exclude that using different hardware with inferior or superior specifications would skew the results, the back-to-back connected InfiniBand \glspl{hca} were also used to perform the tests with the Ethernet based node-types. This was done using the \acrfull{ipoib} driver (\autoref{sec:rdmacm}), which enables processes to send data over the InfiniBand network using the TCP/IP stack (\autoref{fig:openfabrics_stack}).
In order to compel processes to actually use the physical link although both network devices were part of the same system, the Linux network namespace was used. With namespaces\footnote{\url{http://man7.org/linux/man-pages/man7/namespaces.7.html}}, it is possible to wrap system resources in an abstraction, so that they are only visible to processes in that namespace. In case of the network namespace, processes in such a namespace make use of a copy of the network stack. It can be seen as a separate subsystem, with its own routes, firewall rules, and network device(s). The network namespace was managed with \texttt{ip-netns}\footnote{\url{http://man7.org/linux/man-pages/man8/ip-netns.8.html}}.
\paragraph{Results} \Autoref{fig:nanomsg_zeromq_comparison} shows the results of these runs. For rates below \SI{25}{\kilo\hertz}, the latencies of the loopback tests were almost identical to the latencies of the tests on the physical link. Above \SI{25}{\kilo\hertz} the latencies of the latter start to increase. Although especially the \textit{zeromq} node showed a humongous latency increase, the performance of both node-types started to become unsuited for real-time simulations.
The percentage of missed steps for \SI{100}{\hertz} and \SI{2500}{\hertz} was exactly the same for the \textit{nanomsg} and \textit{zeromq} node-type as for the \textit{InfiniBand} and \textit{shmem} node-type. This again indicates that this effect was caused by the \gls{tsc}. It is, however, unlikely that the relatively high median latencies around these rates were caused by the \gls{tsc}. After all, in all previously presented tests in which the \gls{tsc} was used for these rates, such a large difference was not seen.
Although a considerable number of samples were never transmitted, especially for high rates, no samples were dropped after the first sequence number appeared in the out file. The percentages of missed steps of the \textit{nanomsg} and \textit{zeromq} node-type are displayed in \autorefap{a:missed_steps_nanomsg_zeromq}.
\begin{figure}[ht!]
\begin{subfigure}{\textwidth}
\includegraphics[width=15.2cm, keepaspectratio]{plots/nodetype_varying_rate_zeromq_nanomsg/median_graph.pdf}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\includegraphics{plots/nodetype_varying_rate_zeromq_nanomsg/median_legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Results the benchmark yielded for the \textit{zeromq} and \textit{nanomsg} node-types. Both node-types were once tested in loopback mode and once over an actual physical link. Every sample contained 8 64-bit floating-point numbers and the sample generation rate was varied between \SI{100}{\hertz} and \SI{100}{\kilo\hertz}. For every rate, \SI{250000} samples were sent.}\label{fig:nanomsg_zeromq_comparison}
\end{figure}
\Autoref{fig:node_type_comparison} compares the results of the \textit{nanomsg} and \textit{zeromq} node-type on the physical link with the results of the \textit{InfiniBand} and \textit{shmem} node-type. It is apparent from this graph that the \textit{InfiniBand} node-type had a latency that was one order of magnitude smaller than the soft real-time node-types. Furthermore, the variability of the latency of the samples that were sent over InfiniBand was lower than the variability of the latency of the same samples over Ethernet. Finally, both the \textit{nanomsg} and \textit{zeromq} node unmistakably started to show performance losses when exceeding a sample generation rate of \SI{25}{\kilo\hertz}.
\begin{figure}[ht!]
\begin{subfigure}{\textwidth}
\includegraphics[width=15.2cm, keepaspectratio]{plots/nodetype_varying_rate_zeromq_nanomsg_shmem_IB/median_graph.pdf}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\includegraphics{plots/nodetype_varying_rate_zeromq_nanomsg_shmem_IB/median_legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Results the benchmark yielded for the server-server node-types \textit{zeromq}, \textit{nanomsg}, and \textit{InfiniBand} and for the internal node-type \textit{shmem}. Every sample contained 8 64-bit floating-point numbers and the sample generation rate was varied between \SI{100}{\hertz} and \SI{100}{\kilo\hertz}. For every rate, \SI{250000} samples were sent.}\label{fig:node_type_comparison}
\end{figure}