masters-thesis/chapters/future.tex

\chapter{Future Work\label{chap:future}}
\section{Real-time optimizations\label{sec:future_real_time}}
In his master's thesis~\cite{vogel2016development}, Vogel wrote that ``careful optimizations and tuning [of the Linux \gls{os}] are indispensable. The most important change is a \texttt{PREEMPT\_RT}-patched kernel.''

Real-time Linux was first presented by Barabanov and Yodaiken~\cite{barabanov1996real}, and the \texttt{PREEMPT\_RT} patch is currently maintained by Ingo Molnar and Thomas Gleixner. Its main purpose is not to increase the throughput of a Linux system or to decrease its latency, but rather to make it more predictable. It does so by:
\begin{itemize}
	\setlength\itemsep{0.2em}
    \item making parts of the kernel, which were originally not preemptible, preemptible;
    \item adding priority inheritance to the kernel;
    \item running interrupts as threads;
    \item replacing timers, which leads to high-resolution, user-space-accessible timers.
\end{itemize}
The internals of the \gls{rt} patch are described by Rostedt and Hart~\cite{rostedt2007internals} and can also be found on the Real-Time Linux Wiki.\footnote{\url{https://rt.wiki.kernel.org}}

The Mellanox modified \gls{ofed} stack that was used together with the Mellanox' \glspl{hca} (\autoref{tab:benchmark_testsystem}) did not support \texttt{PREEMPT\_RT}-patched Linux kernels. Therefore, none of the benchmarks that were evaluated in \autoref{chap:evaluation} could be run on a real-time operating system. Consequently, the predictability of the benchmark was not always ideal. Examples are:
\begin{itemize}
	\setlength\itemsep{0.2em}
    \item \Autoref{fig:oneway_unsignaled_inline}: $\max t_{lat} = \SI{17.4}{\micro\second}$ and \SI{0.0125}{\percent} of $t_{lat} > \SI{10}{\micro\second}$ with a median latency of only \SI{786}{\nano\second};
    \item \Autoref{fig:timer_comparison_d}: $\max t_{lat}=\SI{262.0}{\micro\second}$ and \SI{0.02}{\percent} of $t_{lat} > \SI{50}{\micro\second}$ with a median latency of only \SI{2.1}{\micro\second}.
\end{itemize}
Although there is a chance that the median latencies will become a little higher with a \texttt{PREEMPT\_RT}-patched kernel, these (sometimes excessive) latency spikes should diminish and the variability should decrease. Furthermore, the increasing latencies for lower transmission rates should diminish with an \gls{rt}-patched kernel.

It would certainly be interesting for future research to examine the behavior of InfiniBand hardware in real-time optimized operating systems. Although InfiniBand is already an attractive communication solution for for real-time applications, this could make it even more attractive.

\section{Optimization \& profiling\label{sec:future_profiling}}
\paragraph{Benchmark optimizations}
During the coarse of the present work, it turned out that the bottleneck of the benchmark from \autoref{fig:villas_benchmark} is the \textit{file} node-type. Although several optimizations, e.g., suppressing as much system calls as possible, were applied to this node-type, it remained the bottleneck for high frequencies.

Reducing the effect the \textit{file} node-type has on the benchmark would yield less distorted, more realistic indications of the latencies that can be achieved. More important, however, is that this would facilitate a method to examine the limitations of low-latency node-types such as \textit{InfiniBand} and \textit{shmem}.

Furthermore, it would be beneficial to evaluate whether the \gls{tsc} can be optimized in a way that it works with low rates as well.

\paragraph{InfiniBand node-type optimizations}
Currently, the read- and write-function of the \textit{InfiniBand} node-type add a latency penalty of roughly \SI{0.9}{\micro\second} to the transmission latency of a message. Since this is the lion's share of the total latency, it would be interesting to analyze how many time is spent in the several functions and what the hot spots are. Profiling tools like \textit{gprof}~\cite{susan1983gprof} can be used for this kind of analysis.

Moreover, all settings were optimized for maximum rates. However, the optimal settings for lower rates probably differ from the optimal settings for high rates.

\section{RDMA over Converged Ethernet support\label{sec:roce}}
In their publication~\cite{macarthur2012performance}, MacArthur and Russel observed that \gls{roce}, which allows \gls{rdma} over conventional Ethernet networks, was just slightly outperformed by InfiniBand for small messages. Although \gls{roce}'s performance would be marginally worse than InfiniBand's and although it does not have as much support for \gls{qos} as InfiniBand, it would be a great addition to VILLASnode for cases in which existing infrastructure must be used.

With the alterations that have been made to VILLASnode in order to support InfiniBand, support for \gls{roce} would not require too many changes to the existing source code.