masters-thesis/chapters/implementation.tex

\chapter{Implementation\label{chap:implementation}}
The first section of this chapter (\ref{sec:ca_benchmarks}) describes the implementation of the benchmark which was used to measure latencies between InfiniBand host channel adapters. Then, \autoref{sec:villas_implementation} describes how the \textit{InfiniBand} node-type for VILLASnode was implemented. Subsequently, \autoref{sec:villas_benchmark} describes the characteristics and implementation of the benchmark that was used to analyze VILLASnode node-types. Thereafter, \autoref{sec:uc_support} describes how \gls{uc} support was added to the \gls{rdma} \gls{cm} library. Finally, \autoref{sec:processing_data} briefly describes what tools and techniques were used to process and analyze the acquired data.

If not stated otherwise, all software that is discussed in this chapters is written in the C programming language~\cite{kernighan1978c}.

\section{Host channel adapter benchmark\label{sec:ca_benchmarks}}
The developed host channel adapter benchmark was inspired by the measurements which were done by MacArthur and Russel~\cite{macarthur2012performance}, which were already presented in \autoref{sec:related_work}. Although this work will likewise analyze the influence of variations in operation modes, settings, and message sizes on latencies, it will not focus on their influence on throughput.

The objective of this benchmark is to measure---as accurately as possible---how long data resides in the actual InfiniBand Architecture when it is sent from one host channel adapter to another host channel adapter. So, if latency is defined as
\begin{equation}
    t_{lat} = t_{subm} - t_{recv},
    \label{eq:latency}
\end{equation}
the time data actually spends in the \gls{iba} can be approximated by setting $t_{subm}$ to the moment on which the send \gls{wr} is submitted, and $t_{recv}$ to the moment the receive node becomes aware of the \gls{cqe} in the completion queue which is bound to the receive queue.

\Autoref{sec:timestamps} first introduces how and where in the source code the timestamps $t_{subm}$ and $t_{recv}$ are measured. Then, \autoref{sec:tests} describes what tests the benchmark is capable of running.

\subsection{Definition of measurement points\label{sec:timestamps}}
Many benchmarks factually measure the latency of the round-trip and divide it by two in order to approximate the one-way time between two host channel adapters. This is necessary if the \glspl{hca} are not part of the same host system. The latency of messages between InfiniBand \glspl{hca} is usually under \SI{5}{\micro\second}; there are even reports of one-way times as small as \SI{300}{\nano\second}~\cite{macarthur2012performance}. Hence, if both \glspl{hca} are part of different systems, even small deviations between the endnodes' system clocks could cause significant skews in $t_{lat}$ and make the results useless. This problem is nonexistent if both timestamps $t_{subm}$ and $t_{recv}$ are generated by the same system clock.

A possible disadvantage of using the round-trip delay to approximate the one-way delay is the additional (software) overhead. Lets assume that a message is sent from node \textit{A} to node \textit{B}, back to node \textit{A}. Then, there can be an additional time penalty which is introduced by software on node \textit{B}, that is necessary to submit a work request in order to return the received message.

Furthermore, it is possible that latency benchmarks---e.g., \texttt{ib\_send\_lat} and \linebreak\texttt{ib\_write\_lat} in the \gls{ofed} Performance Tests\footnote{\url{https://github.com/linux-rdma/perftest}}---yield distorted, possibly idealized, results. Although these are well suited for hardware and software tuning, the results can deviate from the actual latencies that can be seen when implementing an application with the \gls{ofed} verbs.

The present work therefore implements a custom benchmark that assumes two \glspl{hca} in the same host system. It thereby prevents the skewness that is caused by deviations between different endnodes's system clocks and the additional software overhead for round-trip delays. Furthermore, it makes sure that the latencies correspond to the latencies that can be seen in actual applications.

\paragraph{Generation of timestamps} In this benchmark, \texttt{clock\_gettime()}~\cite{kerrisk2010linux} is used to generate timestamps. Its parameters are a variable of type \texttt{clockid\_t} and a reference to an instance of \texttt{struct timespec} (\autoref{lst:timespec}) to which the function will write the current time.

\begin{figure}[ht!]
    \vspace{0.4cm}
    \lstinputlisting[caption=The composition of \texttt{struct timespec}.,
                     label=lst:timespec,
                     style=customc]{listings/timespec.c}
    \vspace{-0.2cm}
\end{figure}

The former parameter, \texttt{clockid\_t}, is particularly interesting. Usually, this is set to \texttt{CLOCK\_REALTIME}, on which \texttt{clock\_gettime()} returns the system's best guess of the current time. This means that this clock can change during operation because it is adapted by the \gls{ntp}. Therefore, this timestamp is not suitable for the calculation of time differences with a nanosecond resolution. However, if the \texttt{CLOCK\_MONOTONIC} is requested, \texttt{clock\_gettime()} will return a strictly linearly increasing timestamp starting at an unspecified point in the past. Since linearity is guaranteed between timestamps for this \texttt{clockid\_t}, it is best suited to calculate $t_{lat}$ from \autoref{eq:latency}.

\paragraph{Location of timestamps in code} This benchmark takes timestamps on three different locations in the code:
\begin{itemize}
	\setlength\itemsep{0.2em}
    \item $t_{subm}$ is acquired right before an already prepared work request is submitted to the send queue with \texttt{ibv\_post\_send()}. The timestamp will be the message's payload. For that reason, it is important that the address to which the scatter/gather element points is valid until the message is actually send and that the timestamp is not overwritten in a next iteration. The pseudocode for this case is displayed in \autoref{lst:send_time}.
    \item $t_{recv}$ is measured on the receiving node. It is acquired right after \texttt{ibv\_poll\_cq()} on the completion queue that is bound to the receive queue returns with a positive value. The pseudocode for this case is displayed in \autoref{lst:cq_time}.

        The function that is displayed in \autoref{lst:cq_time} lies in the datapath, and the moment on which the timestamp and the identifier of the message are saved to be evaluated later, are time-critical. For one, this is optimized by using \SI{2}{\mebi\byte} hugepages instead of conventional \SI{4}{\kibi\byte} pages. For example, when 8000 messages are received, 8000 8-byte timestamps (\SI{64}{\kibi\byte}) and 8000 4-byte identifiers (\SI{32}{\kibi\byte}) must be saved. These \SI{96}{\kibi\byte} fit into one single hugepage, however, it would require 24 conventional pages and in turn 24 potential page faults. For the sake of readability of the code, the timestamps and message identifiers are spread among two hugepages.

        Furthermore, it is made sure that the pages are immediately touched after initialization with \texttt{mmap()} to prevent page faults from happening in the datapath. After allocating the memory, the pages are locked with \texttt{mlockall()}. More information on memory optimization can be found in \autoref{sec:mem_optimization}.
    \item $t_{comp}$ is measured in the same fashion as $t_{recv}$, but on the sending node. \texttt{ibv\_poll\_cq()} polls the completion queue that is bound to the send queue. It gives an indication of the time that passes before the sending node gets a confirmation that the message has been sent. Similar to \autoref{eq:latency}, the latency before a confirmation of transmission is available can be defined as:

        \begin{equation}
            t_{lat}^{comp} = t_{subm} - t_{comp}.
        \label{eq:latency_completion}
        \end{equation}

This timespan is relevant because buffers in the main memory cannot be reused as long as there is no confirmation that the \gls{hca} copied the data from the host's main memory to its internal buffers. (This is not the case for data that is sent inline, see \autoref{sec:postingWRs}.)
\end{itemize}

\begin{figure}[ht!]
    \vspace{0.2cm}
    \lstinputlisting[caption=Pseudocode which records the moment a messages is submitted to the \acrfull{sq}.,
                     label=lst:send_time,
                     style=customc]{listings/send_time.c}
    \vspace{-0.2cm}
\end{figure}

\begin{figure}[ht!]
    \vspace{0.4cm}
    \lstinputlisting[caption=Pseudocode which records the moment a \acrfull{cqe} becomes available in the \acrfull{cq}.,
                     label=lst:cq_time,
                     style=customc]{listings/cq_time.c}
    \vspace{-0.2cm}
\end{figure}

There is one special case which has not been discussed yet. $t_{subm}$ is set to the time right before the \gls{wr} is submitted to the send queue. Since it will take a certain amount of time before the \gls{hca} will copy the data (i.e., the timestamp) from the host's main memory to its internal buffers, it is possible to continue to alter the value after the work request has been posted. This benchmark offers a function to measure $t_{send}$, which approximates the moment the \gls{hca} copies the data to its internal buffer. The delta
\begin{equation}
    \Delta t_{inline} \approx \tilde{t}_{lat}^{send} - \tilde{t}_{lat},
    \label{eq:delta_inline}
\end{equation}
approximates the amount of time which will be saved by sending the data inline. In \autoref{eq:delta_inline}, $\tilde{t}_{lat}^{send}$ is the median latency measured with \textit{send}-timestamps, and $\tilde{t}_{lat}$ is the median latency measured with \textit{submit}-timestamps. The pseudocode of \autoref{lst:send_time} must be replaced with the pseudocode of \autoref{lst:time_thread} to transmit $t_{send}$ instead of $t_{subm}$.

\begin{figure}[ht!]
    \vspace{1cm}
    \lstinputlisting[caption=Pseudocode which continues to update an instance of the \texttt{timespec} C structure in a separate thread\comma whilst a pointer to this instance has already been submitted to the \acrfull{sq}.,
                     label=lst:time_thread,
                     style=customc]{listings/time_thread.c}
    \vspace{-0.2cm}
\end{figure}

\subsection{Supported tests\label{sec:tests}}
The list below provides an overview of the different settings that can be applied. Later, \autoref{sec:evaluation_ca} will present the results for different combinations of these settings.
\begin{itemize}
    \setlength\itemsep{0.2em}
    \item \textbf{The service type} (\autoref{tab:service_types}) can be varied between \gls{rc}, \gls{uc}, and \gls{ud}.
    \item \textbf{The poll mode} (\autoref{fig:poll_event_comparison}) can be set to \textit{busy polling} or \textit{wait for event}. The poll mode can be set independently for $t_{recv}$ and $t_{comp}$.
    \item \textbf{Inline mode} (\autoref{sec:postingWRs}) can be turned on for small messages.
    \item \textbf{Unsignaled completion} can be enabled. When this switch is set, send \glspl{wr} will not generate \glspl{wqe} when the \gls{hca} has processed them.
    \item \textbf{The operation} (\autoref{tab:transport_modes}) can be set to \textit{send with immediate} or \textit{\gls{rdma} write with immediate}. Both operations are only supported \textit{with immediate} in this benchmark since the \acrshort{imm} header is used to identify the order of the messages at the receive side.
    \item \textbf{The burst size} represents the number of messages that will be sent during one test and is limited to the maximum size of a \gls{qp} in the \gls{hca}. The benchmark is built in a way that it will continuously send messages, until this value is reached. It can be varied between 1 and 8192.
    \item \textbf{An intermediate pause} (in nanoseconds) can be set. The benchmark will sleep for this amount of time in between the \texttt{ibv\_post\_send()} calls.
    \item \textbf{Either the send or submit time} can be measured. This switch determines whether $t_{subm}$ or $t_{send}$ is measured.
    \item \textbf{The message size} $S_M$ can be set to
        \begin{equation}
            S_M = \SI[parse-numbers=false]{8\cdot2^i}{\byte},\ i \in [0,12],
        \end{equation}
        where \SI{8}{\byte} is the minimum size of a message with a timestamp. A maximum of \SI{32768}{\byte} (\SI{32}{\kibi\byte}) is chosen because messages in VILLASnode are unlikely to be bigger than \SI{32}{\kibi\byte}.
\end{itemize}

Although the possibility to submit linked lists of scatter/gather elements and work requests to the send queue will be used in the VILLASframework \textit{InfiniBand} node-type, its influence on latency will not be examined in this benchmark. Linking scatter/gather elements can become handy if data from different locations in the memory must be sent. Submitting combined work requests can be convenient if a whole batch of \glspl{wr} has to be posted and it is not necessary that a \gls{wr} is posted immediately after its generation (e.g., creating a set of receive \glspl{wr} in a loop and posting the linked list right after the closing bracket of the loop). However, the lowest latency is achieved by passing only one memory location to the \gls{hca} and by sending a message immediately after generation of the timestamp.

\section{VILLASframework InfiniBand node-type\label{sec:villas_implementation}}
\Autoref{chap:architecture} already introduced the architecture of node-types in VILLASframework and concepts to enable compatibility of \glspl{via}---and in particular the \gls{iba}---with VILLASframework. The key objective of the development of an \textit{InfiniBand} node-type was the implementation of all functions in \autoref{a:nodetype_functions} with as little as possible alterations to the pre-existing architecture. Other than the proposed changes from \autoref{sec:proposal}, the VILLASframework architecture was not modified with regards to the node-type interface and the memory management.

The implementation of the more apparent functions, e.g., \texttt{parse()}, \texttt{check()}, \texttt{reverse()}, \texttt{print()}, \texttt{destroy()}, and \texttt{stop()}, will not be discussed. This section mainly focuses on non-obvious functions, which are either InfiniBand specific (i.e., the start-function in \autoref{sec:villas_start}) or had to be optimized to make full use of the kernel bypass InfiniBand offers (i.e., the read- and write-functions in \autoref{sec:villas_read} and~\ref{sec:villas_write}, respectively). The complete source code of the \textit{InfiniBand} node-type can be found on VILLASnode's public Git repository.\footnote{\url{https://git.rwth-aachen.de/acs/public/villas/VILLASnode/}}

\subsection{Start-function\label{sec:villas_start}}
After a configuration file, which is set by a user, is interpreted by the parse-function and reviewed by the check-function, the super-node will invoke the start-function to initialize all necessary structures. It starts with the creation of a communication event channel with \texttt{rdma\_create\_event\_channel()} and the initialization of an \gls{rdma} communication identifier with \texttt{rdma\_create\_id()}. The latter is bound to both a local InfiniBand device that was defined in the configuration file and the event channel.

Before the node allocates the protection domain with \texttt{ibv\_alloc\_pd()}, the communication identifier tries to resolve the remote address with \texttt{rdma\_resolve\_addr()} (in case of an active node) or places itself into a listening state with \texttt{rdma\_listen()} (in case of a passive node). Whether the node becomes an active or passive node depends on the presence of a remote host address to connect to in the configuration file. Finally, the start-function creates a separate thread with \texttt{pthread\_create()}~\cite{kerrisk2010linux} to monitor all asynchronous events on the \texttt{rdma\_cm\_id}.

When everything is set up successfully, the start-function will return 0, to indicate success. The super-node then moves the node to the \textit{started} state (\autoref{fig:villasnode_states}).

\subsection{Communication management thread\label{sec:comm_management}}
The function that is executed by the thread that is spawned by the start-function is kept busy by a while loop until the node is moved to the \textit{started} state. This avoids races and ensures that the state transitions from \autoref{fig:villasnode_states} are obeyed.

The remainder of this function consists of a while loop that monitors the communication identifier in a blocking manner with \texttt{rdma\_get\_cm\_event()} (\autoref{sec:rdmacm}). Within this loop, the different events are handled by a switch statement. The loop, the switch statement, and a short description of what happens for every case are displayed in \autoref{lst:cm_switch}. Before expanding on the different operations of every case, a note on the blocking characteristics of \texttt{rdma\_get\_cm\_event()} has to be made. This function enables the \gls{os} to suspend further execution of the thread for an indefinite amount of time, which usually results in difficulties when trying to cancel (or kill) the thread. However, \texttt{read()}, which lies at the heart of \texttt{rdma\_get\_cm\_event()}, is a required cancellation point. A thread, for which cancelability is enabled, only acts upon cancellation requests when it reaches a cancellation point~\cite{kerrisk2010linux}. Furthermore, as defined in IEEE Std 1003.1\texttrademark-2017~\cite{posix2018}: ``[when] a cancellation request is made with the thread as a target while the thread is suspended at a cancellation point, the thread shall be awakened and the cancellation request shall be acted upon.'' Thus, even though the thread is suspended, it can be canceled with \texttt{pthread\_cancel()} if necessary.

\begin{figure}[ht!]
    \vspace{0.4cm}
    \lstinputlisting[caption=The events that are monitored by the communication management thread. Although not explicitly stated in this listing\comma every case block ends with a \texttt{break}.,
                     label=lst:cm_switch,
                     style=customc]{listings/cm_switch.c}
    \vspace{-0.2cm}
\end{figure}

\paragraph{Active node} As defined in the previous subsection, an active node is a node that tries to connect to another node. The first event that should appear after the start-function has been called is \texttt{RDMA\_CM\_EVENT\_ADDR\_RESOLVED}. This event denotes that the address has been resolved and that the \gls{qp} and two \glspl{cq}---one for the receive and one for the send queue---can be created. These instances are created using \texttt{rdma\_create\_qp()} and \texttt{ibv\_create\_cq()}, respectively. It is important for the functioning of the \textit{InfiniBand} node-type's write-function (\autoref{sec:villas_write}) that the \gls{qp}'s initialization attribute \texttt{sq\_sig\_all} is set to \zero.

After all necessary structures have been initialized, \texttt{rdma\_resolve\_route()} will be invoked. Then, when the route has successfully been resolved, the event channel will unblock again and return \texttt{RDMA\_CM\_EVENT\_ROUTE\_RESOLVED}. This means that everything is set up, and \texttt{rdma\_connect()} may be called to invoke a connection request. The state of the active node is then set to \textit{pending connect}.

When the remote node accepts the connection, \texttt{RDMA\_CM\_EVENT\_ESTABLISHED} occurs and the state of the node is set to \textit{connected}.

If the node operates with the \gls{ud} service type, the last mentioned event structure contains the \acrfull{ah}, which includes information to reach the remote node. This value is saved because in \gls{ud} mode it has to be defined in every work request (\autoref{sec:postingWRs}). Although the node is not really connected---after all \gls{ud} is an unconnected service type---the node will be transitioned to the \textit{connected} state. In the context of VILLASnode, this state implies that data can be send, either because the \glspl{qp} are connected or because the remote \gls{ah} is known.

\paragraph{Passive node} As mentioned before, a passive node listens on the communication identifier and waits until another node reaches out to it. If another node calls \texttt{rdma\_connect()} on it, the channel will unblock and return the event \linebreak\texttt{RDMA\_CM\_EVENT\_CONNECT\_REQUEST}. Thereon, the node will build its \gls{qp}, its \glspl{cq}, and accept the connection with \texttt{rdma\_accept()}. If the service type of the node is a connected service type (i.e., \gls{uc} or \gls{rc}), the node will move to the \textit{pending connect} state. If the service type is unconnected (i.e., \gls{ud}), it will move directly to the \textit{connected} state.

In case of a connected service type, the \texttt{RDMA\_CM\_EVENT\_ESTABLISHED} event occurs when the connection has successfully been established. The state is then set to \textit{connected}.

\paragraph{Error events} Error events which are caused because a remote node could not be reached are not necessarily fatal for the complete node. In this case, a fallback function which sets the node into the listening mode instead of the active mode will be invoked. This behavior is configurable; if a user sets the appropriate flag in the configuration file these errors can be made fatal.

\subsection{Read-function\label{sec:villas_read}}
This subsection focuses on the implementation of the read-function which was previously proposed in \autoref{sec:readwrite_interfaces}. Contrary to the functioning principle in \autoref{fig:villas_read}, which suggests that all samples that are passed to the read-function will definitely be submitted and must thus be held, there is a chance that some samples will not be submitted successfully. These samples must be released again.

\Autoref{fig:read_implementation} shows a decision graph for the algorithm that is implemented by the read-function. The example case, depicted by the red path, assumes that 5 empty samples are passed to the read-function, and that there are at least \textit{threshold} \glspl{wqe} in the \gls{rq}. This threshold, which is set in the configuration file, is necessary to ensure that a node can always receive samples because there are always at least \textit{threshold} pointers in the \gls{rq}. If this threshold has not yet been reached, all passed samples are submitted to the receive queue and \texttt{*release} is set to 0 (depicted by the black path). Then, the function returns with $ret=0$, without ever polling the completion queue.

\begin{figure}[ht!]
    \begin{subfigure}{\textwidth}
            \includegraphics{images/read_implementation.pdf}
    \end{subfigure}

    \begin{subfigure}{\textwidth}
        \centering
        \vspace{-0.6cm}
        \includegraphics{images/read_write_implementation_legend.pdf}
	    \vspace{-1.4cm}
    \end{subfigure}
    \caption{The decision graph for the read-function in the \textit{InfiniBand} node. Prior to invoking the read-function, \texttt{*release} is always set to \texttt{cnt} by the super-node.}\label{fig:read_implementation}
\end{figure}

If the threshold has been reached, the red path is followed. The completion queue is polled in a while loop until at least one, but not more than \texttt{cnt}, \glspl{cqe} are available. This will block further execution the read-function, which is the intended behavior. After all, when a certain amount of \glspl{wqe} resides in the \gls{rq}, it is undesired to continue to submit new \glspl{wr}. At a certain moment, the queue would be full and it would not be possible to submit new addresses that the node got from the super-node. However, this is necessary to free places in \texttt{*smps[]}, which can only hold up to \texttt{cnt} values. So, if this blocking behavior was not in place, the super-node would keep passing new addresses until the receive queue would overflow and no addresses from \glspl{cqe} could be returned to the super-node anymore.

Because \texttt{ibv\_poll\_cq()} does not rely on any of the system calls that are listed in~\cite{posix2018}, the thread that contains this while loop would not notice if a cancellation request is sent. Therefore, \texttt{pthread\_testcancel()}~\cite{kerrisk2010linux} should regularly be called within this loop.

The addresses in \texttt{*smps[]} are not immediately swapped with the \textit{X} addresses that are returned with \texttt{ibv\_poll\_cq()}. First, after the poll-function has indicated that \textit{X} \glspl{cqe} with addresses are available, \textit{X} addresses from \texttt{*smps[]} are submitted to the \gls{rq}. This ensures that the \gls{rq} does not drain and it makes room for the addresses from the \glspl{cqe} in \texttt{*smps[]}. Finally, the polled addresses are swapped with the addresses that were posted to the receive queue, and the read-function returns with $ret = X$. Note that \texttt{*release} remains untouched: all values in \texttt{*smps[]} are either received or not used, and must thus be released.

\subsection{Write-function\label{sec:villas_write}}
The write-function, depicted in \autoref{fig:write_implementation}, is a bit more complex than the read-function. This time, the algorithm depicted in the decision graph includes four example cases.

Immediately after the write-function is invoked by the super-node, it tries to submit all \texttt{cnt} samples to the \gls{sq}. While going through \texttt{*smps[]}, the node dynamically checks whether the data can be sent inline (\autoref{sec:postingWRs}) and whether an \gls{ah} must be added. The node has to distinguish among four cases:
\begin{itemize}
	\setlength\itemsep{0.2em}
        \item the samples will be submitted normally and may thus not be released by the super-node until a \gls{cqe} with the address appears;
        \item the samples will be submitted normally, but some samples will be immediately marked as \textit{bad} and must thus be released by the super-node;
        \item the samples will be sent inline and, because the \gls{cpu} directly copies them to the \gls{hca}'s memory, must thus be released by the super-node;
        \item an arbitrary combination of all abovementioned cases.
\end{itemize}

For samples that are sent normally, the \gls{wr}'s \texttt{send\_flags} (\autoref{lst:ibv_send_wr}) must be set to \texttt{IBV\_SEND\_SIGNALED}. These samples may only be released after the \gls{hca} has processed them, which must not necessarily be in the same call of the write-function. The only way for the \gls{hca} to let the node know that it is done with a sample, is through a completion queue entry. Since the \gls{qp} is created with \texttt{sq\_sig\_all=0}, the generation of \glspl{cqe} for samples must explicitly be requested.

When a sample is sent inline, \texttt{send\_flags} must only be set to \texttt{IBV\_SEND\_INLINE}. It is not desired to get a \gls{cqe} for an inline \gls{wr} since it can be---and thus will be---released immediately after being submitted to the \gls{sq}. After all, it is not possible to release a sample twice.

\begin{figure}[ht!]
    \begin{subfigure}{\textwidth}
	    \includegraphics{images/write_implementation.pdf}
    \end{subfigure}
    \begin{subfigure}{\textwidth}
        \centering
        \vspace{-0.6cm}
        \includegraphics{images/read_write_implementation_legend.pdf}
	    \vspace{-1.4cm}
    \end{subfigure}
    \caption{The decision graph for the write-function in the \textit{InfiniBand} node. Prior to invoking the write-function, \texttt{*release} is always set to \texttt{cnt} by the super-node.}
	\label{fig:write_implementation}
\end{figure}

There is one exception to this, however. Although no notifications will be generated if the \textit{signaled} flag is not set, the send queue will start to fill up nonetheless. Therefore, when a lot of subsequent \glspl{wr} are submitted with the \textit{inline} flag set, occasionally a \gls{wr} with the \textit{signaled} flag must be submitted. For this reason, the write-function contains a counter which, when reaching a configurable threshold, changes an \texttt{IBV\_SEND\_INLINE} to an \texttt{IBV\_SEND\_SIGNALED}.

When all samples have been submitted to the \gls{sq}, the value \textit{ret}, which will be returned to the super-node when the write-function returns, is set to the total number of samples that were successfully posted to the send queue.

Now, because the node can only use \texttt{*release} to communicate how many samples to release, \texttt{*smps[]} must be reordered. All samples that must be released, i.e., samples that were not successfully submitted to the send queue or samples that were sent inline, must be placed at the top of the list.

In the next step, the write-function shall try to poll
\begin{equation}
    C_{poll} = \texttt{cnt} - C_{release}
\end{equation}
\glspl{cqe}, which corresponds to the number of places in \texttt{*smps[]} that are still free. Here, \texttt{cnt} is the total number of samples in \texttt{*smps[]} and $C_{release}$ the number of samples that have already been marked to be released when the write-function returns. It is certain that all addresses that return from the \gls{cq} must be released, since samples that were sent inline will not generate a \gls{cqe}.

\subsection{Overview of the InfiniBand node-type\label{sec:overview}}
\Autoref{fig:villasnode_implementation} summarizes all components in the VILLASnode \textit{InfiniBand} node-type. Every component that is marked with an asterisk is listed in \autoref{tab:infiniband_node_components}. Here, the sections that describe the respective basics (\autoref{chap:basics}), architecture (\autoref{chap:architecture}), and implementation (\autoref{chap:implementation}) are summarized.
\input{tables/infiniband_node_components}
\begin{figure}[ht!]
	\includegraphics{images/villasnode_implementation.pdf}
    \caption{An overview of the VILLASnode \textit{InfiniBand} node-type and its components.}
    \label{fig:villasnode_implementation}
\end{figure}

\newpage
\section{VILLASnode node-type benchmark\label{sec:villas_benchmark}}
The VILLASnode node-type benchmark is intended to compare different node-types with each other. The structure of the benchmark is depicted in \autoref{fig:villas_benchmark}. The node-type under test could be, for example, the \textit{InfiniBand} node-type. The benchmark is completely based on existing mechanisms within VILLASnode.

First, a \textit{signal} node generates samples which, as aforementioned, also include timestamps. These samples are then sent to a \textit{file} node, which in turn writes them to a \gls{csv} file, here called \textit{in}. Simultaneously, the samples are sent to a sending instance of the node-type that is being tested. Eventually, a receiving instance of that node-type adds a receive timestamp and sends the samples to a second \textit{file} node. This node writes the samples to a \gls{csv} file called \textit{out}.

\begin{figure}[ht!]
	\includegraphics{images/villas_benchmark.pdf}
	\vspace{-0.2cm}
    \caption{The VILLASnode node-type benchmark is formed by connecting a \textit{signal} node, two \textit{file} nodes, and two instances of the node-type that shall be tested.}
    \label{fig:villas_benchmark}
\end{figure}

Although the \textit{out} log file will contain both the generation timestamp and the receive timestamp, the \textit{in} log file is necessary to monitor and analyze lost samples. This benchmark is meant to analyze the latencies of the different node-types, but also to discover their limits. Because it is possible that the signal generation misses steps at high frequencies (more on that in the next subsection), a missing sample in the \textit{out} log file does not necessarily mean that something went wrong within the nodes that were tested. By comparing the \textit{in} and \textit{out} log file, the benchmark can decide which samples were missed by the \textit{signal} node, and which samples were missed by the node that was tested.

\subsection{Signal generation rate\label{sec:signal_generation}}
In order for the benchmark to create an environment similar to the real use cases VILLASnode, the \textit{signal} node must be time-aware and insert samples at a given rate. This injection rate of samples must be adjustable. Although this work only focused on rates between \SI{100}{\hertz} and \SI{100}{\kilo\hertz}, lower and higher rates are theoretically possible.

\Autoref{lst:signal_generation} displays a simplified version of the \textit{signal} node-type's read-function. When a super-node that holds a \textit{signal} node tries to acquire samples from it, it calls its read-function. This function blocks further execution until a function \texttt{task\_wait()} returns. Assuming that the super-node would usually call the read-function at an infinite high frequency, the wait-function ensures that it now only returns after a fixed amount of time.

The wait-function returns an integer \texttt{steps}, which indicates the number of steps between the timestamps of the samples. Lets assume that

\begin{equation}
    t_{\texttt{task\_wait()}}^{i+1} > t_{sample}^{i} + \SI[parse-numbers = false]{\frac{1}{f_{signal}}}{\second},
    \label{eq:timing_violation}
\end{equation}

when attempting to generate the sample with the timestamp $t^{i+1}$. Here, $t_{\texttt{task\_wait()}}^{i+1}$ is the moment \texttt{task\_wait()} is called, $t_{sample}^i$ the moment the last sample was generated, $i$ the iteration of \texttt{signal\_generator\_read()}, and $f_{signal}$ the frequency the \textit{signal} node is set to. When the condition in \autoref{eq:timing_violation} holds, \texttt{task\_wait()} cannot wait until $t_{sample}^{i+1}$ since that time has already passed. Hence, the function must wait until

\begin{equation}
    t_{sample}^{i+2} = t_{sample}^{i} + 2\cdot\SI[parse-numbers = false]{\frac{1}{f_{signal}}}{\second},
\end{equation}

in order to stay synchronized with the set frequency. Now, instead of 1 step, 2 timesteps have passed since the last call of the wait-function. In other words, 1 step is missed.

After the missed steps have been counted and the timestamp has been calculated, the actual samples are generated. These will be returned to the super-node through the \texttt{*smps[]} parameter of the read-function. This behavior is similar to that of the read-function of the \textit{InfiniBand} node-type.

\begin{figure}[ht!]
    \vspace{0.4cm}
    \lstinputlisting[caption=Simplified version of the read-function of the \textit{signal} node-type.,
                     label=lst:signal_generation,
                     style=customc]{listings/signal_generation.c}
    \vspace{-0.2cm}
\end{figure}

This subsection will expand on two different methods to implement \texttt{task\_wait()} and thus to monitor the rate with which samples are sent. Although the first method is the easier and preferred method, it does not work for high frequencies such as \SI{100}{\kilo\hertz} on which the \textit{InfiniBand} node can operate. For these frequencies to work, the second method is introduced.

\paragraph{Timer expiration notifications via a file descriptor}
Linux provides an \gls{api} for timers. The function \texttt{timerfd\_create()} creates a new timer object and returns a file descriptor that refers to that timer. Once the timer's period is set with \texttt{timerfd\_settime()}, the file descriptor can be read with \texttt{read()}~\cite{kerrisk2010linux}.

\Autoref{lst:timerfd_wait} shows the implementation of \texttt{task\_wait()} with a Linux timer object. When \texttt{read()} is called on the timer's file descriptor (line 6, \autoref{lst:timerfd_wait}), it will write the number of elapsed periods since the last modification of the timer or since the last read to \texttt{steps}. If no complete period has gone by when \texttt{read()} is called, the function will block until this is the case.

\begin{figure}[ht!]
    \vspace{0.4cm}
    \lstinputlisting[caption=Implementation of \texttt{task\_wait()} by waiting on timer expiration notifications via a file descriptor.,
                     label=lst:timerfd_wait,
                     style=customc]{listings/timerfd_wait.c}
    \vspace{-0.2cm}
\end{figure}

Although Linux' \gls{api} for timer notifications via a file descriptor offers a convenient way of keeping track of elapsed time periods, it is not suited for high-frequency signals. On the one hand, \texttt{read()} causes a system call which is relatively expensive since it causes a switch between user and kernel mode. On the other hand, the operating system is inclined to suspend the process when the read-function blocks. Since it takes a certain amount of time to wake up the process when a period has been elapsed, this can cause a potential timing violation for the next sample according to \autoref{eq:timing_violation}.

\paragraph{Busy polling the x86 Time Stamp Counter}
All x86 \glspl{cpu} since the Pentium era contain a 64-bit register called \gls{tsc}. Since the Pentium 4 era, this counter increments at a constant rate which depends on the maximum core-clock to bus-clock ratio or the maximum resolved frequency at which the processor is booted~\cite{guide2018intelc3b}. The nominal frequency can be calculated using:

\begin{equation}
    f_{nominal}^{TSC} = \mathtt{\frac{CPUID.15H.ECX[31:0]\cdot CPUID.15H.EBX[31:0]}{CPUID.15H.EAX[31:0]}}.
\end{equation}

In his white paper~\cite{paoloni2010benchmark}, Paoloni describes how the \gls{tsc} can be used to measure elapsed time during code execution. In his work, the \gls{rdtsc} and \gls{rdtscp} instructions that are described in \cite{guide2018intelb2b} are used to read the \gls{tsc}. \Autoref{lst:tsc} shows the inline assembler that was used in VILLASnode to acquire the timestamp.

The functioning of both instructions is largely the same. After the \texttt{rdtsc}/\texttt{rdtscp} instruction is invoked, the 32 \gls{msb} of the timestamp are placed in \texttt{rdx} and the 32 \gls{lsb} in \texttt{rax}. To get a valid 64-bit variable, \texttt{rdx} is shifted left \SI{32}{\bit} and subsequently disjuncted with \texttt{rax}. The resulting value is set as output variable \texttt{tsc}, which is also returned by both functions in \autoref{lst:tsc}. During this operation, the high-order \SI{32}{\bit} of \texttt{rax}, \texttt{rdx}, and \texttt{rcx} are cleared. When hard-coded registers are clobbered as a result of the inline assembly code, this must be revealed up front to the compiler (line 12, \autoref{lst:tsc}).

\begin{listing}[ht!]
\refstepcounter{lstlisting}
\noindent\begin{minipage}[b]{.46\textwidth}
    \lstinputlisting[nolol=true, style=customc]{listings/rdtsc.h}
    \captionof{sublisting}{\gls{rdtsc}.}\label{lst:tsc_a}
    \end{minipage}%
    \hfill
\begin{minipage}[b]{.46\textwidth}
    \lstinputlisting[nolol=true, style=customc]{listings/rdtscp.h}
    \captionof{sublisting}{\gls{rdtscp}.}\label{lst:tsc_b}
\end{minipage}
\addtocounter{lstlisting}{-1}
    \captionof{lstlisting}{The \gls{rdtsc} instruction with fencing and the \gls{rdtscp} instruction, written in inline assembler. Both functions must be placed inline and thus be preceded by \texttt{\_\_attribute\_\_((unused,always\_inline))}.}
\label{lst:tsc}
\end{listing}

The main difference between \gls{rdtsc} and \gls{rdtscp} is that, unlike the former, the latter waits until all previous instructions have been executed and all previous loads are globally visible. One consequence of this, among others, was described by Paoloni~\cite{paoloni2010benchmark}. He demonstrated that \gls{rdtsc} showed a standard deviation of 6.9 cycles, whereas \gls{rdtscp} only showed a standard deviation of 2 cycles.

Since not all x86 processors support \gls{rdtscp}, VILLASnode nonetheless includes \gls{rdtsc}. However, to improve its behavior, the \gls{lfence} instruction~\cite{guide2018intelb2a} is executed prior to the actual read instruction. This type of fence serializes all load-from-memory instructions prior to its call. Furthermore, no instructions that are placed after the load fence execute until the fence has completed.

\Autoref{lst:rdtscp_wait} shows the implementation of \texttt{task\_wait()} based on the \gls{tsc}. During the $(i+1)^{th}$ call of \texttt{task\_wait()}, the counter is busy polled until the desired timestamp $t_{sample}^{i+1}$ is reached. Then, it updates the next timestamp $t_{sample}^{i+2}$ and simultaneously calculates whether $t_{sample}^{i+1}$ is actually only one step after $t_{sample}^{i}$ or if some steps were missed. The period can be calculated according to:
\begin{equation}
    T = \frac{f_{nominal}^{tsc}}{\mathtt{rate}}.
\end{equation}

\begin{figure}[ht!]
    \vspace{0.4cm}
    \lstinputlisting[caption=Implementation of \texttt{task\_wait()} by busy polling the x86 \acrfull{tsc}.,
                     label=lst:rdtscp_wait,
                     style=customc]{listings/rdtscp_wait.c}
    \vspace{-0.2cm}
\end{figure}

The advantage of this implementation of \texttt{task\_wait()} is that given periods can be approximated very accurately ($\sigma=2$ clock cycles,~\cite{paoloni2010benchmark}). Now, complications will rather arise because \texttt{signal\_generator\_read()} is not called frequently enough because datapaths are too long.

\subsection{Further optimizations of the benchmark's datapath\label{sec:optimizations_datapath}}
Before the \textit{signal} node from \autoref{fig:villas_benchmark} generates a sample, it checks whether steps were missed. Then, after it has generated a sample, the super-node has to write it to the file node and an instance of the node-type that is being tested. Only then, the \textit{signal} node can generate the next sample. Both the time that is spent on this check and the time that is spent in the file node are part of the datapath and affect the time it takes before \texttt{task\_wait()} is invoked again. Increasing $t_{\texttt{task\_wait()}}^{i+1}$ accordingly increases the chance of a timing violation according to \autoref{eq:timing_violation}. It is thus desirable that the time that is spent on the check and in the file node is minimized.

\paragraph{Suppressing information to the standard output} Originally, a file node always kept track of the total number of missed steps, and wrote a message to the standard output as soon as one or more steps were missed. Especially the latter is relatively expensive since \texttt{printf()}~\cite{kerrisk2010linux} invokes a system call. For high rates, it can cause a snowball effect: this situation only occurs when the generation rate is already too high so that timing requirements are not met, and now, additionally, the time that is spent in the datapath is increased even more by adding system calls to write to the standard output. Since the missed steps can also be derived from the \textit{in} and \textit{out} log file, it is made configurable to disable internal logging of missed steps. Now, when minimal latency is required, like in the case of the VILLASnode node-type benchmark, a flag can be set in the configuration file.

\paragraph{Buffering the file stream} Usually, each call to the \textit{stdio} library---which is used by the file node-type to read from and write to files---results in a system call. Although it is not possible to get rid of these system calls completely---after all, they are necessary to write to the \textit{in} and \textit{out} log files---they should be reduced to an absolute minimum in the datapath. To achieve this, the file node-type was modified so that the buffering of the file stream can be configured. Now, a user can define the size of a buffer in the configuration file. Buffering is controlled with \texttt{setvbuf()} \cite{kerrisk2010linux}, which enables an instance of the file node-type to read or write data in units equal to the size of that buffer.

\section{Enabling UC support in the RDMA CM\label{sec:uc_support}}
The \gls{rdma} \gls{cm} does not officially support unreliable connections. However, by modifying small parts of the \texttt{librdmacm} library and by re-compiling it, it is possible to facilitate \gls{uc} anyway. This enables the present work to also analyze the unreliable connection with the custom and the VILLASnode node-type benchmark.

To enable support, the \texttt{rdma\_create\_id2()} function of the \texttt{librdmacm} has to be made non-static. As a result, this function can directly be accessed, whereas it is normally only accessible through the wrapper \texttt{rdma\_create\_id()}. Now, also the \gls{qp} type can be passed on the the \gls{rdma} \gls{cm} library, and by passing \texttt{RDMA\_PS\_IPOIB} as \texttt{port\_space} and \texttt{IBV\_QPT\_UC} as \texttt{qp\_type}, a managed \gls{uc} \gls{qp} will be created.

\section{Processing data\label{sec:processing_data}}
In order to analyze the generated comma-separated value dumps, several Python~3.7 scripts were developed in Jupyter Notebook.\footnote{\url{https://python.org}}\footnote{\url{https://jupyter.org}} Jupyter Notebook (formerly IPython Notebook) is part of Project Jupyter and allows a user to interactively explore Python scripts. On the one hand, it enables (stepwise) execution of Python code in a web browser, based on IPython~\cite{perez2007ipython}. On the other hand, rich text documentation, written in Markdown\footnote{\url{https://daringfireball.net/projects/markdown/}}, can directly be included in the document. The documentation, together with the source code, can be exported to several formats, e.g., to \texttt{.py}, \texttt{.tex}, \texttt{.html}, \texttt{.md}, and \texttt{.pdf}.

Jupyter Notebook's command line \gls{api} makes it also highly suitable for automatic analysis of large datasets of timestamps. It is, for example, included in the \acrshort{cicd} pipeline of VILLASnode to automatically analyze the performance impact of certain changes in the source code and to compare node-types against each other. Furthermore, the scripts are included in the present work's build automation, which makes it possible to easily convert raw data from the benchmarks to convenient graphs.

Besides several standard libraries, NumPy\footnote{\url{http://numpy.org}}---which adds support for numerical calculations in Python---and matplotlib\footnote{\url{https://matplotlib.org}}---which adds a comprehensive toolset to create 2D plots---were used.

\subsection{Processing the host channel adapter benchmark's results\label{sec:processing_hca}}
\paragraph{Histograms} The first type of graph that is used in \autoref{chap:evaluation} and \autoref{a:results_benchmarks} is a histogram. The Python script that generates this graph first needs the path that contains the timestamps. This can be passed on through the command line or directly in the notebook. Then, the script loads the \acrshort{json} file that must be present in every data directory. It contains settings on how to process the data, but also information about the plots, e.g., dimensions of the figure and labels.

When all preparatory work is done, the Python script loads all timestamps as defined in \autoref{sec:timestamps}. To keep the minimum message size as low as possible, this benchmark only sends the 8-byte \texttt{long tv\_nsec} from \autoref{lst:timespec}. However, the complication with only sending this long integer is that it overflows from \SI{999999999}{\nano\second} to \SI{0}{\nano\second}. But, since transmissions cannot take longer than \SI{1}{\second}---assuming no severe errors occur---this overflow is resolved by adding \SI{1}{\second} to $t_{recv}$ and $t_{comp}$ if they are smaller than $t_{subm}$/$t_{send}$.

Subsequently, all data is displayed in a histogram. To be able to see differences in the distribution of latencies at a glance and thus to make the comparison of the results easier, all histograms range from \SI{0}{\nano\second} to \SI{10000}{\nano\second}. A small box in the top left or top right corner then provides information on the percentage of values above this limit and about the maximum value. A red, vertical line indicates the median value of the data set.

This script is able to compare data sets from the same run---for example, $t_{lat}$  and $t_{lat}^{comp}$---or data sets from different runs---for example, $t_{lat}$ from various runs with distinct settings. In the present work, the former and the latter first occur in \autoref{fig:oneway_event} and \autoref{fig:oneway_inline}, respectively.

\paragraph{Median plot with variability indication} Histograms are great for getting a more comprehensive view of the distribution of latencies and the effect specific changes have on this distribution. However, this type of plot is not suitable for displaying many different setups in one comprehensible graph. Therefore, a simple line chart is used to display the median values of several data sets. In order to add information about dispersion of latency, error bars are added to every marker. In the present work's line charts, these indicate an \SI{80}{\percent} interval around the median value. Thus, for every marker, \SI{10}{\percent} of the values are bigger than the upper limit of the error bar and \SI{10}{\percent} of the values are smaller than the lower limit.

In the present work, this type of graph first occurs in \autoref{fig:oneway_message_size}.

\subsection{Processing the VILLASnode node-type benchmark's results}
As discussed in \autoref{sec:villas_benchmark} and depicted in \autoref{fig:villas_benchmark}, the VILLASnode node-type benchmark results in two files with data: an \textit{in} and \textit{out} file. For every sample, the former includes a generation timestamp, a sequence number, and the actual values of the sample. Additionally, the latter includes a receive timestamp which is computed by the receiving instance of the node-type that is being benchmarked.

The VILLASnode node-type benchmark serves two purposes. On the one hand, there must be a graph that shows the performance of all node-types in one glance and makes comparison of node-types easy. For this purpose, the line graph from the previous subsection is well suited.

On the other hand, the benchmark should give a comprehensive insight in the latency distribution and the maxima of a certain node-type. For this purpose, the histogram from \autoref{sec:processing_hca} is better suited. However, as described in \autoref{sec:villas_benchmark}, these graphs should also provide information about the limitations of node-types. Not all node-types will be limited to the same maximum frequency. Therefore, the graph should provide additional information about the missing samples in the \textit{in} and \textit{out} file. By comparing these files, it can be determined if samples were not transmitted by the node-type that was tested.

\paragraph{3D surface plot} To be able to wrap up all information up in one plot, a third type of graph is introduced: the 3D surface plot. With this type of graph, it is possible to vary both the message size and sample generation rate, whilst still displaying all data in a comprehensible manner. In addition to the median latencies of size/generation rate combinations, an indication of the percentage of missed steps is plotted. In that way, it is easy to identify which combinations were detrimental for the sample generation.

In the present work, this type of graph first occurs in \autoref{fig:rate_size_3d_RC}.