masters-thesis/chapters/basics.tex
Dennis af25b4b828 Initial commit of master's thesis
This is the version I submitted to RWTH Aachen University at November 9,
2018.
2018-11-12 12:56:59 +01:00

830 lines
114 KiB
TeX
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

\chapter{Basics\label{chap:basics}}
This first section of this chapter (\ref{sec:via}) introduces the Virtual Interface Architecture, of which the InfiniBand Architecture is a descendant. After this brief introduction on InfiniBand's origins, \autoref{sec:infiniband} is completely devoted to the InfiniBand Architecture itself. Subsequently, \autoref{sec:iblibs} introduces the software libraries that are used to operate the InfiniBand hardware in the present work's benchmarks and in the implementation of the VILLASnode \textit{InfiniBand} node-type. Finally, \autoref{sec:optimizations} goes on to discuss real-time optimizations in Linux, which is the operating system VILLASnode is most frequently operated on.
\section{The Virtual Interface Architecture\label{sec:via}}
InfiniBand is rooted in the \gls{via}~\cite{pfister2001introduction}, which was originally introduced by Compaq, Intel, and Microsoft~\cite{compaq1997microsoft}. Although InfiniBand does not completely adhere to the original \gls{via} specifications, it is important to understand its basics. In that way, some design decisions in the InfiniBand Architecture will be more comprehensible. This section will therefore elaborate on the characteristics of the \gls{via}\@.
The lion's share of the Internet protocol suite, also known as \acrshort{tcpip}, is implemented by the \gls{os}~\cite{kozierok2005tcp}. Even though the concept of the \acrshort{tcpip} stack allows the interface between a \gls{nic} and an \gls{os} to be relatively simple, a drawback is that the \gls{nic} is not directly accessible for consumer processes, but only over this stack. Since the \acrshort{tcpip} stack resides in the operating system's kernel, communication operations result in \textit{trap} machine instructions (or on more recent x86 architecture's: \textit{sysenter} instructions), which cause the \gls{cpu} to switch from user to kernel mode~\cite{kerrisk2010linux}. This back-and-forth between both modes is relatively expensive and thus adds a certain amount of latency to the communication operation that caused the switch. Furthermore, since the \acrshort{tcpip} stack also includes reliability protocols and the (de)multiplexing of the \gls{nic} to processes, the operating system has to take care of these rather expensive tasks as well~\cite{kozierok2005tcp}. \Autoref{sec:motivation} already described Larsen and Huggahalli's~\cite{larsen2009architectural} research on the proportions of the latency in the Internet protocol suite. This overhead resulted in the need---and thus the development---of a new architecture which would provide each process with a directly accessible interface to the \gls{nic}\@: the Virtual Interface Architecture was born.
In their publication, Dunning et al.~\cite{dunning1998virtual} describe that the most important characteristics of the \gls{via} are:
\begin{itemize}
\setlength\itemsep{-0.2em}
\item data transfers are realized through zero-copy;
\item system calls are avoided whenever possible;
\item the \gls{nic} is not multiplexed between processes by a driver;
\item the number of instructions needed to initiate data transport is minimized;
\item no interrupts are required when initiating or completing data transport;
\item there is a simple set of instructions for sending and receiving data;
\item it can both be mimicked in software and synthesized to hardware.
\end{itemize}
Accordingly, several tasks which are handled in software in the Internet protocol suite---e.g., multiplexing the \gls{nic} to processes, data transfer scheduling, and preferably reliability of communication---must be handled by the \gls{nic} in the \gls{via}\@.
\subsection{Basic components}
A model of the \gls{via} is depicted in \autoref{fig:via_model}. At the top of the stack are the processes and applications that want to communicate over the network controller. Together with \gls{os} communication protocols and a special set of instructions which are called the \textit{\acrshort{vi} User Agent}, they form the \textit{\acrshort{vi} Consumer}. The VI consumer is colored light gray in \autoref{fig:via_model} and resides completely in the operating system's user space. The user agent provides the upper layer applications and communication protocols with an interface to the \textit{\acrshort{vi} Provider} and a direct interface to the \glspl{vi}.
\begin{figure}[ht]
\hspace{0.4cm}
\includegraphics{images/via_model.pdf}
\vspace{-0.5cm}
\caption{The \acrfull{via} model.}\label{fig:via_model}
\end{figure}
The VI provider, colored dark gray in \autoref{fig:via_model}, is responsible for the instantiation of the virtual interfaces and completion queues, and consists of the \textit{kernel agent} and the \gls{nic}\@. In the \gls{via}, the \gls{nic} implements and manages the virtual interfaces and completion queues---which will both be further elaborated upon in \autoref{sec:data_transfer}---and is responsible for performing data transfers. The kernel agent is part of the operating system and is responsible for resource management, e.g., creation and destruction of \glspl{vi}, management of memory used by the \gls{nic}, and interrupt management. Although communication between consumer and kernel agent requires switches between user and kernel mode, this does not influence the latency of data transfers because no data is actually transferred via this interface.
\subsection{Data transfer\label{sec:data_transfer}}
One of the most distinctive elements of the \gls{via}, compared to the Internet protocol suite, is the \acrfull{vi}. Because of this direct interface to the \gls{nic}, each process assumes that it owns the interface and there is no need for system calls when performing data transfers. Each virtual interface consists of a send and a receive work queue which can hold \textit{descriptors}. These contain all information necessary to transfer data, for example, destination addresses, transfer mode to be used, and the location of data to be transferred in the main memory. Hence, both send and receive data transfers are initiated by writing a descriptor memory structure to a \gls{vi}, and subsequently notifying the VI provider about the submitted structure. This notification happens with the help of a \textit{doorbell}, which is directly implemented in the \gls{nic}\@. As soon as the \gls{nic}'s doorbell has been rung, it starts to asynchronously process the descriptors.
When a transfer has been completed---successfully or with an error---the descriptors are marked by the \gls{nic}\@. Usually, it is the consumer's responsibility to remove completed descriptors from the work queues. Alternatively, on creation, a \gls{vi} can be bound to a \gls{cq}. Then, notifications on completed transfers are directed to this queue. A \gls{cq} has to be bound to at least one work queue. This means that, on the other hand, completion notifications of several work queues can be redirected to one single completion queue. Hence, if there is an environment with $N$ virtual interfaces with each two work queues, there can be
\begin{equation}
0 \leq M \leq 2\cdot N
\end{equation}
completion queues.
The Virtual Interface Architecture supports two asynchronously operating data transfer models: the \textit{send and receive messaging} model and the \gls{rdma} model. The characteristics of both models are described below.
\paragraph{Send and receive messaging model (channel semantics)} This model is the concept behind various other popular data transfer architectures. First, a receiving node explicitly specifies where data which will be received shall be saved in its local memory. In the \gls{via}, this is done by submitting a descriptor to the receive work queue. Subsequently, a sending node specifies the address of the data to be sent to that receiving node in its own memory. This location is then submitted to its send work queue, analogous to the procedure for the receive work queue.
\paragraph{Remote Direct Memory Access model (memory semantics)} This approach is lesser-known. When using the \gls{rdma} model, one node, the active node, specifies both the local and the remote memory region. There are two possible operations in this model: \textit{\gls{rdma} write} and \textit{\gls{rdma} read}. In the former, the active node specifies a local memory region which contains data to be sent and a remote memory region to which the data shall be written. In the latter, the active node specifies a remote memory region which contains data it wants to acquire and a local memory region to which the data shall be written. To initiate an \gls{rdma} transfer, the active node has to specify the local and remote memory addresses and the operation mode in a descriptor and submit it to the send work queue. The operating system and software on the passive node are not aware of both \gls{rdma} operations. Hence, there is no need to submit descriptors to the receive work queue at the passive side.
\subsection{The virtual interface finite-state machine}
The original \gls{via} proposal defines four states in which a virtual interface can reside: \textit{idle}, \textit{pending connect}, \textit{connected}, and \textit{error}. Transitions between states are handled by the VI provider and are invoked by the VI consumer or events on the network. The four states and all possible state transitions are depicted in the finite-state machine in \autoref{fig:via_diagram}. A short clarification on every state is given in the list below:
\begin{itemize}
\setlength\itemsep{0.2em}
\item \textbf{\textit{Idle}}: A \gls{vi} resides in this state after its creation and before it gets destroyed. Receive descriptors may be submitted but will not be processed. Send descriptors will immediately complete with an error.
\item \textbf{\textit{Pending connect}}: An active \gls{vi} can move to this state by invoking a connection request to a passive \gls{vi}\@. A passive \gls{vi} will transition to this state when it attempts to accept a connection. In both cases, it stays in this state until the connection is completely established. If the connection request times out, the connection is rejected, or if one of the \glspl{vi} disconnects, the \gls{vi} will return to the \textit{idle} state. If a hardware or transport error occurs, a transition to the \textit{error} state will be made. Descriptors which are submitted to either work queue in this state are treated in the same fashion as they are in the \textit{idle} state.
\item \textbf{\textit{Connected}}: A \gls{vi} resides in this state if a connection request it has submitted has been accepted or after it has successfully accepted a connection request. The \gls{vi} will transition to the \textit{idle} state if it itself or the remote \gls{vi} disconnects. It will transition to the \textit{error} state on hardware, transport, or, dependent on the reliability level of the connection, on other connection related errors. All descriptors which have been submitted in previous states and did not result in an immediate error and all descriptors which are submitted in this state are processed.
\item \textbf{\textit{Error}}: If the \gls{vi} transitions to this state, all descriptors present in both work queues are marked as erroneous. The VI consumer must handle the error, transition the \gls{vi} to the \textit{idle} state, and restart the connection if desired.
\end{itemize}
\begin{figure}[ht]
\hspace{0.5cm}
\includegraphics{images/via_states.pdf}
\vspace{-0.5cm}
\caption{The \acrfull{via} state diagram.}\label{fig:via_diagram}
\end{figure}
\section{The InfiniBand Architecture\label{sec:infiniband}}
After a brief introduction on the Virtual Interface Architecture in \autoref{sec:via}, this section will further elaborate upon \gls{ib}. Because the \gls{via} is an abstract model, the purpose of the previous section was not to provide the reader with its exact specification, but rather to give him/her a general idea of the \gls{via} design decisions. Since the exact implementation of various parts of the Virtual Interface Architecture is left open, the \gls{iba} does not completely correspond to the \gls{via}\@. Therefore, a more comprehensive breakdown of the \gls{iba} will be given in this section.
The \gls{ibta} was founded by more than 180 companies in August 1999 to create a new industry standard for inter-server communication. After 14 months of work, this resulted in a collection of manuals of which the first volume describes the architecture~\cite{infinibandvol1} and the second the physical implementation of InfiniBand~\cite{infinibandvol2}. In addition, Pfister~\cite{pfister2001introduction} wrote an excellent summary of the \gls{iba}.
\subsection{Basics of the InfiniBand Architecture\label{sec:iba}}
\paragraph{Network stack}
Like most modern network technologies, the \gls{iba} can be described as a network stack, which is depicted in \autoref{fig:iba_network_stack}. The stack consists of a physical, link, network, and transport layer.
\begin{figure}[ht!]
\includegraphics{images/network_stack.pdf}
\caption{The network stack of the \acrfull{iba}.}\label{fig:iba_network_stack}
\end{figure}
The \gls{iba} implementations of the different layers are displayed in the right column of \autoref{fig:iba_network_stack}. Although the present work attempts to separate the different layers into different subsections, some features cannot be explained without referring to features in other layers. Hence, the subsections do not directly correspond with the different layers.
First, this subsection gives some basic definitions for InfiniBand. It also includes some information about segmentation \& reassembly of messages (although that is part of the transport layer). The main component of the transport layer, the queue pair, is presented in \autoref{sec:qp}. That subsection also points out some similarities and differences between the \gls{via} and the \gls{iba}\@. Then, after the basics of the \gls{iba} subnet, the subnet manager, and managers in general are described in \autoref{sec:networking}, inner subnet routing and subnet routing will be elaborated upon in \autoref{sec:addressing}. Subsequently, \autoref{sec:vlandsl} clarifies InfiniBand's virtual lanes and service levels. \Autoref{sec:congestioncontrol} and~\ref{sec:memory} go further into flow control and memory management in the \gls{iba}, respectively. Finally, \autoref{sec:communication_management} explains how communication is established, managed, and destroyed.
An overview of the implementation of the physical link will not be given in the present work. The technical details on this can be found in the second volume of the InfiniBand\texttrademark~Architecture Specification~\cite{infinibandvol2}. The implementation of consumer operations will be elaborated upon later, in \autoref{sec:iblibs}.
\paragraph{Message segmentation} Communication on InfiniBand networks is divided into messages between \SI{0}{\byte} and $\SI[parse-numbers=false]{2^{32}}{\byte}$ (\SI{2}{\gibi\byte}) for all service types, except for unreliable datagram. The latter supports---depending on the \gls{mtu}---messages between \SI{0}{\byte} and \SI{4096}{\byte}.
Messages that are bigger than the \gls{mtu}, which describes the maximum size of a packet, are segmented into smaller packets by \gls{ib} hardware. The \gls{mtu} can be---depending on the hardware that is used---256, 512, 1024, 2048, or \SI{4096}{\byte}. Since segmentation and reassembly of packets is handled by hardware, the \gls{mtu} should not affect performance~\cite{crupnicoff2005deploying}. \Autoref{fig:message_segmentation} depicts the principle of breaking a message down into packets. An exact breakdown of the composition of packets will be described in \autoref{sec:addressing}.
\begin{figure}[ht!]
\includegraphics{images/message_segmentation.pdf}
\vspace{-0.5cm}
\caption{The segmentation of a message into packets.}\label{fig:message_segmentation}
\end{figure}
\paragraph{Endnodes and channel adapters} Ultimately, all communication on an InfiniBand network happens between \textit{endnodes} (also referred to as nodes in the present work). Such an endnode could be a host computer, but also, for example, a storage system.
A \gls{ca} forms the interface between the soft- and hardware of an endnode and the physical link which connects the endnode to a network. A channel adapter can either be a \gls{hca} or a \gls{tca}. The former is most commonly used, and distinguishes itself from the latter by implementing so-called \textit{verbs}. Verbs form the interface between processes on a host computer and the InfiniBand fabric; they are the implementation of the user agent from \autoref{fig:via_model}.
\paragraph{Service types} InfiniBand supports several types of communication services which are introduced in \autoref{tab:service_types}. Every channel adapter must implement \gls{ud}, which is conceptually comparable to \gls{udp}\@. \glspl{hca} must implement \glspl{rc}; this is optional for \glspl{tca}. The reliable connection is similar to \gls{tcp}\@. Neither of the channel adapter types is required to implement \glspl{uc} and \gls{rd}.
\Autoref{tab:service_types} describes the service levels on a very abstract level. More information on the implementation, for example, on the different headers which are used in \gls{iba} data packets, will be given later on. Furthermore, \autoref{tab:service_types} already contains references to the abbreviation \acrshort{qp}, which stands for queue pair and is InfiniBand's equivalent to a virtual interface (\autoref{sec:via}). This will be elaborated upon in the next subsection.
\input{tables/service_types}
\subsection{Queue pairs \& completion queues\label{sec:qp}}
As mentioned before, the InfiniBand Architecture is inspired by the Virtual Interface Architecture. \Autoref{fig:iba_model}, which is derived from \autoref{fig:via_model}, depicts an abstract model of the InfiniBand Architecture. In order to simplify this picture, the consumer and kernel agent are omitted. In the following, the functioning principle of this model will be explained.
\begin{figure}[ht!]
\hspace{0.4cm}
\includegraphics{images/iba_model.pdf}
\vspace{-0.5cm}
\caption{The \acrfull{iba} model.}\label{fig:iba_model}
\end{figure}
Virtual interfaces are called \glspl{qp} in the \gls{iba} and also consists of \glspl{sq} and \glspl{rq}. They are the highest level of abstraction and enable processes to directly communicate with the \gls{hca}\@. After everything has been initialized, a process will perform most operations on queue pairs while communicating over an InfiniBand network.
Similarly to a descriptor in the \gls{via}, a \gls{wr} has to be submitted to the send or receive queue in order to send or receive messages. Submitting a \gls{wr} results in a \gls{wqe} in the respective queue. Among others, a \gls{wqe} holds the address to a location in the host's main memory. In case of a send \gls{wqe}, this memory location contains the data to be sent to a remote host. In case of a receive \gls{wqe}, the containing memory address points to the location in the main memory to which received data shall be written. Not every \gls{qp} can access all memory locations; this protection is handled by specific memory management mechanisms. These also handle which locations may be accessed by the remote hosts and by the \gls{hca}\@. More information on memory management can be found in \autoref{sec:memory}.
A work queue element in the send queue also contains the network address of the remote endnode and the transfer model, e.g., the send messaging model or an \gls{rdma} model. Except for the initialization of data transmissions, a work request can be used to bind a memory window to a memory region. This is further enlarged upon in \autoref{sec:memory}. A more comprehensive overview of the composition of \glspl{wr} in general will be provided in \autoref{sec:iblibs}.
\begin{figure}[ht!]
\hspace{0.4cm}
\includegraphics{images/qp_communication.pdf}
\vspace{-0.5cm}
\caption{Three \acrfullpl{sq} on a sending node communicate with three \acrfullpl{rq} on a receiving node. Both nodes have both a send and a receive queue, but the unused queues have been omitted for the sake of clarity.}\label{fig:qp_communication}
\end{figure}
\paragraph{Example} \autoref{fig:qp_communication} shows an example with three queue pairs in one node---in this example called \textit{sending node}---that communicate with three queue pairs of another node---here, \textit{receiving node}. Note that a queue pair is always initialized with a send and a receive queue; for the sake of clarity, the unused queues have been omitted in this depiction. Hence, the image shows no receive queues for the sending node and no send queues for the receiving node.
First, before any message can be transmitted between the two nodes, the receiving node has to prepare receive \glspl{wqe} by submitting receive work requests to the receive queues. Every receive \gls{wr} includes a pointer to a local memory region, which provides the \gls{hca} with a memory location to save received messages to. In the picture, the consumer is submitting a \gls{wr} to the red receive queue.
Secondly, send work requests may be submitted, which will then be processed by the channel adapter. Although the processing order of the queues depends on the priority of the services (\autoref{sec:vlandsl}), on congestion control (\autoref{sec:congestioncontrol}), and on the manufacturer's implementation of the \gls{hca}, \glspl{wqe} in a single queue will alway obey the \gls{fifo} principle. In this image, the consumer is submitting a send work request to the red send queue, and the \gls{hca} is processing a \gls{wqe} from the blue send queue.
After the \gls{hca} processed a \gls{wqe}, it places a \gls{cqe} in the completion queue. This entry contains, among others, information about the \gls{wqe} which was processed, but also about the status of the operation. The status could indicate a successful transmission, but also an error, e.g., if not sufficient receive work queue elements were available in the receive queue. A \gls{cqe} is posted when a \gls{wqe} is completely processed, so the exact moment that it is posted depends on the service type that is used. E.g., if the service type is unreliable, the \gls{wqe} will be completed as soon as the channel adapter processed it and sent the data. However, if a reliable service type is used, the \gls{wqe} will not complete until the message is successfully received by the remote host.
Obviously, after the message has been sent over the physical link, the receiving node's \gls{hca} will receive that same message. Then, it will acquire the destination \gls{qp} from the packets' base transport headers---more on that in \autoref{sec:addressing}---and grab the first available element from that \gls{qp}'s receive queue. In the case of this example, the channel adapter is consuming a \gls{wqe} from the blue receive queue. After retrieving a work queue element, the \gls{hca} will read the memory address from the \gls{wqe} and write the message to that memory location. When it is done doing so, it will post a completion queue entry to the completion queue. If the consumer of the sending node included immediate data in the message, that will be available in the \gls{cqe} at the receive side.
\paragraph{Processing WQEs}After a process has submitted a work request to one of the queues, the channel adapter starts processing the resulting \gls{wqe}. As can be seen in \autoref{fig:iba_model}, an internal \gls{dma} engine will access the memory location which is included in the work queue element, and will copy the data from the host's main memory to a local buffer of the \gls{hca}. Every port of an \gls{hca} has several of these buffers which are called \glspl{vl}. Subsequently, separately for every port, an arbiter decides from which virtual lane packets will be sent onto the physical link. How packets are distributed among the virtual lanes and how the arbiter decides from which virtual lane to send is explained in \autoref{sec:vlandsl}.
\paragraph{Queue pair state machine} Like the virtual interfaces in \autoref{sec:via}, queue pairs can reside in several states as depicted in \autoref{fig:qp_states}. All black lines are normal transitions and have to be explicitly initialized by a consumer with a \textit{modify queue pair verb}. Red lines are transitions to error states, which usually happen automatically. Because this diagram is more extensive than the state machine of the \gls{via} (\autoref{fig:via_diagram}), the descriptions of the state transitions are omitted in this figure. All states, their characteristics, and the way to enter the state are summarized in the list below. Every list item has a sublist which provides information on how work requests, received messages, and messages to be sent are handled.
\begin{figure}[ht!]
\hspace{0.4cm}
\includegraphics{images/qp_states.pdf}
\vspace{-0.5cm}
\caption{The state diagram of a \acrfull{qp} in the \acrfull{iba}.}\label{fig:qp_states}
\end{figure}
\begin{itemize}
\setlength\itemsep{0.2em}
\item \textbf{\textit{Reset}}: When a \gls{qp} is created, it enters this state. Although this is not depicted, a transition from all other states to this state is possible.
\begin{itemize}
\setlength\itemsep{0.0em}
\item Submitting \textbf{work requests} will return an immediate error.
\item \textbf{Messages that are received} by the \gls{hca} and targeted to this \gls{qp} will be silently dropped.
\item No \textbf{messages are sent} from this \gls{qp}\@.
\end{itemize}
\item \textbf{\textit{Initialized}}: This state can be entered if the modify queue pair verb is called from the \textit{reset} state.
\begin{itemize}
\setlength\itemsep{0.0em}
\item \textbf{Work requests} may be submitted to the receive queue but they will not be processed in this state. Submitting a \gls{wr} to the send queue will return an immediate error.
\item \textbf{Messages that are received} by the \gls{hca} and targeted to this \gls{qp} will be silently dropped.
\item No \textbf{messages are sent} from this \gls{qp}\@.
\end{itemize}
\item \textbf{\textit{Ready to receive}}: This state can be entered if the modify queue pair verb is called from the \textit{initialized} state. The \gls{qp} can reside in this state if it only needs to receive, and thus not to send, messages.
\begin{itemize}
\setlength\itemsep{0.0em}
\item \textbf{Work requests} may be submitted to the receive queue and they will be processed. Submitting a \gls{wr} to the send queue will return an immediate error.
\item \textbf{Messages that are received} by the \gls{hca} and targeted to this \gls{qp} will be processed as defined in the receive \glspl{wqe}.
\item No \textbf{messages are sent} from this \gls{qp}. The queue will respond to received packets, e.g., acknowledgments.
\end{itemize}
\item \textbf{\textit{Ready to send}}: This state can be entered if the modify queue pair verb is called from the \textit{ready to receive} or \textit{\gls{sq} drain} state. Mostly, \glspl{qp} reside in this state because the queue pair is able to receive and send messages and is thus fully operational.
\begin{itemize}
\setlength\itemsep{0.0em}
\item \textbf{Work requests} may be submitted to both queues; \glspl{wqe} in both queues will be processed.
\item \textbf{Messages that are received} by the \gls{hca} and targeted to this \gls{qp} will be processed as defined in the receive \glspl{wqe}.
\item \textbf{Messages are sent} for every \gls{wr} that is submitted to the send queue.
\end{itemize}
\item \textbf{\textit{\gls{sq} drain}}: This state can be entered if the modify queue pair verb is called from the \textit{ready to send} state. This state drains the send queue, which means that all send \glspl{wqe} that are present in the queue when entering the state will be processed, but all \glspl{wqe} that are submitted after it entered this state will not be processed. The state has two internal states: \textit{draining} and \textit{drained}. While residing in the former, there are still work queue elements that are being processed. While residing in the latter, there are no more work queue elements that will be processed. When the \textit{\gls{sq} drain} state transitions from the draining to the drained state, it generates an affiliated asynchronous event.
\begin{itemize}
\setlength\itemsep{0.0em}
\item \textbf{Work requests} may be submitted to both queues. \glspl{wqe} in the receive queue will be processed. \glspl{wqe} in the send queue will only be processed if they were present when entering the \textit{\gls{sq} drain} state.
\item \textbf{Messages that are received} by the \gls{hca} and targeted to this \gls{qp} will be processed as defined in the receive \glspl{wqe}.
\item \textbf{Messages are sent} only for \glspl{wr} that were submitted before the \gls{qp} entered this state.
\end{itemize}
\item \textbf{\textit{\gls{sq} error}}: When a completion error occurs while the \gls{qp} resides in the \textit{ready to send} state, a transition to this state happens automatically for all \gls{qp} types except the \gls{rc} \gls{qp}. Since an error in a \gls{wqe} can cause the local or remote buffers to become undefined, all \glspl{wqe} subsequent to the erroneous \gls{wqe} will be flushed from the queue. The consumer can put the \gls{qp} back to the \textit{ready to send} state by calling the modify queue pair verb.
\begin{itemize}
\item \textbf{Work requests} may be submitted to the receive queue and will be processed in this state. \glspl{wr} that are submitted to the send queue will be flushed with an error.
\item \textbf{Messages that are received} by the \gls{hca} and targeted to this \gls{qp} will be processed as defined in the receive \glspl{wqe}.
\item No \textbf{messages are sent} from this \gls{qp}\@. The queue will respond to received packets, e.g., acknowledgments.
\end{itemize}
\item \textbf{\textit{Error}}: Every state may transition to the \textit{error} state. This can happen automatically---when a send \gls{wr} in an \gls{rc} \gls{qp} completes with an error or when a receive \gls{wr} in any \gls{qp} completes with an error---or explicitly---when the consumer calls the modify queue pair verb. All outstanding and newly submitted \glspl{wr} will be flushed with an error.
\begin{itemize}
\setlength\itemsep{0.0em}
\item \textbf{Work requests} to both queues will be flushed immediately with an error.
\item \textbf{Packets that are received} by the \gls{hca} and targeted to this \gls{qp} will be silently dropped.
\item No \textbf{packets are sent}.
\end{itemize}
\end{itemize}
State transitions that are marked with black lines, which must be explicitly invoked by the consumer, will not succeed if the wrong arguments are passed to the modify queue pair verb. The first volume of the InfiniBand\texttrademark~Architecture Specification~\cite{infinibandvol1} provides a list of all state transitions and the required and optional attributes that can be passed on to the verb. The present work will not provide the complete list of all transitions with their attributes, and will in the following only provide some examples of important states.
Queue pairs are not immediately ready to establish a connection after they have been initialized to the \textit{reset} state. To perform the transition \textit{reset}$\,\to\,$\textit{initialized}, the partition key index and, in case of unconnected service types, the queue key has to be provided. Furthermore, \gls{rdma} and atomic operations have to be enabled or disabled in this transition. A second important transition is \textit{initialized}$\,\to\,$\textit{ready to receive} because here, in case of a connected service, the \gls{qp} will connect to another \gls{qp}\@. The consumer has to provide the modify \gls{qp} verb with, among others, the remote node address vector and the destination \gls{qpn} before it can perform the transition. If the \gls{qp} must operate in loopback mode, this has to be defined here as well.
\subsection{The InfiniBand Architecture subnet\label{sec:networking}}
The smallest entity in the InfiniBand Architecture is a \textit{subnet}. It is defined as a network of at least two endnodes, connected by physical links and optionally connected by one or more switches. Every subnet is managed by a \gls{sm}.
One task of switches is to route packets from their source to their destination, based on the packet's \gls{lid} (\autoref{sec:addressing}). The local identifier is a 16-bit wide address of which 48K values can be used to address endnodes in the subnet and 12K addresses are reserved for multicast. Switches support multiple service levels on several virtual lanes, which will be elaborated upon in \autoref{sec:vlandsl}.
It is possible to route between different subnets with a 128-bit long \gls{gid} (\autoref{sec:addressing}).
\paragraph{Subnet manager} In order for endnodes on a subnet to communicate properly with each other and for the operation of the subnet to be guaranteed, at least one managing entity has to be present to coordinate the network. Such an entity is called \acrfull{sm} and can either be located on an endnode, a switch, or a router. Tasks of the \gls{sm} are:
\begin{itemize}
\setlength\itemsep{0.2em}
\item discovering the topology of the subnet (e.g., information about switches and nodes, including, for example, the \gls{mtu});
\item assigning \glspl{lid} to \glspl{ca};
\item establishing possible paths and loading switches' routing tables;
\item regularly scanning the network for (topology) changes.
\end{itemize}
\begin{figure}[ht]
\hspace{0.4cm}
\includegraphics{images/sm_states.pdf}
\vspace{-0.5cm}
\caption{The state machine for the initialization of a \acrfull{sm}. \textit{AttributeModifiers} from the \acrfull{mad} header (\autoref{fig:MAD}) are completely written in capital letters.}\label{fig:sm_states}
\end{figure}
A subnet can contain more than one manager but only one of them may be the \textit{master} \gls{sm}\@. All others must be in standby mode. \Autoref{fig:sm_states} depicts the state machine a subnet manager goes through to identify whether it should be master or not. An \gls{sm} starts in the \textit{discovering} state in which it scans the network. As soon as it discovers another \gls{sm} with a higher priority, it transitions into \textit{standby} mode in which it keeps polling the newly found manager. If the polled manager fails to respond (\textit{polling time-out}), the \gls{sm} goes back to the \textit{discovering} state. If the node completes the discovery without finding a master or a manager with a higher priority, it transitions into the \textit{master} state and starts to initialize the subnet. A master can put other \glspl{sm} which are currently in standby mode and have a lower priority in the \textit{non-active} mode by sending a \textit{DISABLE} datagram. If it detects an \gls{sm} in standby mode with a higher priority, it will exchange the mastership. To do so, it will send a \textit{HANDOVER} datagram, which will transition the newly found \gls{sm} into the \textit{master} state. If that \gls{sm} responds with an \textit{ACKNOWLEDGE} datagram, the old master will move to the \textit{standby} state.
\paragraph{Subnet management agents} Every endnode has to contain a passive acting \gls{sma}. Although agents can send a trap to the \gls{sm}---for example if the \gls{guid} changes at runtime---they usually only respond to messages from the manager. Messages from the \gls{sm} to an \gls{sma} can, for example, include the endnode's \gls{lid} or the location to send traps to.
\paragraph{Subnet administration} Besides \glspl{sm} and \glspl{sma}, the subnet also contains a \gls{sa}. The \gls{sa} is closely connected to the \gls{sm} and often even a part of it. Through \textit{subnet administration class management datagrams}, endnodes can request information to operate on the network from the administrator. This information can, for example, contain data on paths, but also non-algorithmic data such as \textit{service level to virtual lane mappings}.
\paragraph{Management datagrams} \glspl{mad} are used to communicate management instructions. They are always \SI{256}{\byte}---the exact size of the minimal \gls{mtu}---and are divided into several subclasses. There are two types of \glspl{mad}: one for general services and subnet administration, and one for subnet management. The subnet management \gls{mad} is used for communication between managers and agents, and is also referred to as \gls{smp}. The subnet administration \gls{mad} is used to receive from and send to the subnet administration, and falls under the category of \glspl{gmp}. Other than the \gls{sa}, general services like performance management, baseboard management, device management, SNMP tunneling, communication management (\autoref{sec:communication_management}), and some vendor and application specific protocols make use of \glspl{gmp}.
\begin{figure}[ht!]
\includegraphics{images/MAD.pdf}
\vspace{-0.5cm}
\caption{The composition of a \acrfull{mad}. The first \SI{24}{\byte} are reserved for the common \acrshort{mad} header. The header is followed by up to \SI{232}{\byte} of \acrshort{mad} class specific data.}\label{fig:MAD}
\end{figure}
\Autoref{fig:MAD} shows the management datagram base format. It is made up of a common header (between byte 0 and 23) which is used by all management packets; both \glspl{smp} and \glspl{gmp} use this header. The header is followed by a \SI{232}{\byte} data field which is different for every management datagram class.
\glspl{smp} have some particular characteristics. To ensure their transmission, $\mathrm{\acrshort{vl}}_{15}$ is exclusively reserved for \glspl{smp}. This lane is not subjected to flow control restriction (\autoref{sec:congestioncontrol}) and it is passed through the subnet ahead of all other virtual lanes. Furthermore, \glspl{smp} can make use of directed routing, which means that the ports of the switch it should exit can be defined instead of a local identifier. \glspl{smp} are always received on $\mathrm{\gls{qp}}_0$.
Usually, \glspl{gmp} may use any virtual lane but $\mathrm{\acrshort{vl}}_{15}$ and any queue pair, but this is different for \gls{sa} \glspl{mad}. Although they can use any virtual lane but $\mathrm{\acrshort{vl}}_{15}$, they have to be sent to $\mathrm{\gls{qp}}_1$.
\subsection{Data packet format \& addressing\label{sec:addressing}}
\Autoref{fig:iba_packet_format} shows the composition of a complete InfiniBand data packet. Blocks with a dashed border are optional---e.g., the \gls{grh} is not necessary if the packet does not leave the subnet from which it originated---and blocks with continuous borders are mandatory---e.g., the \glspl{crc} have to be computed for every packet.
In order to send data to non-\gls{iba} subnets, the architecture supports raw packets in which the InfiniBand specific transport headers and the invariant \gls{crc} are omitted. The present work will not go into detail on raw packets; more information on these packets can be found in the \gls{iba} specification~\cite{infinibandvol1}.
Important information about the different kinds of transport headers, immediate data, the payload, and the two kinds of \glspl{crc} can be found in \autoref{tab:packet_abbreviations}. Because of their importance, information on the local and global routing header will be given in a separate section below.
\begin{figure}[ht!]
\includegraphics{images/iba_packet_format.pdf}
\vspace{-0.5cm}
\caption{The composition of a complete packet in the \acrfull{iba}.}\label{fig:iba_packet_format}
\end{figure}
\input{tables/packet_abbreviations}
\paragraph{Local routing header} The \gls{lrh} contains all necessary information for a packet to be correctly passed on within a subnet. \Autoref{fig:LRH} depicts the composition of the \gls{lrh}\@.
The most crucial fields of this header are the 16-bit source and destination \textit{local identifier} fields. A channel adapter's port can be uniquely identified within a subnet by its \gls{lid}, which the subnet manager assigns to every port in the subnet. Besides with an identifier, the subnet manager also provides \glspl{ca} with an \gls{lmc}. This value, which can range from 0 to 7, indicates how many low order bits of the \gls{lid} can be ignored by the \gls{ca} in order to determine if a received packet is targeted to that \gls{ca}. These bits are also called \textit{don't care bits}, and switches do not ignore them; this results in up to 128 different paths to a port in a subnet, which is a large benefit. Consequently, with this mask, it is possible to reach one single port with up to 128 different unicast \glspl{lid}. As mentioned earlier, the 16-bit \gls{lid} can hold approximately 48K unicast entries and 16K multicast entries.
The 11-bit \textit{packet length} field indicates the length of the complete packet in 4-byte words. This not only includes the length of the payload, but also of all headers. The \textit{VL} and \textit{SL} fields indicate which virtual lane and service level are used, respectively. Later, in \autoref{sec:vlandsl}, virtual lanes, service levels, and their connection will be explained in more detail.
The 4-bit \textit{LVer} field indicates which link level protocol is used. \textit{LNH} stands for \textit{Link Next Header} and this 2-bit field indicates the header that follows the mandatory local routing header. The LNH's \gls{msb} indicates if the packet uses \gls{iba} transport or raw transport. The second bit indicates if an optional \gls{grh} is present.
\begin{figure}[ht!]
\includegraphics{images/LRH.pdf}
\vspace{-0.5cm}
\caption{The composition of the \acrfull{lrh}.}\label{fig:LRH}
\end{figure}
\paragraph{Global routing header} The \acrfull{grh} contains all necessary information for a packet to be correctly passed on by a router between subnets. \Autoref{fig:GRH} depicts the composition of the \gls{grh}.
\begin{figure}[ht!]
\includegraphics{images/GRH.pdf}
\vspace{-0.5cm}
\caption{The composition of the \acrfull{grh}.}\label{fig:GRH}
\end{figure}
The most crucial fields of this header are the 128-bit source and destination \textit{global identifier} fields. \Autoref{fig:GID} shows the possible compositions of a \gls{gid}\@. \Autoref{fig:GID_a} shows the composition of the unicast \gls{gid}; it consists of a \gls{gid} prefix---more on that later---and a \gls{guid}. The \gls{guid} is an IEEE \gls{eui64} and uniquely identifies each element in a subnet~\cite{eui64}. The \gls{guid} is always present in a unicast global identifier. The 24 \glspl{msb} of the \gls{guid} are reserved for the company identifier, which is assigned by the IEEE Registration Authority. The 40 \glspl{lsb} are assigned to a device by the said company, to uniquely identify it. The subnet manager may change the \gls{guid} if the scope is set to local, more on that below.
\begin{figure}[ht!]
\vspace{0.7cm}
\begin{subfigure}{\textwidth}
\includegraphics[width=\linewidth]{images/GID_unicast.pdf}
\vspace{-0.7cm}
\caption{The three possible compositions of a unicast \acrshort{gid}.}\label{fig:GID_a}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{\textwidth}
\includegraphics[width=\linewidth]{images/GID_multicast.pdf}
\vspace{-0.7cm}
\caption{The composition of a multicast \acrshort{gid}.}\label{fig:GID_b}
\end{subfigure}
\caption{The possible structures of \acrfullpl{gid}.}\label{fig:GID}
\end{figure}
The composition of the 64-bit prefix depends on the scope in which packets will be sent. It all comes down to three cases, which are listed below. The enumeration of the list below corresponds to the enumeration in \autoref{fig:GID_a}. Each port will have at least one unicast \gls{gid}, which is referred to as \textit{GID index 0}. This \gls{gid} can be created using the first or the second option from the list below. Both options are based on the default \gls{gid} prefix \texttt{0xFE80::0}. Packets that are constructed using the default \gls{gid} prefix and a valid \gls{guid} must always be accepted by an endnode, but must never be forwarded by a router. That means that packets with only a GID index 0 are always restricted to the local subnet.
\begin{enumerate}
\setlength\itemsep{0.2em}
\item \textbf{Link-local}: The global identifier only consists of the default \gls{gid} prefix \texttt{0XFE80::0} and the device's \gls{eui64} and is only unique within the local subnet. Routers will not forward packets with this global identifier. \texttt{0x3FA} in \autoref{fig:GID_a} is another representation of the default \gls{gid} prefix:
\begin{equation}
\texttt{0x3FA} = (\texttt{0xFE8} \gg 2).
\end{equation}
It is used to clarify the extra bit which has to be set in the second option of this list. The two \glspl{lsb} of \texttt{0xFE8} which are eliminated by the right shift are zero and are absorbed by the 54-bit \texttt{0x0::0} block.
\item \textbf{Site-local}: The global identifier consists of the default \gls{gid} prefix with the 54th bit of the \gls{gid} prefix set to \one. In the representation of \autoref{fig:GID_a}, this corresponds to:
\begin{equation}
\texttt{0x3FB} = (\texttt{0xFE8} \gg 2) + 1 = \texttt{0x3FA} + 1.
\end{equation}
The 16-bit \textit{subnet prefix} is set to a value chosen by the subnet manager. This \gls{gid} is unique in a collection of connected subnets, but not necessarily globally.
\item \textbf{Global}: This is the only \gls{gid} type which is forwarded by routers, since it is guaranteed to be globally unique.
\end{enumerate}
Multicast \glspl{gid}, as depicted in \autoref{fig:GID_b}, are fundamentally different from unicast \glspl{gid}. To indicate that it is a multicast packet, the 8 \glspl{msb} are all set to \one. The \gls{lsb} of the \textit{flags} field indicates whether it is a permanently assigned multicast \gls{gid} (\zero) or not (\one). The remaining three bits of the flags block are always \zero. The 4-bit \textit{scope} field indicates the scope of the packet. E.g., if scope equals \texttt{0x2}, a packet will be link-local and if scope equals \texttt{0xE}, a packet will be global. The complete multicast address scope is described in the \gls{iba} specification~\cite{infinibandvol1}. The 122 \glspl{lsb} are reserved for the actual multicast \gls{gid}\@.
Although the source and destination identifiers account for \SI{80}{\percent} of the global routing header (\autoref{fig:GRH}), there are some other fields. The 4-bit \textit{IPVer} field indicates the version of the header and the 8-bit \textit{TClass} field indicates the global service level, which will be elaborated upon in \autoref{sec:vlandsl}. The 20-bit \textit{flow label} field helps to identify groups of packets that must be delivered in order. The 8-bit \textit{NxtHdr} field identifies the header which follows the \gls{grh} in the \gls{iba} packet. This is, in case of a normal \gls{iba} packet, the \gls{iba} transport header. The only remaining block is the 8-bit \textit{HopLmt}, which limits the number of hops a packet can make between subnets, before being dropped.
\subsection{Virtual lanes \& service levels\label{sec:vlandsl}}
\acrfullpl{vl} are independent sets of receive and transmit packet buffers. A channel adapter can be seen as a collection of multiple logical fabrics---lanes---which share a port and physical link.
As introduced in \autoref{sec:iba} and in particular in \autoref{fig:message_segmentation}, after a \gls{wqe} appears in the send queue, the channel adapter segments the message (i.e., the data the \gls{wqe} points to) into smaller chunks of data and forms \gls{iba} packets, based on the information present in the \gls{wqe}. Subsequently, a \gls{dma} engine copies them to a virtual lane.
Every switch and channel adapter must implement $\mathrm{\gls{vl}}_{15}$ because it is used for subnet management packets (\autoref{sec:networking}). Furthermore, between 1 and 15 additional virtual lanes $\mathrm{\gls{vl}}_{0\ldots14}$ must be implemented for data transmission. The actual number of \glspl{vl} that is used by a port (1, 2, 4, 8, or 15) is determined by the subnet manager. Until the \gls{sm} has determined how many \glspl{vl} are supported on both ends of a connection and until it has programmed the port's \acrshort{sl} to \gls{vl} mapping table, the mandatory data lane $\mathrm{\gls{vl}}_{0}$ is used.
To understand \gls{qos} in InfiniBand, which signifies the ability of a network technology to prioritize selected traffic, it is essential to understand how packets are scheduled onto the \glspl{vl}. Crupnicoff, Das, and Zahvai~\cite{crupnicoff2005deploying} did a great deal of describing the functioning of \gls{qos} in the \gls{iba}\@. This section will first explain how packets are scheduled onto the \glspl{vl}. Then, it will describe how the virtual lanes are arbitrated between the actual physical link that is connected to the channel adapter's port.
\paragraph{Scheduling packets onto virtual lanes} The \gls{iba} defines 16 \glspl{sl}. The 4-bit field that represents the \gls{sl} is present in the local routing header (\autoref{fig:LRH}) and stays constant during the packet's path through the subnet. The \gls{sl} depends on the service type which is used (\autoref{tab:service_types}). The first volume of the \gls{iba} specification~\cite{infinibandvol1} describes how the level is acquired for the different types. Besides the \gls{sl} field, there is also the \gls{vl} field in the \gls{lrh}\@. This is set to the virtual lane the packet is sent from, and, as will be discussed below, may change during its path through the subnet.
Although the architecture does not specify a relationship between certain \glspl{sl} and forwarding behavior---this is left open as a fabric administration policy---there is a specification for \gls{sl} to \gls{vl} mapping in switches. If a packet arrives in a switch, the switch may, based on a programmable \textit{SLtoVLMappingTable}, change the lane the packet is on. This also changes the corresponding field in the \gls{lrh}\@. It may happen that a packet on a certain \gls{vl} passes a packet on another \gls{vl} while transitioning through a switch. Service level to virtual lane mapping in switches allows, among others, interoperability between \glspl{ca} with different numbers of lanes.
There is a similar mechanism to service levels for global routing: the \textit{traffic class} (TClass) field in the \gls{grh} (\autoref{fig:GRH}). The present work will not further elaborate upon traffic classes.
\begin{figure}[ht!]
\hspace{0.4cm}
\includegraphics{images/iba_arbiter.pdf}
\vspace{-0.5cm}
\caption{Functional principle of the arbiter.}\label{fig:iba_arbiter}
\end{figure}
\paragraph{Arbitrating the virtual lanes}
The arbitration of virtual lanes to an output port has yet to be discussed. \Autoref{fig:iba_arbiter} depicts the logic in the arbiters which were previously depicted in \autoref{fig:iba_model} as a black box. The arbitration is implemented as a \gls{dpwrr} scheme. It consists of a \textit{high priority-\acrshort{wrr}} table, a \textit{low priority-\acrshort{wrr}} table, and a \textit{limit high priority} counter. Both tables are lists with a field to indicate the index of a virtual lane and a weight with a value between 0 and 255. The counter keeps track of the number of high priority packets that were sent and whether that number exceeds a certain threshold.
If at least one entry is available in the high priority table and the counter is not exceeded, this table is active and a packet from this table will be sent. Which packet depends on the weighted round robin scheme. E.g., two lanes, $\mathrm{\gls{vl}}_0$ and $\mathrm{\gls{vl}}_1$, are listed in a table and they have a weight of 2 and 3, respectively. When the table is active, in $\frac{2}{2+3}\cdot\SI{100}{\percent}=\SI{40}{\percent}$ of the cases a packet from $\mathrm{\gls{vl}}_0$ and in $\frac{3}{2+3}\cdot\SI{100}{\percent}=\SI{60}{\percent}$ of the cases a packet from $\mathrm{\gls{vl}}_1$ will be sent.
If the counter reaches its threshold, a packet from a low priority lane will be sent and the counter is reset to 0. If the high priority table is empty, the low priority table will be checked immediately.
$\mathrm{\gls{vl}}_{15}$ is not subjected to these rules and always has the highest priority. A \gls{vl} may be listed in either one or in both tables at the same time. There may be more than one entry of the same \gls{vl} in one table.
The bottom of \autoref{fig:iba_arbiter} shows how packets are distributed among virtual lanes based on their \gls{sl}\@. This is similar to the mapping in switches, as described above. \Autoref{fig:iba_arbiter} does not depict a switch and assumes a direct connection between two channel adapters.
\subsection{Congestion control\label{sec:congestioncontrol}}
InfiniBand is a lossless fabric, which means that congestion control does not rely on dropping packets. Packets will only be dropped during severe errors, e.g., during hardware failures. InfiniBand supports several mechanisms to deal with congestion without dropping packets. In the following, two control mechanisms will be described.
\paragraph{Link-level flow control} The first mechanism, \gls{llfc}, prevents the loss of packets caused by a receive buffer overflow. This is done by synchronizing the state of the receive buffer between source and target node with \glspl{fcpacket}, of which the composition is depicted in \autoref{fig:flow_control_packet}. Flow control packets coexists with data packets, which were presented in \autoref{sec:addressing}.
Flow control packets for a certain virtual lane shall be sent during the initialization of the physical link and prior to the passing of 65,536 \textit{symbol times} since the last time such a packet was sent for that \gls{vl}\@. A symbol time is defined as the time it takes to transmit an \SI{8}{\bit} data quantity onto a physical lane. If the physical link is in initialization state (referred to as \textit{LinkInitialize} in the IBA specification~\cite{infinibandvol1}), \textit{Op} shall be \one in the flow control packet. If the packet is sent when the link is up and not in failure (\textit{LinkArm} or \textit{LinkActive}), \textit{Op} shall be \zero.
\begin{figure}[ht!]
\includegraphics{images/flow_control_packet.pdf}
\vspace{-0.5cm}
\caption{The structure of a \acrfull{fcpacket}.}\label{fig:flow_control_packet}
\end{figure}
The flow for a complete synchronization---from a source node with a sending queue, to a target node with a receiving queue, back to the sending queue---is described in the list below and depicted in \autoref{fig:flow_control_diagram}. Flow control packets are sent on a per virtual lane base; the 4-bit \textit{VL} field is used to indicate the index of $\mathrm{\gls{vl}}_i$. $\mathrm{\gls{vl}}_{15}$ is excluded from flow control.
\begin{figure}[ht!]
\includegraphics{images/flow_control_diagram.pdf}
\vspace{-0.5cm}
\caption{Working principle of \acrfull{llfc} in the \acrfull{iba}.}\label{fig:flow_control_diagram}
\end{figure}
\begin{enumerate}
\setlength\itemsep{0.2em}
\item \textbf{Set FCTBS \& send FC packet}: Upon transmission of an \gls{fcpacket}, the 12-bit \gls{fctbs} field of the \gls{fcpacket} is set to the total number of blocks transmitted since the \gls{vl} was initialized. The \textit{block size} of a packet $i$ is defined as
\begin{equation}
B_{packet,i} = \ceil[\big]{S_i/64},
\end{equation}
with $S_i$ the size of a packet, including all headers, in bytes. Hence, the total number of blocks transmitted at a certain time is defined as:
\begin{equation}
\mathrm{\gls{fctbs}} = B_{total} = \sum_{i} B_{packet,i}.
\end{equation}
\item \textbf{Set and update ABR}: Upon receipt of an \gls{fcpacket}, a 12-bit \gls{abr} field is set to:
\begin{equation}
\mathrm{\gls{abr}} = \mathrm{\gls{fctbs}}.
\end{equation}
Every time a data packet is received and not discarded due to lack of receive capacity, the value is updated according to:
\begin{equation}
\mathrm{\gls{abr}} = \mathrm{\gls{abr}} + (B_{packet} \bmod 4096),
\end{equation}
with $B_{packet}$ the block size of the received data packet.
\item \textbf{Set FCCL \& send FC packet}: Upon transmission of an \gls{fcpacket}, the 12-bit \gls{fccl} has to be generated. If the receive buffer could permit the receipt of 2048 or more blocks of every possible combination of data packets in the current state, the credit limit is set to:
\begin{equation}
\mathrm{\gls{fccl}} = \mathrm{\gls{abr}} + 2048 \bmod 4096.
\end{equation}
Otherwise, it is set to:
\begin{equation}
\mathrm{\gls{fccl}} = \mathrm{\gls{abr}} + N_{B} \bmod 4096,
\end{equation}
with $N_B$ the number of blocks the buffer could receive in the current state.
\item \textbf{Use FCCL for data packet transmission}: After a valid \gls{fccl} is received, it can be used to decide whether a data packet can be received by a remote node and thus whether it should be sent. To make this decision, a variable $C$ is defined:
\begin{equation}
C = (B_{total} + B_{packet}) \bmod 4096,
\end{equation}
with $B_{total}$ the total blocks sent since initialization and $B_{packet}$ the block size of the packet which will potentially be transmitted. If the condition
\begin{equation}
(\mathrm{\gls{fccl}} - C) \bmod 4096 \leq 2048
\end{equation}
holds, the data packet may be sent.
\end{enumerate}
\paragraph{Feedback based control architecture}
\Autoref{fig:congestion_control} illustrates how the \gls{cca} works. Similar to link-level flow control, the \gls{cca} only controls data \glspl{vl}; $\mathrm{\gls{vl}}_{15}$ is excluded and thus \glspl{smp} will never be restricted.
\begin{figure}[ht!]
\includegraphics{images/congestion_control.pdf}
\vspace{-0.5cm}
\caption{Working principle of the \acrfull{cca}. The \acrfull{cct}, \acrfull{tmr}, and threshold value are initialized by the \acrfull{ccm}.}\label{fig:congestion_control}
\end{figure}
The control consists of five steps that are listed below. The enumeration of the list below corresponds to the numbers in \autoref{fig:congestion_control}.
\begin{enumerate}
\setlength\itemsep{0.2em}
\item \textbf{Detection}: The first step is the actual detection of congestion. This is done by monitoring a virtual lane of a given port and reviewing whether its throughput exceeds a certain threshold. This threshold is set by the \gls{ccm} and must always be between 0 and 15, where a value of 0 will turn off congestion control completely and a value of 15 corresponds to a very low threshold and thus aggressive congestion control on that virtual lane.
If the threshold is reached, the \gls{fecn} flag in the base transport header is set before the packet is forwarded to its destination.
\item \textbf{Response}: When an endnode receives a packet where the \gls{fecn} flag in the \acrshort{bth} is set, it sends a \gls{becn} back to node the packet came from. In the case of connected communication (e.g., reliable connection, unreliable connection), the response might be carried in an ACK packet. If communication is unconnected (e.g., unreliable datagram) an additional \textit{congestion notification packet} has to be sent.
\item \textbf{Determine injection rate reduction}: When a node receives a packet with the \gls{becn} flag set, an index (illustrated as $i$ in \autoref{fig:congestion_control}) will be increased by a preset value. This index is used to read from the \gls{cct}. This table is set by the \gls{ccm} during initialization and contains inter-packet delay values. The higher the index $i$, the higher the delay value it points to.
\item \textbf{Set injection rate reduction}: The value from the \gls{cct} will be used to reduce the injection rate of packets onto the physical link. The reduction can either be applied to the \gls{qp} that caused the packet which got an \gls{fecn} flag, or to all \glspl{qp} that use a particular service level (and thus virtual lane).
\item \textbf{Injection rate recovery}: After a certain time, which is set by the \gls{ccm} as well, the index $i$, and thus also the inter-packet delay, is reduced again. If no more \gls{becn} flags are received, $i$ and the delay will go to zero. If they do not go to zero, the card adapter probably go into an equilibrium at a certain point. In this equilibrium, the \gls{hca} will send packets with an inter-packet delay which is just above or just under the threshold that causes new \gls{fecn} flags to be generated.
\end{enumerate}
\subsection{Memory management\label{sec:memory}}
An \gls{hca}'s access to a host's main memory is managed and protected with three primary objects: \glspl{mr}, \glspl{mw}, and \glspl{pd}. The relationship between queue pairs and these objects is depicted in \autoref{fig:memory_iba}.
\begin{figure}[ht!]
\includegraphics{images/memory_iba.pdf}
\vspace{-0.5cm}
\caption{The relationship between \acrfullpl{qp}, \acrfullpl{mw}, \acrfullpl{mr}, and the host's main memory.}\label{fig:memory_iba}
\end{figure}
\paragraph{Memory regions} A memory region is a registered set of memory locations. A process can register a memory region with a verb, which provides the \gls{hca} with the virtual-to-physical mapping of that region. Furthermore, it returns a \gls{lkey} and \gls{rkey} to the calling process. Every time a work request which has to access a virtual address within a local memory region is submitted to a queue, the local key has to be provided within the work request. The region in the main memory is pinned on registration, which means that the operating system is prohibited from swapping that region out (\autoref{sec:mem_optimization}).
When a work requests tries to access a remote memory region on a target node, e.g., with an \textit{\gls{rdma} read} or \textit{write} operation, the remote key of the memory region on the target host has to be provided. Hence, before an \gls{rdma} operation can be performed, the source node has to acquire the \gls{rkey} of the remote memory region it wants to access. This can, for example, be done with a regular \textit{send} operation which only requires local keys.
\paragraph{Protection domains} Protection domains associate memory regions and queue pairs and are specific to each \gls{hca}\@. During creation of memory regions and queue pairs, both have to be associated with exactly one \gls{pd}\@. Multiple memory regions and queue pairs may be part of one protection domain.
A \gls{qp}, which is associated with a certain \gls{pd}, cannot access a memory region in another \gls{pd}\@. E.g., a \gls{qp} in protection\undershort{}domain\undershort{}X in \autoref{fig:memory_iba} can access memory\undershort{}region\undershort{}A and memory\undershort{}region\undershort{}B, but not memory\undershort{}region\undershort{}C.
\paragraph{Memory windows} If a reliable connection, unreliable connection, or a reliable datagram is used, memory windows can be used for memory management. First, memory windows are allocated, and then they are bound to a memory region. Although allocation and deallocation of a memory window requires a system call---and is thus time-consuming and not suitable for use in a datapath---binding a memory window to (a subset of) a memory region is done through a work request submitted to a send queue. A memory window can be bound to a memory region if both are situated in the same protection domain, if local write access for the memory region is enabled, and if the region was enabled for windowing at initialization.
The \gls{rkey} that the \gls{mw} returns on allocation is just a dummy key. Every time the window is (re)bound to (a subset of) a memory region, the \gls{rkey} is regenerated. Memory windows can be thoroughly handy for dynamic management of remote access of memory. A memory window with remote rights can be bound to a memory region without remote rights, and enable remote access this way. Furthermore, remote access can be granted and revoked dynamically without using system calls.
There are two types of memory windows: Type 1 and Type 2. Whereas the former are addressed only through virtual addresses, the latter can be addressed through either virtual addresses or zero based virtual addresses. More information on the types is given in the first volume of the \gls{iba} specification~\cite{infinibandvol1}.
\paragraph{Examples} The list below provides some examples regarding memory regions, protection domains, and memory windows. The enumerations in the list correspond with the numbers in \autoref{fig:memory_iba}.
\begin{enumerate}
\setlength\itemsep{0.2em}
\item A send work request with a pointer to \texttt{0x0C} was submitted. Since memory\undershort{}region\undershort{}A is bound to the address range this address lies in, the \gls{wr} has to include memory\undershort{}region\undershort{}A's local key. This is necessary so that the \gls{hca} will be able to access the data when it starts processing the \gls{wr}\@. A \gls{wr} submitted to $\mathrm{\gls{qp}}_1$ can only access memory\undershort{}region\undershort{}A and memory\undershort{}region\undershort{}B---and thus only memory with addresses between \texttt{0x0A} and \texttt{0x11} in the current configuration---since these regions share the protection domain with $\mathrm{\gls{qp}}_1$.
Note that, although a memory window is bound to memory\undershort{}region\undershort{}A, $\mathrm{\gls{qp}}_1$ can access the region directly by providing the local key.
\item This case is similar to case 1, but for $\mathrm{\gls{qp}}_2$. Like $\mathrm{\gls{qp}}_1$, $\mathrm{\gls{qp}}_2$ can access all memory regions in the same protection domain as long as the work request that tries to access the memory region contains the right local key.
\item This case is similar to case 1 and 2, but for memory\undershort{}region\undershort{}C since $\mathrm{\gls{qp}}_3$ resides in protection\undershort{}domain\undershort{}Y. It is thus only possible to access memory locations in the main memory in the address range from \texttt{0x12} to \texttt{0x15} with the current configuration. To access other addresses, memory\undershort{}region\undershort{}C would have to be rebound.
\item This case illustrates the reception of an \textit{\gls{rdma} write}. \textbf{Important note}: If a remote host writes into the local memory with an \textit{\gls{rdma} write}, this will not really consume a receive \gls{wr}\@. It is completely processed by the \gls{hca} without the \glspl{qp} and \glspl{cq}, and thus the \gls{os} and processes, even noticing this. Displaying (4) like this was done for the sake of simplicity and clarity.
If a remote host wants to access \texttt{0x0A} or \texttt{0x0B} it can use the remote key of memory\undershort{}window\undershort{}1 to access it. Note that remote access does not necessarily have to be turned on for memory\undershort{}region\undershort{}A; only local write access is necessary.
\end{enumerate}
\subsection{Communication management\label{sec:communication_management}}
The \gls{cm} provides protocols to establish, maintain, and release channels. It is used for all service types which were introduced in \autoref{sec:iba}. In the following, a brief introduction on the establishment and termination of communication will be given. As aforementioned, the present work will ignore special cases for the reliable datagram service type, since it is not supported by the \acrshort{ofed} stack.
Since the communication manager is a general service, it makes use of \glspl{gmp} for communication (see ``Management datagrams'' in \autoref{sec:networking} and the composition of \glspl{mad} in \autoref{fig:MAD}). The \gls{cm} has a set of messages which is set in the \textit{AttributeID} of the common \gls{mad} header. A short summary of communication management related messages which are mandatory for \gls{iba} hosts that support \gls{rc}, \gls{uc}, and \gls{rd} can be found in \autoref{tab:required_cm_messages}. Conditionally required messages for \gls{iba} hosts that support \gls{ud} can be found in \autoref{tab:conditionally_required_cm_messages}. Every message type needs different additional information which is set in the \gls{mad} data field. The exact content of this data for all message types can be found in the \gls{iba} specification~\cite{infinibandvol1}.
As mentioned in \autoref{sec:qp}, the queue pair gets all necessary information in order to reach a remote node as arguments while transitioning \textit{initialized}$\,\to\,$\textit{ready to receive}.
\input{tables/required_cm_messages}
\input{tables/conditionally_required_cm_messages}
\paragraph{Communication establishment sequences} There are various sequences of messages to establish or terminate a connection. \Autoref{fig:communication_manager} introduces three commonly used sequences. In all cases, the communication is established between an active client (\textit{A}) and a passive server (\textit{B}). It is also possible to establish communication between two active clients. If two active clients send a \acrshort{req}, they will compare their \gls{guid} (or, if both clients share a \gls{guid}, their \gls{qpn}), and the client with the smaller \gls{guid} (or \gls{qpn}) will get assigned the passive role. A client can make its reply to a communication request conditional, e.g., rejecting the connection if it gets assigned the passive role.
\begin{figure}[ht!]
\vspace{-0.3cm}
\begin{subfigure}{0.31\textwidth}
\includegraphics[width=\linewidth, page=1]{images/communication_manager.pdf}
\caption{Communication establishment sequence for \gls{rc}, \gls{uc}, and \gls{rd}.}\label{fig:communication_manager_a}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{0.31\textwidth}
\includegraphics[width=\linewidth, page=2]{images/communication_manager.pdf}
\caption{Communication release sequence for \gls{rc}, \gls{uc}, and \gls{rd}.}\label{fig:communication_manager_b}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{0.31\textwidth}
\includegraphics[width=\linewidth, page=3]{images/communication_manager.pdf}
\caption{Service ID Request for \gls{ud}.\newline}\label{fig:communication_manager_c}
\end{subfigure}
\caption{Several Communication Management sequences. All depicted sequences take place between an active and a passive \acrshort{iba} host.}\label{fig:communication_manager}
\end{figure}
\paragraph{Communication establishment} \Autoref{fig:communication_manager_a} depicts the communication establishment sequence for connected service types and for reliable datagram. First, the active host \textit{A} sends a \gls{req}. If \textit{B} wants to accept the communication it replies with \gls{rep}. If it does not want to accept the communication request, it replies with \gls{rej}. If it is not able to reply within the time-out that is specified in the received \gls{req}, it answers with \gls{mra}.
As soon as \textit{A} has received the \gls{rep}, it sends a \gls{rtu} to indicate that transmission can start.
\paragraph{Communication release} \Autoref{fig:communication_manager_b} depicts the communication release sequence for \gls{rc}, \gls{uc}, and \gls{rd}\@. The active host takes the initiative and sends a \gls{dreq}. The passive node acknowledges this with a \gls{drep}. These messages travel out of band, so if there are still operations in progress, it cannot be predicted how they will be completed.
\paragraph{Service ID request} \Autoref{fig:communication_manager_c} illustrates how \textit{A} sends a \gls{sidrreq} in order to receive all necessary information from \textit{B} to communicate over unreliable datagram. This information is sent from \textit{B} to \textit{A} over a \gls{sidrrep}.
\section{OpenFabrics software libraries\label{sec:iblibs}}
Although the \gls{iba} specification~\cite{infinibandvol1} defines the InfiniBand Architecture and abstract characteristics of functions which should be included, it does not define a complete \gls{api}. Initially, the \gls{ibta} planned to leave the exact \gls{api} implementation open to the several vendors. However, in 2004, the non-profit OpenIB Alliance (since 2005: OpenFabrics Alliance) was founded and released the \gls{ofed} under the \gls{gpl} v2.0 or BSD license~\cite{allianceofed}. The \gls{ofed} stack includes, i.a., software drivers, core kernel-code, and user-level interfaces (verbs) and is publicly available online.\footnote{\url{https://github.com/linux-rdma/rdma-core}}\footnote{\url{https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git}} Most InfiniBand vendors fetch this code, sometimes make small enhancements and modifications, and ship it with their hardware.
\begin{figure}[ht!]
\hspace{0.5cm}
\includegraphics{images/openfabrics_stack.pdf}
\vspace{-0.5cm}
\caption{A simplified overview of the \gls{ofed} stack.}\label{fig:openfabrics_stack}
\end{figure}
\Autoref{fig:openfabrics_stack} shows a simplified sketch of the \gls{ofed} stack. This illustration is based on a depiction of Mellanox' \gls{ofed} stack~\cite{mellanox2018linux}. In this picture, the SCSI \gls{rdma} Protocol (SRP), all example applications, and all \acrshort{iwarp} related stack components are omitted. The present work will mainly concentrate on the interface for the user space: the OpenFabrics user verbs (in the remainder of the present work, simply referred to as \textit{verbs}) and the \gls{rdma} \gls{cm}.
When having read \autoref{sec:infiniband}, the names of most verbs are self-explanatory (e.g., \texttt{ibv\_create\_qp()}, \texttt{ibv\_alloc\_pd()}, \texttt{ibv\_modify\_qp()}, and \texttt{ibv\_poll\_cq()}). This section will highlight some functions which often reoccur in the implementations in \autoref{chap:implementation}---i.e., the structure of work requests and how to submit them in \autoref{sec:postingWRs}---or functions which are not or hardly defined in the \gls{iba}---i.e., event channels in \autoref{sec:eventchannels} and the \gls{rdma} communication manager in \autoref{sec:rdmacm}. A complete, alphabetically ordered list of all verbs with a brief description on them can be found in \autoref{a:openfabrics}.
\subsection{Submitting work requests to queues\label{sec:postingWRs}}
\paragraph{Scatter/gather elements} Submitting work requests is a crucial part of the datapath and enables processes to commission data transfers to the host channel adapter without kernel intervention. As presented in \autoref{sec:qp}, both send and receive work queue elements contain one or several memory location(s), which the \gls{hca} will use to read data from, or write data to. Work requests include a pointer to a list of at least one \gls{sge}. This is a simple structure that includes the memory address, the length, and, in order for the \gls{hca} to be able to actually access the memory location, the local key. The structure of a scatter/gather element is displayed in \autoref{lst:ibv_sge}.
\begin{figure}[ht!]
\vspace{0.5cm}
\lstinputlisting[caption=The composition of \texttt{struct ibv\_sge}.,
label=lst:ibv_sge,
style=customc]{listings/ibv_sge.h}
\vspace{-0.2cm}
\end{figure}
\paragraph{Receive work requests} A receive work request, which is used to inform the \gls{hca} about the main memory location where received data should be written to, is a rather simple structure as well. The structure, which is shown in \autoref{lst:ibv_recv_wr}, includes a pointer to the first element of a scatter/gather list (\texttt{*sg\_list}) and an integer to define the number of elements in the list (\texttt{num\_sge}). Passing a list with several memory locations can be handy if data should be written to different locations, rather than to one big coherent memory block. The \texttt{*next} pointer can be used to link a list of receive work requests together. This is helpful if a process first prepares all work requests, and subsequently wants to call \texttt{ibv\_post\_recv()} just once, on the first work request in the list. The \gls{hca} will automatically retrieve all following \glspl{wr}. The unsigned integer \texttt{wr\_id} is optional and can be used to identify the resulting completion queue entry.
\begin{figure}[ht!]
\vspace{0.5cm}
\lstinputlisting[caption=The composition of \texttt{struct ibv\_recv\_wr}.,
label=lst:ibv_recv_wr,
style=customc]{listings/ibv_recv_wr.h}
\vspace{-0.2cm}
\end{figure}
\paragraph{Send work requests} A send work request, displayed in \autoref{lst:ibv_send_wr}, is a larger structure and tells a lot about the several options (some) InfiniBand adapters offer. The first four elements are identical to those of the \texttt{ibv\_recv\_wr} C structure. They provide a way to match a \gls{cqe} with a \gls{wr}, offer the possibility to create a list of \glspl{wr}, and enable the user to specify a pointer to and the length of a list of scatter/gather elements.
The fifth element, \texttt{opcode}, defines the operation which is used to send the message. Which operations are allowed depends on the type of the queue pair the present work request will be sent to; \autoref{tab:transport_modes} shows all possible operations together with the service types they are allowed in. \texttt{send\_flags} can be set to a bitmap of the following flags:
\begin{itemize}
\setlength\itemsep{0.2em}
\item \texttt{IBV\_SEND\_FENCE}: The \gls{wr} will not be processed until all previous \textit{\gls{rdma} read} and \textit{atomic} \glspl{wr} in the send queue have been completed.
\item \texttt{IBV\_SEND\_SIGNALED}: If a \gls{qp} is created with \texttt{sq\_sig\_all=1}, completion queue entries will be generated for every work request that has been submitted to the \gls{sq}. Otherwise, \glspl{cqe} will only be generated for \glspl{wr} with this flag explicitly set.
This only applies to the send queue. Signaling cannot be turned off for the receive queue.
\item \texttt{IBV\_SEND\_SOLICITED}: This flag must be set if the remote node is waiting for an event (\autoref{sec:eventchannels}), rather than actively polling the completion queue. This flag is valid for \textit{send} and \textit{\gls{rdma} write} operations and will wake up the remote node if it is waiting for a solicited message.
\item \texttt{IBV\_SEND\_INLINE}: If this flag is set, the data to which the scatter/gather element points is directly copied into the \gls{wqe} by the \gls{cpu}\@. That means that the \gls{hca} does not need to independently copy the data from the host's main memory to its own internal buffers. Consequently, this saves an additional main memory access operation and, since the \gls{hca}'s \gls{dma} engine will not access the main memory, the local key that is defined in the scatter/gather element will not be checked. Sending data inline is not defined in the original \gls{iba} and thus not all \gls{rdma} devices support it. Before sending a message inline, the maximum supported inline size has to be checked by querying the \gls{qp} attributes using \texttt{ibv\_query\_qp()}.
This flag is frequently used in the remainder of the present work because it offers a potential latency decrease and the buffers can immediately be released for re-use after the send \gls{wr} got submitted.
\end{itemize}
\input{tables/transport_modes}
The 32-bit \texttt{imm\_data} variable is used with operations that send data \textit{with immediate} (\autoref{tab:transport_modes}). The data will be sent in the data packet's \acrshort{imm} field (\autoref{tab:packet_abbreviations}). Besides sending \SI{32}{\bit} of data to the remote's completion queue---for example, as identifier---the immediate data field can also be used for notification of \textit{\gls{rdma} writes}. Usually, the remote host does not know whether an \textit{\gls{rdma} write} message is written to its memory and thus does also not know when it is finished. Since \textit{\gls{rdma} write with immediate} consumes a receive \gls{wqe} and subsequently generates a \gls{cqe} on the receive side, this operation can be used as a way to synchronize and thus make the receiving side aware of the received data.
The fields \texttt{rdma}, \texttt{atomic}, and \texttt{ud} are part of a union, hence, mutually exclusive. The first two structs are used together with the operations with the same name from \autoref{tab:transport_modes}. The content of the \texttt{rdma} C structure defines the remote address and the remote key, which first have to be acquired through a normal \textit{send} operation. The \texttt{atomic} C structure includes the remote address and key, but also a compare and swap operand. The \texttt{ud} structure is used for unreliable datagram. As mentioned before, \glspl{qp} in \gls{ud} mode are not connected and the consumer has to explicitly define the \gls{ah} of the remote \gls{qp} in every \gls{wr}\@. The \gls{ah} is included in the \gls{wr} through the \texttt{*ah} pointer, and can, for example, be acquired with the \gls{rdma} communication manager which is presented in \autoref{sec:rdmacm}. The \texttt{remote\_qpn} and \texttt{remote\_qkey} variables are used for the queue pair number and queue pair key of the remote \gls{qp}, respectively.
\begin{figure}[ht!]
\vspace{0.5cm}
\lstinputlisting[caption=The composition of \texttt{struct ibv\_send\_wr}.,
label=lst:ibv_send_wr,
style=customc]{listings/ibv_send_wr.h}
\vspace{-0.2cm}
\end{figure}
\subsection{Event channels\label{sec:eventchannels}}
Usually, completion queues (\autoref{sec:qp}) are checked for new entries by actively polling them with \texttt{ibv\_poll\_cq()}; this is called \textit{busy polling}. In order for this to return a \gls{cqe} as soon as one appears in the completion queue, polling has to be done continuously. Although this is the fastest way to get to know if a new \gls{cqe} is available, it is very processor intensive: a \gls{cpu} core with a thread which continuously polls the completion queue will always be utilized \SI{100}{\percent}. If minimal \gls{cpu} utilization outweighs performance, the \gls{ofed} user verbs collections offers \glspl{cc}. Here, an instance of the \texttt{ibv\_comp\_channel} C structure is created with \texttt{ibv\_create\_comp\_channel()} and is, on creation of the completion queue, bound to that queue. After creation and every time after an event is generated, the completion queue has to be armed with \texttt{ibv\_req\_notify\_cq()} in order for it to notify the \gls{cc} about new \glspl{cqe}. To prevent races, events have to be acknowledged using \texttt{ibv\_ack\_cq\_event()}. Events do not have to be acknowledged before new events can be received, but all events have to be acknowledged before the completion queue is destroyed. Since this operation is relatively expensive, and since it is possible to acknowledge several events with one call to \texttt{ibv\_ack\_cq\_event()}, acknowledgments should be done outside of the datapath.
The completion channel is realized with help of the Linux system call\linebreak \texttt{read()}~\cite{kerrisk2010linux}. In default mode, \texttt{read()} tries to read a file descriptor \texttt{fd} and blocks the process until it can return. Hence, as long as \texttt{fd} is not available, the operating system hibernates the process, which enables it to schedule other processes to the \gls{cpu}\@. Because \texttt{read()} is used, the C structure of the channel, displayed in \autoref{lst:ibv_comp_channel}, is not much more than a mere file descriptor and a reference counter. The blocking function which is used to wait for a channel is \texttt{ibv\_get\_cq\_event()}; this function is a wrapper around \texttt{read()}.
\begin{figure}[ht!]
\vspace{0.5cm}
\lstinputlisting[caption=The composition of \texttt{struct ibv\_comp\_channel}.,
label=lst:ibv_comp_channel,
style=customc]{listings/ibv_comp_channel.h}
\vspace{-0.2cm}
\end{figure}
\Autoref{fig:poll_event_comparison} depicts a comparison between busy polling and polling after an event channel returns (\textit{event based polling}). \Autoref{fig:poll_based_polling} depicts busy polling, in which \texttt{ibv\_poll\_cq()} is placed in an endless loop and continuously polls the completion queue. In order to achieve low latencies---in other words in order to poll as often as possible---this takes place in a separate thread. If \texttt{ibv\_poll\_cq()} returns a value \texttt{ret > 0}, it was able to retrieve \texttt{ret} completion queue entries. These can now be processed, for example, to release the buffers they are pointing to.
Event based polling, depicted in \autoref{fig:event_based_polling}, is a little bit more complex. As described above, first, a completion channel is created and is bound to the completion queue during initialization. Then, the \gls{cq} must be informed with\linebreak\texttt{ibv\_req\_notify\_cq()} about the fact that it should notify the completion channel whenever a \gls{cqe} arrives. After initialization, the completion channel will be read with \texttt{ibv\_get\_cq\_event()}. This happens again in a separate thread, this time because \texttt{ibv\_get\_cq\_event()} will block the thread as long as no \gls{cqe} arrives in the completion queue. Whenever the function returns, it also returns a pointer to the original \gls{cq}, which in turn can be used to busy poll the queue for limited amount of time. However, there are two important differences to regular busy polling: when the \gls{cq} is polled the first time, it is ensured that it will return at least one \gls{cqe}\@. Furthermore, after it has been polled the first time, the thread will continue to poll it, but as soon as \texttt{ibv\_poll\_cq()} returns 0, the process will re-arm the \gls{cq} and return to the blocking function. (Acknowledging with \texttt{ibv\_ack\_cq\_event()} is omitted from this example for the sake of simplicity; it has to be called at least once before the completion queue is destroyed.)
\begin{figure}[ht!]
\vspace{-0.5cm}
\begin{subfigure}{\textwidth}
\includegraphics[width=\linewidth]{images/poll_based_polling.pdf}
\vspace{-0.7cm}
\caption{The working principle of busy polling.}\label{fig:poll_based_polling}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{\textwidth}
\includegraphics[width=\linewidth]{images/event_based_polling.pdf}
\vspace{-0.7cm}
\caption{The working principle of event based polling.}\label{fig:event_based_polling}
\end{subfigure}
\begin{subfigure}{\textwidth}
\includegraphics[width=\linewidth]{images/poll_event_comparison_legend.pdf}
\end{subfigure}
\vspace{-1.5cm}
\caption{A comparison between busy polling and polling after an event channel returns.}\label{fig:poll_event_comparison}
\end{figure}
\subsection{RDMA communication manager library\label{sec:rdmacm}}
Because communication management can be quite cumbersome in the \gls{iba}, Annex A11 of the \gls{iba} specification~\cite{infinibandvol1} proposes the \gls{rdma} IP connection manager, which is implemented by the OpenFabrics Alliance. It offers a socket-like connection model and encodes the connection 5-tuple (i.e., protocol, source and destination IP and ports) into the private data of the \gls{cm} \gls{req} field (\autoref{sec:communication_management}).
\paragraph{RDMA CM over IPoIB} The \gls{ofed} \texttt{librdmacm}\footnote{\url{https://github.com/linux-rdma/rdma-core/blob/master/librdmacm}} library makes use of \gls{ipoib} in its implementation of this communication manager. \gls{ipoib} uses an unreliable datagram queue pair to drive communication because this is the only mode which must be implemented by \glspl{hca} and because of its multicast support~\cite{ipoib}. As can be seen in \autoref{fig:openfabrics_stack}, the Linux \gls{ipoib} driver enables processes to access the InfiniBand \gls{hca} over the \acrshort{tcpip} stack. On one hand, this eradicates InfiniBand's advantages like kernel bypass. On the other hand, this offers an easy to set up interface to other InfiniBand nodes. All tools capable of working with the \acrshort{tcpip} stack are also able to work with the \acrshort{tcpip} stack on top of the \gls{ipoib} driver. Because of this, the \gls{rdma} communication manager is able to send \gls{arp} requests to other nodes which support \gls{ipoib} on the InfiniBand network. The \gls{arp} response will---assuming that a node with the requested IP address is present in the network---include a \SI{20}{\byte} \textit{MAC address}. This address consists of---listed from the \gls{msb} to the \gls{lsb}---1 reserved byte, a 3-byte \gls{qpn} field, and a 16-byte \gls{gid} field. It is important to note that some applications or operating systems may have problems with the length of \gls{ipoib}'s MAC addresses since an \acrshort{eui48}--which has a length of \SI{6}{\byte} instead of \SI{20}{\byte}---is mostly used in IEEE~802~\cite{eui64}.
Thus, after the \gls{ipoib} drivers are loaded and the interface is properly configured using tools like \texttt{ifconfig} or \texttt{ip}, the \gls{rdma} \gls{cm} is able to retrieve the queue pair number and global identifier of a remote queue pair with the help of a socket like construct.
\paragraph{Communication identifier \& events} The abovementioned socket like construct is realized through so-called \textit{communication identifiers} (\texttt{struct rdma\_cm\_id}). Unlike conventional sockets, these identifiers must be bound to a local \gls{hca} before they can be used. During creation of the identifier with \texttt{rdma\_create\_id()}, an event channel, conceptually similar to the channels presented in \autoref{sec:eventchannels}, can be bound to the identifier. If such a channel is present, all results of operations (e.g., resolve address, connect to remote \gls{qp}) are reported asynchronously, otherwise the identifier will operate synchronously. In the latter case, calls to functions that usually cause an event on the channel will block until the operation completes. The former case makes use of a function similar to \texttt{ibv\_get\_cq\_event()}: \texttt{rdma\_get\_cm\_event()} also implements a blocking function that only returns when an event occurs on the channel. This function can be used in a separate thread to monitor events that occur on the identifier and to act on them. It is possible to switch between synchronous and asynchronous mode.
Queue pairs can be allocated to an \texttt{rdma\_cm\_id}. Because the identifier keeps track of the different communication events that occur, it will automatically transition the \gls{qp} through its different states; explicitly invoking \texttt{ibv\_modify\_qp()} is no longer necessary.
\section{Real-time optimizations in Linux\label{sec:optimizations}}
This section introduces optimizations that can be applied to systems running on the Linux operating system. It expands upon techniques that were applied to the Linux environment all benchmarks and VILLASnode instances were executed on and upon memory optimizations of the code. Of course, the optimizations in this section are a mere subset of all possibilities. The first subsection (\ref{sec:mem_optimization}) elaborates on memory optimization, the second subsection (\ref{sec:numa}) specifically on non-uniform memory access, the third subsection (\ref{sec:cpu_isolation}) on \gls{cpu} isolation and affinity, the fourth subsection (\ref{sec:irq_affinity}) on interrupt affinity, and finally, the last subsection (\ref{sec:tuned}) elaborates on the \texttt{tuned} daemon.
This section will not expand on the \texttt{PREEMPT\_RT} patch~\cite{rostedt2007internals} because it could not be used together with the current \gls{ofed} stack. Possible opportunities of this real-time optimization with regards to InfiniBand applications are further expanded upon in \autoref{sec:future_real_time}.
\subsection{Memory optimizations\label{sec:mem_optimization}}
There are lots of factors that determine how efficiently memory is used: they can be on a high level---e.g., the different techniques that are supported by the \gls{os}---but also on a low level---e.g., by changing the order of certain memory accesses in the actual algorithm. Exploring all these different techniques is beyond the scope of the present work; rather, some techniques that are used in the benchmarks and in the implementation of the \textit{InfiniBand} node-type are discussed in this subsection. The interested reader is referred to Drepper's publication~\cite{drepper2007every}, which provides a comprehensive overview of methods that can be applied to optimize memory access in Linux.
\paragraph{Hugepages} Most modern operating systems---with Linux being no exception---support \textit{demand-paging}. In this method, every process has its own \textit{virtual memory} which appears to the process as a large contiguous block of memory. The \gls{os} maps the \textit{physical addresses} of the actual physical memory (or even of a disk) to \textit{virtual addresses}. This is done through a combination of software and the \gls{mmu} which is located in the \gls{cpu}.
Memory is divided into \textit{pages}. It is the smallest block of memory that can be accessed in virtual memory. For most modern operating systems, the smallest page size is \SI{4}{\kibi\byte}; in a 64-bit architecture these \SI{4}{\kibi\byte} can hold up to 512 words. If a process tries to access data at a certain address in the virtual memory which is not yet available, a \textit{page fault} is generated. This exception is detected by the \gls{mmu}, which in turn tries to map the complete page from the physical memory (or from a disk) into the virtual memory.
Page faults are quite expensive and it is beneficial for performance to cause as little as possible page faults~\cite{drepper2007every}. One possible solution to achieve this is to increase the size of the pages: Linux supports so-called \textit{hugepages}. Although there are several possible sizes for hugepages, on x86-64 architectures they are usually \SI{2}{\mebi\byte}~\cite{guide2018intelc3a}. Compared to the 512 words that can fit into a \SI{4}{\kibi\byte} page, the hugepage can fit \SI{262144} words into one page, which is 512 times as much. Since more data can be accessed with less page faults, this will increase performance; Drepper~\cite{drepper2007every} reports performance gains up to \SI{57}{\percent} (for a working set of $\SI[parse-numbers=false]{2^{20}}{\byte}$).
Additionally, with hugepages, more memory can be mapped with a single entry in the \gls{tlb}. This buffer is part of the \gls{mmu} and caches the most recently used page table entries. If a page is present in the \gls{tlb} (\textit{\gls{tlb} hit}), resolution of a page in the physical memory is instantaneous. Otherwise (\textit{\gls{tlb} miss}) up to four memory accesses in x86-64 architectures are required \cite{gandhi2016range}. Since the \gls{tlb} size is limited, larger pages result in the instantaneous resolution of a larger range of addresses with the same size \gls{tlb}.
Using hugepages is not an all-in-one solution; it has some disadvantages that have to be considered. When page sizes are becoming bigger, it gets harder for the \gls{mmu} to find contiguous physical memory sectors of this size. This goes hand in hand with external fragmentation of the memory. Furthermore, the size of hugepages makes them more prone to internal fragmentation, which means that more memory is allocated than is actually needed.
\paragraph{Alignment} A memory address $a$ is \textit{n-byte aligned} when
\begin{equation}
a = C\cdot n=C\cdot 2^i, \qquad \quad \mathrm{with}~i\geq0,\ \; C\in\mathbb{Z}.
\label{eq:alignment}
\end{equation}
An n-byte aligned address needs to meet the sufficient condition that $log_2(n)$ \glspl{lsb} of the address are \zero.
\begin{listing}[ht!]
\refstepcounter{lstlisting}
\noindent\begin{minipage}[b]{.34\textwidth}
\lstinputlisting[nolol=true, style=customc]{listings/memory_alignment_a.h}
\captionof{sublisting}{Struct with padding.}\label{lst:memory_alignment_a}
\end{minipage}%
\hfill
\begin{minipage}[b]{.58\textwidth}
\lstinputlisting[nolol=true, style=customc]{listings/memory_alignment_b.h}
\captionof{sublisting}{Packed struct without padding.}\label{lst:memory_alignment_b}
\end{minipage}
\addtocounter{lstlisting}{-1}
\captionof{lstlisting}{Two C structures with an 1-bit character, a 4-bit integer, and a 2-bit short.}
\label{lst:memory_alignment}
\end{listing}
\Autoref{fig:memory_alignment} shows a simple example for a 32-bit system with the three primitive C data types from \autoref{lst:memory_alignment}. In \autoref{fig:memory_alignment_a} the data is \textit{naturally aligned}: the compiler added padding between the data types to ensure alignment to the memory word boundaries. In the structure definition of \autoref{lst:memory_alignment}\hyperref[lst:memory_alignment]{b}, the compiler is compelled to omit additional padding: the data types are not aligned to word boundaries. Note that \autoref{eq:alignment} holds in \autoref{fig:memory_alignment_a}, but not in \autoref{fig:memory_alignment_b}. Furthermore, for \autoref{fig:memory_alignment_a}, additional 1-bit characters could be placed at \texttt{0x0001}, \texttt{0x0002}, \texttt{0x0003}, and \texttt{0x000A} in this example. Additional 2-bit shorts could be placed at \texttt{0x0002} and \texttt{0x000A}.
\begin{figure}[ht!]
\vspace{-0.2cm}
\begin{subfigure}{0.49\textwidth}
\includegraphics[width=\linewidth,page=1]{images/memory_alignment.pdf}
\vspace{-0.7cm}
\caption{An aligned struct (\autoref{lst:memory_alignment}\hyperref[lst:memory_alignment]{a}).}\label{fig:memory_alignment_a}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{0.49\textwidth}
\includegraphics[width=\linewidth, page=2]{images/memory_alignment.pdf}
\vspace{-0.7cm}
\caption{An unaligned struct (\autoref{lst:memory_alignment}\hyperref[lst:memory_alignment]{b}).}\label{fig:memory_alignment_b}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{\textwidth}
\includegraphics[width=\linewidth]{images/memory_alignment_legend.pdf}
\end{subfigure}
\vspace{-1.2cm}
\caption{An example of an 1-bit character, a 4-bit integer, and a 2-bit short from \autoref{lst:memory_alignment} in memory with a word size of \SI{32}{\byte}.}\label{fig:memory_alignment}
\end{figure}
Similar to pages, a system can only access one whole word at a time. In \autoref{fig:memory_alignment_a}, this translates to one memory access per data type. In \autoref{fig:memory_alignment_b}, however, this is no longer possible. To access the integer, the operating system first has to access the word at address \texttt{0x0000} and then the word at address \texttt{0x0004}. Subsequently the value in the first word must be shifted one position and the value in the second word three positions. Finally, both words have to be merged. These additional operations cause additional delay when trying to access the memory. Moreover, atomicity becomes more difficult to guarantee, since the \gls{os} needs to access two memory locations to access one data type.
Alignment is not only relevant for memory words. Not aligning allocated memory to cache lines significantly slows down memory access~\cite{drepper2007every}. Furthermore, due to the way the \gls{tlb} works, alignment can speed up resolution of addresses in the physical memory.
\paragraph{Pinning memory} The process of preventing the operating system from swapping out (parts of) the virtual address space of a process is called \textit{pinning memory}. It is invoked by calling \texttt{mlock()} to prevent parts of the address space from being swapped out, or \texttt{mlockall()} to prevent the complete address space from being swapped out.\footnote{\url{http://man7.org/linux/man-pages/man2/mlock.2.html}}
Explicitly pinning buffers that are allocated to use as source or sink for data by an \gls{hca} is not necessary: when registering a memory region (\autoref{sec:memory}), the registration process automatically pins the memory pages~\cite{mellanox2015RDMA}.
\subsection{Non-uniform memory access\label{sec:numa}}
If different memory locations in the address space show different access times, this is called \gls{numa}. A common example of a \gls{numa} system is a computer system with multiple \gls{cpu} sockets and thus also multiple system buses. Because a \gls{numa} node is defined as memory with the same access characteristics, here, this is the memory which is closest to the respective \gls{cpu}. Accessing memory on a remote \gls{numa} node adds up to 50 percent to the latency for a memory access \cite{lameter2013numa}.
\Autoref{fig:numa_nodes} depicts an example with two \gls{numa} nodes and the interconnect between them. It is beneficial for the performance of processes to access only memory which is closest to the processor that executes the process. Furthermore, regarding the InfiniBand applications later presented in the present work, it is beneficial to run processes that need to access a certain \gls{hca} on the same \gls{numa} node as the \gls{hca}. An \gls{hca} is connected to the system bus through the \gls{pcie} bus, hence, access of memory in the same \gls{numa} node will be faster than access of memory on a remote \gls{numa} node. Thus, in case of \autoref{fig:numa_nodes}, if a process needs to access \gls{hca} 0, it should be scheduled on one or more cores on processor 0 and should be restricted to memory locations of memory 0.
\begin{figure}[ht]
\includegraphics{images/numa_nodes.pdf}
\vspace{-0.5cm}
\caption{Two \acrfull{numa} nodes with \acrshortpl{hca} on the respective \acrshort{pcie} buses.}\label{fig:numa_nodes}
\end{figure}
To set the memory policy of processes, tools like \texttt{numactl}\footnote{\url{http://man7.org/linux/man-pages/man8/numactl.8.html}}, which are based on the system call \texttt{set\_mempolicy()}\footnote{\url{http://man7.org/linux/man-pages/man2/set_mempolicy.2.html}}, can be used. These tools will not be further elaborated upon here since the next subsection will introduce a more general tool to constrain both \gls{cpu} cores and \gls{numa} nodes to processes.
\subsection{CPU isolation \& affinity\label{sec:cpu_isolation}}
\paragraph{Isolcpus} It is beneficial for the performance of a process if one or more \gls{cpu} cores (in the remainder of the present work often simply referred to as \textit{cores} or \textit{\glspl{cpu}}) are completely dedicated to its execution. Historically, the \texttt{isolcpus}\footnote{\url{https://www.kernel.org/doc/Documentation/admin-guide/kernel-parameters.txt}} kernel parameter is used to exclude processor cores from the general balancing and scheduler algorithms on symmetric multiprocessing architectures. With this exclusion, processes will only be moved to excluded cores if their affinity is explicitly set to these cores with the system call \texttt{sched\_setaffinity()}~\cite{kerrisk2010linux}. The tool \texttt{taskset}\footnote{\url{http://man7.org/linux/man-pages/man1/taskset.1.html}}, which relies on the aforementioned system call, is often used to set the \gls{cpu} affinity of running processes or to set the affinity of new commands.
The major advantage of \texttt{isolcpus} is at the same time its biggest disadvantage: the exclusion of cores from the scheduling algorithms causes threads, that are created by a process, to always be executed on the same core as the process itself. Take the example of busy polling: if a thread that must busy poll a completion queue is created and is executed on the same core as the primary thread, this has an adverse effect on the performance of the latter. So, it is desired to isolate \gls{cpu} cores that are dedicated to certain explicitly defined processes, but simultaneously enable efficient scheduling of threads of these processes among the isolated cores.
\paragraph{Cpusets} A possible solution to this problem is offered by \textit{cpusets}~\cite{derr2004cpusets} which uses the generic \textit{control group} (cgroup)~\cite{menage2004cgroups} subsystem. If this mechanism is used, requests by a task to include \glspl{cpu} in its \gls{cpu} affinity or requests to include memory nodes are filtered through the task's cpuset. That way, the scheduler will not schedule a task on a core that is not in its \texttt{cpuset.cpus} list and not use memory on \gls{numa} nodes which are not in the \texttt{cpuset.mems} list.
Cpusets are managed through the \textit{cgroup virtual file system} and each cpuset is represented by a directory in this file system. The root cpuset is located under \texttt{/sys/fs/cgroup/cpuset} and includes all memory nodes and \gls{cpu} cores. A new cpuset is generated by creating a directory within the root directory. Every newly created directory automatically includes similar files to the root directory. These files shall be used to write the cpuset's configuration to (e.g., with \texttt{echo}\footnote{\url{http://man7.org/linux/man-pages/man1/echo.1.html}}) or to read the current configuration from (e.g., with \texttt{cat}\footnote{\url{http://man7.org/linux/man-pages/man1/cat.1.html}}). The following settings are available for every cpuset~\cite{derr2004cpusets}:
\begin{itemize}
\setlength\itemsep{-0.1em}
\item \texttt{cpuset.cpus}: list of \glspl{cpu} in that cpuset;
\item \texttt{cpuset.mems}: list of memory nodes in that cpuset;
\item \texttt{cpuset.memory\_migrate}: if set, pages are moved to cpuset's nodes;
\item \texttt{cpuset.cpu\_exclusive}: if set, cpu placement is exclusive;
\item \texttt{cpuset.mem\_exclusive}: if set, memory placement is exclusive;
\item \texttt{cpuset.mem\_hardwall}: if set, memory allocation is hardwalled;
\item \texttt{cpuset.memory\_pressure}: measure of how much paging pressure in cpuset;
\item \texttt{cpuset.memory\_pressure\_enabled}\footnote{exclusive to root cpuset}: if set, memory pressure is computed;
\item \texttt{cpuset.memory\_spread\_page}: if set, page cache is spread evenly on nodes;
\item \texttt{cpuset.memory\_spread\_slab}: if set, slab cache is spread evenly on nodes;
\item \texttt{cpuset.sched\_load\_balance}: if set, load is balanced among CPUs;
\item \texttt{cpuset.sched\_relax\_domain\_level}: searching range when migrating tasks.
\end{itemize}
Once all desired cpusets are created and everything is set up by writing settings to the abovementioned files, tasks can be assigned by writing their \gls{pid} to \texttt{/sys/fs/cgroup/cpuset/<name\_cpuset>/tasks}.
\paragraph{Cpuset tool} Since the process of manually writing tasks to the \textit{tasks-file} can be quite cumbersome, there are several tools and mechanisms to manage which processes are bound to which cgroups.\footnote{Although the libcgroup package was used in the past, systemd is nowadays the preferred method for managing control groups.} A rudimentary tool that is used in the present work is called \textit{cpuset}.\footnote{\url{https://github.com/lpechacek/cpuset}} It was developed by Alex Tsariounov and is a Python wrapper around the file system operations to manage cpusets. The following examples on how to create cpusets, how to move threads between cpusets, and how to execute applications in a cpuset are all based on this tool. However, the exact same settings can also be achieved by writing values manually to the virtual file system.
\Autoref{lst:cset_create} shows how to create different subsets. (Here, and in the remainder of the present work, the octothorpe indicates that the commands must be executed by a superuser. A dollar sign indicates that the command can be executed by a normal user.) In this example, an arbitrary machine with 24 cores and two \gls{numa} nodes (\autoref{sec:numa}) is assumed. The first cpuset, \textit{system}, may use 16 of these cores exclusively and may use memory in both \gls{numa} nodes. This will become the default cpuset for non-time-critical applications. The second and third cpuset, called \textit{real-time-0} and \textit{real-time-1} in this example, may use four cores each. These are exclusively reserved for time-critical applications. In this example, it is assumed that the \glspl{cpu} 16, 18, 20, and 22 reside in \gls{numa} node 0 and the \glspl{cpu} 17, 19, 21, and 23 in \gls{numa} node 1; the real-time cpusets are thus constrained to their respective nodes.
\begin{figure}[ht!]
\vspace{0.5cm}
\lstinputlisting[caption=Creating cpusets for system tasks and real-time tasks.,
label=lst:cset_create,
style=customconfig]{listings/cset_create.sh}
\vspace{-0.2cm}
\end{figure}
The exclusiveness of a \gls{cpu} to cpuset only applies to its siblings; tasks in the cpuset's parent may still use the \gls{cpu}. Therefore, \autoref{lst:cset_move} shows how to move threads and movable kernel threads from the root cpuset to the newly created \textit{system} cpuset. Now, the execution of these tasks and of all their children exclusively takes place on \glspl{cpu} that range from 0 to 15.
\begin{figure}[ht!]
\vspace{0.5cm}
\lstinputlisting[caption=Moving all tasks\comma threads\comma and moveable kernel threads to \textit{system}.,
label=lst:cset_move,
style=customconfig]{listings/cset_move.sh}
\vspace{-0.2cm}
\end{figure}
This leaves the two real-time cpusets exclusively for high-priority applications. \Autoref{lst:cset_exec} shows how new applications with their arguments can be started within the real-time cpusets.
To ensure that the load is balanced among the \glspl{cpu} in a cpuset---a feature that is not supported by \texttt{isolcpus}---\texttt{cpuset.sched\_load\_balance} must be \one. It is not necessary to explicitly set this value since its default value is already \one.
\begin{figure}[ht!]
\vspace{0.5cm}
\lstinputlisting[caption=Execute \texttt{<application>} with the arguments \texttt{<args>} in the real-time cpusets.,
label=lst:cset_exec,
style=customconfig]{listings/cset_exec.sh}
\vspace{-0.2cm}
\end{figure}
\paragraph{Non-movable kernel threads} Kernel threads are background operations performed by the kernel. They do not have an address space, are created on system boot, and can only be created by other kernel threads \cite{love2010linux}. Although some of them may be moved from one \gls{cpu} to another, this is not generally the case. Some kernel threads are pinned to a \gls{cpu} on creation. Although it is not possible to completely exclude kernel threads from getting pinned to cores which will be shielded, there is a workaround which might minimize this chance.
By setting the kernel parameter \texttt{maxcpus}\footnote{\url{https://www.kernel.org/doc/Documentation/admin-guide/kernel-parameters.txt}} to a number smaller than the total amount of \gls{cpu} cores in the system, some cores will not be brought up during bootup. Hence, these processors will not be used to schedule kernel threads. Later, when all movable kernel threads are moved to a shielded cpuset, the remaining \glspl{cpu} can be activated with the command from \autoref{lst:activate_cpu}. Then, these \glspl{cpu} can be added to an exclusive cpuset. Although it is inevitable that some necessary threads will be spawned on these cores once they are brought up, most of the non-movable kernel threads cannot move from a processor that was available during bootup to a processor that was activated after bootup.
\begin{figure}[ht!]
\vspace{0.5cm}
\lstinputlisting[caption=Bring up a \gls{cpu} \texttt{<cpuX>} which was disabled during bootup.,
label=lst:activate_cpu,
style=customconfig]{listings/activate_cpu.sh}
\vspace{-0.2cm}
\end{figure}
\subsection{Interrupt affinity\label{sec:irq_affinity}}
In most computer systems, hardware \textit{interrupts} provide a mechanism for \gls{io} hardware to notify the \gls{cpu} when it has finished the work it was assigned. When an \gls{io} device wants to inform the \gls{cpu}, it asserts a signal on the bus line it has been assigned to. The signal is then detected by the \textit{interrupt controller} which decides if the targeted \gls{cpu} core is busy. If this is not the case, the interrupt is immediately forwarded to the \gls{cpu}, which in turn ceases its current activity to handle the interrupt. If the \gls{cpu} is busy, for example, because another interrupt with a higher priority is being processed, the controller ignores the interrupt for the moment and the device keeps asserting a signal to the line until the \gls{cpu} is not busy\linebreak anymore \cite{tanenbaum2014modern}.
Hence, if a \gls{cpu} is busy performing time-critical operations---e.g., busy polling (\autoref{fig:poll_based_polling})---too many interrupts are detrimental for the performance. Thus, it can be advantageous to re-route interrupts to \glspl{cpu} that do not perform time-critical applications.
\Autoref{lst:get_irq_affinity} shows how to obtain the \gls{irq} affinity of a certain interrupt request \texttt{<irqX>}. The value \texttt{smp\_affinity} is a \textit{bitmap}, which means that the indices that are set represent the allowed \glspl{cpu}~\cite{bowden2009proc}. E.g., when \texttt{smp\_affinity} for a certain \gls{irq} is \texttt{10}, it means that \gls{cpu} 1 is allowed; if the affinity is \texttt{11}, it means that \gls{cpu} 1 and 0 are allowed.
\begin{figure}[ht!]
\vspace{0.5cm}
\lstinputlisting[caption=Get the \gls{irq} affinity of interrupt \texttt{<irqX>}.,
label=lst:get_irq_affinity,
style=customconfig]{listings/get_irq_affinity.sh}
\vspace{-0.2cm}
\end{figure}
\Autoref{lst:set_irq_affinity} demonstrates how the \gls{irq} affinity of a certain interrupt request can be set. In the case of \autoref{lst:set_irq_affinity}, it is set to \gls{cpu} 0--15, which corresponds to the \textit{system} cpuset from the previous paragraph. \texttt{<irqX>} will no longer bother the \glspl{cpu} 16--23. To re-route all \glspl{irq}, a script (e.g., in Bash) that loops through \texttt{/proc/irq} can be used.
\begin{figure}[ht!]
\vspace{0.5cm}
\lstinputlisting[caption=Set the \gls{irq} affinity of interrupt \texttt{<irqX>} to \gls{cpu} 0--15.,
label=lst:set_irq_affinity,
style=customconfig]{listings/set_irq_affinity.sh}
\vspace{-0.2cm}
\end{figure}
\subsection{Tuned daemon\label{sec:tuned}}
Red Hat based systems support the \texttt{tuned} daemon\footnote{\url{https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/performance_tuning_guide/chap-red_hat_enterprise_linux-performance_tuning_guide-tuned}}, which uses \texttt{udev}~\cite{kroah2003udev} to monitor devices and, on the basis of its findings, adjusts system settings to increase performance according to a selected profile. The daemon consists of two types of plugins: monitoring and tuning plugins. The former can, at the moment of writing the present work, monitor the disk load, network load, and \gls{cpu} load. The tuning plugins currently supported are: cpu, eepc\undershort{}she, net, sysctl, usb, vm, audio, disk, mounts, script, sysfs, and video.
Although it is possible to define custom profiles, \texttt{tuned} offers a wide range of predefined profiles, of which \textit{latency-performance} is eminently suitable for low-latency applications. This profile does, among others, disable power saving mechanisms, set the \gls{cpu} governor to \textit{performance}, and lock the \gls{cpu} to a low C-state. A complete overview of all settings in the \textit{latency-performance} profile can be found in \autoref{a:tuned_profile}.
Management of different tuning profiles can be done with the command line tool \texttt{tuned-adm}.\footnote{\url{https://linux.die.net/man/1/tuned-adm}}