masters-thesis/chapters/architecture.tex

\chapter{Architecture\label{chap:architecture}}
The first section of this chapter (\ref{sec:villasbasics}) explains the concept and internals of a VILLASnode instance. In the second section (\ref{sec:configuration}), a brief introduction on the configuration of node-type instances is given. Then, in \autoref{sec:readwrite_interfaces},~\ref{sec:memorymanagement}, and~\ref{sec:villas_fsm}, the adaptions that had to be made to the interface of node-types, the memory management of VILLASnode, and the finite-state machine of nodes are explained, respectively.

\section{Concept\label{sec:villasbasics}}
The functioning principles and general structure of VILLASframework, of which VILLASnode is a sub-project, were already presented in \autoref{sec:intro_villas}. This section solely focuses on the structure of VILLASnode.

\Autoref{tab:villasnode_nodes} presented the different \textit{node-types} that VILLASnode supported at the time of writing the present work. One VILLASnode instance---in the remainder of the present work often referred to as \textit{super-node}---may have several \textit{nodes} which act as source and/or sink of simulation data. A node is defined as an instance of a node-type. Accordingly, a super-node can serve as a gateway for simulation data. Node-types can roughly be divided into three categories:
\begin{itemize}
	\setlength\itemsep{0.2em}
    \item \textit{internal node-types}, which enable communication with node-types on the same host (e.g., writing data to a file descriptor through a \textit{file} node);
    \item \textit{server-server node-types}, which enable communication with nodes on different hosts (e.g., communicating with a \textit{socket} node on a remote host);
    \item \textit{simulator-server node-types}, which enable communication with simulators (e.g., acquiring data from an OPAL-RT simulator).
\end{itemize}
(In the remainder of this work, names of node-types and nodes are written in a cursive font, for example, \textit{file} node, \textit{socket} node, or \textit{InfiniBand} node-type.)

Within a super-node, so called \textit{paths} connect different nodes. A path starts at a node from which it acquires data. Immediately after data is obtained, it is optionally sent through a \textit{hook}, which can be seen as an extension to manipulate the data (e.g., to filter or transform it). Then, the data is written into a \gls{fifo} (also called: \textit{queue}), which holds it until it can be passed on. Subsequently, the data is sent through a \textit{register}, which can multiplex and mask it. Before the data is placed into the output queue and right before the sending node obtains it, it can be manipulated by more hooks. Finally, if the output node is ready, the data is moved from the output queue to the output node, which then sends it to a given destination node.

Data is transmitted in \textit{samples}, which store the simulation data for a given point in time, send and receive timestamps, and a sequence number. The sample structure is deliberately kept simple because it is the smallest common denominator of all supported simulators.

\begin{figure}[ht!]
	\includegraphics{images/villasnode.pdf}
	\vspace{-0.5cm}
    \caption{The internal VILLASnode architecture~\cite{vogel2017open}. Depicted is one VILLASnode instance (\textit{super-node}) that includes three \textit{paths}, which connect five node-type instances (\textit{nodes}) with each other.}
    \label{fig:villasnode}
\end{figure}

\Autoref{fig:villasnode} depicts the internal connections of an example super-node. This VILLASnode instance includes five node-type instances: \textit{opal} ($n_1$), \textit{file} ($n_2$), \textit{socket} ($n_3$), \textit{mqtt} ($n_4$), and a yet to be implemented \textit{InfiniBand} ($n_5$) node. On receive, data from the \textit{opal} node $n_1$ is modified by hook $h_1$ before it is placed in queue $q_{i,1}$. Path 1 continues through register $r_1$, hook $h_2$, and hook $h_3$, before it enters the output queue $q_{o,1}$. Before the \textit{socket} node $n_3$ sends the data from the queue to another \textit{socket} node, it is modified one last time by hook $h_4$.

Path 2 connects a \textit{socket} node ($n_3$), an \textit{mqtt} node ($n_4$), and an \textit{InfiniBand} node ($n_5$) with an \textit{opal} node $n_1$. In this path, the register $r_2$ determines the forwarding conditions for $q_{i,2}$, $q_{i,3}$, and $q_{i,4}$; it could, for example, depending on the data available in the queues, mask them. Before the data is placed in the output queue $q_{o,2}$ and right before the \textit{opal} node sends the data, it is modified by hook $h_5$ and $h_6$, respectively.

Path 3 connects a \textit{file} node $n_2$, which reads data from a local file, with an \textit{mqtt} node $n_4$ and \textit{InfiniBand} node $n_5$.

\section{Configuration of nodes\label{sec:configuration}}
\Autoref{lst:node_config} shows an example of a stripped down VILLASnode configuration file. The first part of the configuration consists of a list of nodes to be initialized (comparable with $n_{1\ldots5}$ in \autoref{fig:villasnode}). In this example, an instance of a \textit{file} node-type (\texttt{node\_1}) and an instance of an \textit{InfiniBand} node-type (\texttt{node\_2}) would be instantiated. Besides the type, a user can specify a range of settings for every node. These can be divided into global settings for the complete instance, settings only for the input part of the node, and settings only for the output part. The supported settings for every node-type can be found on the VILLASframework documentation pages.\footnote{\url{https://villas.fein-aachen.org/doc/node-types.html}}

\begin{figure}[ht!]
    \vspace{0.5cm}
    \lstinputlisting[caption=Structure of the configuration file of a \textit{file} node and an \textit{InfiniBand} node with a path connecting them.,
            		 label=lst:node_config,
                     style=customconfig]{listings/node_config.conf}
    \vspace{-0.2cm}
\end{figure}

The \textit{paths} section describes how nodes are connected within the super-node (compare with path 1, path 2, and path 3 in \autoref{fig:villasnode}). In this case, there is a path between \texttt{node\_1} and \texttt{node\_2}. This means that data is read from a file, which would be specified in the in-section of \texttt{node\_1}, and then placed in a buffer in the super-node. Then, after it is sent through possible hooks---which are not defined in this configuration file---it is copied to the memory that is allocated as output buffer for the \textit{InfiniBand} node. The super-node then sends these samples to the write-function of that node, which in turn sends the samples to a remote node as specified in its out-section.

\section{Interface of node-types\label{sec:readwrite_interfaces}}
To ensure interoperability between different node-types and VILLASnode, the VILLASframework specification defines an interface to use between the super-node and node-types. It is realized as a fixed set of functions with a given set of parameters that every node-type can implement. These functions have to be registered with the framework by passing it the pointers of the respective functions. Examples of functions to be implemented are \texttt{start()} and \texttt{stop()}, as well as \texttt{read()} and \texttt{write()}. Since their parameters had to be changed to efficiently support an \textit{InfiniBand} node-type, this section will expand upon the latter.

Not every function is mandatory; some functions will simply be ignored if they are not implemented. A complete list of all functions a node-type should implement, together with a brief description, is presented in \autoref{a:nodetype_functions}.

\subsection{Original implementation of the read- and write-function}
\Autoref{lst:read_write_original} shows the variables which were originally used in the \texttt{node\_type} C structure (\autorefap{a:sec:structnodetype}) to save the function pointers to the read- and write-function. Since this listing shows the initial parameters, it helps to understand the working principles of both functions and their weaknesses.

For both functions, \texttt{*n} is a C structure that holds information about the node-type instance. It contains, among others, information about the state, the number of generated or received samples, the configuration of the node and a field for node-type specific virtual data. The node structure is displayed in \autorefap{a:sec:structnode}; the present work will not expand further upon this struct.

\begin{figure}[ht!]
    \vspace{0.5cm}
    \lstinputlisting[caption=Original parameters of \texttt{read()} and \texttt{write()},
            		 label=lst:read_write_original,
                     style=customc]{listings/read_write_original.h}
    \vspace{-0.2cm}
\end{figure}

\paragraph{Read-function} The working principle of the read-function is displayed in \autoref{fig:villas_read}. The \textit{\undershort read()} box represents the function to which the \texttt{(*read)} pointer (line 1 in \autoref{lst:read_write_original}) of a given node-type points and is often simply referred to as \textit{read-function} in the remainder of the present work. The box thus depicts a part of the interface between the super-node and the node.

In order to retrieve data from a node, the super-node starts by allocating $\mathtt{cnt} \geq 1$ empty samples. A sample contains fields for, i.a., an origin timestamp, a receive timestamp, a sequence number, a reference counter, and a field to save the actual signal. The signal can contain unsigned 64-bit integers, 64-bit floating-point numbers, booleans, or complex numbers. \Autorefap{a:sec:structsample} presents the \texttt{sample} C structure. Since this structure contains some host specific information, it contains more data than will actually be sent.

After samples have been allocated, their reference counter (\textit{refcnt}) is increased by one. Samples in VILLASnode cannot be destroyed unless the reference counter is 1 when the release-function is called. When $refcnt>1$, other instances within VILLASnode still rely on the sample; calling the release-function on such a sample will cause the reference counter to be decremented by 1. In the remainder of the present work, \textit{releasing a sample} and \textit{decreasing the reference counter of a sample by one} is used interchangeably.

\begin{figure}[ht!]
	\vspace{-0.5cm}
    \begin{subfigure}{0.49\textwidth}
        \includegraphics[width=\linewidth, page=1]{images/villas_read.pdf}
	    \vspace{-0.8cm}
        \caption{Invoking the read-function.}\label{fig:villas_read_a}
    \end{subfigure}
    \hspace*{\fill} % separation between the subfigures
    \begin{subfigure}{0.49\textwidth}
        \includegraphics[width=\linewidth, page=2]{images/villas_read.pdf}
	    \vspace{-0.8cm}
        \caption{Return of the read-function.}\label{fig:villas_read_b}
    \end{subfigure}
    \begin{subfigure}{\textwidth}
        \includegraphics[width=\linewidth]{images/villas_read_legend.pdf}
    \end{subfigure}
    \vspace{-1.5cm}
    \caption{A depiction of the working principle of the read-function in VILLASnode. This function is part of the interface between a super-node and a node.}\label{fig:villas_read}
\end{figure}

After memory to hold the samples has been allocated, a pointer to the first sample (\texttt{*smps[]}) and the total number of allocated samples (\texttt{cnt}) is passed to the node by calling the read-function (\autoref{fig:villas_read_a}). The node then tries to receive a maximum of \texttt{cnt} values to subsequently copy them to the allocated memory.

The return of the read-function is depicted in \autoref{fig:villas_read_b}. After the receive module, which is blackboxed here, has filled up $ret \leq \mathtt{cnt}$ samples, it lets the read-function return with \textit{ret}. The super-node then processes \textit{ret} samples (e.g., sending them through several hooks, before sending them to another node). Finally, all \texttt{cnt}---thus not only \textit{ret}---samples are released. So, after a read cycle, the reference counter of all samples is decreased by 1, and in that way the samples are usually destroyed.

\paragraph{Write-function} The write-function works in a similar fashion as the read-function and has identical parameters (line 2 in \autoref{lst:read_write_original}). The working principle of this function is depicted in \autoref{fig:villas_write}. When a super-node's path needs to write data to a node, it calls the write-function (\autoref{fig:villas_write_a}) and passes the total number of samples and the pointer to the first sample as arguments.


\begin{figure}[ht!]
	\vspace{-0.5cm}
    \begin{subfigure}{0.49\textwidth}
        \includegraphics[width=\linewidth, page=1]{images/villas_write.pdf}
	    \vspace{-0.8cm}
        \caption{Invoking the write-function.}\label{fig:villas_write_a}
    \end{subfigure}
    \hspace*{\fill} % separation between the subfigures
    \begin{subfigure}{0.49\textwidth}
        \includegraphics[width=\linewidth, page=2]{images/villas_write.pdf}
	    \vspace{-0.8cm}
        \caption{Return of the write-function.}\label{fig:villas_write_b}
    \end{subfigure}
    \begin{subfigure}{\textwidth}
        \includegraphics[width=\linewidth]{images/villas_write_legend.pdf}
    \end{subfigure}
    \vspace{-1.5cm}
    \caption{A depiction of the working principle of the write-function in VILLASnode. This function is part of the interface between a super-node and a node.} \label{fig:villas_write}
\end{figure}

When the write-function is called, the node starts processing the samples by copying \texttt{cnt} samples to its send module and instructing it to send the data. The send module does not return until all samples are copied to the send module, and in case of many nodes, not until the data is successfully sent. When the send module is done, depicted in \autoref{fig:villas_write_b}, it lets the write-function return with the number of samples that have been successfully sent. Ideally, the returned value \textit{ret} is equal to the number of passed samples \texttt{cnt}. If this is not the case, the super-node will detect this and act upon a possible error. In all cases, the reference counter of all \texttt{cnt} samples is decremented by~1.

\subsection{Requirements for the read- and write-function of an InfiniBand node\label{sec:requirements}}
As discussed in the previous section, the reference counters of all samples that have been sent into the read- or write-functions are decreased after the functions return. For nodes with either a receive module that has a local buffer or with a send module which does not return until it has made a copy of the data or actually sent the data, this approach works exactly as intended. But, as soon as the modules are implemented by an architecture which is based on the \gls{via}---in this particular case the \gls{iba}---the implementation causes problems. To adhere to the zero-copy principle of the \gls{via}, data should not be copied from the super-node's buffer to a local buffer or the other way around. Rather, a pointer to, and the length of, a memory location should be passed to the network adapter, which then independently copies the data from the host's memory to its local buffers or the other way around.

In the following, the ideal situation for a read and write operation for the InfiniBand Architecture is presented. Although this approach is specifically for the \gls{iba}, it can relatively easily be translated to other \glspl{via}. After the desired approach has been discussed, the next subsection will discuss the shortcomings of the parameters in \autoref{lst:read_write_original}, that obstruct the implementation of this approach.

\paragraph{Read-function}
\Autoref{fig:villas_read_iba} depicts a super-node that reads from a node-type instance whose communication is based on the \gls{iba}. The receive module in this figure relies on the receive queue of an InfiniBand \gls{qp}. As explained in \autoref{sec:qp}, a queue pair cannot receive data unless its \gls{rq} holds receive \glspl{wqe}. Hence, work requests that point to buffers of the super-node have to be submitted.

\begin{figure}[ht!]
	\vspace{-0.4cm}
    \begin{subfigure}{0.49\textwidth}
        \includegraphics[width=\linewidth, page=1]{images/villas_read_iba.pdf}
	    \vspace{-0.8cm}
        \caption{Invoking the read-function.}\label{fig:villas_read_iba_a}
    \end{subfigure}
    \hspace*{\fill} % separation between the subfigures
    \begin{subfigure}{0.49\textwidth}
        \includegraphics[width=\linewidth, page=2]{images/villas_read_iba.pdf}
	    \vspace{-0.8cm}
        \caption{Return of the read-function.}\label{fig:villas_read_iba_b}
    \end{subfigure}
    \begin{subfigure}{\textwidth}
        \includegraphics[width=\linewidth]{images/villas_read_iba_legend.pdf}
    \end{subfigure}
    \vspace{-1.5cm}
    \caption{A depiction of the working principle of the read-function in an \textit{InfiniBand} node. The \acrshort{rq} is part of a complete \acrshort{qp}, but the \acrshort{sq} is omitted for the sake of simplicity.} \label{fig:villas_read_iba}
\end{figure}

An important requirement for this node-type was that it should be compatible with the original node-type interface; or at least that the changes would be minimal. Hence, in order to acquire pointers to samples from the super-node, the \texttt{*smps[]} parameter from the read-function is used. Like the super-node in \autoref{fig:villas_read_a}, the super-node in \autoref{fig:villas_read_iba_a} starts by allocating $cnt \geq 1$ empty samples, increasing their reference counters, and passing their pointers to the node's read-function. The node, in turn, takes the addresses of the samples, wraps them up in scatter/gather elements, places them in work requests, and submits them to the \gls{rq}. Now, when the \gls{hca} receives a message, it will write the data directly into the allocated memory within the super-node. In this way, an additional copy between the node and the super-node is avoided.

Since the receive module of an \textit{InfiniBand} node does not copy data to the passed samples, the returning of function in \autoref{fig:villas_read_iba_b} works fundamentally different from the returning of the function in \autoref{fig:villas_read_b}. If there are no \glspl{cqe} in the completion queue, thus if the HCA did not receive any data, the return value \textit{ret} of the node shall be 0. In that way, the super-node knows that the set of previously allocated \texttt{smps[]} does not hold any data. The reference counters of none of the buffers shall be decreased since they are all submitted to the \gls{rq} and the \gls{hca} will thus write data to them.

If \glspl{cqe} are available, pointers to samples which are submitted to the \gls{rq} (light gray in \autoref{fig:villas_read_iba}) are replaced by the pointers to the buffers that are filled by the HCA (dark gray in \autoref{fig:villas_read_iba}). The return value \textit{ret} shall be the number of pointers that have been replaced since these buffers now contain valid data that was sent to this node. The reference counters of these buffers must be decreased after they have been processed by the super-node.

Consequently, in order for the \textit{InfiniBand} node to be able to receive data, the super-node has to invoke the read-function at least once without acquiring any data. To store the pointers to the buffers in the \glspl{cqe}, the \gls{wr} C structure member \texttt{wr\_id} can be used (see \autoref{sec:postingWRs}).

\paragraph{Write-function} The write-function, depicted in \autoref{fig:villas_write_iba}, has to adhere to similar conventions as the read-function in order to realize zero-copy. Again, the addresses of the samples are passed to the node as arguments of the write-function, to be subsequently submitted to the \gls{sq}. The \gls{hca} will then process the submitted work requests and take care of the necessary memory operations.

\begin{figure}[ht!]
	\vspace{-0.5cm}
    \begin{subfigure}{0.49\textwidth}
        \includegraphics[width=\linewidth, page=1]{images/villas_write_iba.pdf}
        \vspace{-0.8cm}
        \caption{Invoking the write-function.}\label{fig:villas_write_iba_a}
    \end{subfigure}
    \hspace*{\fill} % separation between the subfigures
    \begin{subfigure}{0.49\textwidth}
        \includegraphics[width=\linewidth, page=2]{images/villas_write_iba.pdf}
        \vspace{-0.8cm}
        \caption{Return of the write-function.}\label{fig:villas_write_iba_b}
    \end{subfigure}
    \begin{subfigure}{\textwidth}
        \includegraphics[width=\linewidth]{images/villas_write_iba_legend.pdf}
    \end{subfigure}
    \vspace{-1.5cm}
    \caption{A depiction of the working principle of the write-function in an \textit{InfiniBand} node. The \acrshort{sq} is part of a complete \acrshort{qp}, but the \acrshort{rq} is omitted for the sake of simplicity.} \label{fig:villas_write_iba}
\end{figure}

When the pointers are successfully submitted to the \gls{sq}, the function shall return the total number of submitted pointers \textit{ret}. If the completion queue is empty, none of these pointers may be released because the HCA has yet to access the memory locations. If the completion queue contains entries, that means that previously submitted send \glspl{wr} are finished; these pointers can be released. So, in order to release them, the initial pointers to the data to be sent (light gray in \autoref{fig:villas_write_iba}) are replaced by pointers to buffers which were submitted to the \gls{sq} in a previous call of the write-function. The super-node has to be notified that it must only decrease the reference counter of pointers that were yielded by the \glspl{cqe}.

\subsection{Proposal for a new read- and write-function\label{sec:proposal}}
Apparently, the major shortcoming of the functions from \autoref{lst:read_write_original} is the lack of an interface to pass the number of samples to be released to the super-node. There is no way the super-node can predict how many samples may be released; this becomes even more difficult if it is taken into account that some samples may be sent inline---thus can be released immediately after submitting the \gls{wr}---and that some work requests may not be successfully submitted to the \gls{sq}.

Therefore, new parameters for the read- and write-function are proposed in \autoref{lst:read_write_proposal}. The additional parameter in each function lets a node decide how many items of \texttt{smps[]} should actually be released. The several distinctions which have to be considered are further elaborated upon in \autoref{sec:villas_implementation}.

\begin{figure}[ht!]
    \vspace{0.5cm}
    \lstinputlisting[caption=Proposal for an additional parameter in \texttt{read()} and \texttt{write()}.,
            		 label=lst:read_write_proposal,
                     style=customc]{listings/read_write_proposal.h}
    \vspace{-0.2cm}
\end{figure}

\section{Memory management\label{sec:memorymanagement}}
Originally, memory that was allocated within the framework could be allocated with a fixed set of settings called \textit{memory-types}. The VILLASnode internal \texttt{alloc()} could be called, for example, with \texttt{memory\_hugepage}, which pins memory and maps it to hugepages (see \autoref{sec:mem_optimization}), or with \texttt{memory\_heap}, which allocates aligned memory on the heap. These embedded memory-types are not sufficient for the \textit{InfiniBand} node-type. \Autoref{sec:requirements} already showed that the \gls{hca} will directly access the memory that is allocated by the super-node. Thus, as follows from \autoref{sec:memory}, the buffer must be registered with a memory region and the \glspl{wr} that are submitted to either queue of the \gls{qp} must contain the local key.

Since embedding a memory-type for every node-type in the VILLASnode source code would go against the principle of modularity, this is not an option. Consequently, the most obvious solution is to allow every node-type to register its own memory-type if necessary. In that way, every node-type can exactly define what the \texttt{alloc()} and \texttt{free()} functions implement. For \texttt{alloc()}, a node-type can, for example, define how memory should be allocated, whether the pages should be aligned, how big the pages should be, and if the memory should be registered with a memory region. It is also possible for a node-type to implement certain functions which interact with the memory that is allocated by the memory-type; this can, for example, be used within the \textit{InfiniBand} node to acquire the local key of a sample that is passed as an argument of the read- or write-function.

With this method, every node-type may define a \texttt{memory\_type} C structure, which it must register in the same fashion as it registers the interface functions with the super-node (line 39, \autoref{lst:struct_nodetype}). By enabling node-types to register their own memory-type, the super-node knows what type of memory to use for input and/or output buffers that are connected to nodes of this type ($q_{i,x}$ and $q_{o,x}$ in \autoref{fig:villasnode}).

If no memory-type is specified, the super-node will assume \texttt{memory\_hugepage}.

\section{VILLASnode finite-state machine\label{sec:villas_fsm}}
Initially, a node could reside in one of the six states displayed in \autoref{lst:states}. The super-node transitions the node through the states depending on the results of functions from \autoref{a:nodetype_functions}. E.g., when the super-node calls a node's start-function, the transition \textit{checked}$\,\to\,$\textit{started} is performed if the function returns successfully.

\begin{figure}[ht!]
    \vspace{0.5cm}
    \lstinputlisting[caption=The six states a node could originally reside in.,
            		 label=lst:states,
                     style=customc]{listings/states.h}
    \vspace{-0.2cm}
\end{figure}

These states were sufficient for the node-types which existed up to now (\autoref{tab:villasnode_nodes}); when a node resided in \textit{started}, this meant it was ready to send and receive data. This is not the case for node-types that are based (descendants of) the Virtual Interface Architecture. Here, a node can be initiated---for which the \textit{started} state can be used---but not connected and thus not able to send data to another node. Accordingly, the introduction of a new state \textit{connected} would be appropriate. Furthermore, architectures that are based on the \gls{via} rely on descriptors (called work requests in the \gls{iba}) in a send and receive queue. Hence, in order to be able to receive data directly after the connection has been established, descriptors have to be present in the \gls{rq} at this moment. For this reason, in (descendants of) the \gls{via}, it is possible to prepare elements in the receive queue prior to the actual connection.

These considerations yield the finite-state machine in \autoref{fig:villasnode_states}. The states which are indicated with dashed borders, \textit{pending connect} and \textit{connected}, may be set by the node after the super-node transitioned the instance to the \textit{started} state. The use of both states is not mandatory. If a node is in one of these two states, the super-node interprets it as were the node in the \textit{started} state. But, they can be used within the node itself to distinguish between a node being started, being in a pending connect state, or actually being connected. This state machine shows similarities with the \gls{via}'s finite-state machine in \autoref{fig:via_diagram}. It can therefore be used for future node-types that are based on the \gls{via}---other than the \textit{InfiniBand} node-type that is presented in the present work---as well.

Although it is necessary to execute the transition \textit{checked}$\,\to\,$\textit{started}, it is possible to transition to \textit{stopped} and \textit{destroyed} from any of the three states in the dashed square.

\begin{figure}[ht]
    \vspace{-0.65cm}
	\hspace{0.4cm}
	\includegraphics{images/villasnode_states.pdf}
	\vspace{-0.45cm}
    \caption{The VILLASnode state diagram with the two newly introduced states \textit{pending connect} and \textit{connected}.}
    \label{fig:villasnode_states}
\end{figure}