Initial commit of master's thesis

This is the version I submitted to RWTH Aachen University at November 9,
2018.
This commit is contained in:
Dennis Potter 2018-11-12 12:56:59 +01:00
parent ffbcce77f9
commit af25b4b828
1136 changed files with 127398252 additions and 2 deletions
.gitignoreMakefileREADME.mdabstract.tex
abstract
appendices.tex
appendices
biblatex.cfgbibliography.bibchapters.tex
chapters
glossary
images
listings

256
.gitignore vendored Normal file

@ -0,0 +1,256 @@
## VS Code files
.vscode/
## Autogenerated files
*/build/
## Python
__pycache__
plots/*.py
*.ipynb_checkpoints
## Core latex/pdflatex auxiliary files:
*.aux
*.lof
*.log
*.lot
*.fls
*.out
*.toc
*.fmt
*.fot
*.cb
*.cb2
.*.lb
## Intermediate documents:
*.dvi
*.xdv
*-converted-to.*
# these rules might exclude image files for figures etc.
# *.ps
# *.eps
## Bibliography auxiliary files (bibtex/biblatex/biber):
*.bbl
*.bcf
*.blg
*-blx.aux
*-blx.bib
*.run.xml
## Build tool auxiliary files:
*.fdb_latexmk
*.synctex
*.synctex(busy)
*.synctex.gz
*.synctex.gz(busy)
*.pdfsync
## Build tool directories for auxiliary files
# latexrun
latex.out/
## Auxiliary and intermediate files from other packages:
# algorithms
*.alg
*.loa
# achemso
acs-*.bib
# amsthm
*.thm
# beamer
*.nav
*.pre
*.snm
*.vrb
# changes
*.soc
# cprotect
*.cpt
# elsarticle (documentclass of Elsevier journals)
*.spl
# endnotes
*.ent
# fixme
*.lox
# feynmf/feynmp
*.mf
*.mp
*.t[1-9]
*.t[1-9][0-9]
*.tfm
#(r)(e)ledmac/(r)(e)ledpar
*.end
*.?end
*.[1-9]
*.[1-9][0-9]
*.[1-9][0-9][0-9]
*.[1-9]R
*.[1-9][0-9]R
*.[1-9][0-9][0-9]R
*.eledsec[1-9]
*.eledsec[1-9]R
*.eledsec[1-9][0-9]
*.eledsec[1-9][0-9]R
*.eledsec[1-9][0-9][0-9]
*.eledsec[1-9][0-9][0-9]R
# glossaries
*.acn
*.acr
*.glg
*.glo
*.gls
*.glsdefs
# gnuplottex
*-gnuplottex-*
# gregoriotex
*.gaux
*.gtex
# htlatex
*.4ct
*.4tc
*.idv
*.lg
*.trc
*.xref
# hyperref
*.brf
# knitr
*-concordance.tex
# TODO Comment the next line if you want to keep your tikz graphics files
*.tikz
*-tikzDictionary
# listings
*.lol
# makeidx
*.idx
*.ilg
*.ind
*.ist
# minitoc
*.maf
*.mlf
*.mlt
*.mtc[0-9]*
*.slf[0-9]*
*.slt[0-9]*
*.stc[0-9]*
# minted
_minted*
*.pyg
# morewrites
*.mw
# nomencl
*.nlg
*.nlo
*.nls
# pax
*.pax
# pdfpcnotes
*.pdfpc
# sagetex
*.sagetex.sage
*.sagetex.py
*.sagetex.scmd
# scrwfile
*.wrt
# sympy
*.sout
*.sympy
sympy-plots-for-*.tex/
# pdfcomment
*.upa
*.upb
# pythontex
*.pytxcode
pythontex-files-*/
# thmtools
*.loe
# TikZ & PGF
*.dpth
*.md5
*.auxlock
# todonotes
*.tdo
# easy-todo
*.lod
# xmpincl
*.xmpi
# xindy
*.xdy
# xypic precompiled matrices
*.xyc
# endfloat
*.ttt
*.fff
# Latexian
TSWLatexianTemp*
## Editors:
# WinEdt
*.bak
*.sav
# Texpad
.texpadtmp
# LyX
*.lyx~
# Kile
*.backup
# KBibTeX
*~[0-9]*
# auto folder when using emacs and auctex
./auto/*
*.el
# expex forward references with \gathertags
*-tags.tex
# standalone packages
*.sta
# PDF
*.pdf

40
Makefile Normal file

@ -0,0 +1,40 @@
MAINTEX := thesis.tex
LATEX := lualatex
FLAGS := -quiet -shell-escape
.PHONY : default help pdf verbose clean veryclean
default :
cd scripts && $(MAKE)
cd images && $(MAKE)
cd plots && $(MAKE)
latexmk -$(LATEX) $(FLAGS) $(MAINTEX)
help :
@echo ""
@echo "This Makefile creates the PDF of the thesis by using 'latexmk'"
@echo " make : Generate PDF of the thesis"
@echo " make pdf : Generate PDF of the thesis (forced mode)"
@echo " make verbose : Show output from latex compiler"
@echo " make clean : Delete temporary files"
@echo " make veryclean : Delete temporary files including PDF"
@echo ""
pdf :
latexmk -g -$(LATEX) $(FLAGS) $(MAINTEX)
verbose :
latexmk -g -$(LATEX) $(FLAGS) -verbose $(MAINTEX)
clean :
latexmk -c
cd scripts && $(MAKE) clean
cd images && $(MAKE) clean
cd plots && $(MAKE) clean
rm -f appendices/*.aux chapters/*.aux
rm -f *.lol *.fls thesis-blx.bib *.xml *.bbl *.nlo *.nls *.acn *.acr *.alg *.glo *.ist *.tdo
veryclean : clean
latexmk -C

@ -1,2 +1,8 @@
# masters-thesis
## Setup
Add this to `~/.latexmkrc`:
```bash
add_cus_dep('acn', 'acr', 0, 'makeacn2acr');
sub makeacn2acr {
system("makeindex -s \"$_[0].ist\" -t \"$_[0].alg\" -o \"$_[0].acr\" \"$_[0].acn\"");
}
```

13
abstract.tex Normal file

@ -0,0 +1,13 @@
% English abstract
\begin{otherlanguage}{english}
\begin{abstract}
\input{abstract/english}
\end{abstract}
\end{otherlanguage}
% German abstract
\begin{otherlanguage}{ngerman}
\begin{abstract}
\input{abstract/german}
\end{abstract}
\end{otherlanguage}

5
abstract/english.tex Normal file

@ -0,0 +1,5 @@
The present work evaluates the feasibility and added value of an InfiniBand based communication in the co-simulation framework VILLASframework and its simulation data gateway VILLASnode. InfiniBand is characterized by its high throughput and low latencies, which makes it particularly suitable for the hard real-time requirements of VILLASnode. It allows applications on different host systems to communicate with each other, without many of the latency bottlenecks that are present in other technologies such as Ethernet.
The present work shows that---with some optimizations---sub-microsecond latencies were achievable in a benchmark that mimics the characteristics of the co-simulation framework. After it presents how InfiniBand was integrated in the framework, thereby only making minor adjustments to the existing communication \acrshort{api}, it shows how the newly implemented interface performs compared to the existing ones.
The results showed that, regarding latency, the InfiniBand interface performed more than one order of magnitude better than VILLASnode's other interfaces that enable server-server communication. Furthermore, much higher transmission rates could be achieved and the latency's predictability substantially improved. Its latencies, which lie between \SI{1.7}{\micro\second} and \SI{4.9}{\micro\second}, were only 1.5--\SI{2.5}{\micro\second} worse than the zero-latency reference, in which VILLASnode uses the \textit{\acrshort{posix} shared memory} \acrshort{api} to communicate. However, since the shared memory interface is only supported when the different VILLASnode instances are located on the same computer, the InfiniBand interface turned out to have the lowest latency of the currently implemented server-server interfaces.

5
abstract/german.tex Normal file

@ -0,0 +1,5 @@
Die vorliegende Arbeit thematisiert die Realisierbarkeit und den Mehrwert einer auf InfiniBand basierten Kommunikation in dem Co-Simulationsframework VILLASframework und insbesondere seiner Simulationsdatenschnittstelle VILLASnode. Charakteristisch für die Datenübertragungstechnik InfiniBand sind hohe Durchsatzraten und niedrige Latenzzeiten, welche es besonders geeignet machen für die harten Echtzeitanforderungen von VILLASnode. Die Technik ermöglicht es Anwendungssoftware, auf verschiedenen Hostrechnern miteinander zu kommunizieren, ohne dabei die Engpässe anderer Datenübertragungstechniken, wie zum Beispiel Ethernet, zu spüren.
Ein Mess- und Bewertungsverfahren, welches das Verhalten des \linebreak Co-Simulationsframeworks nachahmt und im Rahmen dieser Arbeit entwickelt wurde, zeigt, dass nach Optimierung, Latenzen im Submikrosekundenbereich möglich waren. Nachdem die Arbeit sich damit auseinandergesetzt hat wie InfiniBand, mit minimalen Änderungen der Programmiersschnittstelle, in das Framework integriert wurde, stellt es die implementierte Technik den existierenden Techniken gegenüber.
Wie sich herausstellt, sind die Latenzzeiten der InfiniBand Übertragungstechnik in VILLASnode um mehr als eine Grö\ss enordnung niedriger als die Latenzzeiten der existierenden Techniken, die Kommunikation zwischen verschiedenen Hostrechnern ermöglichen. Au\ss erdem ermöglicht InfiniBand eine höhere Prognostizierbarkeit der Latenzen und können erheblich höhere Übertragungsraten bewältigt werden. Darüber hinaus sind die Latenzzeiten, die zwischen \SI{1.7}{\micro\second} und \SI{4.9}{\micro\second} liegen, lediglich 1.5--\SI{2.5}{\micro\second} grö\ss er als die der Null-Latenz-Referenz, die jedoch die \textit{\acrshort{posix} shared memory} Programmierschnittstelle zur Datenübertragung nutzt. Da diese Schnittstelle nur genutzt werden kann als Kommunikation zwischen VILLASnode Instanzen auf dem gleichen Rechner, kann gefolgert werden, dass die InfiniBandschnittstelle die niedrigste Latenz der gegenwärtigen Rechner-Rechner Schnittstellen aufweist.

20
appendices.tex Normal file

@ -0,0 +1,20 @@
\bookmarksetupnext{level=part}
\appendices
\addtocontents{toc}{\protect\setcounter{tocdepth}{2}}
\makeatletter
\addtocontents{toc}{%
\begingroup
\let\protect\l@chapter\protect\l@section
\let\protect\l@section\protect\l@subsection
}
\makeatother
\input{appendices/verbs}
\input{appendices/tuned}
\input{appendices/nodetype_interface}
\input{appendices/villas_structs}
\input{appendices/infiniband_configuration}
\input{appendices/results_benchmarks}
\bookmarksetupnext{startatroot}
\addtocontents{toc}{\endgroup}
\endappendices
\backmatter

@ -0,0 +1,9 @@
\chapter{InfiniBand node configuration\label{a:infiniband_config}}
\begin{figure}[ht!]
\vspace{-0.0cm}
\lstinputlisting[caption=The configuration that was used to examine the InfiniBand node-type with the benchmark from \autoref{fig:villas_benchmark}. The bash variables were replaced by a script that controlled the benchmark.,
label=lst:infiniband_config,
style=customconfig]{listings/infiniband.conf}
\vspace{-1.4cm}
\end{figure}

@ -0,0 +1,3 @@
\chapter{VILLASnode node-type interface\label{a:nodetype_functions}}
\input{scripts/build/nodetype_functions}

@ -0,0 +1,204 @@
\chapter{Results benchmarks\label{a:results_benchmarks}}
\section{Influence of CQEs on latency of RDMA write\label{a:oneway_unsignaled_rdma}}
\input{tables/oneway_settings_unsignaled_rdma}
\begin{figure}[ht!]
\vspace{1.5cm}
\begin{subfigure}{\textwidth}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_unsignaled_rdma_hist/plot_0.pdf}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\vspace{0.05cm}
\includegraphics{plots/oneway_unsignaled_rdma_hist/legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Results of the one-way benchmark with the settings from \autoref{tab:oneway_settings_unsignaled_rdma}. These were used to analyze the difference in latency between messages that did and did not cause a \acrfull{cqe}. The \textit{\gls{rdma} write} operation mode was used in this test.}\label{fig:oneway_unsignaled_rdma}
\end{figure}
\newpage
\section{Influence of constant burst size on latency\label{a:oneway_message_size_inline}}
\input{tables/oneway_settings_message_size_inline}
\begin{figure}[ht!]
\begin{subfigure}{0.351\textwidth}
\includegraphics[width=\linewidth, keepaspectratio]{plots/oneway_message_size_inline_median/plot_0.pdf}
\caption{\gls{rc}}\label{fig:oneway_message_size_inline_a}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{0.312\textwidth}
\includegraphics[width=\linewidth, keepaspectratio]{plots/oneway_message_size_inline_median/plot_1.pdf}
\caption{\gls{uc}}\label{fig:oneway_message_size_inline_b}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{0.312\textwidth}
\includegraphics[width=\linewidth, keepaspectratio]{plots/oneway_message_size_inline_median/plot_2.pdf}
\caption{\gls{ud}}\label{fig:oneway_message_size_inline_c}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{\textwidth}
\centering
\vspace{0.15cm}
\includegraphics{plots/oneway_message_size_inline_median/legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Results of the one-way benchmark with the settings from \autoref{tab:oneway_settings_message_size_inline}. While a triangle indicates $\tilde{t}_{lat}$ for a certain message size, the error bars indicate the upper and lower 10\% of $t_{lat}$ for that message size.}\label{fig:oneway_message_size_inline}
\end{figure}
\newpage
\section{Influence of intermediate pauses on latency\label{a:oneway_message_size_wait}}
\input{tables/oneway_settings_message_size_wait}
\begin{figure}[ht!]
\vspace{0.5cm}
\begin{subfigure}{0.351\textwidth}
\includegraphics[width=\linewidth, keepaspectratio]{plots/oneway_message_size_wait_median/plot_0.pdf}
\caption{\gls{rc}}\label{fig:oneway_message_size_wait_a}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{0.312\textwidth}
\includegraphics[width=\linewidth, keepaspectratio]{plots/oneway_message_size_wait_median/plot_1.pdf}
\caption{\gls{uc}}\label{fig:oneway_message_size_wait_b}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{0.312\textwidth}
\includegraphics[width=\linewidth, keepaspectratio]{plots/oneway_message_size_wait_median/plot_2.pdf}
\caption{\gls{ud}}\label{fig:oneway_message_size_wait_c}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{\textwidth}
\centering
\vspace{0.15cm}
\includegraphics{plots/oneway_message_size_wait_median/legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Results of the one-way benchmark with the settings from \autoref{tab:oneway_settings_message_size_wait}. While a triangle indicates $\tilde{t}_{lat}$ for a certain message size, the error bars indicate the upper and lower 10\% of $t_{lat}$ for that message size.}\label{fig:oneway_message_size_wait}
\vspace{-0.5cm}
\end{figure}
\newpage
\section{Comparison of timer functions\label{a:timer_comparison}}
\begin{figure}[ht!]
\vspace{-0.5cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:timer_comparison_a}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/nodetype_timer_comparison_wo_optimizations/infiniband_RC_0i_0j.pdf}
\end{minipage}
\end{subfigure}
\vspace{-0.2cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:timer_comparison_b}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/nodetype_timer_comparison_wo_optimizations/infiniband_RC_1i_0j.pdf}
\end{minipage}
\end{subfigure}
\vspace{-0.2cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:timer_comparison_c}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/nodetype_timer_comparison_w_optimizations/infiniband_RC_0i_0j.pdf}
\end{minipage}
\end{subfigure}
\vspace{-0.2cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:timer_comparison_d}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/nodetype_timer_comparison_w_optimizations/infiniband_RC_1i_0j.pdf}
\end{minipage}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\includegraphics{plots/nodetype_timer_comparison_w_optimizations/histogram_legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Comprehensive plots of the results from \autoref{tab:timer_comparison}. Subfigure (a) and (b) show the results in the unoptimized environment with \texttt{timerfd} and \gls{tsc}, respectively. Subfigure (c) and (d) show the results for the same settings, but in the optimized environment.}\label{fig:timer_comparison}
\vspace{-3.0cm}
\end{figure}
\newpage
\section{3D plots InfiniBand nodes (UC \& UD)\label{a:rate_size_3d_UC_UD}}
\begin{figure}[ht!]
\begin{subfigure}{\textwidth}
\includegraphics[width=15.2cm, keepaspectratio]{plots/nodetype_3d_IB/median_3d_graph_UC.pdf}
\end{subfigure}
\begin{subfigure}{\textwidth}
\vspace{0.2cm}
\centering
\includegraphics{plots/nodetype_3d_IB/3d_UC_legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{The influence of the message size and generation rate on $\tilde{t}_{lat}$ between two InfiniBand nodes that communicate over an \acrfull{uc}.}\label{fig:rate_size_3d_UC}
\end{figure}
\begin{figure}[ht!]
\begin{subfigure}{\textwidth}
\includegraphics[width=15.2cm, keepaspectratio]{plots/nodetype_3d_IB/median_3d_graph_UD.pdf}
\end{subfigure}
\begin{subfigure}{\textwidth}
\vspace{0.2cm}
\centering
\includegraphics{plots/nodetype_3d_IB/3d_UD_legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{The influence of the message size and generation rate on $\tilde{t}_{lat}$ between two InfiniBand nodes that communicate over \acrfull{ud}.}\label{fig:rate_size_3d_UD}
\end{figure}
\newpage
\section{3D plot shmem node\label{a:shmem_3d}}
\begin{figure}[ht!]
\vspace{5.5cm}
\begin{subfigure}{\textwidth}
\includegraphics[width=15.2cm, keepaspectratio]{plots/nodetype_3d_shmem/median_3d_graph_XX.pdf}
\end{subfigure}
\begin{subfigure}{\textwidth}
\vspace{0.2cm}
\centering
\includegraphics{plots/nodetype_3d_shmem/3d_XX_legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{The influence of signal generation rate and the message size on the median latency between two \textit{shmem}.}\label{fig:shmem_3d}
\end{figure}
\newpage
\section{Missed steps nanomsg and zeromq nodes\label{a:missed_steps_nanomsg_zeromq}}
\input{tables/missed_steps_nanomsg_zeromq}

10
appendices/tuned.tex Normal file

@ -0,0 +1,10 @@
\chapter{Tuned daemon profile\label{a:tuned_profile}}
This appendix shows the \textit{latency-performance} \texttt{tuned} profile that was used during the benchmarks that were run on the \glspl{hca} and VILLASnode.
\begin{figure}[ht!]
\vspace{0.5cm}
\lstinputlisting[caption=The \texttt{tuned} default profile \textit{latency-performance}. Comments are omitted for the sake of brevity.,
label=lst:tuned_latency_performance,
style=customconfig]{listings/tuned_latency_performance.conf}
\vspace{-0.2cm}
\end{figure}

12
appendices/verbs.tex Normal file

@ -0,0 +1,12 @@
\chapter{OpenFabrics Verbs\label{a:openfabrics}}
Experimental functions are not included in this appendix. Furthermore, the \gls{rdma} verbs \gls{api} is omitted because it is not used in the present work. A comprehensive documentation on all verbs can be found in the \gls{rdma} Aware Networks Programming User Manual~\cite{mellanox2015RDMA}.
\section{IB verbs API}
This section presents the default InfiniBand verbs API.
\input{scripts/build/ib_verbs}
\newpage
\section{RDMA CM API}
This section presents the RDMA communication manager API, as presented in \autoref{sec:rdmacm}.
\input{scripts/build/rdma_cm_verbs}

@ -0,0 +1,28 @@
\chapter{VILLASnode structs\label{a:villas_structs}}
This appendix presents a few structures which help to understand the VILLASnode architecture from \autoref{chap:architecture}. A full overview of all header files can be found on the VILLASnode Git repository\footnote{\url{https://git.rwth-aachen.de/acs/public/villas/VILLASnode}}.
\section{\texttt{struct sample}\label{a:sec:structsample}}
\begin{figure}[ht!]
\lstinputlisting[caption=The C structure of a VILLASnode sample.,
label=lst:struct_sample,
style=customc]{listings/struct_sample.h}
\vspace{-0.2cm}
\end{figure}
\newpage
\section{\texttt{struct node}\label{a:sec:structnode}}
\begin{figure}[ht!]
\lstinputlisting[caption=The C structure of a VILLASnode node.,
label=lst:struct_node,
style=customc]{listings/struct_node.h}
\vspace{-0.2cm}
\end{figure}
\newpage
\section{\texttt{struct node\_type}\label{a:sec:structnodetype}}
\begin{figure}[ht!]
\lstinputlisting[caption=The C structure of a VILLASnode node-type.,
label=lst:struct_nodetype,
style=customc]{listings/struct_nodetype.h}
\vspace{-0.2cm}
\end{figure}

19
biblatex.cfg Normal file

@ -0,0 +1,19 @@
\NewBibliographyString{noauthor}
\NewBibliographyString{noeditor}
\NewBibliographyString{nodate}
\NewBibliographyString{notitle}
\NewBibliographyString{nolocation}
\NewBibliographyString{nopublisher}
\DefineBibliographyStrings{english}{%
noauthor = {s\adddot a\adddot},
noeditor = {s\adddot ed\adddot},
nodate = {s\adddot a\adddot},
notitle = {s\adddot t\adddot},
nolocation = {s\adddot l\adddot},
nopublisher = {s\adddot ed\adddot},
}
\newcommand*\nosomethings{noauthor,noeditor,nodate,notitle,nolocation,nopublisher}
\@for \xx:=\nosomethings \do {%
\expandafter\ifcsname\xx\endcsname\relax\else
\expandafter\expandafter\expandafter\expandafter\edef\csname\xx\endcsname{\noexpand\bibstring{\xx}}%
\fi}

482
bibliography.bib Normal file

@ -0,0 +1,482 @@
%
% $Description: ACS Thesis Bibliography$
%
% $Author: pickartz $
% $Date: 2015/04/23 $
% $Revision: 0.1 $
%
@manual{compaq1997microsoft,
author={\noauthor},
title={{Virtual Interface Architecture Specification}},
organization={Compaq, Intel, Microsoft},
note={Version 1.0},
month={12},
year={1997}
}
@article{dunning1998virtual,
title={{The Virtual Interface Architecture}},
author={Dunning, Dave and Regnier, Greg and McAlpine, Gary and Cameron, Don and Shubert, Bill and Berry, Frank and Merritt, Anne Marie and Gronke, Ed and Dodd, Chris},
journal={IEEE micro},
volume={18},
number={2},
pages={66--76},
year={1998},
month={3},
publisher={IEEE},
ISSN={0272-1732},
doi={10.1109/40.671404}
}
@article{pfister2001introduction,
title={{An Introduction to the Infiniband™ Architecture}},
author={Pfister, Gregory F},
journal={High Performance Mass Storage and Parallel I/O},
volume={42},
pages={617--632},
year={2001},
publisher={chapter42}
}
@book{tanenbaum2014modern,
title={{Modern Operating System}},
author={Tanenbaum, Andrew S and Bos, Herbert},
year={2014},
isbn={978-0-13-359162-0},
edition={4},
publisher={Pearson Education, Inc}
}
@book{kozierok2005tcp,
title={{The TCP/IP-Guide: A Comprehensive, Illustrated Internet Protocols Reference}},
author={Kozierok, Charles M},
year={2005},
isbn={978-1593270476},
publisher={No Starch Press}
}
@manual{infinibandvol1,
author={\noauthor},
title={{InfiniBand\texttrademark~Architecture Specification, Volume 1}},
organization={InfiniBand Trade Association and others},
note={Release 1.2.1},
month={11},
year={2007}
}
@manual{infinibandvol2,
author={\noauthor},
title={{InfiniBand\texttrademark~Architecture Specification Volume 2}},
organization={InfiniBand Trade Association and others},
note={Release 1.3.1},
month={11},
year = {2016}
}
@article{grun2010introduction,
title={Introduction to infiniband for end users},
author={Grun, Paul},
organization={InfiniBand Trade Association},
year={2010}
}
@techreport{crupnicoff2005deploying,
title={{Deploying Quality of Service and Congestion Control in InfiniBand-based Data Center Networks}},
author={Crupnicoff, Diego and Das, Sujal and Zahavi, Eitan},
organization={{Mellanox Technologies}},
number={2379},
year={2005},
}
@manual{eui64,
author={\noauthor},
title={{Guidelines for Use of Extended Unique Identifier (EUI), Organizationally Unique Identifier (OUI), and Company ID (CID)}},
organization={Institute of Electrical and Electronics Engineers},
month={8},
year={2017}
}
%%%%%%%%%%%%%%%%%%%% INTRODUCTION %%%%%%%%%%%%%%%%%%%
@article{strasser2015review,
title={{A Review of Architectures and Concepts for Intelligence in Future Electric Energy Systems}},
author={Strasser, Thomas and Andr{\'e}n, Filip and Kathan, Johannes and Cecati, Carlo and Buccella, Concettina and Siano, Pierluigi and Leitao, Paulo and Zhabelova, Gulnara and Vyatkin, Valeriy and Vrba, Pavel and others},
journal={IEEE Transactions on Industrial Electronics},
volume={62},
number={4},
pages={2424--2438},
year={2015},
month={4},
publisher={IEEE},
ISSN={0278-0046},
doi={10.1109/TIE.2014.2361486}
}
@article{faruque2015real,
title={{Real-Time Simulation Technologies for Power Systems Design, Testing, and Analysis}},
author={Faruque, MD Omar and Strasser, Thomas and Lauss, Georg and Jalili-Marandi, Vahid and Forsyth, Paul and Dufour, Christian and Dinavahi, Venkata and Monti, Antonello and Kotsampopoulos, Panos and Martinez, Juan A and others},
journal={IEEE Power and Energy Technology Systems Journal},
volume={2},
number={2},
pages={63--73},
year={2015},
month={6},
publisher={IEEE},
ISSN={2332-7707},
doi={10.1109/JPETS.2015.2427370}
}
@article{larsen2009architectural,
title={{Architectural breakdown of end-to-end latency in a TCP/IP network}},
author={Larsen, Steen and Sarangam, Parthasarathy and Huggahalli, Ram and Kulkarni, Siddharth},
journal={{International Journal of Parallel Programming}},
year={2009},
month={12},
volume={37},
number={6},
pages={556--571},
issn={1573-7640},
doi={10.1007/s10766-009-0109-6},
publisher={Springer}
}
@article{reinemo2006overview,
author={S. Reinemo and T. Skeie and T. Sødring and O. Lysne and O. Trudbakken},
journal={IEEE Communications Magazine},
title={{An Overview of QoS Capabilities in InfiniBand, Advanced Switching Interconnect, and Ethernet}},
year={2006},
volume={44},
number={7},
pages={32-38},
doi={10.1109/MCOM.2006.1668378},
ISSN={0163-6804},
month={09}
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%% OFED %%%%%%%%%%%%%%%%%%%%%%%%
@misc{allianceofed,
author={\noauthor},
year={2018},
title={{OFA Overview}},
organization={{OpenFabric Alliance}},
url={https://www.openfabrics.org/ofa-overview/},
urldate = {2018-08-22},
}
@manual{mellanox2018linux,
author={\noauthor},
title={{Mellanox OFED for Linux User Manual}},
organization={{Mellanox Technologies}},
year={2018},
month={3},
number={2877},
note={Rev 4.3}
}
@manual{mellanox2015RDMA,
author={\noauthor},
title={{RDMA Aware Networks Programming User Manual}},
organization={{Mellanox Technologies}},
year={2015},
month={5},
edition={Rev 1.7}
}
@manual{ipoib,
title={{IP over InfiniBand (IPoIB) Architecture}},
author={{Kashyap, V}},
organization={Internet Engineering Task Force},
year={2015},
month={5},
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%% VILLAS NODE %%%%%%%%%%%%%%%%%%%%
@article{stevic2017multi,
title={{Multi-site European framework for real-time co-simulation of power systems}},
author={Stevic, Marija and Estebsari, Abouzar and Vogel, Steffen and Pons, Enrico and Bompard, Ettore and Masera, Marcelo and Monti, Antonello},
journal={IET Generation, Transmission \& Distribution},
volume={11},
number={17},
pages={4126--4135},
year={2017},
publisher={IET},
ISSN={1751-8687},
doi={10.1049/iet-gtd.2016.1576}
}
@inproceedings{vogel2017open,
title={{An Open Solution for Next-generation Real-time Power System Simulation}},
author={Vogel, Steffen and Mirz, Markus and Razik, Lukas and Monti, Antonello},
booktitle={{Energy Internet and Energy System Integration (EI2), 2017 IEEE Conference on}},
pages={1--6},
year={2017},
month={11},
publisher={IEEE},
doi={10.1109/EI2.2017.8245739}
}
@inproceedings{mirz2018distributed,
title={{Distributed Real-Time Co-Simulation as a Service}},
author={Mirz, Markus and Vogel, Steffen and Sch{\"a}fer, Bettina and Monti, Antonello},
booktitle={Industrial Electronics for Sustainable Energy Systems (IESES), 2018 IEEE International Conference on},
pages={534--539},
year={2018},
month={2},
doi={10.1109/IESES.2018.8349934},
publisher={IEEE},
address={Hamilton, New Zealand}
}
@mastersthesis{vogel2016development,
title={{Development of a modular and fully-digital PCIe-based interface to Real-Time Digital Simulator}},
author={Vogel, Steffen},
year=2016,
month=8,
school={RWTH Aachen University},
institution={Institute for Automation of Complex Power Systems}
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%% PERFORMANCE STUDIES %%%%%%%%%%%%%%%%%%%%
@inproceedings{macarthur2012performance,
title={{A Performance Study to Guide RDMA Programming Decisions}},
author={MacArthur, Patrick and Russell, Robert D},
booktitle={{High Performance Computing and Communication \& 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC-ICESS), 2012 IEEE 14th International Conference on}},
pages={778--785},
year={2012},
publisher={IEEE},
doi={10.1109/HPCC.2012.110}
}
@inproceedings{liu2014performance,
author = {Liu, Qian and Russell, Robert D.},
title = {{A Performance Study of InfiniBand Fourteen Data Rate (FDR)}},
booktitle = {{Proceedings of the High Performance Computing Symposium}},
series = {HPC '14},
year = {2014},
location = {Tampa, Florida},
pages = {1--10},
articleno = {16},
numpages = {10},
acmid = {2663526},
publisher = {Society for Computer Simulation International},
address = {San Diego, CA, USA},
keywords = {InfiniBand, NUMA, RDMA, RDMA_WRITE_WITH_IMM, fourteen data rate},
}
%%%%%%%%%%%%%%%%%%%%%%%% LINUX BOOKS %%%%%%%%%%%%%%%%%%%%%%%%
@book{kerrisk2010linux,
title={{The Linux Programming Interface: a Linux and UNIX System Programming Handbook}},
author={Kerrisk, Michael},
year={2010},
isbn={978-1-59327-220-3},
publisher={No Starch Press}
}
@manual{posix2018,
author={\noauthor},
title={{IEEE Standard for Information Technology---Portable Operating System Interface (POSIX\textregistered)}},
organization={Institute of Electrical and Electronics Engineers},
note={Base Specifications, Issue 7},
isbn={978-1-5044-4542-9},
month={1},
year={2018},
doi={10.1109/IEEESTD.2018.8277153}
}
@book{kernighan1978c,
title={{The C Programming Language}},
author={Kernighan, Brian W and Ritchie, Dennis M},
note={1st ed.},
isbn={0-13-110163-3},
month={2},
year={1978}
}
@article{barabanov1996real,
title={{Real-Time Linux}},
author={Barabanov, Michael and Yodaiken, Victor},
journal={Linux journal},
volume={23},
number={4.2},
pages={1},
year={1996}
}
@inproceedings{rostedt2007internals,
title={{Internals of the RT Patch}},
author={Rostedt, Steven and Hart, Darren V},
booktitle={Proceedings of the Linux symposium},
volume={2},
pages={161--172},
month={6},
year={2007}
}
@book{love2010linux,
title={{Linux Kernel Development}},
author={Love, Robert},
month={6},
year={2010},
publisher={Pearson Education, Inc.},
isbn={978-0-672-32946-3}
}
@article{lameter2013numa,
author = {Lameter, Christoph},
title = {{NUMA (Non-Uniform Memory Access): An Overview}},
journal = {Queue},
issue_date = {July 2013},
volume = {11},
number = {7},
month = jul,
year = {2013},
issn = {1542-7730},
pages = {40--51},
articleno = {40},
numpages = {12},
doi = {10.1145/2508834.2513149},
publisher = {ACM},
address = {New York, NY, USA},
}
@misc{derr2004cpusets,
title={Cpusets},
author={Derr, Simon and Jackson, P and Lameter, C and Menage, P and Seto, H},
year={2004},
url={https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt},
urldate={2018-09-16}
}
@misc{menage2004cgroups,
title={Cgroups},
author={Menage, Paul and Jackson, Paul and Lameter, Christoph},
year={2008},
url={https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt},
urldate={2018-09-16}
}
@inproceedings{kroah2003udev,
title={udev--A Userspace Implementation of devfs},
author={Kroah-Hartman, Greg},
booktitle={Proceedings of the Linux symposium},
pages={263--271},
month={7},
year={2003},
}
@misc{drepper2007every,
title={{What Every Programmer Should Know About Memory}},
author={Drepper, Ulrich},
organization={Red Hat, Inc.},
note={Version 1.0},
month={11},
year={2007}
}
@manual{guide2018intelc3a,
author={\noauthor},
title={{Intel\textregistered~64 and IA-32 Architectures Software Developers Manual}},
organization={Intel},
note={Volume 3A: System Programming Guide, Part 1},
year={2018},
month={5}
}
@manual{guide2018intelc3b,
author={\noauthor},
title={{Intel\textregistered~64 and IA-32 Architectures Software Developers Manual}},
organization={Intel},
note={Volume 3B: System Programming Guide, Part 2},
year={2018},
month={5}
}
@manual{guide2018intelb2a,
author={\noauthor},
title={{Intel\textregistered~64 and IA-32 Architectures Software Developers Manual}},
organization={Intel},
note={Volume 2A: Instruction Set Reference, A-L},
year={2018},
month={5}
}
@manual{guide2018intelb2b,
author={\noauthor},
title={{Intel\textregistered~64 and IA-32 Architectures Software Developers Manual}},
organization={Intel},
note={Volume 2B: Instruction Set Reference, M-U},
year={2018},
month={5}
}
@techreport{paoloni2010benchmark,
title={{How to Benchmark Code Execution Times on Intel\textregistered{} IA-32 and IA-64 Instruction Set Architectures}},
author={Paoloni, Gabriele},
organization={Intel},
year={2010},
month={9}
}
@article{gandhi2016range,
title={Range Translations for Fast Virtual Memory.},
author={Gandhi, Jayneel and Karakostas, Vasileios and Ayar, Furkan and Cristal, Adri{\'a}n and Hill, Mark D and McKinley, Kathryn S and Nemirovsky, Mario and Swift, Michael M and Unsal, Osman S},
journal={IEEE Micro},
volume={36},
number={3},
pages={118--126},
doi={10.1109/MM.2016.10},
ISSN={0272-1732},
month={5},
year={2016}
}
@misc{bowden2009proc,
title={The /proc Filesystem},
author={Bowden, Terrehon and Bauer, Bodo and Nerin, Jorge and Feng, Shen and Seibold, Stefani},
year={2009},
month={6},
note={Version 1.3},
url={https://www.kernel.org/doc/Documentation/filesystems/proc.txt},
urldate={2018-09-19}
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@article{perez2007ipython,
Author = {P\'erez, Fernando and Granger, Brian E.},
Title = {{IPython: a System for Interactive Scientific Computing}},
Journal = {Computing in Science and Engineering},
Volume = {9},
Number = {3},
Pages = {21--29},
month = may,
year = 2007,
url = "https://ipython.org",
ISSN = "1521-9615",
doi = {10.1109/MCSE.2007.53},
publisher = {IEEE Computer Society},
}
@manual{pcisig2010pciexpress,
author={\noauthor},
title={{PCI Express\textregistered{} Base Specification}},
organization={{PCI-SIG}},
year={2010},
month={11},
note={Revision 3.0}
}
@inproceedings{susan1983gprof,
title={{gprof: A Call Graph Execution Profiler}},
author={Susan L. and Graham Peter B. and Kessler Marshall K. and McKusick K.},
booktitle={Proceedings: USENIX Association [and] Software Tools Users Group Summer Conference},
pages={81--88},
year={1983},
address={Toronto, Ontario, Canada}
}

7
chapters.tex Normal file

@ -0,0 +1,7 @@
\include{chapters/introduction}
\include{chapters/basics}
\include{chapters/architecture}
\include{chapters/implementation}
\include{chapters/evaluation}
\include{chapters/conclusion}
\include{chapters/future}

226
chapters/architecture.tex Normal file

@ -0,0 +1,226 @@
\chapter{Architecture\label{chap:architecture}}
The first section of this chapter (\ref{sec:villasbasics}) explains the concept and internals of a VILLASnode instance. In the second section (\ref{sec:configuration}), a brief introduction on the configuration of node-type instances is given. Then, in \autoref{sec:readwrite_interfaces},~\ref{sec:memorymanagement}, and~\ref{sec:villas_fsm}, the adaptions that had to be made to the interface of node-types, the memory management of VILLASnode, and the finite-state machine of nodes are explained, respectively.
\section{Concept\label{sec:villasbasics}}
The functioning principles and general structure of VILLASframework, of which VILLASnode is a sub-project, were already presented in \autoref{sec:intro_villas}. This section solely focuses on the structure of VILLASnode.
\Autoref{tab:villasnode_nodes} presented the different \textit{node-types} that VILLASnode supported at the time of writing the present work. One VILLASnode instance---in the remainder of the present work often referred to as \textit{super-node}---may have several \textit{nodes} which act as source and/or sink of simulation data. A node is defined as an instance of a node-type. Accordingly, a super-node can serve as a gateway for simulation data. Node-types can roughly be divided into three categories:
\begin{itemize}
\setlength\itemsep{0.2em}
\item \textit{internal node-types}, which enable communication with node-types on the same host (e.g., writing data to a file descriptor through a \textit{file} node);
\item \textit{server-server node-types}, which enable communication with nodes on different hosts (e.g., communicating with a \textit{socket} node on a remote host);
\item \textit{simulator-server node-types}, which enable communication with simulators (e.g., acquiring data from an OPAL-RT simulator).
\end{itemize}
(In the remainder of this work, names of node-types and nodes are written in a cursive font, for example, \textit{file} node, \textit{socket} node, or \textit{InfiniBand} node-type.)
Within a super-node, so called \textit{paths} connect different nodes. A path starts at a node from which it acquires data. Immediately after data is obtained, it is optionally sent through a \textit{hook}, which can be seen as an extension to manipulate the data (e.g., to filter or transform it). Then, the data is written into a \gls{fifo} (also called: \textit{queue}), which holds it until it can be passed on. Subsequently, the data is sent through a \textit{register}, which can multiplex and mask it. Before the data is placed into the output queue and right before the sending node obtains it, it can be manipulated by more hooks. Finally, if the output node is ready, the data is moved from the output queue to the output node, which then sends it to a given destination node.
Data is transmitted in \textit{samples}, which store the simulation data for a given point in time, send and receive timestamps, and a sequence number. The sample structure is deliberately kept simple because it is the smallest common denominator of all supported simulators.
\begin{figure}[ht!]
\includegraphics{images/villasnode.pdf}
\vspace{-0.5cm}
\caption{The internal VILLASnode architecture~\cite{vogel2017open}. Depicted is one VILLASnode instance (\textit{super-node}) that includes three \textit{paths}, which connect five node-type instances (\textit{nodes}) with each other.}
\label{fig:villasnode}
\end{figure}
\Autoref{fig:villasnode} depicts the internal connections of an example super-node. This VILLASnode instance includes five node-type instances: \textit{opal} ($n_1$), \textit{file} ($n_2$), \textit{socket} ($n_3$), \textit{mqtt} ($n_4$), and a yet to be implemented \textit{InfiniBand} ($n_5$) node. On receive, data from the \textit{opal} node $n_1$ is modified by hook $h_1$ before it is placed in queue $q_{i,1}$. Path 1 continues through register $r_1$, hook $h_2$, and hook $h_3$, before it enters the output queue $q_{o,1}$. Before the \textit{socket} node $n_3$ sends the data from the queue to another \textit{socket} node, it is modified one last time by hook $h_4$.
Path 2 connects a \textit{socket} node ($n_3$), an \textit{mqtt} node ($n_4$), and an \textit{InfiniBand} node ($n_5$) with an \textit{opal} node $n_1$. In this path, the register $r_2$ determines the forwarding conditions for $q_{i,2}$, $q_{i,3}$, and $q_{i,4}$; it could, for example, depending on the data available in the queues, mask them. Before the data is placed in the output queue $q_{o,2}$ and right before the \textit{opal} node sends the data, it is modified by hook $h_5$ and $h_6$, respectively.
Path 3 connects a \textit{file} node $n_2$, which reads data from a local file, with an \textit{mqtt} node $n_4$ and \textit{InfiniBand} node $n_5$.
\section{Configuration of nodes\label{sec:configuration}}
\Autoref{lst:node_config} shows an example of a stripped down VILLASnode configuration file. The first part of the configuration consists of a list of nodes to be initialized (comparable with $n_{1\ldots5}$ in \autoref{fig:villasnode}). In this example, an instance of a \textit{file} node-type (\texttt{node\_1}) and an instance of an \textit{InfiniBand} node-type (\texttt{node\_2}) would be instantiated. Besides the type, a user can specify a range of settings for every node. These can be divided into global settings for the complete instance, settings only for the input part of the node, and settings only for the output part. The supported settings for every node-type can be found on the VILLASframework documentation pages.\footnote{\url{https://villas.fein-aachen.org/doc/node-types.html}}
\begin{figure}[ht!]
\vspace{0.5cm}
\lstinputlisting[caption=Structure of the configuration file of a \textit{file} node and an \textit{InfiniBand} node with a path connecting them.,
label=lst:node_config,
style=customconfig]{listings/node_config.conf}
\vspace{-0.2cm}
\end{figure}
The \textit{paths} section describes how nodes are connected within the super-node (compare with path 1, path 2, and path 3 in \autoref{fig:villasnode}). In this case, there is a path between \texttt{node\_1} and \texttt{node\_2}. This means that data is read from a file, which would be specified in the in-section of \texttt{node\_1}, and then placed in a buffer in the super-node. Then, after it is sent through possible hooks---which are not defined in this configuration file---it is copied to the memory that is allocated as output buffer for the \textit{InfiniBand} node. The super-node then sends these samples to the write-function of that node, which in turn sends the samples to a remote node as specified in its out-section.
\section{Interface of node-types\label{sec:readwrite_interfaces}}
To ensure interoperability between different node-types and VILLASnode, the VILLASframework specification defines an interface to use between the super-node and node-types. It is realized as a fixed set of functions with a given set of parameters that every node-type can implement. These functions have to be registered with the framework by passing it the pointers of the respective functions. Examples of functions to be implemented are \texttt{start()} and \texttt{stop()}, as well as \texttt{read()} and \texttt{write()}. Since their parameters had to be changed to efficiently support an \textit{InfiniBand} node-type, this section will expand upon the latter.
Not every function is mandatory; some functions will simply be ignored if they are not implemented. A complete list of all functions a node-type should implement, together with a brief description, is presented in \autoref{a:nodetype_functions}.
\subsection{Original implementation of the read- and write-function}
\Autoref{lst:read_write_original} shows the variables which were originally used in the \texttt{node\_type} C structure (\autorefap{a:sec:structnodetype}) to save the function pointers to the read- and write-function. Since this listing shows the initial parameters, it helps to understand the working principles of both functions and their weaknesses.
For both functions, \texttt{*n} is a C structure that holds information about the node-type instance. It contains, among others, information about the state, the number of generated or received samples, the configuration of the node and a field for node-type specific virtual data. The node structure is displayed in \autorefap{a:sec:structnode}; the present work will not expand further upon this struct.
\begin{figure}[ht!]
\vspace{0.5cm}
\lstinputlisting[caption=Original parameters of \texttt{read()} and \texttt{write()},
label=lst:read_write_original,
style=customc]{listings/read_write_original.h}
\vspace{-0.2cm}
\end{figure}
\paragraph{Read-function} The working principle of the read-function is displayed in \autoref{fig:villas_read}. The \textit{\undershort read()} box represents the function to which the \texttt{(*read)} pointer (line 1 in \autoref{lst:read_write_original}) of a given node-type points and is often simply referred to as \textit{read-function} in the remainder of the present work. The box thus depicts a part of the interface between the super-node and the node.
In order to retrieve data from a node, the super-node starts by allocating $\mathtt{cnt} \geq 1$ empty samples. A sample contains fields for, i.a., an origin timestamp, a receive timestamp, a sequence number, a reference counter, and a field to save the actual signal. The signal can contain unsigned 64-bit integers, 64-bit floating-point numbers, booleans, or complex numbers. \Autorefap{a:sec:structsample} presents the \texttt{sample} C structure. Since this structure contains some host specific information, it contains more data than will actually be sent.
After samples have been allocated, their reference counter (\textit{refcnt}) is increased by one. Samples in VILLASnode cannot be destroyed unless the reference counter is 1 when the release-function is called. When $refcnt>1$, other instances within VILLASnode still rely on the sample; calling the release-function on such a sample will cause the reference counter to be decremented by 1. In the remainder of the present work, \textit{releasing a sample} and \textit{decreasing the reference counter of a sample by one} is used interchangeably.
\begin{figure}[ht!]
\vspace{-0.5cm}
\begin{subfigure}{0.49\textwidth}
\includegraphics[width=\linewidth, page=1]{images/villas_read.pdf}
\vspace{-0.8cm}
\caption{Invoking the read-function.}\label{fig:villas_read_a}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{0.49\textwidth}
\includegraphics[width=\linewidth, page=2]{images/villas_read.pdf}
\vspace{-0.8cm}
\caption{Return of the read-function.}\label{fig:villas_read_b}
\end{subfigure}
\begin{subfigure}{\textwidth}
\includegraphics[width=\linewidth]{images/villas_read_legend.pdf}
\end{subfigure}
\vspace{-1.5cm}
\caption{A depiction of the working principle of the read-function in VILLASnode. This function is part of the interface between a super-node and a node.}\label{fig:villas_read}
\end{figure}
After memory to hold the samples has been allocated, a pointer to the first sample (\texttt{*smps[]}) and the total number of allocated samples (\texttt{cnt}) is passed to the node by calling the read-function (\autoref{fig:villas_read_a}). The node then tries to receive a maximum of \texttt{cnt} values to subsequently copy them to the allocated memory.
The return of the read-function is depicted in \autoref{fig:villas_read_b}. After the receive module, which is blackboxed here, has filled up $ret \leq \mathtt{cnt}$ samples, it lets the read-function return with \textit{ret}. The super-node then processes \textit{ret} samples (e.g., sending them through several hooks, before sending them to another node). Finally, all \texttt{cnt}---thus not only \textit{ret}---samples are released. So, after a read cycle, the reference counter of all samples is decreased by 1, and in that way the samples are usually destroyed.
\paragraph{Write-function} The write-function works in a similar fashion as the read-function and has identical parameters (line 2 in \autoref{lst:read_write_original}). The working principle of this function is depicted in \autoref{fig:villas_write}. When a super-node's path needs to write data to a node, it calls the write-function (\autoref{fig:villas_write_a}) and passes the total number of samples and the pointer to the first sample as arguments.
\begin{figure}[ht!]
\vspace{-0.5cm}
\begin{subfigure}{0.49\textwidth}
\includegraphics[width=\linewidth, page=1]{images/villas_write.pdf}
\vspace{-0.8cm}
\caption{Invoking the write-function.}\label{fig:villas_write_a}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{0.49\textwidth}
\includegraphics[width=\linewidth, page=2]{images/villas_write.pdf}
\vspace{-0.8cm}
\caption{Return of the write-function.}\label{fig:villas_write_b}
\end{subfigure}
\begin{subfigure}{\textwidth}
\includegraphics[width=\linewidth]{images/villas_write_legend.pdf}
\end{subfigure}
\vspace{-1.5cm}
\caption{A depiction of the working principle of the write-function in VILLASnode. This function is part of the interface between a super-node and a node.} \label{fig:villas_write}
\end{figure}
When the write-function is called, the node starts processing the samples by copying \texttt{cnt} samples to its send module and instructing it to send the data. The send module does not return until all samples are copied to the send module, and in case of many nodes, not until the data is successfully sent. When the send module is done, depicted in \autoref{fig:villas_write_b}, it lets the write-function return with the number of samples that have been successfully sent. Ideally, the returned value \textit{ret} is equal to the number of passed samples \texttt{cnt}. If this is not the case, the super-node will detect this and act upon a possible error. In all cases, the reference counter of all \texttt{cnt} samples is decremented by~1.
\subsection{Requirements for the read- and write-function of an InfiniBand node\label{sec:requirements}}
As discussed in the previous section, the reference counters of all samples that have been sent into the read- or write-functions are decreased after the functions return. For nodes with either a receive module that has a local buffer or with a send module which does not return until it has made a copy of the data or actually sent the data, this approach works exactly as intended. But, as soon as the modules are implemented by an architecture which is based on the \gls{via}---in this particular case the \gls{iba}---the implementation causes problems. To adhere to the zero-copy principle of the \gls{via}, data should not be copied from the super-node's buffer to a local buffer or the other way around. Rather, a pointer to, and the length of, a memory location should be passed to the network adapter, which then independently copies the data from the host's memory to its local buffers or the other way around.
In the following, the ideal situation for a read and write operation for the InfiniBand Architecture is presented. Although this approach is specifically for the \gls{iba}, it can relatively easily be translated to other \glspl{via}. After the desired approach has been discussed, the next subsection will discuss the shortcomings of the parameters in \autoref{lst:read_write_original}, that obstruct the implementation of this approach.
\paragraph{Read-function}
\Autoref{fig:villas_read_iba} depicts a super-node that reads from a node-type instance whose communication is based on the \gls{iba}. The receive module in this figure relies on the receive queue of an InfiniBand \gls{qp}. As explained in \autoref{sec:qp}, a queue pair cannot receive data unless its \gls{rq} holds receive \glspl{wqe}. Hence, work requests that point to buffers of the super-node have to be submitted.
\begin{figure}[ht!]
\vspace{-0.4cm}
\begin{subfigure}{0.49\textwidth}
\includegraphics[width=\linewidth, page=1]{images/villas_read_iba.pdf}
\vspace{-0.8cm}
\caption{Invoking the read-function.}\label{fig:villas_read_iba_a}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{0.49\textwidth}
\includegraphics[width=\linewidth, page=2]{images/villas_read_iba.pdf}
\vspace{-0.8cm}
\caption{Return of the read-function.}\label{fig:villas_read_iba_b}
\end{subfigure}
\begin{subfigure}{\textwidth}
\includegraphics[width=\linewidth]{images/villas_read_iba_legend.pdf}
\end{subfigure}
\vspace{-1.5cm}
\caption{A depiction of the working principle of the read-function in an \textit{InfiniBand} node. The \acrshort{rq} is part of a complete \acrshort{qp}, but the \acrshort{sq} is omitted for the sake of simplicity.} \label{fig:villas_read_iba}
\end{figure}
An important requirement for this node-type was that it should be compatible with the original node-type interface; or at least that the changes would be minimal. Hence, in order to acquire pointers to samples from the super-node, the \texttt{*smps[]} parameter from the read-function is used. Like the super-node in \autoref{fig:villas_read_a}, the super-node in \autoref{fig:villas_read_iba_a} starts by allocating $cnt \geq 1$ empty samples, increasing their reference counters, and passing their pointers to the node's read-function. The node, in turn, takes the addresses of the samples, wraps them up in scatter/gather elements, places them in work requests, and submits them to the \gls{rq}. Now, when the \gls{hca} receives a message, it will write the data directly into the allocated memory within the super-node. In this way, an additional copy between the node and the super-node is avoided.
Since the receive module of an \textit{InfiniBand} node does not copy data to the passed samples, the returning of function in \autoref{fig:villas_read_iba_b} works fundamentally different from the returning of the function in \autoref{fig:villas_read_b}. If there are no \glspl{cqe} in the completion queue, thus if the HCA did not receive any data, the return value \textit{ret} of the node shall be 0. In that way, the super-node knows that the set of previously allocated \texttt{smps[]} does not hold any data. The reference counters of none of the buffers shall be decreased since they are all submitted to the \gls{rq} and the \gls{hca} will thus write data to them.
If \glspl{cqe} are available, pointers to samples which are submitted to the \gls{rq} (light gray in \autoref{fig:villas_read_iba}) are replaced by the pointers to the buffers that are filled by the HCA (dark gray in \autoref{fig:villas_read_iba}). The return value \textit{ret} shall be the number of pointers that have been replaced since these buffers now contain valid data that was sent to this node. The reference counters of these buffers must be decreased after they have been processed by the super-node.
Consequently, in order for the \textit{InfiniBand} node to be able to receive data, the super-node has to invoke the read-function at least once without acquiring any data. To store the pointers to the buffers in the \glspl{cqe}, the \gls{wr} C structure member \texttt{wr\_id} can be used (see \autoref{sec:postingWRs}).
\paragraph{Write-function} The write-function, depicted in \autoref{fig:villas_write_iba}, has to adhere to similar conventions as the read-function in order to realize zero-copy. Again, the addresses of the samples are passed to the node as arguments of the write-function, to be subsequently submitted to the \gls{sq}. The \gls{hca} will then process the submitted work requests and take care of the necessary memory operations.
\begin{figure}[ht!]
\vspace{-0.5cm}
\begin{subfigure}{0.49\textwidth}
\includegraphics[width=\linewidth, page=1]{images/villas_write_iba.pdf}
\vspace{-0.8cm}
\caption{Invoking the write-function.}\label{fig:villas_write_iba_a}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{0.49\textwidth}
\includegraphics[width=\linewidth, page=2]{images/villas_write_iba.pdf}
\vspace{-0.8cm}
\caption{Return of the write-function.}\label{fig:villas_write_iba_b}
\end{subfigure}
\begin{subfigure}{\textwidth}
\includegraphics[width=\linewidth]{images/villas_write_iba_legend.pdf}
\end{subfigure}
\vspace{-1.5cm}
\caption{A depiction of the working principle of the write-function in an \textit{InfiniBand} node. The \acrshort{sq} is part of a complete \acrshort{qp}, but the \acrshort{rq} is omitted for the sake of simplicity.} \label{fig:villas_write_iba}
\end{figure}
When the pointers are successfully submitted to the \gls{sq}, the function shall return the total number of submitted pointers \textit{ret}. If the completion queue is empty, none of these pointers may be released because the HCA has yet to access the memory locations. If the completion queue contains entries, that means that previously submitted send \glspl{wr} are finished; these pointers can be released. So, in order to release them, the initial pointers to the data to be sent (light gray in \autoref{fig:villas_write_iba}) are replaced by pointers to buffers which were submitted to the \gls{sq} in a previous call of the write-function. The super-node has to be notified that it must only decrease the reference counter of pointers that were yielded by the \glspl{cqe}.
\subsection{Proposal for a new read- and write-function\label{sec:proposal}}
Apparently, the major shortcoming of the functions from \autoref{lst:read_write_original} is the lack of an interface to pass the number of samples to be released to the super-node. There is no way the super-node can predict how many samples may be released; this becomes even more difficult if it is taken into account that some samples may be sent inline---thus can be released immediately after submitting the \gls{wr}---and that some work requests may not be successfully submitted to the \gls{sq}.
Therefore, new parameters for the read- and write-function are proposed in \autoref{lst:read_write_proposal}. The additional parameter in each function lets a node decide how many items of \texttt{smps[]} should actually be released. The several distinctions which have to be considered are further elaborated upon in \autoref{sec:villas_implementation}.
\begin{figure}[ht!]
\vspace{0.5cm}
\lstinputlisting[caption=Proposal for an additional parameter in \texttt{read()} and \texttt{write()}.,
label=lst:read_write_proposal,
style=customc]{listings/read_write_proposal.h}
\vspace{-0.2cm}
\end{figure}
\section{Memory management\label{sec:memorymanagement}}
Originally, memory that was allocated within the framework could be allocated with a fixed set of settings called \textit{memory-types}. The VILLASnode internal \texttt{alloc()} could be called, for example, with \texttt{memory\_hugepage}, which pins memory and maps it to hugepages (see \autoref{sec:mem_optimization}), or with \texttt{memory\_heap}, which allocates aligned memory on the heap. These embedded memory-types are not sufficient for the \textit{InfiniBand} node-type. \Autoref{sec:requirements} already showed that the \gls{hca} will directly access the memory that is allocated by the super-node. Thus, as follows from \autoref{sec:memory}, the buffer must be registered with a memory region and the \glspl{wr} that are submitted to either queue of the \gls{qp} must contain the local key.
Since embedding a memory-type for every node-type in the VILLASnode source code would go against the principle of modularity, this is not an option. Consequently, the most obvious solution is to allow every node-type to register its own memory-type if necessary. In that way, every node-type can exactly define what the \texttt{alloc()} and \texttt{free()} functions implement. For \texttt{alloc()}, a node-type can, for example, define how memory should be allocated, whether the pages should be aligned, how big the pages should be, and if the memory should be registered with a memory region. It is also possible for a node-type to implement certain functions which interact with the memory that is allocated by the memory-type; this can, for example, be used within the \textit{InfiniBand} node to acquire the local key of a sample that is passed as an argument of the read- or write-function.
With this method, every node-type may define a \texttt{memory\_type} C structure, which it must register in the same fashion as it registers the interface functions with the super-node (line 39, \autoref{lst:struct_nodetype}). By enabling node-types to register their own memory-type, the super-node knows what type of memory to use for input and/or output buffers that are connected to nodes of this type ($q_{i,x}$ and $q_{o,x}$ in \autoref{fig:villasnode}).
If no memory-type is specified, the super-node will assume \texttt{memory\_hugepage}.
\section{VILLASnode finite-state machine\label{sec:villas_fsm}}
Initially, a node could reside in one of the six states displayed in \autoref{lst:states}. The super-node transitions the node through the states depending on the results of functions from \autoref{a:nodetype_functions}. E.g., when the super-node calls a node's start-function, the transition \textit{checked}$\,\to\,$\textit{started} is performed if the function returns successfully.
\begin{figure}[ht!]
\vspace{0.5cm}
\lstinputlisting[caption=The six states a node could originally reside in.,
label=lst:states,
style=customc]{listings/states.h}
\vspace{-0.2cm}
\end{figure}
These states were sufficient for the node-types which existed up to now (\autoref{tab:villasnode_nodes}); when a node resided in \textit{started}, this meant it was ready to send and receive data. This is not the case for node-types that are based (descendants of) the Virtual Interface Architecture. Here, a node can be initiated---for which the \textit{started} state can be used---but not connected and thus not able to send data to another node. Accordingly, the introduction of a new state \textit{connected} would be appropriate. Furthermore, architectures that are based on the \gls{via} rely on descriptors (called work requests in the \gls{iba}) in a send and receive queue. Hence, in order to be able to receive data directly after the connection has been established, descriptors have to be present in the \gls{rq} at this moment. For this reason, in (descendants of) the \gls{via}, it is possible to prepare elements in the receive queue prior to the actual connection.
These considerations yield the finite-state machine in \autoref{fig:villasnode_states}. The states which are indicated with dashed borders, \textit{pending connect} and \textit{connected}, may be set by the node after the super-node transitioned the instance to the \textit{started} state. The use of both states is not mandatory. If a node is in one of these two states, the super-node interprets it as were the node in the \textit{started} state. But, they can be used within the node itself to distinguish between a node being started, being in a pending connect state, or actually being connected. This state machine shows similarities with the \gls{via}'s finite-state machine in \autoref{fig:via_diagram}. It can therefore be used for future node-types that are based on the \gls{via}---other than the \textit{InfiniBand} node-type that is presented in the present work---as well.
Although it is necessary to execute the transition \textit{checked}$\,\to\,$\textit{started}, it is possible to transition to \textit{stopped} and \textit{destroyed} from any of the three states in the dashed square.
\begin{figure}[ht]
\vspace{-0.65cm}
\hspace{0.4cm}
\includegraphics{images/villasnode_states.pdf}
\vspace{-0.45cm}
\caption{The VILLASnode state diagram with the two newly introduced states \textit{pending connect} and \textit{connected}.}
\label{fig:villasnode_states}
\end{figure}

829
chapters/basics.tex Normal file

@ -0,0 +1,829 @@
\chapter{Basics\label{chap:basics}}
This first section of this chapter (\ref{sec:via}) introduces the Virtual Interface Architecture, of which the InfiniBand Architecture is a descendant. After this brief introduction on InfiniBand's origins, \autoref{sec:infiniband} is completely devoted to the InfiniBand Architecture itself. Subsequently, \autoref{sec:iblibs} introduces the software libraries that are used to operate the InfiniBand hardware in the present work's benchmarks and in the implementation of the VILLASnode \textit{InfiniBand} node-type. Finally, \autoref{sec:optimizations} goes on to discuss real-time optimizations in Linux, which is the operating system VILLASnode is most frequently operated on.
\section{The Virtual Interface Architecture\label{sec:via}}
InfiniBand is rooted in the \gls{via}~\cite{pfister2001introduction}, which was originally introduced by Compaq, Intel, and Microsoft~\cite{compaq1997microsoft}. Although InfiniBand does not completely adhere to the original \gls{via} specifications, it is important to understand its basics. In that way, some design decisions in the InfiniBand Architecture will be more comprehensible. This section will therefore elaborate on the characteristics of the \gls{via}\@.
The lion's share of the Internet protocol suite, also known as \acrshort{tcpip}, is implemented by the \gls{os}~\cite{kozierok2005tcp}. Even though the concept of the \acrshort{tcpip} stack allows the interface between a \gls{nic} and an \gls{os} to be relatively simple, a drawback is that the \gls{nic} is not directly accessible for consumer processes, but only over this stack. Since the \acrshort{tcpip} stack resides in the operating system's kernel, communication operations result in \textit{trap} machine instructions (or on more recent x86 architecture's: \textit{sysenter} instructions), which cause the \gls{cpu} to switch from user to kernel mode~\cite{kerrisk2010linux}. This back-and-forth between both modes is relatively expensive and thus adds a certain amount of latency to the communication operation that caused the switch. Furthermore, since the \acrshort{tcpip} stack also includes reliability protocols and the (de)multiplexing of the \gls{nic} to processes, the operating system has to take care of these rather expensive tasks as well~\cite{kozierok2005tcp}. \Autoref{sec:motivation} already described Larsen and Huggahalli's~\cite{larsen2009architectural} research on the proportions of the latency in the Internet protocol suite. This overhead resulted in the need---and thus the development---of a new architecture which would provide each process with a directly accessible interface to the \gls{nic}\@: the Virtual Interface Architecture was born.
In their publication, Dunning et al.~\cite{dunning1998virtual} describe that the most important characteristics of the \gls{via} are:
\begin{itemize}
\setlength\itemsep{-0.2em}
\item data transfers are realized through zero-copy;
\item system calls are avoided whenever possible;
\item the \gls{nic} is not multiplexed between processes by a driver;
\item the number of instructions needed to initiate data transport is minimized;
\item no interrupts are required when initiating or completing data transport;
\item there is a simple set of instructions for sending and receiving data;
\item it can both be mimicked in software and synthesized to hardware.
\end{itemize}
Accordingly, several tasks which are handled in software in the Internet protocol suite---e.g., multiplexing the \gls{nic} to processes, data transfer scheduling, and preferably reliability of communication---must be handled by the \gls{nic} in the \gls{via}\@.
\subsection{Basic components}
A model of the \gls{via} is depicted in \autoref{fig:via_model}. At the top of the stack are the processes and applications that want to communicate over the network controller. Together with \gls{os} communication protocols and a special set of instructions which are called the \textit{\acrshort{vi} User Agent}, they form the \textit{\acrshort{vi} Consumer}. The VI consumer is colored light gray in \autoref{fig:via_model} and resides completely in the operating system's user space. The user agent provides the upper layer applications and communication protocols with an interface to the \textit{\acrshort{vi} Provider} and a direct interface to the \glspl{vi}.
\begin{figure}[ht]
\hspace{0.4cm}
\includegraphics{images/via_model.pdf}
\vspace{-0.5cm}
\caption{The \acrfull{via} model.}\label{fig:via_model}
\end{figure}
The VI provider, colored dark gray in \autoref{fig:via_model}, is responsible for the instantiation of the virtual interfaces and completion queues, and consists of the \textit{kernel agent} and the \gls{nic}\@. In the \gls{via}, the \gls{nic} implements and manages the virtual interfaces and completion queues---which will both be further elaborated upon in \autoref{sec:data_transfer}---and is responsible for performing data transfers. The kernel agent is part of the operating system and is responsible for resource management, e.g., creation and destruction of \glspl{vi}, management of memory used by the \gls{nic}, and interrupt management. Although communication between consumer and kernel agent requires switches between user and kernel mode, this does not influence the latency of data transfers because no data is actually transferred via this interface.
\subsection{Data transfer\label{sec:data_transfer}}
One of the most distinctive elements of the \gls{via}, compared to the Internet protocol suite, is the \acrfull{vi}. Because of this direct interface to the \gls{nic}, each process assumes that it owns the interface and there is no need for system calls when performing data transfers. Each virtual interface consists of a send and a receive work queue which can hold \textit{descriptors}. These contain all information necessary to transfer data, for example, destination addresses, transfer mode to be used, and the location of data to be transferred in the main memory. Hence, both send and receive data transfers are initiated by writing a descriptor memory structure to a \gls{vi}, and subsequently notifying the VI provider about the submitted structure. This notification happens with the help of a \textit{doorbell}, which is directly implemented in the \gls{nic}\@. As soon as the \gls{nic}'s doorbell has been rung, it starts to asynchronously process the descriptors.
When a transfer has been completed---successfully or with an error---the descriptors are marked by the \gls{nic}\@. Usually, it is the consumer's responsibility to remove completed descriptors from the work queues. Alternatively, on creation, a \gls{vi} can be bound to a \gls{cq}. Then, notifications on completed transfers are directed to this queue. A \gls{cq} has to be bound to at least one work queue. This means that, on the other hand, completion notifications of several work queues can be redirected to one single completion queue. Hence, if there is an environment with $N$ virtual interfaces with each two work queues, there can be
\begin{equation}
0 \leq M \leq 2\cdot N
\end{equation}
completion queues.
The Virtual Interface Architecture supports two asynchronously operating data transfer models: the \textit{send and receive messaging} model and the \gls{rdma} model. The characteristics of both models are described below.
\paragraph{Send and receive messaging model (channel semantics)} This model is the concept behind various other popular data transfer architectures. First, a receiving node explicitly specifies where data which will be received shall be saved in its local memory. In the \gls{via}, this is done by submitting a descriptor to the receive work queue. Subsequently, a sending node specifies the address of the data to be sent to that receiving node in its own memory. This location is then submitted to its send work queue, analogous to the procedure for the receive work queue.
\paragraph{Remote Direct Memory Access model (memory semantics)} This approach is lesser-known. When using the \gls{rdma} model, one node, the active node, specifies both the local and the remote memory region. There are two possible operations in this model: \textit{\gls{rdma} write} and \textit{\gls{rdma} read}. In the former, the active node specifies a local memory region which contains data to be sent and a remote memory region to which the data shall be written. In the latter, the active node specifies a remote memory region which contains data it wants to acquire and a local memory region to which the data shall be written. To initiate an \gls{rdma} transfer, the active node has to specify the local and remote memory addresses and the operation mode in a descriptor and submit it to the send work queue. The operating system and software on the passive node are not aware of both \gls{rdma} operations. Hence, there is no need to submit descriptors to the receive work queue at the passive side.
\subsection{The virtual interface finite-state machine}
The original \gls{via} proposal defines four states in which a virtual interface can reside: \textit{idle}, \textit{pending connect}, \textit{connected}, and \textit{error}. Transitions between states are handled by the VI provider and are invoked by the VI consumer or events on the network. The four states and all possible state transitions are depicted in the finite-state machine in \autoref{fig:via_diagram}. A short clarification on every state is given in the list below:
\begin{itemize}
\setlength\itemsep{0.2em}
\item \textbf{\textit{Idle}}: A \gls{vi} resides in this state after its creation and before it gets destroyed. Receive descriptors may be submitted but will not be processed. Send descriptors will immediately complete with an error.
\item \textbf{\textit{Pending connect}}: An active \gls{vi} can move to this state by invoking a connection request to a passive \gls{vi}\@. A passive \gls{vi} will transition to this state when it attempts to accept a connection. In both cases, it stays in this state until the connection is completely established. If the connection request times out, the connection is rejected, or if one of the \glspl{vi} disconnects, the \gls{vi} will return to the \textit{idle} state. If a hardware or transport error occurs, a transition to the \textit{error} state will be made. Descriptors which are submitted to either work queue in this state are treated in the same fashion as they are in the \textit{idle} state.
\item \textbf{\textit{Connected}}: A \gls{vi} resides in this state if a connection request it has submitted has been accepted or after it has successfully accepted a connection request. The \gls{vi} will transition to the \textit{idle} state if it itself or the remote \gls{vi} disconnects. It will transition to the \textit{error} state on hardware, transport, or, dependent on the reliability level of the connection, on other connection related errors. All descriptors which have been submitted in previous states and did not result in an immediate error and all descriptors which are submitted in this state are processed.
\item \textbf{\textit{Error}}: If the \gls{vi} transitions to this state, all descriptors present in both work queues are marked as erroneous. The VI consumer must handle the error, transition the \gls{vi} to the \textit{idle} state, and restart the connection if desired.
\end{itemize}
\begin{figure}[ht]
\hspace{0.5cm}
\includegraphics{images/via_states.pdf}
\vspace{-0.5cm}
\caption{The \acrfull{via} state diagram.}\label{fig:via_diagram}
\end{figure}
\section{The InfiniBand Architecture\label{sec:infiniband}}
After a brief introduction on the Virtual Interface Architecture in \autoref{sec:via}, this section will further elaborate upon \gls{ib}. Because the \gls{via} is an abstract model, the purpose of the previous section was not to provide the reader with its exact specification, but rather to give him/her a general idea of the \gls{via} design decisions. Since the exact implementation of various parts of the Virtual Interface Architecture is left open, the \gls{iba} does not completely correspond to the \gls{via}\@. Therefore, a more comprehensive breakdown of the \gls{iba} will be given in this section.
The \gls{ibta} was founded by more than 180 companies in August 1999 to create a new industry standard for inter-server communication. After 14 months of work, this resulted in a collection of manuals of which the first volume describes the architecture~\cite{infinibandvol1} and the second the physical implementation of InfiniBand~\cite{infinibandvol2}. In addition, Pfister~\cite{pfister2001introduction} wrote an excellent summary of the \gls{iba}.
\subsection{Basics of the InfiniBand Architecture\label{sec:iba}}
\paragraph{Network stack}
Like most modern network technologies, the \gls{iba} can be described as a network stack, which is depicted in \autoref{fig:iba_network_stack}. The stack consists of a physical, link, network, and transport layer.
\begin{figure}[ht!]
\includegraphics{images/network_stack.pdf}
\caption{The network stack of the \acrfull{iba}.}\label{fig:iba_network_stack}
\end{figure}
The \gls{iba} implementations of the different layers are displayed in the right column of \autoref{fig:iba_network_stack}. Although the present work attempts to separate the different layers into different subsections, some features cannot be explained without referring to features in other layers. Hence, the subsections do not directly correspond with the different layers.
First, this subsection gives some basic definitions for InfiniBand. It also includes some information about segmentation \& reassembly of messages (although that is part of the transport layer). The main component of the transport layer, the queue pair, is presented in \autoref{sec:qp}. That subsection also points out some similarities and differences between the \gls{via} and the \gls{iba}\@. Then, after the basics of the \gls{iba} subnet, the subnet manager, and managers in general are described in \autoref{sec:networking}, inner subnet routing and subnet routing will be elaborated upon in \autoref{sec:addressing}. Subsequently, \autoref{sec:vlandsl} clarifies InfiniBand's virtual lanes and service levels. \Autoref{sec:congestioncontrol} and~\ref{sec:memory} go further into flow control and memory management in the \gls{iba}, respectively. Finally, \autoref{sec:communication_management} explains how communication is established, managed, and destroyed.
An overview of the implementation of the physical link will not be given in the present work. The technical details on this can be found in the second volume of the InfiniBand\texttrademark~Architecture Specification~\cite{infinibandvol2}. The implementation of consumer operations will be elaborated upon later, in \autoref{sec:iblibs}.
\paragraph{Message segmentation} Communication on InfiniBand networks is divided into messages between \SI{0}{\byte} and $\SI[parse-numbers=false]{2^{32}}{\byte}$ (\SI{2}{\gibi\byte}) for all service types, except for unreliable datagram. The latter supports---depending on the \gls{mtu}---messages between \SI{0}{\byte} and \SI{4096}{\byte}.
Messages that are bigger than the \gls{mtu}, which describes the maximum size of a packet, are segmented into smaller packets by \gls{ib} hardware. The \gls{mtu} can be---depending on the hardware that is used---256, 512, 1024, 2048, or \SI{4096}{\byte}. Since segmentation and reassembly of packets is handled by hardware, the \gls{mtu} should not affect performance~\cite{crupnicoff2005deploying}. \Autoref{fig:message_segmentation} depicts the principle of breaking a message down into packets. An exact breakdown of the composition of packets will be described in \autoref{sec:addressing}.
\begin{figure}[ht!]
\includegraphics{images/message_segmentation.pdf}
\vspace{-0.5cm}
\caption{The segmentation of a message into packets.}\label{fig:message_segmentation}
\end{figure}
\paragraph{Endnodes and channel adapters} Ultimately, all communication on an InfiniBand network happens between \textit{endnodes} (also referred to as nodes in the present work). Such an endnode could be a host computer, but also, for example, a storage system.
A \gls{ca} forms the interface between the soft- and hardware of an endnode and the physical link which connects the endnode to a network. A channel adapter can either be a \gls{hca} or a \gls{tca}. The former is most commonly used, and distinguishes itself from the latter by implementing so-called \textit{verbs}. Verbs form the interface between processes on a host computer and the InfiniBand fabric; they are the implementation of the user agent from \autoref{fig:via_model}.
\paragraph{Service types} InfiniBand supports several types of communication services which are introduced in \autoref{tab:service_types}. Every channel adapter must implement \gls{ud}, which is conceptually comparable to \gls{udp}\@. \glspl{hca} must implement \glspl{rc}; this is optional for \glspl{tca}. The reliable connection is similar to \gls{tcp}\@. Neither of the channel adapter types is required to implement \glspl{uc} and \gls{rd}.
\Autoref{tab:service_types} describes the service levels on a very abstract level. More information on the implementation, for example, on the different headers which are used in \gls{iba} data packets, will be given later on. Furthermore, \autoref{tab:service_types} already contains references to the abbreviation \acrshort{qp}, which stands for queue pair and is InfiniBand's equivalent to a virtual interface (\autoref{sec:via}). This will be elaborated upon in the next subsection.
\input{tables/service_types}
\subsection{Queue pairs \& completion queues\label{sec:qp}}
As mentioned before, the InfiniBand Architecture is inspired by the Virtual Interface Architecture. \Autoref{fig:iba_model}, which is derived from \autoref{fig:via_model}, depicts an abstract model of the InfiniBand Architecture. In order to simplify this picture, the consumer and kernel agent are omitted. In the following, the functioning principle of this model will be explained.
\begin{figure}[ht!]
\hspace{0.4cm}
\includegraphics{images/iba_model.pdf}
\vspace{-0.5cm}
\caption{The \acrfull{iba} model.}\label{fig:iba_model}
\end{figure}
Virtual interfaces are called \glspl{qp} in the \gls{iba} and also consists of \glspl{sq} and \glspl{rq}. They are the highest level of abstraction and enable processes to directly communicate with the \gls{hca}\@. After everything has been initialized, a process will perform most operations on queue pairs while communicating over an InfiniBand network.
Similarly to a descriptor in the \gls{via}, a \gls{wr} has to be submitted to the send or receive queue in order to send or receive messages. Submitting a \gls{wr} results in a \gls{wqe} in the respective queue. Among others, a \gls{wqe} holds the address to a location in the host's main memory. In case of a send \gls{wqe}, this memory location contains the data to be sent to a remote host. In case of a receive \gls{wqe}, the containing memory address points to the location in the main memory to which received data shall be written. Not every \gls{qp} can access all memory locations; this protection is handled by specific memory management mechanisms. These also handle which locations may be accessed by the remote hosts and by the \gls{hca}\@. More information on memory management can be found in \autoref{sec:memory}.
A work queue element in the send queue also contains the network address of the remote endnode and the transfer model, e.g., the send messaging model or an \gls{rdma} model. Except for the initialization of data transmissions, a work request can be used to bind a memory window to a memory region. This is further enlarged upon in \autoref{sec:memory}. A more comprehensive overview of the composition of \glspl{wr} in general will be provided in \autoref{sec:iblibs}.
\begin{figure}[ht!]
\hspace{0.4cm}
\includegraphics{images/qp_communication.pdf}
\vspace{-0.5cm}
\caption{Three \acrfullpl{sq} on a sending node communicate with three \acrfullpl{rq} on a receiving node. Both nodes have both a send and a receive queue, but the unused queues have been omitted for the sake of clarity.}\label{fig:qp_communication}
\end{figure}
\paragraph{Example} \autoref{fig:qp_communication} shows an example with three queue pairs in one node---in this example called \textit{sending node}---that communicate with three queue pairs of another node---here, \textit{receiving node}. Note that a queue pair is always initialized with a send and a receive queue; for the sake of clarity, the unused queues have been omitted in this depiction. Hence, the image shows no receive queues for the sending node and no send queues for the receiving node.
First, before any message can be transmitted between the two nodes, the receiving node has to prepare receive \glspl{wqe} by submitting receive work requests to the receive queues. Every receive \gls{wr} includes a pointer to a local memory region, which provides the \gls{hca} with a memory location to save received messages to. In the picture, the consumer is submitting a \gls{wr} to the red receive queue.
Secondly, send work requests may be submitted, which will then be processed by the channel adapter. Although the processing order of the queues depends on the priority of the services (\autoref{sec:vlandsl}), on congestion control (\autoref{sec:congestioncontrol}), and on the manufacturer's implementation of the \gls{hca}, \glspl{wqe} in a single queue will alway obey the \gls{fifo} principle. In this image, the consumer is submitting a send work request to the red send queue, and the \gls{hca} is processing a \gls{wqe} from the blue send queue.
After the \gls{hca} processed a \gls{wqe}, it places a \gls{cqe} in the completion queue. This entry contains, among others, information about the \gls{wqe} which was processed, but also about the status of the operation. The status could indicate a successful transmission, but also an error, e.g., if not sufficient receive work queue elements were available in the receive queue. A \gls{cqe} is posted when a \gls{wqe} is completely processed, so the exact moment that it is posted depends on the service type that is used. E.g., if the service type is unreliable, the \gls{wqe} will be completed as soon as the channel adapter processed it and sent the data. However, if a reliable service type is used, the \gls{wqe} will not complete until the message is successfully received by the remote host.
Obviously, after the message has been sent over the physical link, the receiving node's \gls{hca} will receive that same message. Then, it will acquire the destination \gls{qp} from the packets' base transport headers---more on that in \autoref{sec:addressing}---and grab the first available element from that \gls{qp}'s receive queue. In the case of this example, the channel adapter is consuming a \gls{wqe} from the blue receive queue. After retrieving a work queue element, the \gls{hca} will read the memory address from the \gls{wqe} and write the message to that memory location. When it is done doing so, it will post a completion queue entry to the completion queue. If the consumer of the sending node included immediate data in the message, that will be available in the \gls{cqe} at the receive side.
\paragraph{Processing WQEs}After a process has submitted a work request to one of the queues, the channel adapter starts processing the resulting \gls{wqe}. As can be seen in \autoref{fig:iba_model}, an internal \gls{dma} engine will access the memory location which is included in the work queue element, and will copy the data from the host's main memory to a local buffer of the \gls{hca}. Every port of an \gls{hca} has several of these buffers which are called \glspl{vl}. Subsequently, separately for every port, an arbiter decides from which virtual lane packets will be sent onto the physical link. How packets are distributed among the virtual lanes and how the arbiter decides from which virtual lane to send is explained in \autoref{sec:vlandsl}.
\paragraph{Queue pair state machine} Like the virtual interfaces in \autoref{sec:via}, queue pairs can reside in several states as depicted in \autoref{fig:qp_states}. All black lines are normal transitions and have to be explicitly initialized by a consumer with a \textit{modify queue pair verb}. Red lines are transitions to error states, which usually happen automatically. Because this diagram is more extensive than the state machine of the \gls{via} (\autoref{fig:via_diagram}), the descriptions of the state transitions are omitted in this figure. All states, their characteristics, and the way to enter the state are summarized in the list below. Every list item has a sublist which provides information on how work requests, received messages, and messages to be sent are handled.
\begin{figure}[ht!]
\hspace{0.4cm}
\includegraphics{images/qp_states.pdf}
\vspace{-0.5cm}
\caption{The state diagram of a \acrfull{qp} in the \acrfull{iba}.}\label{fig:qp_states}
\end{figure}
\begin{itemize}
\setlength\itemsep{0.2em}
\item \textbf{\textit{Reset}}: When a \gls{qp} is created, it enters this state. Although this is not depicted, a transition from all other states to this state is possible.
\begin{itemize}
\setlength\itemsep{0.0em}
\item Submitting \textbf{work requests} will return an immediate error.
\item \textbf{Messages that are received} by the \gls{hca} and targeted to this \gls{qp} will be silently dropped.
\item No \textbf{messages are sent} from this \gls{qp}\@.
\end{itemize}
\item \textbf{\textit{Initialized}}: This state can be entered if the modify queue pair verb is called from the \textit{reset} state.
\begin{itemize}
\setlength\itemsep{0.0em}
\item \textbf{Work requests} may be submitted to the receive queue but they will not be processed in this state. Submitting a \gls{wr} to the send queue will return an immediate error.
\item \textbf{Messages that are received} by the \gls{hca} and targeted to this \gls{qp} will be silently dropped.
\item No \textbf{messages are sent} from this \gls{qp}\@.
\end{itemize}
\item \textbf{\textit{Ready to receive}}: This state can be entered if the modify queue pair verb is called from the \textit{initialized} state. The \gls{qp} can reside in this state if it only needs to receive, and thus not to send, messages.
\begin{itemize}
\setlength\itemsep{0.0em}
\item \textbf{Work requests} may be submitted to the receive queue and they will be processed. Submitting a \gls{wr} to the send queue will return an immediate error.
\item \textbf{Messages that are received} by the \gls{hca} and targeted to this \gls{qp} will be processed as defined in the receive \glspl{wqe}.
\item No \textbf{messages are sent} from this \gls{qp}. The queue will respond to received packets, e.g., acknowledgments.
\end{itemize}
\item \textbf{\textit{Ready to send}}: This state can be entered if the modify queue pair verb is called from the \textit{ready to receive} or \textit{\gls{sq} drain} state. Mostly, \glspl{qp} reside in this state because the queue pair is able to receive and send messages and is thus fully operational.
\begin{itemize}
\setlength\itemsep{0.0em}
\item \textbf{Work requests} may be submitted to both queues; \glspl{wqe} in both queues will be processed.
\item \textbf{Messages that are received} by the \gls{hca} and targeted to this \gls{qp} will be processed as defined in the receive \glspl{wqe}.
\item \textbf{Messages are sent} for every \gls{wr} that is submitted to the send queue.
\end{itemize}
\item \textbf{\textit{\gls{sq} drain}}: This state can be entered if the modify queue pair verb is called from the \textit{ready to send} state. This state drains the send queue, which means that all send \glspl{wqe} that are present in the queue when entering the state will be processed, but all \glspl{wqe} that are submitted after it entered this state will not be processed. The state has two internal states: \textit{draining} and \textit{drained}. While residing in the former, there are still work queue elements that are being processed. While residing in the latter, there are no more work queue elements that will be processed. When the \textit{\gls{sq} drain} state transitions from the draining to the drained state, it generates an affiliated asynchronous event.
\begin{itemize}
\setlength\itemsep{0.0em}
\item \textbf{Work requests} may be submitted to both queues. \glspl{wqe} in the receive queue will be processed. \glspl{wqe} in the send queue will only be processed if they were present when entering the \textit{\gls{sq} drain} state.
\item \textbf{Messages that are received} by the \gls{hca} and targeted to this \gls{qp} will be processed as defined in the receive \glspl{wqe}.
\item \textbf{Messages are sent} only for \glspl{wr} that were submitted before the \gls{qp} entered this state.
\end{itemize}
\item \textbf{\textit{\gls{sq} error}}: When a completion error occurs while the \gls{qp} resides in the \textit{ready to send} state, a transition to this state happens automatically for all \gls{qp} types except the \gls{rc} \gls{qp}. Since an error in a \gls{wqe} can cause the local or remote buffers to become undefined, all \glspl{wqe} subsequent to the erroneous \gls{wqe} will be flushed from the queue. The consumer can put the \gls{qp} back to the \textit{ready to send} state by calling the modify queue pair verb.
\begin{itemize}
\item \textbf{Work requests} may be submitted to the receive queue and will be processed in this state. \glspl{wr} that are submitted to the send queue will be flushed with an error.
\item \textbf{Messages that are received} by the \gls{hca} and targeted to this \gls{qp} will be processed as defined in the receive \glspl{wqe}.
\item No \textbf{messages are sent} from this \gls{qp}\@. The queue will respond to received packets, e.g., acknowledgments.
\end{itemize}
\item \textbf{\textit{Error}}: Every state may transition to the \textit{error} state. This can happen automatically---when a send \gls{wr} in an \gls{rc} \gls{qp} completes with an error or when a receive \gls{wr} in any \gls{qp} completes with an error---or explicitly---when the consumer calls the modify queue pair verb. All outstanding and newly submitted \glspl{wr} will be flushed with an error.
\begin{itemize}
\setlength\itemsep{0.0em}
\item \textbf{Work requests} to both queues will be flushed immediately with an error.
\item \textbf{Packets that are received} by the \gls{hca} and targeted to this \gls{qp} will be silently dropped.
\item No \textbf{packets are sent}.
\end{itemize}
\end{itemize}
State transitions that are marked with black lines, which must be explicitly invoked by the consumer, will not succeed if the wrong arguments are passed to the modify queue pair verb. The first volume of the InfiniBand\texttrademark~Architecture Specification~\cite{infinibandvol1} provides a list of all state transitions and the required and optional attributes that can be passed on to the verb. The present work will not provide the complete list of all transitions with their attributes, and will in the following only provide some examples of important states.
Queue pairs are not immediately ready to establish a connection after they have been initialized to the \textit{reset} state. To perform the transition \textit{reset}$\,\to\,$\textit{initialized}, the partition key index and, in case of unconnected service types, the queue key has to be provided. Furthermore, \gls{rdma} and atomic operations have to be enabled or disabled in this transition. A second important transition is \textit{initialized}$\,\to\,$\textit{ready to receive} because here, in case of a connected service, the \gls{qp} will connect to another \gls{qp}\@. The consumer has to provide the modify \gls{qp} verb with, among others, the remote node address vector and the destination \gls{qpn} before it can perform the transition. If the \gls{qp} must operate in loopback mode, this has to be defined here as well.
\subsection{The InfiniBand Architecture subnet\label{sec:networking}}
The smallest entity in the InfiniBand Architecture is a \textit{subnet}. It is defined as a network of at least two endnodes, connected by physical links and optionally connected by one or more switches. Every subnet is managed by a \gls{sm}.
One task of switches is to route packets from their source to their destination, based on the packet's \gls{lid} (\autoref{sec:addressing}). The local identifier is a 16-bit wide address of which 48K values can be used to address endnodes in the subnet and 12K addresses are reserved for multicast. Switches support multiple service levels on several virtual lanes, which will be elaborated upon in \autoref{sec:vlandsl}.
It is possible to route between different subnets with a 128-bit long \gls{gid} (\autoref{sec:addressing}).
\paragraph{Subnet manager} In order for endnodes on a subnet to communicate properly with each other and for the operation of the subnet to be guaranteed, at least one managing entity has to be present to coordinate the network. Such an entity is called \acrfull{sm} and can either be located on an endnode, a switch, or a router. Tasks of the \gls{sm} are:
\begin{itemize}
\setlength\itemsep{0.2em}
\item discovering the topology of the subnet (e.g., information about switches and nodes, including, for example, the \gls{mtu});
\item assigning \glspl{lid} to \glspl{ca};
\item establishing possible paths and loading switches' routing tables;
\item regularly scanning the network for (topology) changes.
\end{itemize}
\begin{figure}[ht]
\hspace{0.4cm}
\includegraphics{images/sm_states.pdf}
\vspace{-0.5cm}
\caption{The state machine for the initialization of a \acrfull{sm}. \textit{AttributeModifiers} from the \acrfull{mad} header (\autoref{fig:MAD}) are completely written in capital letters.}\label{fig:sm_states}
\end{figure}
A subnet can contain more than one manager but only one of them may be the \textit{master} \gls{sm}\@. All others must be in standby mode. \Autoref{fig:sm_states} depicts the state machine a subnet manager goes through to identify whether it should be master or not. An \gls{sm} starts in the \textit{discovering} state in which it scans the network. As soon as it discovers another \gls{sm} with a higher priority, it transitions into \textit{standby} mode in which it keeps polling the newly found manager. If the polled manager fails to respond (\textit{polling time-out}), the \gls{sm} goes back to the \textit{discovering} state. If the node completes the discovery without finding a master or a manager with a higher priority, it transitions into the \textit{master} state and starts to initialize the subnet. A master can put other \glspl{sm} which are currently in standby mode and have a lower priority in the \textit{non-active} mode by sending a \textit{DISABLE} datagram. If it detects an \gls{sm} in standby mode with a higher priority, it will exchange the mastership. To do so, it will send a \textit{HANDOVER} datagram, which will transition the newly found \gls{sm} into the \textit{master} state. If that \gls{sm} responds with an \textit{ACKNOWLEDGE} datagram, the old master will move to the \textit{standby} state.
\paragraph{Subnet management agents} Every endnode has to contain a passive acting \gls{sma}. Although agents can send a trap to the \gls{sm}---for example if the \gls{guid} changes at runtime---they usually only respond to messages from the manager. Messages from the \gls{sm} to an \gls{sma} can, for example, include the endnode's \gls{lid} or the location to send traps to.
\paragraph{Subnet administration} Besides \glspl{sm} and \glspl{sma}, the subnet also contains a \gls{sa}. The \gls{sa} is closely connected to the \gls{sm} and often even a part of it. Through \textit{subnet administration class management datagrams}, endnodes can request information to operate on the network from the administrator. This information can, for example, contain data on paths, but also non-algorithmic data such as \textit{service level to virtual lane mappings}.
\paragraph{Management datagrams} \glspl{mad} are used to communicate management instructions. They are always \SI{256}{\byte}---the exact size of the minimal \gls{mtu}---and are divided into several subclasses. There are two types of \glspl{mad}: one for general services and subnet administration, and one for subnet management. The subnet management \gls{mad} is used for communication between managers and agents, and is also referred to as \gls{smp}. The subnet administration \gls{mad} is used to receive from and send to the subnet administration, and falls under the category of \glspl{gmp}. Other than the \gls{sa}, general services like performance management, baseboard management, device management, SNMP tunneling, communication management (\autoref{sec:communication_management}), and some vendor and application specific protocols make use of \glspl{gmp}.
\begin{figure}[ht!]
\includegraphics{images/MAD.pdf}
\vspace{-0.5cm}
\caption{The composition of a \acrfull{mad}. The first \SI{24}{\byte} are reserved for the common \acrshort{mad} header. The header is followed by up to \SI{232}{\byte} of \acrshort{mad} class specific data.}\label{fig:MAD}
\end{figure}
\Autoref{fig:MAD} shows the management datagram base format. It is made up of a common header (between byte 0 and 23) which is used by all management packets; both \glspl{smp} and \glspl{gmp} use this header. The header is followed by a \SI{232}{\byte} data field which is different for every management datagram class.
\glspl{smp} have some particular characteristics. To ensure their transmission, $\mathrm{\acrshort{vl}}_{15}$ is exclusively reserved for \glspl{smp}. This lane is not subjected to flow control restriction (\autoref{sec:congestioncontrol}) and it is passed through the subnet ahead of all other virtual lanes. Furthermore, \glspl{smp} can make use of directed routing, which means that the ports of the switch it should exit can be defined instead of a local identifier. \glspl{smp} are always received on $\mathrm{\gls{qp}}_0$.
Usually, \glspl{gmp} may use any virtual lane but $\mathrm{\acrshort{vl}}_{15}$ and any queue pair, but this is different for \gls{sa} \glspl{mad}. Although they can use any virtual lane but $\mathrm{\acrshort{vl}}_{15}$, they have to be sent to $\mathrm{\gls{qp}}_1$.
\subsection{Data packet format \& addressing\label{sec:addressing}}
\Autoref{fig:iba_packet_format} shows the composition of a complete InfiniBand data packet. Blocks with a dashed border are optional---e.g., the \gls{grh} is not necessary if the packet does not leave the subnet from which it originated---and blocks with continuous borders are mandatory---e.g., the \glspl{crc} have to be computed for every packet.
In order to send data to non-\gls{iba} subnets, the architecture supports raw packets in which the InfiniBand specific transport headers and the invariant \gls{crc} are omitted. The present work will not go into detail on raw packets; more information on these packets can be found in the \gls{iba} specification~\cite{infinibandvol1}.
Important information about the different kinds of transport headers, immediate data, the payload, and the two kinds of \glspl{crc} can be found in \autoref{tab:packet_abbreviations}. Because of their importance, information on the local and global routing header will be given in a separate section below.
\begin{figure}[ht!]
\includegraphics{images/iba_packet_format.pdf}
\vspace{-0.5cm}
\caption{The composition of a complete packet in the \acrfull{iba}.}\label{fig:iba_packet_format}
\end{figure}
\input{tables/packet_abbreviations}
\paragraph{Local routing header} The \gls{lrh} contains all necessary information for a packet to be correctly passed on within a subnet. \Autoref{fig:LRH} depicts the composition of the \gls{lrh}\@.
The most crucial fields of this header are the 16-bit source and destination \textit{local identifier} fields. A channel adapter's port can be uniquely identified within a subnet by its \gls{lid}, which the subnet manager assigns to every port in the subnet. Besides with an identifier, the subnet manager also provides \glspl{ca} with an \gls{lmc}. This value, which can range from 0 to 7, indicates how many low order bits of the \gls{lid} can be ignored by the \gls{ca} in order to determine if a received packet is targeted to that \gls{ca}. These bits are also called \textit{don't care bits}, and switches do not ignore them; this results in up to 128 different paths to a port in a subnet, which is a large benefit. Consequently, with this mask, it is possible to reach one single port with up to 128 different unicast \glspl{lid}. As mentioned earlier, the 16-bit \gls{lid} can hold approximately 48K unicast entries and 16K multicast entries.
The 11-bit \textit{packet length} field indicates the length of the complete packet in 4-byte words. This not only includes the length of the payload, but also of all headers. The \textit{VL} and \textit{SL} fields indicate which virtual lane and service level are used, respectively. Later, in \autoref{sec:vlandsl}, virtual lanes, service levels, and their connection will be explained in more detail.
The 4-bit \textit{LVer} field indicates which link level protocol is used. \textit{LNH} stands for \textit{Link Next Header} and this 2-bit field indicates the header that follows the mandatory local routing header. The LNH's \gls{msb} indicates if the packet uses \gls{iba} transport or raw transport. The second bit indicates if an optional \gls{grh} is present.
\begin{figure}[ht!]
\includegraphics{images/LRH.pdf}
\vspace{-0.5cm}
\caption{The composition of the \acrfull{lrh}.}\label{fig:LRH}
\end{figure}
\paragraph{Global routing header} The \acrfull{grh} contains all necessary information for a packet to be correctly passed on by a router between subnets. \Autoref{fig:GRH} depicts the composition of the \gls{grh}.
\begin{figure}[ht!]
\includegraphics{images/GRH.pdf}
\vspace{-0.5cm}
\caption{The composition of the \acrfull{grh}.}\label{fig:GRH}
\end{figure}
The most crucial fields of this header are the 128-bit source and destination \textit{global identifier} fields. \Autoref{fig:GID} shows the possible compositions of a \gls{gid}\@. \Autoref{fig:GID_a} shows the composition of the unicast \gls{gid}; it consists of a \gls{gid} prefix---more on that later---and a \gls{guid}. The \gls{guid} is an IEEE \gls{eui64} and uniquely identifies each element in a subnet~\cite{eui64}. The \gls{guid} is always present in a unicast global identifier. The 24 \glspl{msb} of the \gls{guid} are reserved for the company identifier, which is assigned by the IEEE Registration Authority. The 40 \glspl{lsb} are assigned to a device by the said company, to uniquely identify it. The subnet manager may change the \gls{guid} if the scope is set to local, more on that below.
\begin{figure}[ht!]
\vspace{0.7cm}
\begin{subfigure}{\textwidth}
\includegraphics[width=\linewidth]{images/GID_unicast.pdf}
\vspace{-0.7cm}
\caption{The three possible compositions of a unicast \acrshort{gid}.}\label{fig:GID_a}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{\textwidth}
\includegraphics[width=\linewidth]{images/GID_multicast.pdf}
\vspace{-0.7cm}
\caption{The composition of a multicast \acrshort{gid}.}\label{fig:GID_b}
\end{subfigure}
\caption{The possible structures of \acrfullpl{gid}.}\label{fig:GID}
\end{figure}
The composition of the 64-bit prefix depends on the scope in which packets will be sent. It all comes down to three cases, which are listed below. The enumeration of the list below corresponds to the enumeration in \autoref{fig:GID_a}. Each port will have at least one unicast \gls{gid}, which is referred to as \textit{GID index 0}. This \gls{gid} can be created using the first or the second option from the list below. Both options are based on the default \gls{gid} prefix \texttt{0xFE80::0}. Packets that are constructed using the default \gls{gid} prefix and a valid \gls{guid} must always be accepted by an endnode, but must never be forwarded by a router. That means that packets with only a GID index 0 are always restricted to the local subnet.
\begin{enumerate}
\setlength\itemsep{0.2em}
\item \textbf{Link-local}: The global identifier only consists of the default \gls{gid} prefix \texttt{0XFE80::0} and the device's \gls{eui64} and is only unique within the local subnet. Routers will not forward packets with this global identifier. \texttt{0x3FA} in \autoref{fig:GID_a} is another representation of the default \gls{gid} prefix:
\begin{equation}
\texttt{0x3FA} = (\texttt{0xFE8} \gg 2).
\end{equation}
It is used to clarify the extra bit which has to be set in the second option of this list. The two \glspl{lsb} of \texttt{0xFE8} which are eliminated by the right shift are zero and are absorbed by the 54-bit \texttt{0x0::0} block.
\item \textbf{Site-local}: The global identifier consists of the default \gls{gid} prefix with the 54th bit of the \gls{gid} prefix set to \one. In the representation of \autoref{fig:GID_a}, this corresponds to:
\begin{equation}
\texttt{0x3FB} = (\texttt{0xFE8} \gg 2) + 1 = \texttt{0x3FA} + 1.
\end{equation}
The 16-bit \textit{subnet prefix} is set to a value chosen by the subnet manager. This \gls{gid} is unique in a collection of connected subnets, but not necessarily globally.
\item \textbf{Global}: This is the only \gls{gid} type which is forwarded by routers, since it is guaranteed to be globally unique.
\end{enumerate}
Multicast \glspl{gid}, as depicted in \autoref{fig:GID_b}, are fundamentally different from unicast \glspl{gid}. To indicate that it is a multicast packet, the 8 \glspl{msb} are all set to \one. The \gls{lsb} of the \textit{flags} field indicates whether it is a permanently assigned multicast \gls{gid} (\zero) or not (\one). The remaining three bits of the flags block are always \zero. The 4-bit \textit{scope} field indicates the scope of the packet. E.g., if scope equals \texttt{0x2}, a packet will be link-local and if scope equals \texttt{0xE}, a packet will be global. The complete multicast address scope is described in the \gls{iba} specification~\cite{infinibandvol1}. The 122 \glspl{lsb} are reserved for the actual multicast \gls{gid}\@.
Although the source and destination identifiers account for \SI{80}{\percent} of the global routing header (\autoref{fig:GRH}), there are some other fields. The 4-bit \textit{IPVer} field indicates the version of the header and the 8-bit \textit{TClass} field indicates the global service level, which will be elaborated upon in \autoref{sec:vlandsl}. The 20-bit \textit{flow label} field helps to identify groups of packets that must be delivered in order. The 8-bit \textit{NxtHdr} field identifies the header which follows the \gls{grh} in the \gls{iba} packet. This is, in case of a normal \gls{iba} packet, the \gls{iba} transport header. The only remaining block is the 8-bit \textit{HopLmt}, which limits the number of hops a packet can make between subnets, before being dropped.
\subsection{Virtual lanes \& service levels\label{sec:vlandsl}}
\acrfullpl{vl} are independent sets of receive and transmit packet buffers. A channel adapter can be seen as a collection of multiple logical fabrics---lanes---which share a port and physical link.
As introduced in \autoref{sec:iba} and in particular in \autoref{fig:message_segmentation}, after a \gls{wqe} appears in the send queue, the channel adapter segments the message (i.e., the data the \gls{wqe} points to) into smaller chunks of data and forms \gls{iba} packets, based on the information present in the \gls{wqe}. Subsequently, a \gls{dma} engine copies them to a virtual lane.
Every switch and channel adapter must implement $\mathrm{\gls{vl}}_{15}$ because it is used for subnet management packets (\autoref{sec:networking}). Furthermore, between 1 and 15 additional virtual lanes $\mathrm{\gls{vl}}_{0\ldots14}$ must be implemented for data transmission. The actual number of \glspl{vl} that is used by a port (1, 2, 4, 8, or 15) is determined by the subnet manager. Until the \gls{sm} has determined how many \glspl{vl} are supported on both ends of a connection and until it has programmed the port's \acrshort{sl} to \gls{vl} mapping table, the mandatory data lane $\mathrm{\gls{vl}}_{0}$ is used.
To understand \gls{qos} in InfiniBand, which signifies the ability of a network technology to prioritize selected traffic, it is essential to understand how packets are scheduled onto the \glspl{vl}. Crupnicoff, Das, and Zahvai~\cite{crupnicoff2005deploying} did a great deal of describing the functioning of \gls{qos} in the \gls{iba}\@. This section will first explain how packets are scheduled onto the \glspl{vl}. Then, it will describe how the virtual lanes are arbitrated between the actual physical link that is connected to the channel adapter's port.
\paragraph{Scheduling packets onto virtual lanes} The \gls{iba} defines 16 \glspl{sl}. The 4-bit field that represents the \gls{sl} is present in the local routing header (\autoref{fig:LRH}) and stays constant during the packet's path through the subnet. The \gls{sl} depends on the service type which is used (\autoref{tab:service_types}). The first volume of the \gls{iba} specification~\cite{infinibandvol1} describes how the level is acquired for the different types. Besides the \gls{sl} field, there is also the \gls{vl} field in the \gls{lrh}\@. This is set to the virtual lane the packet is sent from, and, as will be discussed below, may change during its path through the subnet.
Although the architecture does not specify a relationship between certain \glspl{sl} and forwarding behavior---this is left open as a fabric administration policy---there is a specification for \gls{sl} to \gls{vl} mapping in switches. If a packet arrives in a switch, the switch may, based on a programmable \textit{SLtoVLMappingTable}, change the lane the packet is on. This also changes the corresponding field in the \gls{lrh}\@. It may happen that a packet on a certain \gls{vl} passes a packet on another \gls{vl} while transitioning through a switch. Service level to virtual lane mapping in switches allows, among others, interoperability between \glspl{ca} with different numbers of lanes.
There is a similar mechanism to service levels for global routing: the \textit{traffic class} (TClass) field in the \gls{grh} (\autoref{fig:GRH}). The present work will not further elaborate upon traffic classes.
\begin{figure}[ht!]
\hspace{0.4cm}
\includegraphics{images/iba_arbiter.pdf}
\vspace{-0.5cm}
\caption{Functional principle of the arbiter.}\label{fig:iba_arbiter}
\end{figure}
\paragraph{Arbitrating the virtual lanes}
The arbitration of virtual lanes to an output port has yet to be discussed. \Autoref{fig:iba_arbiter} depicts the logic in the arbiters which were previously depicted in \autoref{fig:iba_model} as a black box. The arbitration is implemented as a \gls{dpwrr} scheme. It consists of a \textit{high priority-\acrshort{wrr}} table, a \textit{low priority-\acrshort{wrr}} table, and a \textit{limit high priority} counter. Both tables are lists with a field to indicate the index of a virtual lane and a weight with a value between 0 and 255. The counter keeps track of the number of high priority packets that were sent and whether that number exceeds a certain threshold.
If at least one entry is available in the high priority table and the counter is not exceeded, this table is active and a packet from this table will be sent. Which packet depends on the weighted round robin scheme. E.g., two lanes, $\mathrm{\gls{vl}}_0$ and $\mathrm{\gls{vl}}_1$, are listed in a table and they have a weight of 2 and 3, respectively. When the table is active, in $\frac{2}{2+3}\cdot\SI{100}{\percent}=\SI{40}{\percent}$ of the cases a packet from $\mathrm{\gls{vl}}_0$ and in $\frac{3}{2+3}\cdot\SI{100}{\percent}=\SI{60}{\percent}$ of the cases a packet from $\mathrm{\gls{vl}}_1$ will be sent.
If the counter reaches its threshold, a packet from a low priority lane will be sent and the counter is reset to 0. If the high priority table is empty, the low priority table will be checked immediately.
$\mathrm{\gls{vl}}_{15}$ is not subjected to these rules and always has the highest priority. A \gls{vl} may be listed in either one or in both tables at the same time. There may be more than one entry of the same \gls{vl} in one table.
The bottom of \autoref{fig:iba_arbiter} shows how packets are distributed among virtual lanes based on their \gls{sl}\@. This is similar to the mapping in switches, as described above. \Autoref{fig:iba_arbiter} does not depict a switch and assumes a direct connection between two channel adapters.
\subsection{Congestion control\label{sec:congestioncontrol}}
InfiniBand is a lossless fabric, which means that congestion control does not rely on dropping packets. Packets will only be dropped during severe errors, e.g., during hardware failures. InfiniBand supports several mechanisms to deal with congestion without dropping packets. In the following, two control mechanisms will be described.
\paragraph{Link-level flow control} The first mechanism, \gls{llfc}, prevents the loss of packets caused by a receive buffer overflow. This is done by synchronizing the state of the receive buffer between source and target node with \glspl{fcpacket}, of which the composition is depicted in \autoref{fig:flow_control_packet}. Flow control packets coexists with data packets, which were presented in \autoref{sec:addressing}.
Flow control packets for a certain virtual lane shall be sent during the initialization of the physical link and prior to the passing of 65,536 \textit{symbol times} since the last time such a packet was sent for that \gls{vl}\@. A symbol time is defined as the time it takes to transmit an \SI{8}{\bit} data quantity onto a physical lane. If the physical link is in initialization state (referred to as \textit{LinkInitialize} in the IBA specification~\cite{infinibandvol1}), \textit{Op} shall be \one in the flow control packet. If the packet is sent when the link is up and not in failure (\textit{LinkArm} or \textit{LinkActive}), \textit{Op} shall be \zero.
\begin{figure}[ht!]
\includegraphics{images/flow_control_packet.pdf}
\vspace{-0.5cm}
\caption{The structure of a \acrfull{fcpacket}.}\label{fig:flow_control_packet}
\end{figure}
The flow for a complete synchronization---from a source node with a sending queue, to a target node with a receiving queue, back to the sending queue---is described in the list below and depicted in \autoref{fig:flow_control_diagram}. Flow control packets are sent on a per virtual lane base; the 4-bit \textit{VL} field is used to indicate the index of $\mathrm{\gls{vl}}_i$. $\mathrm{\gls{vl}}_{15}$ is excluded from flow control.
\begin{figure}[ht!]
\includegraphics{images/flow_control_diagram.pdf}
\vspace{-0.5cm}
\caption{Working principle of \acrfull{llfc} in the \acrfull{iba}.}\label{fig:flow_control_diagram}
\end{figure}
\begin{enumerate}
\setlength\itemsep{0.2em}
\item \textbf{Set FCTBS \& send FC packet}: Upon transmission of an \gls{fcpacket}, the 12-bit \gls{fctbs} field of the \gls{fcpacket} is set to the total number of blocks transmitted since the \gls{vl} was initialized. The \textit{block size} of a packet $i$ is defined as
\begin{equation}
B_{packet,i} = \ceil[\big]{S_i/64},
\end{equation}
with $S_i$ the size of a packet, including all headers, in bytes. Hence, the total number of blocks transmitted at a certain time is defined as:
\begin{equation}
\mathrm{\gls{fctbs}} = B_{total} = \sum_{i} B_{packet,i}.
\end{equation}
\item \textbf{Set and update ABR}: Upon receipt of an \gls{fcpacket}, a 12-bit \gls{abr} field is set to:
\begin{equation}
\mathrm{\gls{abr}} = \mathrm{\gls{fctbs}}.
\end{equation}
Every time a data packet is received and not discarded due to lack of receive capacity, the value is updated according to:
\begin{equation}
\mathrm{\gls{abr}} = \mathrm{\gls{abr}} + (B_{packet} \bmod 4096),
\end{equation}
with $B_{packet}$ the block size of the received data packet.
\item \textbf{Set FCCL \& send FC packet}: Upon transmission of an \gls{fcpacket}, the 12-bit \gls{fccl} has to be generated. If the receive buffer could permit the receipt of 2048 or more blocks of every possible combination of data packets in the current state, the credit limit is set to:
\begin{equation}
\mathrm{\gls{fccl}} = \mathrm{\gls{abr}} + 2048 \bmod 4096.
\end{equation}
Otherwise, it is set to:
\begin{equation}
\mathrm{\gls{fccl}} = \mathrm{\gls{abr}} + N_{B} \bmod 4096,
\end{equation}
with $N_B$ the number of blocks the buffer could receive in the current state.
\item \textbf{Use FCCL for data packet transmission}: After a valid \gls{fccl} is received, it can be used to decide whether a data packet can be received by a remote node and thus whether it should be sent. To make this decision, a variable $C$ is defined:
\begin{equation}
C = (B_{total} + B_{packet}) \bmod 4096,
\end{equation}
with $B_{total}$ the total blocks sent since initialization and $B_{packet}$ the block size of the packet which will potentially be transmitted. If the condition
\begin{equation}
(\mathrm{\gls{fccl}} - C) \bmod 4096 \leq 2048
\end{equation}
holds, the data packet may be sent.
\end{enumerate}
\paragraph{Feedback based control architecture}
\Autoref{fig:congestion_control} illustrates how the \gls{cca} works. Similar to link-level flow control, the \gls{cca} only controls data \glspl{vl}; $\mathrm{\gls{vl}}_{15}$ is excluded and thus \glspl{smp} will never be restricted.
\begin{figure}[ht!]
\includegraphics{images/congestion_control.pdf}
\vspace{-0.5cm}
\caption{Working principle of the \acrfull{cca}. The \acrfull{cct}, \acrfull{tmr}, and threshold value are initialized by the \acrfull{ccm}.}\label{fig:congestion_control}
\end{figure}
The control consists of five steps that are listed below. The enumeration of the list below corresponds to the numbers in \autoref{fig:congestion_control}.
\begin{enumerate}
\setlength\itemsep{0.2em}
\item \textbf{Detection}: The first step is the actual detection of congestion. This is done by monitoring a virtual lane of a given port and reviewing whether its throughput exceeds a certain threshold. This threshold is set by the \gls{ccm} and must always be between 0 and 15, where a value of 0 will turn off congestion control completely and a value of 15 corresponds to a very low threshold and thus aggressive congestion control on that virtual lane.
If the threshold is reached, the \gls{fecn} flag in the base transport header is set before the packet is forwarded to its destination.
\item \textbf{Response}: When an endnode receives a packet where the \gls{fecn} flag in the \acrshort{bth} is set, it sends a \gls{becn} back to node the packet came from. In the case of connected communication (e.g., reliable connection, unreliable connection), the response might be carried in an ACK packet. If communication is unconnected (e.g., unreliable datagram) an additional \textit{congestion notification packet} has to be sent.
\item \textbf{Determine injection rate reduction}: When a node receives a packet with the \gls{becn} flag set, an index (illustrated as $i$ in \autoref{fig:congestion_control}) will be increased by a preset value. This index is used to read from the \gls{cct}. This table is set by the \gls{ccm} during initialization and contains inter-packet delay values. The higher the index $i$, the higher the delay value it points to.
\item \textbf{Set injection rate reduction}: The value from the \gls{cct} will be used to reduce the injection rate of packets onto the physical link. The reduction can either be applied to the \gls{qp} that caused the packet which got an \gls{fecn} flag, or to all \glspl{qp} that use a particular service level (and thus virtual lane).
\item \textbf{Injection rate recovery}: After a certain time, which is set by the \gls{ccm} as well, the index $i$, and thus also the inter-packet delay, is reduced again. If no more \gls{becn} flags are received, $i$ and the delay will go to zero. If they do not go to zero, the card adapter probably go into an equilibrium at a certain point. In this equilibrium, the \gls{hca} will send packets with an inter-packet delay which is just above or just under the threshold that causes new \gls{fecn} flags to be generated.
\end{enumerate}
\subsection{Memory management\label{sec:memory}}
An \gls{hca}'s access to a host's main memory is managed and protected with three primary objects: \glspl{mr}, \glspl{mw}, and \glspl{pd}. The relationship between queue pairs and these objects is depicted in \autoref{fig:memory_iba}.
\begin{figure}[ht!]
\includegraphics{images/memory_iba.pdf}
\vspace{-0.5cm}
\caption{The relationship between \acrfullpl{qp}, \acrfullpl{mw}, \acrfullpl{mr}, and the host's main memory.}\label{fig:memory_iba}
\end{figure}
\paragraph{Memory regions} A memory region is a registered set of memory locations. A process can register a memory region with a verb, which provides the \gls{hca} with the virtual-to-physical mapping of that region. Furthermore, it returns a \gls{lkey} and \gls{rkey} to the calling process. Every time a work request which has to access a virtual address within a local memory region is submitted to a queue, the local key has to be provided within the work request. The region in the main memory is pinned on registration, which means that the operating system is prohibited from swapping that region out (\autoref{sec:mem_optimization}).
When a work requests tries to access a remote memory region on a target node, e.g., with an \textit{\gls{rdma} read} or \textit{write} operation, the remote key of the memory region on the target host has to be provided. Hence, before an \gls{rdma} operation can be performed, the source node has to acquire the \gls{rkey} of the remote memory region it wants to access. This can, for example, be done with a regular \textit{send} operation which only requires local keys.
\paragraph{Protection domains} Protection domains associate memory regions and queue pairs and are specific to each \gls{hca}\@. During creation of memory regions and queue pairs, both have to be associated with exactly one \gls{pd}\@. Multiple memory regions and queue pairs may be part of one protection domain.
A \gls{qp}, which is associated with a certain \gls{pd}, cannot access a memory region in another \gls{pd}\@. E.g., a \gls{qp} in protection\undershort{}domain\undershort{}X in \autoref{fig:memory_iba} can access memory\undershort{}region\undershort{}A and memory\undershort{}region\undershort{}B, but not memory\undershort{}region\undershort{}C.
\paragraph{Memory windows} If a reliable connection, unreliable connection, or a reliable datagram is used, memory windows can be used for memory management. First, memory windows are allocated, and then they are bound to a memory region. Although allocation and deallocation of a memory window requires a system call---and is thus time-consuming and not suitable for use in a datapath---binding a memory window to (a subset of) a memory region is done through a work request submitted to a send queue. A memory window can be bound to a memory region if both are situated in the same protection domain, if local write access for the memory region is enabled, and if the region was enabled for windowing at initialization.
The \gls{rkey} that the \gls{mw} returns on allocation is just a dummy key. Every time the window is (re)bound to (a subset of) a memory region, the \gls{rkey} is regenerated. Memory windows can be thoroughly handy for dynamic management of remote access of memory. A memory window with remote rights can be bound to a memory region without remote rights, and enable remote access this way. Furthermore, remote access can be granted and revoked dynamically without using system calls.
There are two types of memory windows: Type 1 and Type 2. Whereas the former are addressed only through virtual addresses, the latter can be addressed through either virtual addresses or zero based virtual addresses. More information on the types is given in the first volume of the \gls{iba} specification~\cite{infinibandvol1}.
\paragraph{Examples} The list below provides some examples regarding memory regions, protection domains, and memory windows. The enumerations in the list correspond with the numbers in \autoref{fig:memory_iba}.
\begin{enumerate}
\setlength\itemsep{0.2em}
\item A send work request with a pointer to \texttt{0x0C} was submitted. Since memory\undershort{}region\undershort{}A is bound to the address range this address lies in, the \gls{wr} has to include memory\undershort{}region\undershort{}A's local key. This is necessary so that the \gls{hca} will be able to access the data when it starts processing the \gls{wr}\@. A \gls{wr} submitted to $\mathrm{\gls{qp}}_1$ can only access memory\undershort{}region\undershort{}A and memory\undershort{}region\undershort{}B---and thus only memory with addresses between \texttt{0x0A} and \texttt{0x11} in the current configuration---since these regions share the protection domain with $\mathrm{\gls{qp}}_1$.
Note that, although a memory window is bound to memory\undershort{}region\undershort{}A, $\mathrm{\gls{qp}}_1$ can access the region directly by providing the local key.
\item This case is similar to case 1, but for $\mathrm{\gls{qp}}_2$. Like $\mathrm{\gls{qp}}_1$, $\mathrm{\gls{qp}}_2$ can access all memory regions in the same protection domain as long as the work request that tries to access the memory region contains the right local key.
\item This case is similar to case 1 and 2, but for memory\undershort{}region\undershort{}C since $\mathrm{\gls{qp}}_3$ resides in protection\undershort{}domain\undershort{}Y. It is thus only possible to access memory locations in the main memory in the address range from \texttt{0x12} to \texttt{0x15} with the current configuration. To access other addresses, memory\undershort{}region\undershort{}C would have to be rebound.
\item This case illustrates the reception of an \textit{\gls{rdma} write}. \textbf{Important note}: If a remote host writes into the local memory with an \textit{\gls{rdma} write}, this will not really consume a receive \gls{wr}\@. It is completely processed by the \gls{hca} without the \glspl{qp} and \glspl{cq}, and thus the \gls{os} and processes, even noticing this. Displaying (4) like this was done for the sake of simplicity and clarity.
If a remote host wants to access \texttt{0x0A} or \texttt{0x0B} it can use the remote key of memory\undershort{}window\undershort{}1 to access it. Note that remote access does not necessarily have to be turned on for memory\undershort{}region\undershort{}A; only local write access is necessary.
\end{enumerate}
\subsection{Communication management\label{sec:communication_management}}
The \gls{cm} provides protocols to establish, maintain, and release channels. It is used for all service types which were introduced in \autoref{sec:iba}. In the following, a brief introduction on the establishment and termination of communication will be given. As aforementioned, the present work will ignore special cases for the reliable datagram service type, since it is not supported by the \acrshort{ofed} stack.
Since the communication manager is a general service, it makes use of \glspl{gmp} for communication (see ``Management datagrams'' in \autoref{sec:networking} and the composition of \glspl{mad} in \autoref{fig:MAD}). The \gls{cm} has a set of messages which is set in the \textit{AttributeID} of the common \gls{mad} header. A short summary of communication management related messages which are mandatory for \gls{iba} hosts that support \gls{rc}, \gls{uc}, and \gls{rd} can be found in \autoref{tab:required_cm_messages}. Conditionally required messages for \gls{iba} hosts that support \gls{ud} can be found in \autoref{tab:conditionally_required_cm_messages}. Every message type needs different additional information which is set in the \gls{mad} data field. The exact content of this data for all message types can be found in the \gls{iba} specification~\cite{infinibandvol1}.
As mentioned in \autoref{sec:qp}, the queue pair gets all necessary information in order to reach a remote node as arguments while transitioning \textit{initialized}$\,\to\,$\textit{ready to receive}.
\input{tables/required_cm_messages}
\input{tables/conditionally_required_cm_messages}
\paragraph{Communication establishment sequences} There are various sequences of messages to establish or terminate a connection. \Autoref{fig:communication_manager} introduces three commonly used sequences. In all cases, the communication is established between an active client (\textit{A}) and a passive server (\textit{B}). It is also possible to establish communication between two active clients. If two active clients send a \acrshort{req}, they will compare their \gls{guid} (or, if both clients share a \gls{guid}, their \gls{qpn}), and the client with the smaller \gls{guid} (or \gls{qpn}) will get assigned the passive role. A client can make its reply to a communication request conditional, e.g., rejecting the connection if it gets assigned the passive role.
\begin{figure}[ht!]
\vspace{-0.3cm}
\begin{subfigure}{0.31\textwidth}
\includegraphics[width=\linewidth, page=1]{images/communication_manager.pdf}
\caption{Communication establishment sequence for \gls{rc}, \gls{uc}, and \gls{rd}.}\label{fig:communication_manager_a}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{0.31\textwidth}
\includegraphics[width=\linewidth, page=2]{images/communication_manager.pdf}
\caption{Communication release sequence for \gls{rc}, \gls{uc}, and \gls{rd}.}\label{fig:communication_manager_b}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{0.31\textwidth}
\includegraphics[width=\linewidth, page=3]{images/communication_manager.pdf}
\caption{Service ID Request for \gls{ud}.\newline}\label{fig:communication_manager_c}
\end{subfigure}
\caption{Several Communication Management sequences. All depicted sequences take place between an active and a passive \acrshort{iba} host.}\label{fig:communication_manager}
\end{figure}
\paragraph{Communication establishment} \Autoref{fig:communication_manager_a} depicts the communication establishment sequence for connected service types and for reliable datagram. First, the active host \textit{A} sends a \gls{req}. If \textit{B} wants to accept the communication it replies with \gls{rep}. If it does not want to accept the communication request, it replies with \gls{rej}. If it is not able to reply within the time-out that is specified in the received \gls{req}, it answers with \gls{mra}.
As soon as \textit{A} has received the \gls{rep}, it sends a \gls{rtu} to indicate that transmission can start.
\paragraph{Communication release} \Autoref{fig:communication_manager_b} depicts the communication release sequence for \gls{rc}, \gls{uc}, and \gls{rd}\@. The active host takes the initiative and sends a \gls{dreq}. The passive node acknowledges this with a \gls{drep}. These messages travel out of band, so if there are still operations in progress, it cannot be predicted how they will be completed.
\paragraph{Service ID request} \Autoref{fig:communication_manager_c} illustrates how \textit{A} sends a \gls{sidrreq} in order to receive all necessary information from \textit{B} to communicate over unreliable datagram. This information is sent from \textit{B} to \textit{A} over a \gls{sidrrep}.
\section{OpenFabrics software libraries\label{sec:iblibs}}
Although the \gls{iba} specification~\cite{infinibandvol1} defines the InfiniBand Architecture and abstract characteristics of functions which should be included, it does not define a complete \gls{api}. Initially, the \gls{ibta} planned to leave the exact \gls{api} implementation open to the several vendors. However, in 2004, the non-profit OpenIB Alliance (since 2005: OpenFabrics Alliance) was founded and released the \gls{ofed} under the \gls{gpl} v2.0 or BSD license~\cite{allianceofed}. The \gls{ofed} stack includes, i.a., software drivers, core kernel-code, and user-level interfaces (verbs) and is publicly available online.\footnote{\url{https://github.com/linux-rdma/rdma-core}}\footnote{\url{https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git}} Most InfiniBand vendors fetch this code, sometimes make small enhancements and modifications, and ship it with their hardware.
\begin{figure}[ht!]
\hspace{0.5cm}
\includegraphics{images/openfabrics_stack.pdf}
\vspace{-0.5cm}
\caption{A simplified overview of the \gls{ofed} stack.}\label{fig:openfabrics_stack}
\end{figure}
\Autoref{fig:openfabrics_stack} shows a simplified sketch of the \gls{ofed} stack. This illustration is based on a depiction of Mellanox' \gls{ofed} stack~\cite{mellanox2018linux}. In this picture, the SCSI \gls{rdma} Protocol (SRP), all example applications, and all \acrshort{iwarp} related stack components are omitted. The present work will mainly concentrate on the interface for the user space: the OpenFabrics user verbs (in the remainder of the present work, simply referred to as \textit{verbs}) and the \gls{rdma} \gls{cm}.
When having read \autoref{sec:infiniband}, the names of most verbs are self-explanatory (e.g., \texttt{ibv\_create\_qp()}, \texttt{ibv\_alloc\_pd()}, \texttt{ibv\_modify\_qp()}, and \texttt{ibv\_poll\_cq()}). This section will highlight some functions which often reoccur in the implementations in \autoref{chap:implementation}---i.e., the structure of work requests and how to submit them in \autoref{sec:postingWRs}---or functions which are not or hardly defined in the \gls{iba}---i.e., event channels in \autoref{sec:eventchannels} and the \gls{rdma} communication manager in \autoref{sec:rdmacm}. A complete, alphabetically ordered list of all verbs with a brief description on them can be found in \autoref{a:openfabrics}.
\subsection{Submitting work requests to queues\label{sec:postingWRs}}
\paragraph{Scatter/gather elements} Submitting work requests is a crucial part of the datapath and enables processes to commission data transfers to the host channel adapter without kernel intervention. As presented in \autoref{sec:qp}, both send and receive work queue elements contain one or several memory location(s), which the \gls{hca} will use to read data from, or write data to. Work requests include a pointer to a list of at least one \gls{sge}. This is a simple structure that includes the memory address, the length, and, in order for the \gls{hca} to be able to actually access the memory location, the local key. The structure of a scatter/gather element is displayed in \autoref{lst:ibv_sge}.
\begin{figure}[ht!]
\vspace{0.5cm}
\lstinputlisting[caption=The composition of \texttt{struct ibv\_sge}.,
label=lst:ibv_sge,
style=customc]{listings/ibv_sge.h}
\vspace{-0.2cm}
\end{figure}
\paragraph{Receive work requests} A receive work request, which is used to inform the \gls{hca} about the main memory location where received data should be written to, is a rather simple structure as well. The structure, which is shown in \autoref{lst:ibv_recv_wr}, includes a pointer to the first element of a scatter/gather list (\texttt{*sg\_list}) and an integer to define the number of elements in the list (\texttt{num\_sge}). Passing a list with several memory locations can be handy if data should be written to different locations, rather than to one big coherent memory block. The \texttt{*next} pointer can be used to link a list of receive work requests together. This is helpful if a process first prepares all work requests, and subsequently wants to call \texttt{ibv\_post\_recv()} just once, on the first work request in the list. The \gls{hca} will automatically retrieve all following \glspl{wr}. The unsigned integer \texttt{wr\_id} is optional and can be used to identify the resulting completion queue entry.
\begin{figure}[ht!]
\vspace{0.5cm}
\lstinputlisting[caption=The composition of \texttt{struct ibv\_recv\_wr}.,
label=lst:ibv_recv_wr,
style=customc]{listings/ibv_recv_wr.h}
\vspace{-0.2cm}
\end{figure}
\paragraph{Send work requests} A send work request, displayed in \autoref{lst:ibv_send_wr}, is a larger structure and tells a lot about the several options (some) InfiniBand adapters offer. The first four elements are identical to those of the \texttt{ibv\_recv\_wr} C structure. They provide a way to match a \gls{cqe} with a \gls{wr}, offer the possibility to create a list of \glspl{wr}, and enable the user to specify a pointer to and the length of a list of scatter/gather elements.
The fifth element, \texttt{opcode}, defines the operation which is used to send the message. Which operations are allowed depends on the type of the queue pair the present work request will be sent to; \autoref{tab:transport_modes} shows all possible operations together with the service types they are allowed in. \texttt{send\_flags} can be set to a bitmap of the following flags:
\begin{itemize}
\setlength\itemsep{0.2em}
\item \texttt{IBV\_SEND\_FENCE}: The \gls{wr} will not be processed until all previous \textit{\gls{rdma} read} and \textit{atomic} \glspl{wr} in the send queue have been completed.
\item \texttt{IBV\_SEND\_SIGNALED}: If a \gls{qp} is created with \texttt{sq\_sig\_all=1}, completion queue entries will be generated for every work request that has been submitted to the \gls{sq}. Otherwise, \glspl{cqe} will only be generated for \glspl{wr} with this flag explicitly set.
This only applies to the send queue. Signaling cannot be turned off for the receive queue.
\item \texttt{IBV\_SEND\_SOLICITED}: This flag must be set if the remote node is waiting for an event (\autoref{sec:eventchannels}), rather than actively polling the completion queue. This flag is valid for \textit{send} and \textit{\gls{rdma} write} operations and will wake up the remote node if it is waiting for a solicited message.
\item \texttt{IBV\_SEND\_INLINE}: If this flag is set, the data to which the scatter/gather element points is directly copied into the \gls{wqe} by the \gls{cpu}\@. That means that the \gls{hca} does not need to independently copy the data from the host's main memory to its own internal buffers. Consequently, this saves an additional main memory access operation and, since the \gls{hca}'s \gls{dma} engine will not access the main memory, the local key that is defined in the scatter/gather element will not be checked. Sending data inline is not defined in the original \gls{iba} and thus not all \gls{rdma} devices support it. Before sending a message inline, the maximum supported inline size has to be checked by querying the \gls{qp} attributes using \texttt{ibv\_query\_qp()}.
This flag is frequently used in the remainder of the present work because it offers a potential latency decrease and the buffers can immediately be released for re-use after the send \gls{wr} got submitted.
\end{itemize}
\input{tables/transport_modes}
The 32-bit \texttt{imm\_data} variable is used with operations that send data \textit{with immediate} (\autoref{tab:transport_modes}). The data will be sent in the data packet's \acrshort{imm} field (\autoref{tab:packet_abbreviations}). Besides sending \SI{32}{\bit} of data to the remote's completion queue---for example, as identifier---the immediate data field can also be used for notification of \textit{\gls{rdma} writes}. Usually, the remote host does not know whether an \textit{\gls{rdma} write} message is written to its memory and thus does also not know when it is finished. Since \textit{\gls{rdma} write with immediate} consumes a receive \gls{wqe} and subsequently generates a \gls{cqe} on the receive side, this operation can be used as a way to synchronize and thus make the receiving side aware of the received data.
The fields \texttt{rdma}, \texttt{atomic}, and \texttt{ud} are part of a union, hence, mutually exclusive. The first two structs are used together with the operations with the same name from \autoref{tab:transport_modes}. The content of the \texttt{rdma} C structure defines the remote address and the remote key, which first have to be acquired through a normal \textit{send} operation. The \texttt{atomic} C structure includes the remote address and key, but also a compare and swap operand. The \texttt{ud} structure is used for unreliable datagram. As mentioned before, \glspl{qp} in \gls{ud} mode are not connected and the consumer has to explicitly define the \gls{ah} of the remote \gls{qp} in every \gls{wr}\@. The \gls{ah} is included in the \gls{wr} through the \texttt{*ah} pointer, and can, for example, be acquired with the \gls{rdma} communication manager which is presented in \autoref{sec:rdmacm}. The \texttt{remote\_qpn} and \texttt{remote\_qkey} variables are used for the queue pair number and queue pair key of the remote \gls{qp}, respectively.
\begin{figure}[ht!]
\vspace{0.5cm}
\lstinputlisting[caption=The composition of \texttt{struct ibv\_send\_wr}.,
label=lst:ibv_send_wr,
style=customc]{listings/ibv_send_wr.h}
\vspace{-0.2cm}
\end{figure}
\subsection{Event channels\label{sec:eventchannels}}
Usually, completion queues (\autoref{sec:qp}) are checked for new entries by actively polling them with \texttt{ibv\_poll\_cq()}; this is called \textit{busy polling}. In order for this to return a \gls{cqe} as soon as one appears in the completion queue, polling has to be done continuously. Although this is the fastest way to get to know if a new \gls{cqe} is available, it is very processor intensive: a \gls{cpu} core with a thread which continuously polls the completion queue will always be utilized \SI{100}{\percent}. If minimal \gls{cpu} utilization outweighs performance, the \gls{ofed} user verbs collections offers \glspl{cc}. Here, an instance of the \texttt{ibv\_comp\_channel} C structure is created with \texttt{ibv\_create\_comp\_channel()} and is, on creation of the completion queue, bound to that queue. After creation and every time after an event is generated, the completion queue has to be armed with \texttt{ibv\_req\_notify\_cq()} in order for it to notify the \gls{cc} about new \glspl{cqe}. To prevent races, events have to be acknowledged using \texttt{ibv\_ack\_cq\_event()}. Events do not have to be acknowledged before new events can be received, but all events have to be acknowledged before the completion queue is destroyed. Since this operation is relatively expensive, and since it is possible to acknowledge several events with one call to \texttt{ibv\_ack\_cq\_event()}, acknowledgments should be done outside of the datapath.
The completion channel is realized with help of the Linux system call\linebreak \texttt{read()}~\cite{kerrisk2010linux}. In default mode, \texttt{read()} tries to read a file descriptor \texttt{fd} and blocks the process until it can return. Hence, as long as \texttt{fd} is not available, the operating system hibernates the process, which enables it to schedule other processes to the \gls{cpu}\@. Because \texttt{read()} is used, the C structure of the channel, displayed in \autoref{lst:ibv_comp_channel}, is not much more than a mere file descriptor and a reference counter. The blocking function which is used to wait for a channel is \texttt{ibv\_get\_cq\_event()}; this function is a wrapper around \texttt{read()}.
\begin{figure}[ht!]
\vspace{0.5cm}
\lstinputlisting[caption=The composition of \texttt{struct ibv\_comp\_channel}.,
label=lst:ibv_comp_channel,
style=customc]{listings/ibv_comp_channel.h}
\vspace{-0.2cm}
\end{figure}
\Autoref{fig:poll_event_comparison} depicts a comparison between busy polling and polling after an event channel returns (\textit{event based polling}). \Autoref{fig:poll_based_polling} depicts busy polling, in which \texttt{ibv\_poll\_cq()} is placed in an endless loop and continuously polls the completion queue. In order to achieve low latencies---in other words in order to poll as often as possible---this takes place in a separate thread. If \texttt{ibv\_poll\_cq()} returns a value \texttt{ret > 0}, it was able to retrieve \texttt{ret} completion queue entries. These can now be processed, for example, to release the buffers they are pointing to.
Event based polling, depicted in \autoref{fig:event_based_polling}, is a little bit more complex. As described above, first, a completion channel is created and is bound to the completion queue during initialization. Then, the \gls{cq} must be informed with\linebreak\texttt{ibv\_req\_notify\_cq()} about the fact that it should notify the completion channel whenever a \gls{cqe} arrives. After initialization, the completion channel will be read with \texttt{ibv\_get\_cq\_event()}. This happens again in a separate thread, this time because \texttt{ibv\_get\_cq\_event()} will block the thread as long as no \gls{cqe} arrives in the completion queue. Whenever the function returns, it also returns a pointer to the original \gls{cq}, which in turn can be used to busy poll the queue for limited amount of time. However, there are two important differences to regular busy polling: when the \gls{cq} is polled the first time, it is ensured that it will return at least one \gls{cqe}\@. Furthermore, after it has been polled the first time, the thread will continue to poll it, but as soon as \texttt{ibv\_poll\_cq()} returns 0, the process will re-arm the \gls{cq} and return to the blocking function. (Acknowledging with \texttt{ibv\_ack\_cq\_event()} is omitted from this example for the sake of simplicity; it has to be called at least once before the completion queue is destroyed.)
\begin{figure}[ht!]
\vspace{-0.5cm}
\begin{subfigure}{\textwidth}
\includegraphics[width=\linewidth]{images/poll_based_polling.pdf}
\vspace{-0.7cm}
\caption{The working principle of busy polling.}\label{fig:poll_based_polling}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{\textwidth}
\includegraphics[width=\linewidth]{images/event_based_polling.pdf}
\vspace{-0.7cm}
\caption{The working principle of event based polling.}\label{fig:event_based_polling}
\end{subfigure}
\begin{subfigure}{\textwidth}
\includegraphics[width=\linewidth]{images/poll_event_comparison_legend.pdf}
\end{subfigure}
\vspace{-1.5cm}
\caption{A comparison between busy polling and polling after an event channel returns.}\label{fig:poll_event_comparison}
\end{figure}
\subsection{RDMA communication manager library\label{sec:rdmacm}}
Because communication management can be quite cumbersome in the \gls{iba}, Annex A11 of the \gls{iba} specification~\cite{infinibandvol1} proposes the \gls{rdma} IP connection manager, which is implemented by the OpenFabrics Alliance. It offers a socket-like connection model and encodes the connection 5-tuple (i.e., protocol, source and destination IP and ports) into the private data of the \gls{cm} \gls{req} field (\autoref{sec:communication_management}).
\paragraph{RDMA CM over IPoIB} The \gls{ofed} \texttt{librdmacm}\footnote{\url{https://github.com/linux-rdma/rdma-core/blob/master/librdmacm}} library makes use of \gls{ipoib} in its implementation of this communication manager. \gls{ipoib} uses an unreliable datagram queue pair to drive communication because this is the only mode which must be implemented by \glspl{hca} and because of its multicast support~\cite{ipoib}. As can be seen in \autoref{fig:openfabrics_stack}, the Linux \gls{ipoib} driver enables processes to access the InfiniBand \gls{hca} over the \acrshort{tcpip} stack. On one hand, this eradicates InfiniBand's advantages like kernel bypass. On the other hand, this offers an easy to set up interface to other InfiniBand nodes. All tools capable of working with the \acrshort{tcpip} stack are also able to work with the \acrshort{tcpip} stack on top of the \gls{ipoib} driver. Because of this, the \gls{rdma} communication manager is able to send \gls{arp} requests to other nodes which support \gls{ipoib} on the InfiniBand network. The \gls{arp} response will---assuming that a node with the requested IP address is present in the network---include a \SI{20}{\byte} \textit{MAC address}. This address consists of---listed from the \gls{msb} to the \gls{lsb}---1 reserved byte, a 3-byte \gls{qpn} field, and a 16-byte \gls{gid} field. It is important to note that some applications or operating systems may have problems with the length of \gls{ipoib}'s MAC addresses since an \acrshort{eui48}--which has a length of \SI{6}{\byte} instead of \SI{20}{\byte}---is mostly used in IEEE~802~\cite{eui64}.
Thus, after the \gls{ipoib} drivers are loaded and the interface is properly configured using tools like \texttt{ifconfig} or \texttt{ip}, the \gls{rdma} \gls{cm} is able to retrieve the queue pair number and global identifier of a remote queue pair with the help of a socket like construct.
\paragraph{Communication identifier \& events} The abovementioned socket like construct is realized through so-called \textit{communication identifiers} (\texttt{struct rdma\_cm\_id}). Unlike conventional sockets, these identifiers must be bound to a local \gls{hca} before they can be used. During creation of the identifier with \texttt{rdma\_create\_id()}, an event channel, conceptually similar to the channels presented in \autoref{sec:eventchannels}, can be bound to the identifier. If such a channel is present, all results of operations (e.g., resolve address, connect to remote \gls{qp}) are reported asynchronously, otherwise the identifier will operate synchronously. In the latter case, calls to functions that usually cause an event on the channel will block until the operation completes. The former case makes use of a function similar to \texttt{ibv\_get\_cq\_event()}: \texttt{rdma\_get\_cm\_event()} also implements a blocking function that only returns when an event occurs on the channel. This function can be used in a separate thread to monitor events that occur on the identifier and to act on them. It is possible to switch between synchronous and asynchronous mode.
Queue pairs can be allocated to an \texttt{rdma\_cm\_id}. Because the identifier keeps track of the different communication events that occur, it will automatically transition the \gls{qp} through its different states; explicitly invoking \texttt{ibv\_modify\_qp()} is no longer necessary.
\section{Real-time optimizations in Linux\label{sec:optimizations}}
This section introduces optimizations that can be applied to systems running on the Linux operating system. It expands upon techniques that were applied to the Linux environment all benchmarks and VILLASnode instances were executed on and upon memory optimizations of the code. Of course, the optimizations in this section are a mere subset of all possibilities. The first subsection (\ref{sec:mem_optimization}) elaborates on memory optimization, the second subsection (\ref{sec:numa}) specifically on non-uniform memory access, the third subsection (\ref{sec:cpu_isolation}) on \gls{cpu} isolation and affinity, the fourth subsection (\ref{sec:irq_affinity}) on interrupt affinity, and finally, the last subsection (\ref{sec:tuned}) elaborates on the \texttt{tuned} daemon.
This section will not expand on the \texttt{PREEMPT\_RT} patch~\cite{rostedt2007internals} because it could not be used together with the current \gls{ofed} stack. Possible opportunities of this real-time optimization with regards to InfiniBand applications are further expanded upon in \autoref{sec:future_real_time}.
\subsection{Memory optimizations\label{sec:mem_optimization}}
There are lots of factors that determine how efficiently memory is used: they can be on a high level---e.g., the different techniques that are supported by the \gls{os}---but also on a low level---e.g., by changing the order of certain memory accesses in the actual algorithm. Exploring all these different techniques is beyond the scope of the present work; rather, some techniques that are used in the benchmarks and in the implementation of the \textit{InfiniBand} node-type are discussed in this subsection. The interested reader is referred to Drepper's publication~\cite{drepper2007every}, which provides a comprehensive overview of methods that can be applied to optimize memory access in Linux.
\paragraph{Hugepages} Most modern operating systems---with Linux being no exception---support \textit{demand-paging}. In this method, every process has its own \textit{virtual memory} which appears to the process as a large contiguous block of memory. The \gls{os} maps the \textit{physical addresses} of the actual physical memory (or even of a disk) to \textit{virtual addresses}. This is done through a combination of software and the \gls{mmu} which is located in the \gls{cpu}.
Memory is divided into \textit{pages}. It is the smallest block of memory that can be accessed in virtual memory. For most modern operating systems, the smallest page size is \SI{4}{\kibi\byte}; in a 64-bit architecture these \SI{4}{\kibi\byte} can hold up to 512 words. If a process tries to access data at a certain address in the virtual memory which is not yet available, a \textit{page fault} is generated. This exception is detected by the \gls{mmu}, which in turn tries to map the complete page from the physical memory (or from a disk) into the virtual memory.
Page faults are quite expensive and it is beneficial for performance to cause as little as possible page faults~\cite{drepper2007every}. One possible solution to achieve this is to increase the size of the pages: Linux supports so-called \textit{hugepages}. Although there are several possible sizes for hugepages, on x86-64 architectures they are usually \SI{2}{\mebi\byte}~\cite{guide2018intelc3a}. Compared to the 512 words that can fit into a \SI{4}{\kibi\byte} page, the hugepage can fit \SI{262144} words into one page, which is 512 times as much. Since more data can be accessed with less page faults, this will increase performance; Drepper~\cite{drepper2007every} reports performance gains up to \SI{57}{\percent} (for a working set of $\SI[parse-numbers=false]{2^{20}}{\byte}$).
Additionally, with hugepages, more memory can be mapped with a single entry in the \gls{tlb}. This buffer is part of the \gls{mmu} and caches the most recently used page table entries. If a page is present in the \gls{tlb} (\textit{\gls{tlb} hit}), resolution of a page in the physical memory is instantaneous. Otherwise (\textit{\gls{tlb} miss}) up to four memory accesses in x86-64 architectures are required \cite{gandhi2016range}. Since the \gls{tlb} size is limited, larger pages result in the instantaneous resolution of a larger range of addresses with the same size \gls{tlb}.
Using hugepages is not an all-in-one solution; it has some disadvantages that have to be considered. When page sizes are becoming bigger, it gets harder for the \gls{mmu} to find contiguous physical memory sectors of this size. This goes hand in hand with external fragmentation of the memory. Furthermore, the size of hugepages makes them more prone to internal fragmentation, which means that more memory is allocated than is actually needed.
\paragraph{Alignment} A memory address $a$ is \textit{n-byte aligned} when
\begin{equation}
a = C\cdot n=C\cdot 2^i, \qquad \quad \mathrm{with}~i\geq0,\ \; C\in\mathbb{Z}.
\label{eq:alignment}
\end{equation}
An n-byte aligned address needs to meet the sufficient condition that $log_2(n)$ \glspl{lsb} of the address are \zero.
\begin{listing}[ht!]
\refstepcounter{lstlisting}
\noindent\begin{minipage}[b]{.34\textwidth}
\lstinputlisting[nolol=true, style=customc]{listings/memory_alignment_a.h}
\captionof{sublisting}{Struct with padding.}\label{lst:memory_alignment_a}
\end{minipage}%
\hfill
\begin{minipage}[b]{.58\textwidth}
\lstinputlisting[nolol=true, style=customc]{listings/memory_alignment_b.h}
\captionof{sublisting}{Packed struct without padding.}\label{lst:memory_alignment_b}
\end{minipage}
\addtocounter{lstlisting}{-1}
\captionof{lstlisting}{Two C structures with an 1-bit character, a 4-bit integer, and a 2-bit short.}
\label{lst:memory_alignment}
\end{listing}
\Autoref{fig:memory_alignment} shows a simple example for a 32-bit system with the three primitive C data types from \autoref{lst:memory_alignment}. In \autoref{fig:memory_alignment_a} the data is \textit{naturally aligned}: the compiler added padding between the data types to ensure alignment to the memory word boundaries. In the structure definition of \autoref{lst:memory_alignment}\hyperref[lst:memory_alignment]{b}, the compiler is compelled to omit additional padding: the data types are not aligned to word boundaries. Note that \autoref{eq:alignment} holds in \autoref{fig:memory_alignment_a}, but not in \autoref{fig:memory_alignment_b}. Furthermore, for \autoref{fig:memory_alignment_a}, additional 1-bit characters could be placed at \texttt{0x0001}, \texttt{0x0002}, \texttt{0x0003}, and \texttt{0x000A} in this example. Additional 2-bit shorts could be placed at \texttt{0x0002} and \texttt{0x000A}.
\begin{figure}[ht!]
\vspace{-0.2cm}
\begin{subfigure}{0.49\textwidth}
\includegraphics[width=\linewidth,page=1]{images/memory_alignment.pdf}
\vspace{-0.7cm}
\caption{An aligned struct (\autoref{lst:memory_alignment}\hyperref[lst:memory_alignment]{a}).}\label{fig:memory_alignment_a}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{0.49\textwidth}
\includegraphics[width=\linewidth, page=2]{images/memory_alignment.pdf}
\vspace{-0.7cm}
\caption{An unaligned struct (\autoref{lst:memory_alignment}\hyperref[lst:memory_alignment]{b}).}\label{fig:memory_alignment_b}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{\textwidth}
\includegraphics[width=\linewidth]{images/memory_alignment_legend.pdf}
\end{subfigure}
\vspace{-1.2cm}
\caption{An example of an 1-bit character, a 4-bit integer, and a 2-bit short from \autoref{lst:memory_alignment} in memory with a word size of \SI{32}{\byte}.}\label{fig:memory_alignment}
\end{figure}
Similar to pages, a system can only access one whole word at a time. In \autoref{fig:memory_alignment_a}, this translates to one memory access per data type. In \autoref{fig:memory_alignment_b}, however, this is no longer possible. To access the integer, the operating system first has to access the word at address \texttt{0x0000} and then the word at address \texttt{0x0004}. Subsequently the value in the first word must be shifted one position and the value in the second word three positions. Finally, both words have to be merged. These additional operations cause additional delay when trying to access the memory. Moreover, atomicity becomes more difficult to guarantee, since the \gls{os} needs to access two memory locations to access one data type.
Alignment is not only relevant for memory words. Not aligning allocated memory to cache lines significantly slows down memory access~\cite{drepper2007every}. Furthermore, due to the way the \gls{tlb} works, alignment can speed up resolution of addresses in the physical memory.
\paragraph{Pinning memory} The process of preventing the operating system from swapping out (parts of) the virtual address space of a process is called \textit{pinning memory}. It is invoked by calling \texttt{mlock()} to prevent parts of the address space from being swapped out, or \texttt{mlockall()} to prevent the complete address space from being swapped out.\footnote{\url{http://man7.org/linux/man-pages/man2/mlock.2.html}}
Explicitly pinning buffers that are allocated to use as source or sink for data by an \gls{hca} is not necessary: when registering a memory region (\autoref{sec:memory}), the registration process automatically pins the memory pages~\cite{mellanox2015RDMA}.
\subsection{Non-uniform memory access\label{sec:numa}}
If different memory locations in the address space show different access times, this is called \gls{numa}. A common example of a \gls{numa} system is a computer system with multiple \gls{cpu} sockets and thus also multiple system buses. Because a \gls{numa} node is defined as memory with the same access characteristics, here, this is the memory which is closest to the respective \gls{cpu}. Accessing memory on a remote \gls{numa} node adds up to 50 percent to the latency for a memory access \cite{lameter2013numa}.
\Autoref{fig:numa_nodes} depicts an example with two \gls{numa} nodes and the interconnect between them. It is beneficial for the performance of processes to access only memory which is closest to the processor that executes the process. Furthermore, regarding the InfiniBand applications later presented in the present work, it is beneficial to run processes that need to access a certain \gls{hca} on the same \gls{numa} node as the \gls{hca}. An \gls{hca} is connected to the system bus through the \gls{pcie} bus, hence, access of memory in the same \gls{numa} node will be faster than access of memory on a remote \gls{numa} node. Thus, in case of \autoref{fig:numa_nodes}, if a process needs to access \gls{hca} 0, it should be scheduled on one or more cores on processor 0 and should be restricted to memory locations of memory 0.
\begin{figure}[ht]
\includegraphics{images/numa_nodes.pdf}
\vspace{-0.5cm}
\caption{Two \acrfull{numa} nodes with \acrshortpl{hca} on the respective \acrshort{pcie} buses.}\label{fig:numa_nodes}
\end{figure}
To set the memory policy of processes, tools like \texttt{numactl}\footnote{\url{http://man7.org/linux/man-pages/man8/numactl.8.html}}, which are based on the system call \texttt{set\_mempolicy()}\footnote{\url{http://man7.org/linux/man-pages/man2/set_mempolicy.2.html}}, can be used. These tools will not be further elaborated upon here since the next subsection will introduce a more general tool to constrain both \gls{cpu} cores and \gls{numa} nodes to processes.
\subsection{CPU isolation \& affinity\label{sec:cpu_isolation}}
\paragraph{Isolcpus} It is beneficial for the performance of a process if one or more \gls{cpu} cores (in the remainder of the present work often simply referred to as \textit{cores} or \textit{\glspl{cpu}}) are completely dedicated to its execution. Historically, the \texttt{isolcpus}\footnote{\url{https://www.kernel.org/doc/Documentation/admin-guide/kernel-parameters.txt}} kernel parameter is used to exclude processor cores from the general balancing and scheduler algorithms on symmetric multiprocessing architectures. With this exclusion, processes will only be moved to excluded cores if their affinity is explicitly set to these cores with the system call \texttt{sched\_setaffinity()}~\cite{kerrisk2010linux}. The tool \texttt{taskset}\footnote{\url{http://man7.org/linux/man-pages/man1/taskset.1.html}}, which relies on the aforementioned system call, is often used to set the \gls{cpu} affinity of running processes or to set the affinity of new commands.
The major advantage of \texttt{isolcpus} is at the same time its biggest disadvantage: the exclusion of cores from the scheduling algorithms causes threads, that are created by a process, to always be executed on the same core as the process itself. Take the example of busy polling: if a thread that must busy poll a completion queue is created and is executed on the same core as the primary thread, this has an adverse effect on the performance of the latter. So, it is desired to isolate \gls{cpu} cores that are dedicated to certain explicitly defined processes, but simultaneously enable efficient scheduling of threads of these processes among the isolated cores.
\paragraph{Cpusets} A possible solution to this problem is offered by \textit{cpusets}~\cite{derr2004cpusets} which uses the generic \textit{control group} (cgroup)~\cite{menage2004cgroups} subsystem. If this mechanism is used, requests by a task to include \glspl{cpu} in its \gls{cpu} affinity or requests to include memory nodes are filtered through the task's cpuset. That way, the scheduler will not schedule a task on a core that is not in its \texttt{cpuset.cpus} list and not use memory on \gls{numa} nodes which are not in the \texttt{cpuset.mems} list.
Cpusets are managed through the \textit{cgroup virtual file system} and each cpuset is represented by a directory in this file system. The root cpuset is located under \texttt{/sys/fs/cgroup/cpuset} and includes all memory nodes and \gls{cpu} cores. A new cpuset is generated by creating a directory within the root directory. Every newly created directory automatically includes similar files to the root directory. These files shall be used to write the cpuset's configuration to (e.g., with \texttt{echo}\footnote{\url{http://man7.org/linux/man-pages/man1/echo.1.html}}) or to read the current configuration from (e.g., with \texttt{cat}\footnote{\url{http://man7.org/linux/man-pages/man1/cat.1.html}}). The following settings are available for every cpuset~\cite{derr2004cpusets}:
\begin{itemize}
\setlength\itemsep{-0.1em}
\item \texttt{cpuset.cpus}: list of \glspl{cpu} in that cpuset;
\item \texttt{cpuset.mems}: list of memory nodes in that cpuset;
\item \texttt{cpuset.memory\_migrate}: if set, pages are moved to cpuset's nodes;
\item \texttt{cpuset.cpu\_exclusive}: if set, cpu placement is exclusive;
\item \texttt{cpuset.mem\_exclusive}: if set, memory placement is exclusive;
\item \texttt{cpuset.mem\_hardwall}: if set, memory allocation is hardwalled;
\item \texttt{cpuset.memory\_pressure}: measure of how much paging pressure in cpuset;
\item \texttt{cpuset.memory\_pressure\_enabled}\footnote{exclusive to root cpuset}: if set, memory pressure is computed;
\item \texttt{cpuset.memory\_spread\_page}: if set, page cache is spread evenly on nodes;
\item \texttt{cpuset.memory\_spread\_slab}: if set, slab cache is spread evenly on nodes;
\item \texttt{cpuset.sched\_load\_balance}: if set, load is balanced among CPUs;
\item \texttt{cpuset.sched\_relax\_domain\_level}: searching range when migrating tasks.
\end{itemize}
Once all desired cpusets are created and everything is set up by writing settings to the abovementioned files, tasks can be assigned by writing their \gls{pid} to \texttt{/sys/fs/cgroup/cpuset/<name\_cpuset>/tasks}.
\paragraph{Cpuset tool} Since the process of manually writing tasks to the \textit{tasks-file} can be quite cumbersome, there are several tools and mechanisms to manage which processes are bound to which cgroups.\footnote{Although the libcgroup package was used in the past, systemd is nowadays the preferred method for managing control groups.} A rudimentary tool that is used in the present work is called \textit{cpuset}.\footnote{\url{https://github.com/lpechacek/cpuset}} It was developed by Alex Tsariounov and is a Python wrapper around the file system operations to manage cpusets. The following examples on how to create cpusets, how to move threads between cpusets, and how to execute applications in a cpuset are all based on this tool. However, the exact same settings can also be achieved by writing values manually to the virtual file system.
\Autoref{lst:cset_create} shows how to create different subsets. (Here, and in the remainder of the present work, the octothorpe indicates that the commands must be executed by a superuser. A dollar sign indicates that the command can be executed by a normal user.) In this example, an arbitrary machine with 24 cores and two \gls{numa} nodes (\autoref{sec:numa}) is assumed. The first cpuset, \textit{system}, may use 16 of these cores exclusively and may use memory in both \gls{numa} nodes. This will become the default cpuset for non-time-critical applications. The second and third cpuset, called \textit{real-time-0} and \textit{real-time-1} in this example, may use four cores each. These are exclusively reserved for time-critical applications. In this example, it is assumed that the \glspl{cpu} 16, 18, 20, and 22 reside in \gls{numa} node 0 and the \glspl{cpu} 17, 19, 21, and 23 in \gls{numa} node 1; the real-time cpusets are thus constrained to their respective nodes.
\begin{figure}[ht!]
\vspace{0.5cm}
\lstinputlisting[caption=Creating cpusets for system tasks and real-time tasks.,
label=lst:cset_create,
style=customconfig]{listings/cset_create.sh}
\vspace{-0.2cm}
\end{figure}
The exclusiveness of a \gls{cpu} to cpuset only applies to its siblings; tasks in the cpuset's parent may still use the \gls{cpu}. Therefore, \autoref{lst:cset_move} shows how to move threads and movable kernel threads from the root cpuset to the newly created \textit{system} cpuset. Now, the execution of these tasks and of all their children exclusively takes place on \glspl{cpu} that range from 0 to 15.
\begin{figure}[ht!]
\vspace{0.5cm}
\lstinputlisting[caption=Moving all tasks\comma threads\comma and moveable kernel threads to \textit{system}.,
label=lst:cset_move,
style=customconfig]{listings/cset_move.sh}
\vspace{-0.2cm}
\end{figure}
This leaves the two real-time cpusets exclusively for high-priority applications. \Autoref{lst:cset_exec} shows how new applications with their arguments can be started within the real-time cpusets.
To ensure that the load is balanced among the \glspl{cpu} in a cpuset---a feature that is not supported by \texttt{isolcpus}---\texttt{cpuset.sched\_load\_balance} must be \one. It is not necessary to explicitly set this value since its default value is already \one.
\begin{figure}[ht!]
\vspace{0.5cm}
\lstinputlisting[caption=Execute \texttt{<application>} with the arguments \texttt{<args>} in the real-time cpusets.,
label=lst:cset_exec,
style=customconfig]{listings/cset_exec.sh}
\vspace{-0.2cm}
\end{figure}
\paragraph{Non-movable kernel threads} Kernel threads are background operations performed by the kernel. They do not have an address space, are created on system boot, and can only be created by other kernel threads \cite{love2010linux}. Although some of them may be moved from one \gls{cpu} to another, this is not generally the case. Some kernel threads are pinned to a \gls{cpu} on creation. Although it is not possible to completely exclude kernel threads from getting pinned to cores which will be shielded, there is a workaround which might minimize this chance.
By setting the kernel parameter \texttt{maxcpus}\footnote{\url{https://www.kernel.org/doc/Documentation/admin-guide/kernel-parameters.txt}} to a number smaller than the total amount of \gls{cpu} cores in the system, some cores will not be brought up during bootup. Hence, these processors will not be used to schedule kernel threads. Later, when all movable kernel threads are moved to a shielded cpuset, the remaining \glspl{cpu} can be activated with the command from \autoref{lst:activate_cpu}. Then, these \glspl{cpu} can be added to an exclusive cpuset. Although it is inevitable that some necessary threads will be spawned on these cores once they are brought up, most of the non-movable kernel threads cannot move from a processor that was available during bootup to a processor that was activated after bootup.
\begin{figure}[ht!]
\vspace{0.5cm}
\lstinputlisting[caption=Bring up a \gls{cpu} \texttt{<cpuX>} which was disabled during bootup.,
label=lst:activate_cpu,
style=customconfig]{listings/activate_cpu.sh}
\vspace{-0.2cm}
\end{figure}
\subsection{Interrupt affinity\label{sec:irq_affinity}}
In most computer systems, hardware \textit{interrupts} provide a mechanism for \gls{io} hardware to notify the \gls{cpu} when it has finished the work it was assigned. When an \gls{io} device wants to inform the \gls{cpu}, it asserts a signal on the bus line it has been assigned to. The signal is then detected by the \textit{interrupt controller} which decides if the targeted \gls{cpu} core is busy. If this is not the case, the interrupt is immediately forwarded to the \gls{cpu}, which in turn ceases its current activity to handle the interrupt. If the \gls{cpu} is busy, for example, because another interrupt with a higher priority is being processed, the controller ignores the interrupt for the moment and the device keeps asserting a signal to the line until the \gls{cpu} is not busy\linebreak anymore \cite{tanenbaum2014modern}.
Hence, if a \gls{cpu} is busy performing time-critical operations---e.g., busy polling (\autoref{fig:poll_based_polling})---too many interrupts are detrimental for the performance. Thus, it can be advantageous to re-route interrupts to \glspl{cpu} that do not perform time-critical applications.
\Autoref{lst:get_irq_affinity} shows how to obtain the \gls{irq} affinity of a certain interrupt request \texttt{<irqX>}. The value \texttt{smp\_affinity} is a \textit{bitmap}, which means that the indices that are set represent the allowed \glspl{cpu}~\cite{bowden2009proc}. E.g., when \texttt{smp\_affinity} for a certain \gls{irq} is \texttt{10}, it means that \gls{cpu} 1 is allowed; if the affinity is \texttt{11}, it means that \gls{cpu} 1 and 0 are allowed.
\begin{figure}[ht!]
\vspace{0.5cm}
\lstinputlisting[caption=Get the \gls{irq} affinity of interrupt \texttt{<irqX>}.,
label=lst:get_irq_affinity,
style=customconfig]{listings/get_irq_affinity.sh}
\vspace{-0.2cm}
\end{figure}
\Autoref{lst:set_irq_affinity} demonstrates how the \gls{irq} affinity of a certain interrupt request can be set. In the case of \autoref{lst:set_irq_affinity}, it is set to \gls{cpu} 0--15, which corresponds to the \textit{system} cpuset from the previous paragraph. \texttt{<irqX>} will no longer bother the \glspl{cpu} 16--23. To re-route all \glspl{irq}, a script (e.g., in Bash) that loops through \texttt{/proc/irq} can be used.
\begin{figure}[ht!]
\vspace{0.5cm}
\lstinputlisting[caption=Set the \gls{irq} affinity of interrupt \texttt{<irqX>} to \gls{cpu} 0--15.,
label=lst:set_irq_affinity,
style=customconfig]{listings/set_irq_affinity.sh}
\vspace{-0.2cm}
\end{figure}
\subsection{Tuned daemon\label{sec:tuned}}
Red Hat based systems support the \texttt{tuned} daemon\footnote{\url{https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/performance_tuning_guide/chap-red_hat_enterprise_linux-performance_tuning_guide-tuned}}, which uses \texttt{udev}~\cite{kroah2003udev} to monitor devices and, on the basis of its findings, adjusts system settings to increase performance according to a selected profile. The daemon consists of two types of plugins: monitoring and tuning plugins. The former can, at the moment of writing the present work, monitor the disk load, network load, and \gls{cpu} load. The tuning plugins currently supported are: cpu, eepc\undershort{}she, net, sysctl, usb, vm, audio, disk, mounts, script, sysfs, and video.
Although it is possible to define custom profiles, \texttt{tuned} offers a wide range of predefined profiles, of which \textit{latency-performance} is eminently suitable for low-latency applications. This profile does, among others, disable power saving mechanisms, set the \gls{cpu} governor to \textit{performance}, and lock the \gls{cpu} to a low C-state. A complete overview of all settings in the \textit{latency-performance} profile can be found in \autoref{a:tuned_profile}.
Management of different tuning profiles can be done with the command line tool \texttt{tuned-adm}.\footnote{\url{https://linux.die.net/man/1/tuned-adm}}

8
chapters/conclusion.tex Normal file

@ -0,0 +1,8 @@
\chapter{Conclusion\label{chap:conclusion}}
The present work shows that the \acrfull{iba} enables the transmission of small messages at high rates with sub-microsecond latencies. Together with the presented performance optimizations, the \gls{iba} is eminently suitable as communication technology for applications with hard real-time requirements.
With only a few adaptions to the existing node-type interface and the buffer management of VILLASnode, it was possible to create an \textit{InfiniBand} node-type that makes full use of the \gls{iba}'s zero-copy capabilities and can initiate data transfers without needing system calls. The adaptions that have been made were necessary because complications, which were non-existent with prior node-types, emerged. Eventually, since the InfiniBand Architecture is rooted in the \acrfull{via}, these alterations facilitate VILLASnode with an interface that is---with minimal adaptions---compatible to other \acrshortpl{via} as well.
In the custom benchmark that was used, the InfiniBand node-type showed median latency between approximately \SI{1.7}{\micro\second} (for high rates and small message sizes) and \SI{4.9}{\micro\second} (for low rates and large message sizes). Compared to the \textit{shmem} node-type, which can be seen as zero-latency reference, these median latencies were only roughly 1.5--\SI{2.5}{\micro\second} higher. This is an excellent achievement since the latter solely allows communication between nodes on the same host system. The former, on the other hand, also allows communication between nodes on different host systems. For comparison: prior node-types that allow communication between different host systems and rely on Ethernet as communication technique, showed median latencies which were one order of magnitude larger than the median latencies the \textit{InfiniBand} node-type showed. Furthermore, with the new node-type, much higher transmission rates could be achieved and the latency's predictability substantially improved.
It can thus be concluded that the \textit{InfiniBand} node-type is a valuable extension to the pool of existing VILLASnode node-types. Although the node-type is not suitable for inter-laboratory communication, it enables hard real-time scalability of simulation power within laboratories.

696
chapters/evaluation.tex Normal file

@ -0,0 +1,696 @@
\chapter{Evaluation\label{chap:evaluation}}
This chapter discusses the results of the previously presented benchmarks. \Autoref{sec:evaluation_ca} starts with an evaluation of the custom one-way \gls{hca} benchmark from \autoref{sec:ca_benchmarks}. After these results have been analyzed, \autoref{sec:perftest} will compare them to the results of \texttt{ib\_send\_lat} of the \gls{ofed} Performance Test package. Subsequently, \autoref{sec:evaluation_villasnode} discusses the several VILLASnode node-types that were benchmarked.
\Autoref{tab:benchmark_testsystem} lists the hardware, the operating system, the \gls{ofed} stack version, and the VILLASnode version that were used for all benchmarks. Fedora was selected as \gls{os} because of its support for the \texttt{tuned} daemon (\autoref{sec:tuned}) and because of its easy-to-set-up support for \texttt{PREEMPT\_RT}-patched kernels (\autoref{sec:future_real_time}). At the time of writing the present work, the chosen Fedora and kernel version was the latest combination that was seamlessly supported by this version of the Mellanox\textregistered{} variant of the \gls{ofed} stack.
\input{tables/benchmark_testsystem}
The system was optimized using to the techniques from \autoref{sec:optimizations}. Unless stated otherwise, all analyses that are presented in this chapter have been run under these circumstances. \Autoref{fig:configuration_system} shows the distribution of \glspl{cpu} among cpusets (\autoref{sec:cpu_isolation}). The \glspl{cpu} in the two \textit{real-time-<X>} cpusets are limited to the memory locations in their \gls{numa} node (\autoref{sec:numa}). These memory locations are also the same as those the respective \glspl{hca} will read from or write to. Finally, the system is optimized by setting the \texttt{tuned} daemon to the \textit{latency-performance} profile (\autoref{sec:tuned}).
Thus, all time-critical processes that needed to use the \glspl{hca} \texttt{mlx5\_0} and \texttt{mlx5\_1} were run on the \glspl{cpu} 16, 18, 20, and 22 and 17, 19, 21, and 23, respectively.
\begin{figure}[ht]
\includegraphics{images/configuration_system.pdf}
\vspace{-0.5cm}
\caption{The configuration of the Dell PowerEdge T630 from \autoref{tab:benchmark_testsystem}, which was used in the present work's evaluations. \gls{numa} specific data is acquired with \texttt{numactl}.}\label{fig:configuration_system}
\end{figure}
\section{Custom one-way host channel adapter benchmark\label{sec:evaluation_ca}}
This section examines different possible configurations of communication over an InfiniBand network using the benchmark presented in \autoref{sec:ca_benchmarks}. It is intended to help make a well considered choice regarding the configuration of the InfiniBand VILLASnode node-type and to get a ballpark estimate of the latency this communication technology will show in VILLASnode.
\subsection{Event based polling\label{sec:event_based_polling}}
The first analyses that were performed were meant to examine the characteristics of event based polling (\autoref{fig:event_based_polling}). Since event channels are designed to be \gls{cpu} efficient, in this case, the optimizations from \autoref{sec:cpu_isolation} (``CPU isolation \& affinity'') and \autoref{sec:irq_affinity} (``Interrupt affinity'') were not applied and \autoref{fig:configuration_system} is not relevant. Instead of improving latency, the aforementioned optimizations had an adverse effect and actually increased latency. However, the \texttt{tuned} profile \textit{latency-performance} and memory optimization techniques were applied nevertheless.
\Autoref{tab:oneway_settings_event} shows the settings that were used with the custom one-way benchmark. These settings were introduced in \autoref{sec:tests}. Gray columns in \autoref{tab:oneway_settings_event}, and in all following tables that list benchmark settings, indicate that the settings of these columns were varied during the different runs. Consequently, all settings in the white columns stayed constant whilst performing the different tests. The graphs that were generated from the resulting data are shown in \autoref{fig:oneway_event}.
\input{tables/oneway_settings_event}
In the first three subfigures of \autoref{fig:oneway_event}, $25\cdot8000$ messages of \SI{32}{\byte} were bursted for \gls{rc}, \gls{uc}, and \gls{ud}. This message size was chosen in most of the following tests because it is the minimum size of a message in the VILLASnode \textit{InfiniBand} node-type. Every sample that is sent from one VILLASnode \textit{InfiniBand} node to another contains at least one 8-byte value and always carries \SI{24}{\byte} of metadata.
The first thing that catches the eye are the relatively high median latencies (\autoref{eq:latency}) of all service types: $\tilde{t}_{lat}^{RC}=\SI{3608}{\nano\second}$, $\tilde{t}_{lat}^{UC}=\SI{3598}{\nano\second}$, and $\tilde{t}_{lat}^{UD}=\SI{3389}{\nano\second}$. These were caused by the event channels that were used for synchronization: with abovementioned settings, the benchmark waits until a \texttt{read()} system call returns before it tries to poll the completion queue. Therefore, in the meantime, other processes can be scheduled onto the \gls{cpu} and it will take a certain amount of time to wake the benchmark up again. So, event based polling results in a lower \gls{cpu} utilization compared to busy polling, but, in return, yields a higher latency.
\paragraph{Maxima} The maximum latencies that can be seen were mainly caused by initial transfers immediately after the process started or after a period of hibernation. This is sometimes referred to as the \textit{warm up effect}. Potential solutions for this problem are introduced in \autoref{sec:future_real_time}.
The custom one-way benchmark includes another potential cause for latency maxima. As mentioned in~\autoref{sec:timestamps}, the function that measures and saves the receive timestamps (\autoref{lst:cq_time}) lies in the time-critical path. The worst case situation, in which the two memory regions were only initialized by \texttt{mmap()} but were not yet touched and thus allocated, was examined. This caused maxima of more than \SI{700}{\micro\second}. When the pages were present in the virtual memory, the latency of both save operations was determined to be approximately \SI{40}{\nano\second} together.
Thus, in order to make full use of the capabilities and low latencies of InfiniBand, it is important to carefully pick the operations that lie in the datapath.
\begin{figure}
\vspace{-0.5cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:oneway_event_a}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_event_hist/plot_0.pdf}
\end{minipage}
\end{subfigure}
\vspace{-0.2cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:oneway_event_b}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_event_hist/plot_1.pdf}
\end{minipage}
\end{subfigure}
\vspace{-0.2cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:oneway_event_c}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_event_hist/plot_2.pdf}
\end{minipage}
\end{subfigure}
\vspace{-0.2cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:oneway_event_d}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_event_hist/plot_3.pdf}
\end{minipage}
\end{subfigure}
\vspace{-0.2cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:oneway_event_e}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_event_hist/plot_4.pdf}
\end{minipage}
\end{subfigure}
\vspace{-0.2cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:oneway_event_f}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_event_hist/plot_5.pdf}
\end{minipage}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\includegraphics{plots/oneway_event_hist/legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Results of the one-way benchmark with the settings from \autoref{tab:oneway_settings_event}. These were used to analyze latencies with event based polling.}\label{fig:oneway_event}
\end{figure}
\paragraph{Minima} The small peaks at the left side of the graphs, between approximately \SI{900}{\nano\second} and \SI{2900}{\nano\second}, were caused by how this benchmark implements event based polling. \Autoref{fig:event_based_polling} already showed that after a completion channel notifies the process that a new \gls{cqe} is available, the \gls{cq} must be polled with \texttt{ibv\_poll\_cq()} to acquire \glspl{cqe}. After polling, this benchmark does not immediately return control to \texttt{ibv\_get\_cq\_event()}; rather it tries to poll again to see if new messages arrived in the meantime. If this was the case, these messages did not have to wait until a \texttt{read()} system call returned before they got processed, for that reason, their latency was lower.
\paragraph{Sent confirmations} \Autoref{sec:qp} already discussed at which moment \glspl{cqe} at the send side are generated. In case of a reliable connection (\autoref{fig:oneway_event_a}), entries showed up in the completion queue when a message was delivered to a remote \gls{ca} and when that \gls{ca} acknowledged that it received the message. Naturally,
\begin{equation}
t_{lat}^{comp} = t_{comp}-t_{subm} > t_{recv}-t_{subm} = t_{lat}
\end{equation}
was almost certainly true for every message that was sent.
This was different for the unreliable service types (\gls{uc} and \gls{ud}, \autoref{fig:oneway_event_b} and \autoref{fig:oneway_event_c}), where the \gls{hca} is only responsible for sending a message. Hence, in these cases, the \gls{hca} generated a \gls{cqe} immediately after a message was sent. Thus, for more messages,
\begin{equation}
t_{lat}^{comp} < t_{lat}
\label{eq:tcomp_min_tsubm}
\end{equation}
was true. In \autoref{fig:oneway_event_b}, this cannot be identified yet, but the difference between the median values $\tilde{t}_{lat}^{comp}$ and $\tilde{t}_{lat}$ is getting smaller. For messages that were sent as unreliable datagrams, \autoref{eq:tcomp_min_tsubm} usually holds, and in \autoref{fig:oneway_event_c},
\begin{equation}
\tilde{t}_{lat}^{comp} < \tilde{t}_{lat}
\end{equation}
is even true.
\paragraph{Comparison of the service types} It can be seen that the median latencies of the unreliable service types were barely different from the median latency of the reliable connection. With \SI{3598}{\nano\second} and \SI{3521}{\nano\second}, the median latencies of \gls{uc} and \gls{ud} were just slightly lower than the median latency of \SI{3608}{\nano\second} of the \gls{rc} service type. As expected, this was caused by the absence of acknowledgment messages between the two channel adapters. However, the variability of the three service types differed. With regards to $t_{lat}$, \gls{ud} had the highest ($t_{lat} > \SI{10000}{\nano\second}$ in \SI{0.1665}{\percent} of the cases) and \gls{uc} the lowest ($t_{lat} > \SI{10000}{\nano\second}$ in \SI{0.0595}{\percent} of the cases) dispersion. In the remainder of this section, \SI{10000}{\nano\second} and \SI{10}{\micro\second} will be used interchangeably with regards to the significant figures.
\paragraph{Intermediate pauses} The last three subfigures of \autoref{fig:oneway_event} show the results of the same test, but with an intermediate pause of \SI{1000000000}{\nano\second} (\SI{1}{\second}) and with just $1\cdot8000$ messages per run. One can see that the latency almost doubled. The pause of \SI{1}{\second} was long enough for the \gls{os} to swap out the waiting process, and it took a considerable amount of time to re-activate the process after the \texttt{read()} system call returned. Furthermore, the peaks at the left side of the graphs completely disappeared because now there could never be a second entry in the \gls{cq} after the first entry was acquired.
\subsection{Busy polling\label{sec:busy_polling}}
Event based polling is suitable for semi-time-critical applications in which minimal \gls{cpu} utilization outweighs maximum performance and thus minimal latency. However, if minimal latency is the topmost priority, busy polling (\autoref{fig:poll_based_polling}) should be used.
To be able to compare apples to apples, the settings in \autoref{tab:oneway_settings_busy} are very much alike those in \autoref{tab:oneway_settings_event}, but with a different polling mode. Since busy polling is a \gls{cpu} intensive task, all tests were performed in the optimized environment that was presented at the beginning of this chapter. The results of the tests are displayed in \autoref{fig:oneway_busy}.
\input{tables/oneway_settings_busy}
In the first three subfigures of \autoref{fig:oneway_busy}, again, $25\cdot8000$ messages of \SI{32}{\byte} were bursted for \gls{rc}, \gls{uc}, and \gls{ud}. It is immediately visible that the median latencies $\tilde{t}_{lat}^{RC}=\SI{1269}{\nano\second}$, $\tilde{t}_{lat}^{UC}=\SI{1251}{\nano\second}$, and $\tilde{t}_{lat}^{UD}=\SI{1273}{\nano\second}$ are approximately \SI{65}{\percent} lower than the same latencies for event based polling. This is in line with the findings of MacArthur and Russel~\cite{macarthur2012performance}, who reported a decrease of almost \SI{70}{\percent} in their work.
Since the completion queues on the send side were also busy polled, their latencies also went down. Now, \autoref{eq:tcomp_min_tsubm} holds for both unreliable service types. Note that it could be, depending on the use case, beneficial to busy poll the receive \gls{cq} but to rely on a completion channel that is bound to the send queue. In that way, less \gls{cpu} cores are fully utilized by busy polling, but low latencies are achieved between the sending and receiving node anyway. This approach would naturally result in:
\begin{equation}
t_{lat}^{comp} \gg t_{lat},
\end{equation}
and is suitable for applications that do not need to release the send buffers virtually instantaneous (\autoref{sec:requirements} \& \autoref{sec:proposal}).
\begin{figure}
\vspace{-0.5cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:oneway_busy_a}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_busy_hist/plot_0.pdf}
\end{minipage}
\end{subfigure}
\vspace{-0.2cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:oneway_busy_b}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_busy_hist/plot_1.pdf}
\end{minipage}
\end{subfigure}
\vspace{-0.2cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:oneway_busy_c}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_busy_hist/plot_2.pdf}
\end{minipage}
\end{subfigure}
\vspace{-0.2cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:oneway_busy_d}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_busy_hist/plot_3.pdf}
\end{minipage}
\end{subfigure}
\vspace{-0.2cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:oneway_busy_e}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_busy_hist/plot_4.pdf}
\end{minipage}
\end{subfigure}
\vspace{-0.2cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:oneway_busy_f}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_busy_hist/plot_5.pdf}
\end{minipage}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\includegraphics{plots/oneway_busy_hist/legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Results of the one-way benchmark with the settings from \autoref{tab:oneway_settings_busy}. These were used to analyze latencies with busy polling.}\label{fig:oneway_busy}
\end{figure}
\paragraph{Maxima} The maximum latencies did not decrease with the same proportions as the median latencies, but still notably. With regards to $\max t_{lat}$, the results for the reliable service type decreased with approximately \SI{14}{\percent} and for the unreliable service types with approximately \SI{36}{\percent}. The main reason for the maxima was likely the same as for event based polling: the warm up effect caused peaks at the beginning of the transmission. This conjecture is strengthened by the tests that were done with an intermediate pause of \SI{1}{\second}. For these runs, the yielded maximum latencies were only slightly lower, which indicates that the maxima were not caused by congestion but rather by the scheduling of the polling process. After all, the tests that were performed with an intermediate pause of \SI{1}{\second} between transmissions are unlikely to have been subject to congestion.
\paragraph{Minima} Latency minima as could be seen with event based polling could not arise here. This polling mode polls continuously all the time, so no peaks can arise because of short periods of time during which another polling mode was used.
\paragraph{Variability} The number of messages for which it took more than \SI{10}{\micro\second} to arrive at the receiving host was almost one order of magnitude lower for the \gls{rc} and \gls{ud} service types, and approximately 5 times lower for the \gls{uc} service type. This considerably reduced variability naturally implies a higher predictability. When sending messages in an environment that is based on busy polling, the maximum latency can be estimated with more certainty.
\paragraph{Intermediate pauses} This shows another important difference between event based polling and busy polling. Whereas the runs with event based polling showed more than double the latency when intermediate pauses occurred between transfers, runs that relied on busy polling showed a much smaller difference. Latencies of tests with intermediate pauses were about \SI{20}{\percent} higher than latencies of tests without any pauses when busy polling was applied. The same comparison for tests that relied on event based polling yielded a difference of \SI{120}{\percent}.
Although the median latencies with intermediate pauses when busy polling were substantially better than when waiting for an event, they were still higher than anticipated. Since the process continuously polled the completion queue, and the operating system should thus not have suspended it, it was expected that $\tilde{t}_{lat}$ would be lower for scenarios with less traffic on the link. However, for these cases, $\tilde{t}_{lat}$ was slightly higher in \autoref{fig:oneway_busy}.
It was first suspected that \gls{aspm}, which is described in the \gls{pcie} Base Specifications~\cite{pcisig2010pciexpress}, caused this additional latency. This technique sets the \gls{pcie} link to a lower power state when the device it is connected to---which would in this case be the \gls{hca}---is not used. However, when the tests from \autoref{tab:oneway_settings_busy} were repeated with \gls{aspm} explicitly turned off, the results remained the same.
The second suspicion was related to power saving levels of the \gls{cpu}: the so-called \textit{C-states}. After ensuring that all power savings were turned off---i.e., C0 was the only allowed state---a maximum response latency of \SI{0}{\micro\second} was written to \texttt{/dev/cpu\_dma\_latency}. This virtual file forms an interface to the \gls{pmqos}\footnote{\url{https://www.kernel.org/doc/Documentation/power/pm_qos_interface.txt}}, and writing \zero{} to it expresses to the \gls{os} that the minimum achievable \gls{dma} latency is required. However, this did also not improve $\tilde{t}_{lat}$.
Nevertheless, busy polling is still the more suitable technique for real-time applications. The next sections will explore other techniques to reduce the latency even more. For the methods that are likely to have a similar impact on the different service types, only the \gls{uc} service type was used for the sake of brevity. The unreliable connection was chosen because it showed the best results so far.
\subsection{Differences between the submit and send timestamp\label{sec:difference_timestamps}}
This subsection explores the difference between the moment a work request is submitted to the send queue and the moment the \gls{hca} actually sends the data. The feature of the benchmark that measures this difference is based on \autoref{lst:time_thread}: the sending node keeps updating the timestamp until the \gls{hca} copies the data to one of its virtual lanes.
\input{tables/oneway_settings_submit_send_comparison}
\Autoref{tab:oneway_settings_submit_send_comparison} shows the settings of the two tests that were performed. The results of both are plotted in \autoref{fig:oneway_submit_send_comparison}.
In the results of this test, and in the results of all following tests of this type, all data regarding $t_{lat}^{comp}$ is completely omitted. In the previous two subsections, it could be seen that settings that affect the receive \gls{cq} will affect the send \gls{cq} in a very similar manner. Hence, continuing to plot it would have been redundant. Rather, two similar data sets that must be compared---e.g., $(t_{recv}-t_{send})$ and $(t_{recv}-t_{subm})$---have been plotted in the same graph.
As it turns out, approximately
\begin{equation}
\left(1-\frac{\SI{726}{\nano\second}}{\SI{1253}{\nano\second}}\right)\cdot\SI{100}{\percent}\approx\SI{42}{\percent}
\end{equation}
of the time that was needed to send a message from one node to another node was spent before the \gls{hca} actually copied the data. This timespan includes the notification of the \gls{hca}, but also the accessing and copying of the data from the hosts's main memory to the \gls{hca}'s internal buffers. Note that this test did not measure the time the data spent in the sending node's \gls{hca} since it is not possible to update the timestamp as soon as it resided in the \gls{hca}'s buffers.
This relatively long timespan suggests that the memory access is a bottleneck. The next subsection will discuss a possible solution for small messages.
\begin{figure}[ht!]
\begin{subfigure}{\textwidth}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_submit_send_comparison_hist/plot_0.pdf}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\includegraphics{plots/oneway_submit_send_comparison_hist/legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Results of the one-way benchmark with the settings from \autoref{tab:oneway_settings_submit_send_comparison}. These were used to analyze the difference between $t_{lat}$ and $t_{lat}^{send}$.}\label{fig:oneway_submit_send_comparison}
\end{figure}
\subsection{Inline messages\label{sec:oneway_inline}}
\Autoref{eq:delta_inline} in \autoref{sec:timestamps} already suggested that the difference between $\tilde{t}_{lat}$ and $\tilde{t}_{lat}^{send}$ could be an approximation of the latency decrease that can be achieved by using the \textit{inline} flag that some InfiniBand \glspl{hca}---among them the Mellanox\textregistered{} ConnectX\textregistered-4---support. By setting this flag, introduced in \autoref{sec:postingWRs}, relatively small messages ($\lesssim\SI{1}{\kibi\byte}$) will directly be included in a work request. Accordingly, the \gls{hca}'s \gls{dma} does not need to access the host's main memory to acquire the data when it becomes aware of the submitted \gls{wr}. This suggests that posting small messages inline will eradicate a part of the overhead that was discussed in the last subsection.
\Autoref{tab:oneway_settings_inline} shows which settings were used with the one-way benchmark to analyze this difference. They are almost identical to the settings from \autoref{tab:oneway_settings_submit_send_comparison}, but instead of varying the timestamp that was taken ($t_{subm}$/$t_{send}$), the inline mode was varied. The results are depicted in \autoref{fig:oneway_inline}.
\input{tables/oneway_settings_inline}
Being \SI{1264}{\nano\second}, the median latency for the regularly submitted case was almost identical to the latency in \autoref{fig:oneway_submit_send_comparison}, which makes it very suitable for comparison. In \autoref{sec:difference_timestamps}, it was determined that about \SI{42}{\percent} of the time was lost before the \gls{hca} actually copied the data to its own buffers. The graph shows that messages that were submitted with the inline flag had a
\begin{equation}
\left(1-\frac{\SI{906}{\nano\second}}{\SI{1264}{\nano\second}}\right)\cdot\SI{100}{\percent}\approx\SI{28}{\percent}
\label{eq:inline_decrease}
\end{equation}
lower latency than regularly submitted messages.
Thus, apparently, the additional memory access the \gls{hca} had to perform when a \SI{32}{\byte} message was not directly included in the work request was accountable for \SI{28}{\percent} of the latency. Hence, if possible, it is favorable for latency to include data directly in the work request. Furthermore, as mentioned in \autoref{sec:postingWRs}, another advantage is the fact that the used buffers can be released immediately after submitting the \gls{wr}.
\begin{figure}[ht!]
\begin{subfigure}{\textwidth}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_inline_hist/plot_0.pdf}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\vspace{0.05cm}
\includegraphics{plots/oneway_inline_hist/legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Results of the one-way benchmark with the settings \autoref{tab:oneway_settings_inline}. These were used to analyze the difference between messages that are submitted regularly ($t_{subm}^{reg.}$) and that are submitted inline ($t_{subm}^{inl.}$).}\label{fig:oneway_inline}
\end{figure}
\subsection{RDMA write compared to the send operation}
\Autoref{tab:transport_modes} presented the different operations which are supported for the different service types. So far, all discussed tests relied on \textit{send with immediate}. The second suitable operation to transfer a message to a remote host which also supports an additional 32-bit header as identifier is \textit{\gls{rdma} write with immediate}. In the remainder of this chapter, for the sake of brevity, this operation is simply referred to as \textit{\gls{rdma} write}.
\Autoref{tab:oneway_settings_rdma} describes the settings that were used with the one-way benchmark to compare the \textit{send} operation with \textit{\gls{rdma} write}. Note that \gls{ud} is not included, since none of the \gls{rdma} operations support it. The results of the tests are depicted in \autoref{fig:oneway_rdma}.
\input{tables/oneway_settings_rdma}
In these results, the \textit{\gls{rdma} write} operation seems slower than the \textit{send} operation. However, a few remarks have to be made. First, the maximum latency and the variability of the \gls{rdma} transfers were lower. In case of the \gls{uc} service type, sending messages with \gls{rdma} resulted in 5\times{} less messages with a latency greater than \SI{10}{\micro\second}. (In some iterations of the tests, reductions up to 25\times{} could be seen.) So, although the median latency was slightly higher for \gls{rdma}, the lower variability makes it a more predictable service type.
Secondly, this test relied on the \textit{\gls{rdma} write with immediate}, not \textit{\gls{rdma} write}. The actual \textit{\gls{rdma} write} operation is probably a little faster, but without synchronization there is no way for a process on the receiving side to know when data is available. Since the only other way of synchronizing would be using an additional \textit{send} operation, \textit{\gls{rdma} write with immediate} is the fastest way of sending data with \gls{rdma} and signaling to the receiving node that data is available.
\begin{figure}[ht!]
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:oneway_rdma_a}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_rdma_hist/plot_0.pdf}
\end{minipage}
\end{subfigure}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:oneway_rdma_b}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_rdma_hist/plot_1.pdf}
\end{minipage}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\vspace{0.05cm}
\includegraphics{plots/oneway_rdma_hist/legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Results of the one-way benchmark with the settings from \autoref{tab:oneway_settings_rdma}. These were used to analyze the difference between the \textit{\gls{rdma} write with immediate} and \textit{send with immediate} operation.}\label{fig:oneway_rdma}
\end{figure}
\subsection{Unsignaled messages compared to signaled messages}
\Autoref{sec:postingWRs} discussed that the \gls{ofed} verbs allow to submit \glspl{wr} to the \gls{sq} without generating a notification. Thereafter, \autoref{sec:villas_write} presented how this technique was implemented in the node-type's write-function. This was done to prevent file structures from unnecessarily rippling through the completion queue into the write-function, to subsequently be discarded there. Since MacArthur and Russel~\cite{macarthur2012performance} only observed small performance increases but recommended sending unsignaled for inline messages, the following tests were intended to review the performance increase in the present work's environment.
\Autoref{tab:oneway_settings_unsignaled_inline} shows the settings that were used with the one-way benchmark during these tests and \autoref{fig:oneway_unsignaled_inline} shows the resulting latencies. The median latency $\tilde{t}_{lat}^{sig.}$ of the messages that were sent inline with signaling approximately corresponds to the number from \autoref{fig:oneway_inline}. Thus, since \autoref{fig:oneway_unsignaled_inline} shows that the median latency of unsignaled messages is:
\begin{equation}
\tilde{t}_{lat}^{uns.} \approx 0.87\cdot \tilde{t}_{lat}^{sig.},
\end{equation}
it can be concluded that turning signaling off yields a noteworthy performance increase. By signaling only shortly before the send queue overflows, a decrease in latency of almost \SI{13}{\percent} can be seen.
\input{tables/oneway_settings_unsignaled_inline}
Because previous works~\cite{macarthur2012performance, liu2014performance} were inclined to use \textit{\gls{rdma} write} over \textit{send} operations, the same tests as in \autoref{tab:oneway_settings_unsignaled_inline} were repeated with \textit{\gls{rdma} write} as operation mode. Similar to the results in \autoref{fig:oneway_rdma}, the latency for messages that were sent over \gls{rdma} was worse than for those that were sent normally. However, the relative increase in performance caused by the disabling of the signaling was, being a bit more than \SI{12}{\percent}, almost identical to the increase in \autoref{fig:oneway_unsignaled_inline}.
The settings and the results of these tests can be seen in \autorefap{a:oneway_unsignaled_rdma}.
\begin{figure}[ht!]
\begin{subfigure}{\textwidth}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_unsignaled_inline_hist/plot_0.pdf}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\vspace{0.05cm}
\includegraphics{plots/oneway_unsignaled_inline_hist/legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Results of the one-way benchmark with the settings from \autoref{tab:oneway_settings_unsignaled_inline} to analyze the difference in latency between messages that did and did not cause a \acrfull{cqe}. The \textit{send} operation mode was used in this test.}\label{fig:oneway_unsignaled_inline}
\end{figure}
Based on the results from the previous subsections, $\tilde{t}_{lat} = \SI{786}{\nano\second}$ seems to be the lowest achievable median latency for 32-byte messages. This confirms the implementation of the VILLASnode node-type that was presented in \autoref{sec:villas_read} and~\ref{sec:villas_write}. In the communication between \textit{InfiniBand} node-types, the \textit{send} operation mode is used, when under a configurable threshold messages are sent inline, and a \gls{cqe} for inline-messages is only generated when a counter reaches a configurable threshold.
Although sub-microsecond latencies could easily be achieved in the used environment, there was still a considerable deviation from the latencies MacArthur and Russel~\cite{macarthur2012performance} observed, which had minima around \SI{300}{\nano\second}. A possible explanation for this could be the number of used buffer. The objective of this benchmark was to find the best fit for a VILLASnode node-type. Because a node-type needs a relatively large pool of buffers to be able to process a lot of small samples with high frequencies, this benchmark also assumed a large pool of buffers. MacArthur and Russel, however, observed that latencies in their environment started to increase when more than 16 buffers were used.
\subsection{Variation of message size\label{sec:variation_of_message_size}}
All aforementioned tests assumed an idealized situation with 32-byte messages. Usually, the packets in a real-time co-simulation framework will be a few powers of two larger. \Autoref{tab:oneway_settings_message_size} shows the settings that were used with the one-way benchmark to explore the influence of message size on the latency.
The tests are grouped in three categories: \autoref{fig:oneway_message_size_a} exclusively shows the \gls{rc}, \autoref{fig:oneway_message_size_b} the \gls{uc}, and \autoref{fig:oneway_message_size_c} the \gls{ud} service type. Furthermore, an upward pointing triangle and a dark shade indicate the \textit{send} operation, and a downward pointing triangle and a light shade an \textit{rdma write} operation. Black shades were used for messages that were sent normally and blue shades for messages that were sent inline.
Whenever possible, tests were performed with messages ranging from \SI{8}{\byte} to \SI{32}{\kibi\byte}. However, inline work requests and the \gls{ud} service type do not support messages that big; the adjusted ranges are listed in \autoref{tab:oneway_settings_message_size}.
\input{tables/oneway_settings_message_size}
\paragraph{Constant latency (\SI{8}{\byte}--\SI{256}{\byte})} As can be seen in \autoref{fig:oneway_message_size}, all $\tilde{t}_{lat}$ of messages that were smaller than \SI{256}{\byte} were virtually the same. The only difference is that, as expected from \autoref{eq:inline_decrease}, messages that were sent inline have a median latency that is approximately \SI{28}{\percent} lower than messages that were sent normally. All these $\tilde{t}_{lat}$ were around the values that could be seen for \SI{32}{\byte} messages in \autoref{fig:oneway_busy},~\ref{fig:oneway_inline},~and~\ref{fig:oneway_rdma}. This is similar to MacArthur and Russel's results~\cite{macarthur2012performance}. In their publication, they found that messages smaller than \SI{1024}{\byte} have a somewhat constant latency. In the present work's finding this is only true for messages up to approximately \SI{256}{\byte}.
For all these sizes, the variance of the latencies is minimal. The error bars in \autoref{fig:oneway_message_size} indicate where the boundary to the upper and the lower \SI{10}{\percent} of the values lie.
\paragraph{Increasing latency (\SI{256}{\byte}--\SI{32}{\kibi\byte})} When the message size exceeded \SI{256}{\byte}, $\tilde{t}_{lat}$ started to gradually go up and the variance increased for messages that were sent normally. At \SI{256}{\byte}, $\tilde{t}_{lat}$ for messages that were sent inline even exceeded the median latency of messages that were sent normally. Because not only the message size but also the burst size changed for the blue lines, the inline tests were repeated with a fixed burst size of 2730 messages per burst (\autorefap{a:oneway_message_size_inline}). Since this steep slope between \SI{128}{\byte} and \SI{256}{\byte} is still present for fixed burst sizes, it can be concluded that---although the \gls{hca} allows it---sending data inline is not always favorable.
\begin{figure}[ht!]
\begin{subfigure}{0.351\textwidth}
\includegraphics[width=\linewidth, keepaspectratio]{plots/oneway_message_size_median/plot_0.pdf}
\caption{\gls{rc}}\label{fig:oneway_message_size_a}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{0.312\textwidth}
\includegraphics[width=\linewidth, keepaspectratio]{plots/oneway_message_size_median/plot_1.pdf}
\caption{\gls{uc}}\label{fig:oneway_message_size_b}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{0.312\textwidth}
\includegraphics[width=\linewidth, keepaspectratio]{plots/oneway_message_size_median/plot_2.pdf}
\caption{\gls{ud}}\label{fig:oneway_message_size_c}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{\textwidth}
\centering
\vspace{0.15cm}
\includegraphics{plots/oneway_message_size_median/legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Results of the one-way benchmark with the settings from \autoref{tab:oneway_settings_message_size}. These were used to analyze the influence of message size on the latency. While a triangle indicates $\tilde{t}_{lat}$ for a certain message size, the error bars indicate the upper and lower \SI{10}{\percent} of $t_{lat}$ for that message size.}\label{fig:oneway_message_size}
\end{figure}
The increasing latency of inline messages around \SI{256}{\byte} is in line with the findings of MacArthur and Russel~\cite{macarthur2012performance}. In their work, they claim that this latency step was caused by their adapter's cache line size, which happened to be \SI{256}{\byte}. The \gls{hca} that was used in the present work, however, had a cache line size of a mere \SI{32}{\byte}. Thus, according to their findings, messages that were equal to or bigger than \SI{32}{\byte} should have had latencies which were substantially bigger than latencies of messages that were smaller than \SI{32}{\byte}. However, this was not the case, as can be seen in \autoref{fig:oneway_message_size}. This leads to the conclusion that the increase is not solely caused by the cache line size.
\paragraph{Decreased variability (\SI{4096}{\byte})}The second aspect that catches the eye is located at \SI{4096}{\byte}, which also happened to be the set \gls{mtu} in these tests. For all service types---even for \gls{ud}, which does only support messages up to the \gls{mtu}---the variability of the latency decreased for messages bigger than or equal to \SI{4096}{\byte}. Thus, although the median latency continued to go up, the predictability of the latency also rose.
\paragraph{Further peculiarities} There was no meaningful difference between channel semantics and memory semantics with immediate data. Although the \textit{send} operation was always slightly better in terms of median latency, the operation that best suits the requirements of the application should be used.
To make sure that the increased median latency was not caused by congestion control (\autoref{sec:congestioncontrol}) all tests from \autoref{tab:oneway_settings_message_size} were repeated with an intermediate pause of \SI{5500}{\nano\second} between calls of \texttt{ibv\_post\_send()}. As \autorefap{a:oneway_message_size_wait} shows, this did not influence the median latency.
\section{OFED's round-trip host channel adapter benchmark\label{sec:perftest}}
This section analyzes some assumptions that were made in previous sections. In the first subsection, the results of the round-trip benchmark \texttt{ib\_lat\_send} will be compared to the results from \autoref{sec:variation_of_message_size}. Then, in the second and third subsection, the influence of the \gls{mtu} and the \gls{qp} type on latency will be examined.
\subsection{Correspondence between round-trip and one-way benchmark}
\Autoref{tab:correlation_benchmarks} shows the results for the first tests that were performed with \texttt{ib\_send\_lat}. The median latencies in \autoref{tab:correlation_benchmarks} approximately correspond to the latencies for the same test in \autoref{fig:oneway_message_size}. It stands out that the same leap in latency between \SI{128}{\byte} and \SI{256}{\byte} that could be seen in \autoref{fig:oneway_message_size}, also occurred in these results. To rule out that this leap was solely caused by the fact that messages were not sent inline anymore at \SI{256}{\byte}, the test was also performed with the inline threshold set to a higher value. In this second test, the leap between \SI{128}{\byte} and \SI{256}{\byte} turned out to be even higher.
\input{tables/correlation_benchmarks}
\paragraph{Difference in maximum latencies} A substantial difference between the results of the round-trip benchmark and the custom one-way benchmark were the maxima. This was, in all likelihood, caused by the used sample size. The description on the \gls{ofed} Performance Tests' Git\footnote{\url{https://github.com/linux-rdma/perftest}} states that ``setting a very high number of iteration may have negative impact on the measured performance which are not related to the devices under test. If [\ldots] strictly necessary, it is recommended to use the -N flag (No Peak).'' Therefore, the default setting of the round-trip benchmark was set to 1000 messages per test. Since the custom one-way benchmark was meant to mimic the behavior of InfiniBand hardware in VILLASnode---which would also burst large amounts of small messages at high frequencies---this hint was ignored in the custom benchmark. Every marker in \autoref{fig:oneway_message_size} includes between \SI{27300}{} and \SI{80000}{} time deltas.
Foreseeing the analysis of the \textit{InfiniBand} node-type, the one-way benchmark gave a more realistic view of the way the InfiniBand adapters would behave in VILLASnode. For example, all plots in \autorefap{a:timer_comparison} show low median latencies, however, also latency peaks which are much higher than the median values.
Furthermore, the median latencies the round-trip benchmark yielded were marginally lower than the ones yielded by the custom one-way benchmark. This difference was probably caused by the abovementioned effect as well.
\subsection{Variation of the MTU}
Crupnicoff, Das, and Zahavi~\cite{crupnicoff2005deploying} report that the selected \gls{mtu} does not affect the latency. Since the \gls{mtu} can affect the latency in other technologies---such as Ethernet---this claim was examined. With \texttt{ib\_send\_lat}, it is fairly easy to change the \gls{mtu}. All results of this test are displayed in \autoref{tab:mtu_performance}. Since the \gls{uc} service type is not officially supported by the \gls{rdma} \gls{cm} \gls{qp}, only results for the \gls{rc} and \gls{ud} service type are shown.
The table shows that no extraordinary peaks occurred. The only latency that stands out is marked red. However, since the difference is not substantial, and since this is the only occurrence of such a peak, it can be assumed that the \gls{mtu} indeed does not affect latency.
\input{tables/mtu_performance}
\subsection{RDMA CM queue pairs compared to regular queue pairs}
In all implementations presented in the present work, it was assumed that the performance of a regular \gls{qp} and a \gls{qp} that is managed by the \gls{rdma} \gls{cm} is almost identical. This assumption was evaluated as well.
\Autoref{tab:qp_performance} shows that the median latency for smaller messages was slightly smaller for regular \glspl{qp}. For larger messages, this difference in latency diminished. This inconsiderable difference, however, does not outweigh the ease that comes with the \gls{rdma} communication manager. To get a latency decrease of less than \SI{7}{\percent} (\autoref{tab:qp_performance}'s worst case), a lot of complexity would have to be added to the source code, in order to efficiently manage the \glspl{qp}.
\input{tables/qp_performance}
\section{VILLASnode node-type benchmark\label{sec:evaluation_villasnode}}
Again, all runs of the benchmark in this section were performed in the optimized environment as introduced in \autoref{fig:configuration_system} on the host system from \autoref{tab:benchmark_testsystem}.\footnotemark{}
\footnotetext{A small change to the environment had to be made: all tests that are presented in the following were performed with a customized version of the \textit{latency-performance} \texttt{tuned} profile. The reason for this is discussed in the paragraph ``Optimized environment'' below.}
\paragraph{Timer of the signal node} To find the timer that was best suited for the needs of the analyses that are discussed in this section, separate tests were performed and their results are presented below. Since the ability to generate samples at high rates was a requirement for most of the analyses in the remainder of this section, a fixed, high rate of \SI{100}{\kilo\hertz} was set for the tests to analyze the timers. Four tests were prepared: two with a VILLASnode instance with a timer object that relies on a file descriptor for notifications (\texttt{timerfd}) and two with a timer that relies on the \gls{tsc}. For the former, as can be seen in \autoref{tab:timer_comparison}, more steps were missed at high rates. In the optimized environment, the file descriptor based implementation missed about \SI{0.68}{\percent} of the signals, whereas the \gls{tsc} based implementation only missed \SI{0.50}{\percent} of the steps. Since the implementation with the least missed steps is preferred---after all, when steps are missed, the actual rate that is sent to the node-type under test is lower than the set rate---the \gls{tsc} was chosen as timer for the following tests.
\Autorefap{a:timer_comparison} shows the histograms for these four tests, including the missed steps and an indicator for whether samples were not transmitted by the nodes that were tested. When comparing the median latencies of the four cases, it becomes apparent that the \texttt{timerfd} timer affected the measured $\tilde{t}_{lat}$ more than the \gls{tsc}. Since this means that the benchmark's results with the \gls{tsc} better reflect the actual performance of the node-type under test, this is another advantage of the \gls{tsc}. Furthermore, in case of the unoptimized environment, the latency's variability with the \texttt{timerfd} timer was considerably worse than in the three other cases.
In later tests, it was also discovered that the \gls{tsc} did not perform well with relatively small rates ($\leq\SI{2500}{\hertz}$). As it turned out, for the minimum rate of \SI{100}{\hertz}, approximately \SI{8}{\percent} of the steps were missed. However, using the the \texttt{timerfd} timer for these low rates would noticeably skew the results, and a deviation of \SI{8}{\hertz} is unlikely to influence the latencies of the analyzed nodes. Therefore, the \gls{tsc} was also used for these low rates.
\input{tables/timer_comparison.tex}
\paragraph{Optimized environment} The tests that were done to analyze the behavior of the timers also revealed information about the effect of the optimized and unoptimized environment on latencies. As it turned out, using the \textit{latency-performance} \texttt{tuned} profile was detrimental for the latency and the overall performance. This effect occurred regardless of the used environment. For the cases in \autoref{fig:timer_comparison}, median latencies increased about \SI{700}{\nano\second}, variability and maxima rose, and the \texttt{timerfd} timer missed up to \SI{15}{\percent} of the steps. Further research has shown that the \texttt{force\_latency} flag (line 6, \autoref{lst:tuned_latency_performance}) caused this problem. Therefore, in all tests that are presented in the following, a customized version of the \textit{latency-performance} \texttt{tuned} profile without this flag was used.
\Autoref{fig:timer_comparison} also reveals that running VILLASnode in the optimized environment was beneficial for latency. However, the difference between both environments was not tremendous. It is likely that the reason for this is that the testsystem from \autoref{tab:benchmark_testsystem} was fully dedicated to the tests that were run on it. In a real life scenario, the system would be busy with other processes, and the difference in latency for processes in the shielded cpuset and in the normal pool of \glspl{cpu} would presumably be larger.
\paragraph{Configuration of the InfiniBand nodes} It was found that the number of buffers hardly influenced the performance of the \textit{InfiniBand} node-type. Even MacArthur and Russel's ``ideal'' number of buffers---although impracticable for the purposes of this real-time framework---were investigated~\cite{macarthur2012performance}. Apart from the fact that such a small number of buffers made it impossible to send samples bigger than a few byte at high frequencies, barely any difference in latency could be seen compared to cases with (a lot) more buffers.
A momentous difference, however, could be seen when the size of the receive queue and the number of mandatory work requests in the receive queue was varied. The situation with the lowest latency arose when the size and the number of \glspl{wr} was chosen to be just big enough to support the highest combination of generation rate and message size. For example, in case of \autoref{fig:timer_comparison_d}, latency extrema around \SI{262}{\micro\second} could be seen with this ideal setup. For an arbitrary large number (e.g., a queue depth of 8192 and 8064 mandatory \glspl{wr} in the queue), these extrema peaked at more than \SI{3000}{\micro\second}. This effect was caused by the way the \textit{InfiniBand} node-type's read-function is implemented and probably occurred shortly after the initialization of the receiving \textit{InfiniBand} node. As presented in \autoref{fig:read_implementation}, the read-function first fills the receive queue, before it starts polling the queue and processing the data. When the threshold is large, it takes a certain amount of time before data can be processed. However, it is important to keep in mind that a larger receive queue yields a higher stability because overflows will be less likely.
For the send queue, the opposite is true: in order to signal as little as possible, the send queue can be as large as the \gls{hca} allows it to be. The signaling threshold, that describes the maximum number of unsignaled \glspl{wr} before a signaled \gls{wr} must be sent, is determined according to \autoref{eq:signaling} in \autoref{sec:related_work}. If one sample is sent per call of the write-function, which is true for all following tests,
\begin{equation}
S = \frac{D_{SQ}}{2}
\end{equation}
follows from \autoref{eq:signaling}. Before running any of the following tests, it was verified that this threshold indeed yielded the lowest latency. It turned out that any higher or lower threshold yielded, although marginally, worse latencies.
The settings for the sending and the receiving \textit{InfiniBand} node can be found in \autoref{a:infiniband_config}. These settings were used in all tests that are presented in this section.
\subsection{Comparison between InfiniBand service types}
This subsection presents the tests that were performed to examine how the different InfiniBand service types perform within VILLASnode. It solely focuses on the reliable connection and on unreliable datagrams since these two service types are officially supported by the \gls{rdma} \gls{cm}, and thus require no modification of the \gls{rdma} \gls{cm} library.
\paragraph{Varying the sample generation rate} In the first set of tests, the rate with which samples were generated was varied between \SI{100}{\hertz} and \SI{100}{\kilo\hertz}. All tests were performed until \SI{250000}{} samples were transmitted. Each sample that was sent contained 8 random 64-bit floating-point numbers. For the reliable connection, this added up to
\begin{equation}
8\cdot\SI{8}{\byte} + \SI{24}{\byte} = \SI{88}{\byte}
\end{equation}
per message, taking the 24-byte metadata into account. For unreliable datagrams, this number was
\begin{equation}
\SI{88}{\byte} + \SI{40}{\byte} = \SI{128}{\byte}
\end{equation}
because the 40-byte \gls{grh} of the sending node was attached to every message. Since the messages were relatively small, they were all sent inline.
\Autoref{fig:varying_rate} shows the results the VILLASnode node-type benchmark yielded with the abovementioned settings. Both service types showed an almost identical behavior, regardless of which rate was set: for both types, $\tilde{t}_{lat}$ decreased when the rate was increased. This is in line with prior observations in \autoref{sec:busy_polling}, where latency increased when pauses between the transmission of messages were increased.
Characteristic for InfiniBand is the (almost) non-existent latency difference between messages on reliable connections and unreliable datagrams. Because, as discussed in \autoref{sec:via}, reliability is handled in the \gls{hca} rather than in the operating system, it causes less overhead.
\begin{figure}[ht!]
\begin{subfigure}{\textwidth}
\includegraphics[width=15.2cm, keepaspectratio]{plots/nodetype_varying_rate_IB/median_graph.pdf}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\includegraphics{plots/nodetype_varying_rate_IB/median_legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Results the benchmark yields for the \textit{InfiniBand} node-type with a fixed message size of \SI{88}{\byte} for \gls{rc} and \SI{128}{\byte} for \gls{ud}. The sample generation rate was varied between \SI{100}{\hertz} and \SI{100}{\kilo\hertz} and for every rate, \SI{250000}{} samples were sent.}\label{fig:varying_rate}
\end{figure}
All tests for the \textit{InfiniBand} node-type were only performed for signal generation rates up to \SI{100}{\kilo\hertz}. At higher frequencies, the \textit{signal} node started to miss more and more steps. According to the latencies from \autoref{sec:evaluation_ca}, the sample rate needs to be a lot higher than \SI{100}{\kilo\hertz} before the InfiniBand hardware becomes the bottleneck. Assuming a message resides about \SI{1000}{\nano\second} in the InfiniBand stack and network, rates up to:
\begin{equation}
\frac{1}{\SI{1000}{\nano\second}} = \SI{1}{\giga\hertz}
\end{equation}
are theoretically possible with the numbers measured in the previous section.\footnotemark{} However, two problems arise:
\footnotetext{This is only based on the measured time that a message congests the InfiniBand stack and network; it is assumed that \glspl{wr} can be submitted to the \glspl{qp} with this rate.}
\begin{itemize}
\setlength\itemsep{-0.1em}
\item The refresh rate of the buffers in the receive queue is not indefinitely high. As described in \autoref{sec:villas_implementation}, for its completion queue to be cleared and its receive queue to be refilled, an \textit{InfiniBand} node depends on the rate with which the read-function is invoked. When the \gls{qp} is chosen to be big enough, a node should be able to absorb short peaks in the message rate (e.g., \SI{1}{\giga\hertz}) flawlessly. However, if the rate stays high for an extended amount of time, the buffers will overflow in the current setup.
More on the theoretically achievable rate in \autoref{sec:zero_reference_comparison}.
\item \Autoref{sec:optimizations_datapath} described optimizations that were applied to the \textit{file} node-type. Even though these optimizations considerably increased the maximum signal generation rate, rates well above \SI{100}{\kilo\hertz} were still not achievable. Consequently, to increase this upper limit, the \textit{file} node-type should be optimized further, so that the share it takes in the total datapath decreases.
\end{itemize}
\paragraph{Varying the sample size} In the second set of tests, the generation rate was fixed to \SI{25}{\kilo\hertz}. The message size was varied between 1 and 64 values per sample. This resulted in messages between \SI{32}{\byte} and \SI{536}{\byte} for \gls{rc} and \SI{74}{\byte} and \SI{576}{\byte} for \gls{ud}. Based on the results from \autoref{tab:oneway_settings_message_size} and \autoref{fig:oneway_message_size}, messages smaller than or equal to \SI{188}{\byte} were sent inline.\footnotemark
\footnotetext{Inline sizes that are powers of two are not supported by the Mellanox \gls{hca} used in the present work. The \gls{hca} automatically converts it to the closest value that is larger than the set value. In this case, \SI{188}{\byte} is the closest value larger than \SI{128}{\byte}.}
The first observation to be made is the increasing median latency when messages become bigger than approximately \SI{128}{\byte}. This is in line with the findings from \autoref{sec:variation_of_message_size}. Secondly, the variability of the reliable connection was consistently lower than the variability of unreliable datagram. This was not only true for high rates, but also for lower rates. Finally, it can be observed that the \gls{rc} service type had a lower median latency than \gls{ud}. This is remarkable, and a reason for this could be the fact that the receiving node's \gls{ah} must be added to every work request when the \gls{ud} service type is used. Furthermore, the \gls{grh} is added to every message that is sent with the \gls{ud} service type.
\begin{figure}[ht!]
\begin{subfigure}{\textwidth}
\includegraphics[width=15.2cm, keepaspectratio]{plots/nodetype_varying_sample_size_IB/median_graph.pdf}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\includegraphics{plots/nodetype_varying_sample_size_IB/median_legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Results the benchmark yields for the \textit{InfiniBand} node-type at a fixed sample generation rate of \SI{25}{\kilo\hertz} and a message size that was varied between \SI{32}{\byte} and \SI{536}{\byte} for \gls{rc} and \SI{74}{\byte} and \SI{576}{\byte} for \gls{ud}. For every message size, \SI{250000} samples were sent.}\label{fig:varying_sample_size}
\end{figure}
\paragraph{Varying both the sample size and generation rate} \Autoref{fig:rate_size_3d_RC} aims to give a complete view on the influence of the several possible generation rate and message size combinations by combining the previously presented tests. Since the reliable connection shows---although only slightly---the lowest median latencies, this figure only depicts the measurements for \gls{rc}. In this test, the generation rate was varied between \SI{100}{\hertz} and \SI{100}{\kilo\hertz} and the number of values in a sample between 1 and 64. All tests were performed until \SI{250000}{} samples were transmitted.
\Autoref{fig:rate_size_3d_RC} shows that, in accordance with \autoref{fig:varying_sample_size}, the median latency increased with the message size. Additionally, as can be seen along the rate-axis, a higher message generation rate corresponded to a lower median latency. This could also be seen in \autoref{fig:varying_rate}.
When the \textit{signal} node missed more than \SI{10}{\percent} of the steps for a particular sample rate/sample size combination, this is indicated with a red colored percentage in \autoref{fig:rate_size_3d_RC}. From these numbers, it becomes evident that the file-node was not able to process large amounts of data. With tests that missed a substantial amount of samples, a threshold $T$ can be approximated to:
\begin{equation}
T = \left(1 - \frac{P_{missed}}{\SI{100}{\percent}}\right) \cdot S_{sample} \cdot f_{signal} \qquad\qquad \mathrm{[B]\cdot[Hz]=[B/s]},
\end{equation}
where $P_{missed}$ is the percentage of missed samples, $S_{sample}$ is the sample size, and $f_{signal}$ the set signal generation rate. In case of the VILLASnode node-type benchmark, this value was approximately \SI[per-mode=symbol]{20}{\mebi\byte\per\second}. This is, nevertheless, only a rough estimation; the signal generation rate probably has a higher impact on the threshold than the sample size.
\begin{figure}[ht!]
\begin{subfigure}{\textwidth}
\includegraphics[width=15.2cm, keepaspectratio]{plots/nodetype_3d_IB/median_3d_graph_UD.pdf}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\vspace{0.2cm}
\includegraphics{plots/nodetype_3d_IB/3d_RC_legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{The influence of the message size and generation rate on the median latency between two \textit{InfiniBand} nodes that communicate over an \acrfull{rc}.}\label{fig:rate_size_3d_RC}
\end{figure}
The fact that up to \SI{8}{\percent} of the steps were missed at low rates with the \gls{tsc} was already mentioned at the beginning of this section. Since these rates are non-critical for the node-types that were analyzed, it is improbable that a difference of \SI{8}{\hertz} in case of a set rate of \SI{100}{\hertz}, and \SI{75}{\hertz} in case of \SI{2500}{\hertz}, will noticeably affect the median latency. Using an alternative timer, however, would have considerably skewed the latencies in that range.
\Autorefap{a:rate_size_3d_UC_UD} shows the same graphs for \gls{uc} and \gls{ud}, respectively. Both modes show a very similar behavior to the \gls{rc} service type. As observed before, \gls{ud} shows slightly higher median latencies than \gls{rc}. \gls{uc}, on the other hand, shows slightly lower median latencies. This backs the suspicion that was raised earlier, on why \gls{ud} was slightly slower than \gls{rc}. Regarding latency, \gls{uc} does not have three major disadvantages of both types: it does not need to guarantee delivery of a message, but it does also not require an \gls{ah} with every \gls{wr} and does not need to add the \SI{40}{\byte} \gls{grh} to every message.
Thus, the smallest median latencies among the service types that are officially supported by the \gls{rdma} \gls{cm} were observed for the reliable connection. When varying both the message size and generation rate, the minimum latency of about \SI{1.7}{\micro\second} was observed for high rates and low message sizes. The maximum latency was observed for high rates and low message sizes and was approximately \SI{4.9}{\micro\second}.
\subsection{Comparison to the zero-latency reference\label{sec:zero_reference_comparison}}
The first comparison to be done is between the \textit{InfiniBand} node-type and the \textit{shmem} node-type. The latter uses the \acrshort{posix} shared memory \gls{api} to enable communication between nodes over shared memory regions~\cite{kerrisk2010linux}. Because the latency between two \textit{shmem} nodes will approximately be the time it takes to access memory, its $\tilde{t}_{lat}$ can be approximated to the time $\tilde{t}_{villas}$. $\tilde{t}_{villas}$ is the amount of time that is spent by the super-node, apart from the nodes that are being tested. It thus corresponds to the time that is spent in all blocks of \autoref{fig:villas_benchmark}, minus the time that is spent in the nodes that are being tested.
In the tests that were performed, the sample generation rate was varied between \SI{100}{\hertz} and \SI{100}{\kilo\hertz}, every sample contained 8 64-bit floating-point numbers, and for every rate, \SI{250000} samples were sent. The results of these tests can be seen in \autoref{fig:shmem_infiniband_comparison}. Compared to previous graphs, this graph additionally contains an indication of the missed steps of the \textit{signal} node for generation rate.
\begin{figure}[ht!]
\begin{subfigure}{\textwidth}
\includegraphics[width=15.2cm, keepaspectratio]{plots/nodetype_varying_rate_IB_shmem/median_graph.pdf}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\includegraphics{plots/nodetype_varying_rate_IB_shmem/median_legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Results the benchmark yields for the \textit{shmem} and \textit{InfiniBand} node-type with 8 64-bit floating-point number per sample. The sample generation rate was varied between \SI{100}{\hertz} and \SI{100}{\kilo\hertz} and for every rate, \SI{250000} samples were sent.}\label{fig:shmem_infiniband_comparison}
\end{figure}
\paragraph{Difference in latency} The difference between the latencies of these node-types can be seen as the additional latency that communication over InfiniBand adds. The time penalty that the implementation of the read- and write-function add can be approximated to:
\begin{equation}
t_{\operatorname{r/w-function}}^{IB} \approx \tilde{t}_{lat}^{IB}-\tilde{t}_{lat}^{shmem}-\tilde{t}_{lat}^{HCA},
\end{equation}
with $\tilde{t}_{lat}^{IB}$ the median latency that is measured when transmitting data between two InfiniBand VILLASnode nodes, $\tilde{t}_{lat}^{shmem}$ the median latency of communication between two \textit{shmem} nodes, and $\tilde{t}_{lat}^{HCA}$ the latency that was seen for inline communication in \autoref{sec:evaluation_ca}.
With $\tilde{t}_{lat}^{IB} \approx \SI{2}{\micro\second}$, $\tilde{t}_{lat}^{shmem} \approx \SI{0.3}{\micro\second}$, and $\tilde{t}_{lat}^{HCA} \approx \SI{0.8}{\micro\second}$ this number adds up to approximately \SI{0.9}{\micro\second}. Since, as could be seen in \autoref{sec:busy_polling}, up to \SI{0.3}{\micro\second} latency was added when the send rate decreased, the values for the highest frequency from \autoref{fig:shmem_infiniband_comparison} were used. In that way, the added time should should mainly be caused by the implementations from \autoref{sec:villas_read} and \autoref{sec:villas_write}.
\paragraph{Missed steps} The graph shows that, in most cases, the \textit{signal} node only missed slightly more steps when testing the \textit{InfiniBand} node, than when testing the \textit{shmem} node. This indicates that the \textit{InfiniBand} node-type did not give much back pressure and that its write-function returned fast enough and did therefore not influence the signal generation at these rates. Since median latencies around \SI{2500}{\nano\second} were achieved, transmission rates up to
\begin{equation}
\frac{1}{\SI{2500}{\nano\second}}\approx\SI{400}{\kilo\hertz}
\end{equation}
should be possible. This number is probably more pessimistic than the reality since it does not take into account that the latency is not entirely caused by the sending node.
The same similarities could be seen for other sample sizes and sample generation rates. \Autorefap{a:shmem_3d} shows the results the benchmark yielded when the sample generation rate and the message size were varied for the \textit{shmem} node-type. Regarding missed steps, this graph shows similarities to \autoref{fig:rate_size_3d_RC} in this chapter and \autoref{fig:rate_size_3d_UC} and \autoref{fig:rate_size_3d_UD} in \autorefap{a:rate_size_3d_UC_UD}. Since the common denominator of these tests is the \textit{file} node-type, these results again indicate that the component that caused the most complications in the VILLASnode node-type benchmark's datapath was the \textit{file} node-type.
Thus, since the \textit{file} node-type is currently the bottleneck in the benchmark from \autoref{sec:villas_benchmark}, this node-type should be optimized in order to bring down the number of steps the benchmark misses.
\paragraph{Decline in latency} Analogous to previous observations, the median latency of the \textit{InfiniBand} node-type increased for lower frequencies. Remarkable, however, is that the median latency of the \textit{shmem} node-type also increased---although only slightly---for lower frequencies. Even though this decline is not unambiguously visible in \autoref{fig:shmem_infiniband_comparison}, it is more evident in \autoref{fig:shmem_3d} in \autoref{a:results_benchmarks}.
In a previous subsection the suspicion was raised that techniques such as \gls{aspm} caused this effect. But, since the same effect also occurred with node-types that are independent from the \gls{pcie} bus, the cause of this problem cannot solely lie within \gls{io} optimization techniques. Hence, the (scheduler of the) \gls{os} is probably also partially responsible for the increasing latency at lower rates.
\subsection{Comparison to other node-types}
The objective of the present work that was raised in \autoref{sec:hard_real_time_communication_between_servers} was to implement hard real-time communication between different host systems that run VILLASnode. It showed that none of the server-server node-types that were available at the time of writing the present work were able to realize this (\autoref{tab:villasnode_nodes}).
This subsection examines whether the addition of the \textit{InfiniBand} node-type to the pool of available VILLASnode node-types has an added value. It does so by comparing the results of two commonly used node-types for server-server communication---\textit{zeromq} and \textit{nanomsg}---with the \textit{InfiniBand} node-type and the \textit{shmem} node-type.
In the tests that were performed, the sample size was fixed to 8 values. The rate was varied between \SI{100}{\hertz} and \SI{100}{\kilo\hertz} and every test was conducted until \SI{250000}{} messages were transmitted.
\paragraph{Loopback and physical link} First, the tests were performed in loopback mode, in which the source and target node of the \textit{zeromq} and \textit{nanomsg} node-type were both bound to \texttt{127.0.0.1}. However, to make a fair comparison to the \textit{InfiniBand} node-type tests, which were performed on an actual physical link, these tests had to be performed on a physical link as well.
To exclude that using different hardware with inferior or superior specifications would skew the results, the back-to-back connected InfiniBand \glspl{hca} were also used to perform the tests with the Ethernet based node-types. This was done using the \acrfull{ipoib} driver (\autoref{sec:rdmacm}), which enables processes to send data over the InfiniBand network using the TCP/IP stack (\autoref{fig:openfabrics_stack}).
In order to compel processes to actually use the physical link although both network devices were part of the same system, the Linux network namespace was used. With namespaces\footnote{\url{http://man7.org/linux/man-pages/man7/namespaces.7.html}}, it is possible to wrap system resources in an abstraction, so that they are only visible to processes in that namespace. In case of the network namespace, processes in such a namespace make use of a copy of the network stack. It can be seen as a separate subsystem, with its own routes, firewall rules, and network device(s). The network namespace was managed with \texttt{ip-netns}\footnote{\url{http://man7.org/linux/man-pages/man8/ip-netns.8.html}}.
\paragraph{Results} \Autoref{fig:nanomsg_zeromq_comparison} shows the results of these runs. For rates below \SI{25}{\kilo\hertz}, the latencies of the loopback tests were almost identical to the latencies of the tests on the physical link. Above \SI{25}{\kilo\hertz} the latencies of the latter start to increase. Although especially the \textit{zeromq} node showed a humongous latency increase, the performance of both node-types started to become unsuited for real-time simulations.
The percentage of missed steps for \SI{100}{\hertz} and \SI{2500}{\hertz} was exactly the same for the \textit{nanomsg} and \textit{zeromq} node-type as for the \textit{InfiniBand} and \textit{shmem} node-type. This again indicates that this effect was caused by the \gls{tsc}. It is, however, unlikely that the relatively high median latencies around these rates were caused by the \gls{tsc}. After all, in all previously presented tests in which the \gls{tsc} was used for these rates, such a large difference was not seen.
Although a considerable number of samples were never transmitted, especially for high rates, no samples were dropped after the first sequence number appeared in the out file. The percentages of missed steps of the \textit{nanomsg} and \textit{zeromq} node-type are displayed in \autorefap{a:missed_steps_nanomsg_zeromq}.
\begin{figure}[ht!]
\begin{subfigure}{\textwidth}
\includegraphics[width=15.2cm, keepaspectratio]{plots/nodetype_varying_rate_zeromq_nanomsg/median_graph.pdf}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\includegraphics{plots/nodetype_varying_rate_zeromq_nanomsg/median_legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Results the benchmark yielded for the \textit{zeromq} and \textit{nanomsg} node-types. Both node-types were once tested in loopback mode and once over an actual physical link. Every sample contained 8 64-bit floating-point numbers and the sample generation rate was varied between \SI{100}{\hertz} and \SI{100}{\kilo\hertz}. For every rate, \SI{250000} samples were sent.}\label{fig:nanomsg_zeromq_comparison}
\end{figure}
\Autoref{fig:node_type_comparison} compares the results of the \textit{nanomsg} and \textit{zeromq} node-type on the physical link with the results of the \textit{InfiniBand} and \textit{shmem} node-type. It is apparent from this graph that the \textit{InfiniBand} node-type had a latency that was one order of magnitude smaller than the soft real-time node-types. Furthermore, the variability of the latency of the samples that were sent over InfiniBand was lower than the variability of the latency of the same samples over Ethernet. Finally, both the \textit{nanomsg} and \textit{zeromq} node unmistakably started to show performance losses when exceeding a sample generation rate of \SI{25}{\kilo\hertz}.
\begin{figure}[ht!]
\begin{subfigure}{\textwidth}
\includegraphics[width=15.2cm, keepaspectratio]{plots/nodetype_varying_rate_zeromq_nanomsg_shmem_IB/median_graph.pdf}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\includegraphics{plots/nodetype_varying_rate_zeromq_nanomsg_shmem_IB/median_legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Results the benchmark yielded for the server-server node-types \textit{zeromq}, \textit{nanomsg}, and \textit{InfiniBand} and for the internal node-type \textit{shmem}. Every sample contained 8 64-bit floating-point numbers and the sample generation rate was varied between \SI{100}{\hertz} and \SI{100}{\kilo\hertz}. For every rate, \SI{250000} samples were sent.}\label{fig:node_type_comparison}
\end{figure}

41
chapters/future.tex Normal file

@ -0,0 +1,41 @@
\chapter{Future Work\label{chap:future}}
\section{Real-time optimizations\label{sec:future_real_time}}
In his master's thesis~\cite{vogel2016development}, Vogel wrote that ``careful optimizations and tuning [of the Linux \gls{os}] are indispensable. The most important change is a \texttt{PREEMPT\_RT}-patched kernel.''
Real-time Linux was first presented by Barabanov and Yodaiken~\cite{barabanov1996real}, and the \texttt{PREEMPT\_RT} patch is currently maintained by Ingo Molnar and Thomas Gleixner. Its main purpose is not to increase the throughput of a Linux system or to decrease its latency, but rather to make it more predictable. It does so by:
\begin{itemize}
\setlength\itemsep{0.2em}
\item making parts of the kernel, which were originally not preemptible, preemptible;
\item adding priority inheritance to the kernel;
\item running interrupts as threads;
\item replacing timers, which leads to high-resolution, user-space-accessible timers.
\end{itemize}
The internals of the \gls{rt} patch are described by Rostedt and Hart~\cite{rostedt2007internals} and can also be found on the Real-Time Linux Wiki.\footnote{\url{https://rt.wiki.kernel.org}}
The Mellanox modified \gls{ofed} stack that was used together with the Mellanox' \glspl{hca} (\autoref{tab:benchmark_testsystem}) did not support \texttt{PREEMPT\_RT}-patched Linux kernels. Therefore, none of the benchmarks that were evaluated in \autoref{chap:evaluation} could be run on a real-time operating system. Consequently, the predictability of the benchmark was not always ideal. Examples are:
\begin{itemize}
\setlength\itemsep{0.2em}
\item \Autoref{fig:oneway_unsignaled_inline}: $\max t_{lat} = \SI{17.4}{\micro\second}$ and \SI{0.0125}{\percent} of $t_{lat} > \SI{10}{\micro\second}$ with a median latency of only \SI{786}{\nano\second};
\item \Autoref{fig:timer_comparison_d}: $\max t_{lat}=\SI{262.0}{\micro\second}$ and \SI{0.02}{\percent} of $t_{lat} > \SI{50}{\micro\second}$ with a median latency of only \SI{2.1}{\micro\second}.
\end{itemize}
Although there is a chance that the median latencies will become a little higher with a \texttt{PREEMPT\_RT}-patched kernel, these (sometimes excessive) latency spikes should diminish and the variability should decrease. Furthermore, the increasing latencies for lower transmission rates should diminish with an \gls{rt}-patched kernel.
It would certainly be interesting for future research to examine the behavior of InfiniBand hardware in real-time optimized operating systems. Although InfiniBand is already an attractive communication solution for for real-time applications, this could make it even more attractive.
\section{Optimization \& profiling\label{sec:future_profiling}}
\paragraph{Benchmark optimizations}
During the coarse of the present work, it turned out that the bottleneck of the benchmark from \autoref{fig:villas_benchmark} is the \textit{file} node-type. Although several optimizations, e.g., suppressing as much system calls as possible, were applied to this node-type, it remained the bottleneck for high frequencies.
Reducing the effect the \textit{file} node-type has on the benchmark would yield less distorted, more realistic indications of the latencies that can be achieved. More important, however, is that this would facilitate a method to examine the limitations of low-latency node-types such as \textit{InfiniBand} and \textit{shmem}.
Furthermore, it would be beneficial to evaluate whether the \gls{tsc} can be optimized in a way that it works with low rates as well.
\paragraph{InfiniBand node-type optimizations}
Currently, the read- and write-function of the \textit{InfiniBand} node-type add a latency penalty of roughly \SI{0.9}{\micro\second} to the transmission latency of a message. Since this is the lion's share of the total latency, it would be interesting to analyze how many time is spent in the several functions and what the hot spots are. Profiling tools like \textit{gprof}~\cite{susan1983gprof} can be used for this kind of analysis.
Moreover, all settings were optimized for maximum rates. However, the optimal settings for lower rates probably differ from the optimal settings for high rates.
\section{RDMA over Converged Ethernet support\label{sec:roce}}
In their publication~\cite{macarthur2012performance}, MacArthur and Russel observed that \gls{roce}, which allows \gls{rdma} over conventional Ethernet networks, was just slightly outperformed by InfiniBand for small messages. Although \gls{roce}'s performance would be marginally worse than InfiniBand's and although it does not have as much support for \gls{qos} as InfiniBand, it would be a great addition to VILLASnode for cases in which existing infrastructure must be used.
With the alterations that have been made to VILLASnode in order to support InfiniBand, support for \gls{roce} would not require too many changes to the existing source code.

374
chapters/implementation.tex Normal file

@ -0,0 +1,374 @@
\chapter{Implementation\label{chap:implementation}}
The first section of this chapter (\ref{sec:ca_benchmarks}) describes the implementation of the benchmark which was used to measure latencies between InfiniBand host channel adapters. Then, \autoref{sec:villas_implementation} describes how the \textit{InfiniBand} node-type for VILLASnode was implemented. Subsequently, \autoref{sec:villas_benchmark} describes the characteristics and implementation of the benchmark that was used to analyze VILLASnode node-types. Thereafter, \autoref{sec:uc_support} describes how \gls{uc} support was added to the \gls{rdma} \gls{cm} library. Finally, \autoref{sec:processing_data} briefly describes what tools and techniques were used to process and analyze the acquired data.
If not stated otherwise, all software that is discussed in this chapters is written in the C programming language~\cite{kernighan1978c}.
\section{Host channel adapter benchmark\label{sec:ca_benchmarks}}
The developed host channel adapter benchmark was inspired by the measurements which were done by MacArthur and Russel~\cite{macarthur2012performance}, which were already presented in \autoref{sec:related_work}. Although this work will likewise analyze the influence of variations in operation modes, settings, and message sizes on latencies, it will not focus on their influence on throughput.
The objective of this benchmark is to measure---as accurately as possible---how long data resides in the actual InfiniBand Architecture when it is sent from one host channel adapter to another host channel adapter. So, if latency is defined as
\begin{equation}
t_{lat} = t_{subm} - t_{recv},
\label{eq:latency}
\end{equation}
the time data actually spends in the \gls{iba} can be approximated by setting $t_{subm}$ to the moment on which the send \gls{wr} is submitted, and $t_{recv}$ to the moment the receive node becomes aware of the \gls{cqe} in the completion queue which is bound to the receive queue.
\Autoref{sec:timestamps} first introduces how and where in the source code the timestamps $t_{subm}$ and $t_{recv}$ are measured. Then, \autoref{sec:tests} describes what tests the benchmark is capable of running.
\subsection{Definition of measurement points\label{sec:timestamps}}
Many benchmarks factually measure the latency of the round-trip and divide it by two in order to approximate the one-way time between two host channel adapters. This is necessary if the \glspl{hca} are not part of the same host system. The latency of messages between InfiniBand \glspl{hca} is usually under \SI{5}{\micro\second}; there are even reports of one-way times as small as \SI{300}{\nano\second}~\cite{macarthur2012performance}. Hence, if both \glspl{hca} are part of different systems, even small deviations between the endnodes' system clocks could cause significant skews in $t_{lat}$ and make the results useless. This problem is nonexistent if both timestamps $t_{subm}$ and $t_{recv}$ are generated by the same system clock.
A possible disadvantage of using the round-trip delay to approximate the one-way delay is the additional (software) overhead. Lets assume that a message is sent from node \textit{A} to node \textit{B}, back to node \textit{A}. Then, there can be an additional time penalty which is introduced by software on node \textit{B}, that is necessary to submit a work request in order to return the received message.
Furthermore, it is possible that latency benchmarks---e.g., \texttt{ib\_send\_lat} and \linebreak\texttt{ib\_write\_lat} in the \gls{ofed} Performance Tests\footnote{\url{https://github.com/linux-rdma/perftest}}---yield distorted, possibly idealized, results. Although these are well suited for hardware and software tuning, the results can deviate from the actual latencies that can be seen when implementing an application with the \gls{ofed} verbs.
The present work therefore implements a custom benchmark that assumes two \glspl{hca} in the same host system. It thereby prevents the skewness that is caused by deviations between different endnodes's system clocks and the additional software overhead for round-trip delays. Furthermore, it makes sure that the latencies correspond to the latencies that can be seen in actual applications.
\paragraph{Generation of timestamps} In this benchmark, \texttt{clock\_gettime()}~\cite{kerrisk2010linux} is used to generate timestamps. Its parameters are a variable of type \texttt{clockid\_t} and a reference to an instance of \texttt{struct timespec} (\autoref{lst:timespec}) to which the function will write the current time.
\begin{figure}[ht!]
\vspace{0.4cm}
\lstinputlisting[caption=The composition of \texttt{struct timespec}.,
label=lst:timespec,
style=customc]{listings/timespec.c}
\vspace{-0.2cm}
\end{figure}
The former parameter, \texttt{clockid\_t}, is particularly interesting. Usually, this is set to \texttt{CLOCK\_REALTIME}, on which \texttt{clock\_gettime()} returns the system's best guess of the current time. This means that this clock can change during operation because it is adapted by the \gls{ntp}. Therefore, this timestamp is not suitable for the calculation of time differences with a nanosecond resolution. However, if the \texttt{CLOCK\_MONOTONIC} is requested, \texttt{clock\_gettime()} will return a strictly linearly increasing timestamp starting at an unspecified point in the past. Since linearity is guaranteed between timestamps for this \texttt{clockid\_t}, it is best suited to calculate $t_{lat}$ from \autoref{eq:latency}.
\paragraph{Location of timestamps in code} This benchmark takes timestamps on three different locations in the code:
\begin{itemize}
\setlength\itemsep{0.2em}
\item $t_{subm}$ is acquired right before an already prepared work request is submitted to the send queue with \texttt{ibv\_post\_send()}. The timestamp will be the message's payload. For that reason, it is important that the address to which the scatter/gather element points is valid until the message is actually send and that the timestamp is not overwritten in a next iteration. The pseudocode for this case is displayed in \autoref{lst:send_time}.
\item $t_{recv}$ is measured on the receiving node. It is acquired right after \texttt{ibv\_poll\_cq()} on the completion queue that is bound to the receive queue returns with a positive value. The pseudocode for this case is displayed in \autoref{lst:cq_time}.
The function that is displayed in \autoref{lst:cq_time} lies in the datapath, and the moment on which the timestamp and the identifier of the message are saved to be evaluated later, are time-critical. For one, this is optimized by using \SI{2}{\mebi\byte} hugepages instead of conventional \SI{4}{\kibi\byte} pages. For example, when 8000 messages are received, 8000 8-byte timestamps (\SI{64}{\kibi\byte}) and 8000 4-byte identifiers (\SI{32}{\kibi\byte}) must be saved. These \SI{96}{\kibi\byte} fit into one single hugepage, however, it would require 24 conventional pages and in turn 24 potential page faults. For the sake of readability of the code, the timestamps and message identifiers are spread among two hugepages.
Furthermore, it is made sure that the pages are immediately touched after initialization with \texttt{mmap()} to prevent page faults from happening in the datapath. After allocating the memory, the pages are locked with \texttt{mlockall()}. More information on memory optimization can be found in \autoref{sec:mem_optimization}.
\item $t_{comp}$ is measured in the same fashion as $t_{recv}$, but on the sending node. \texttt{ibv\_poll\_cq()} polls the completion queue that is bound to the send queue. It gives an indication of the time that passes before the sending node gets a confirmation that the message has been sent. Similar to \autoref{eq:latency}, the latency before a confirmation of transmission is available can be defined as:
\begin{equation}
t_{lat}^{comp} = t_{subm} - t_{comp}.
\label{eq:latency_completion}
\end{equation}
This timespan is relevant because buffers in the main memory cannot be reused as long as there is no confirmation that the \gls{hca} copied the data from the host's main memory to its internal buffers. (This is not the case for data that is sent inline, see \autoref{sec:postingWRs}.)
\end{itemize}
\begin{figure}[ht!]
\vspace{0.2cm}
\lstinputlisting[caption=Pseudocode which records the moment a messages is submitted to the \acrfull{sq}.,
label=lst:send_time,
style=customc]{listings/send_time.c}
\vspace{-0.2cm}
\end{figure}
\begin{figure}[ht!]
\vspace{0.4cm}
\lstinputlisting[caption=Pseudocode which records the moment a \acrfull{cqe} becomes available in the \acrfull{cq}.,
label=lst:cq_time,
style=customc]{listings/cq_time.c}
\vspace{-0.2cm}
\end{figure}
There is one special case which has not been discussed yet. $t_{subm}$ is set to the time right before the \gls{wr} is submitted to the send queue. Since it will take a certain amount of time before the \gls{hca} will copy the data (i.e., the timestamp) from the host's main memory to its internal buffers, it is possible to continue to alter the value after the work request has been posted. This benchmark offers a function to measure $t_{send}$, which approximates the moment the \gls{hca} copies the data to its internal buffer. The delta
\begin{equation}
\Delta t_{inline} \approx \tilde{t}_{lat}^{send} - \tilde{t}_{lat},
\label{eq:delta_inline}
\end{equation}
approximates the amount of time which will be saved by sending the data inline. In \autoref{eq:delta_inline}, $\tilde{t}_{lat}^{send}$ is the median latency measured with \textit{send}-timestamps, and $\tilde{t}_{lat}$ is the median latency measured with \textit{submit}-timestamps. The pseudocode of \autoref{lst:send_time} must be replaced with the pseudocode of \autoref{lst:time_thread} to transmit $t_{send}$ instead of $t_{subm}$.
\begin{figure}[ht!]
\vspace{1cm}
\lstinputlisting[caption=Pseudocode which continues to update an instance of the \texttt{timespec} C structure in a separate thread\comma whilst a pointer to this instance has already been submitted to the \acrfull{sq}.,
label=lst:time_thread,
style=customc]{listings/time_thread.c}
\vspace{-0.2cm}
\end{figure}
\subsection{Supported tests\label{sec:tests}}
The list below provides an overview of the different settings that can be applied. Later, \autoref{sec:evaluation_ca} will present the results for different combinations of these settings.
\begin{itemize}
\setlength\itemsep{0.2em}
\item \textbf{The service type} (\autoref{tab:service_types}) can be varied between \gls{rc}, \gls{uc}, and \gls{ud}.
\item \textbf{The poll mode} (\autoref{fig:poll_event_comparison}) can be set to \textit{busy polling} or \textit{wait for event}. The poll mode can be set independently for $t_{recv}$ and $t_{comp}$.
\item \textbf{Inline mode} (\autoref{sec:postingWRs}) can be turned on for small messages.
\item \textbf{Unsignaled completion} can be enabled. When this switch is set, send \glspl{wr} will not generate \glspl{wqe} when the \gls{hca} has processed them.
\item \textbf{The operation} (\autoref{tab:transport_modes}) can be set to \textit{send with immediate} or \textit{\gls{rdma} write with immediate}. Both operations are only supported \textit{with immediate} in this benchmark since the \acrshort{imm} header is used to identify the order of the messages at the receive side.
\item \textbf{The burst size} represents the number of messages that will be sent during one test and is limited to the maximum size of a \gls{qp} in the \gls{hca}. The benchmark is built in a way that it will continuously send messages, until this value is reached. It can be varied between 1 and 8192.
\item \textbf{An intermediate pause} (in nanoseconds) can be set. The benchmark will sleep for this amount of time in between the \texttt{ibv\_post\_send()} calls.
\item \textbf{Either the send or submit time} can be measured. This switch determines whether $t_{subm}$ or $t_{send}$ is measured.
\item \textbf{The message size} $S_M$ can be set to
\begin{equation}
S_M = \SI[parse-numbers=false]{8\cdot2^i}{\byte},\ i \in [0,12],
\end{equation}
where \SI{8}{\byte} is the minimum size of a message with a timestamp. A maximum of \SI{32768}{\byte} (\SI{32}{\kibi\byte}) is chosen because messages in VILLASnode are unlikely to be bigger than \SI{32}{\kibi\byte}.
\end{itemize}
Although the possibility to submit linked lists of scatter/gather elements and work requests to the send queue will be used in the VILLASframework \textit{InfiniBand} node-type, its influence on latency will not be examined in this benchmark. Linking scatter/gather elements can become handy if data from different locations in the memory must be sent. Submitting combined work requests can be convenient if a whole batch of \glspl{wr} has to be posted and it is not necessary that a \gls{wr} is posted immediately after its generation (e.g., creating a set of receive \glspl{wr} in a loop and posting the linked list right after the closing bracket of the loop). However, the lowest latency is achieved by passing only one memory location to the \gls{hca} and by sending a message immediately after generation of the timestamp.
\section{VILLASframework InfiniBand node-type\label{sec:villas_implementation}}
\Autoref{chap:architecture} already introduced the architecture of node-types in VILLASframework and concepts to enable compatibility of \glspl{via}---and in particular the \gls{iba}---with VILLASframework. The key objective of the development of an \textit{InfiniBand} node-type was the implementation of all functions in \autoref{a:nodetype_functions} with as little as possible alterations to the pre-existing architecture. Other than the proposed changes from \autoref{sec:proposal}, the VILLASframework architecture was not modified with regards to the node-type interface and the memory management.
The implementation of the more apparent functions, e.g., \texttt{parse()}, \texttt{check()}, \texttt{reverse()}, \texttt{print()}, \texttt{destroy()}, and \texttt{stop()}, will not be discussed. This section mainly focuses on non-obvious functions, which are either InfiniBand specific (i.e., the start-function in \autoref{sec:villas_start}) or had to be optimized to make full use of the kernel bypass InfiniBand offers (i.e., the read- and write-functions in \autoref{sec:villas_read} and~\ref{sec:villas_write}, respectively). The complete source code of the \textit{InfiniBand} node-type can be found on VILLASnode's public Git repository.\footnote{\url{https://git.rwth-aachen.de/acs/public/villas/VILLASnode/}}
\subsection{Start-function\label{sec:villas_start}}
After a configuration file, which is set by a user, is interpreted by the parse-function and reviewed by the check-function, the super-node will invoke the start-function to initialize all necessary structures. It starts with the creation of a communication event channel with \texttt{rdma\_create\_event\_channel()} and the initialization of an \gls{rdma} communication identifier with \texttt{rdma\_create\_id()}. The latter is bound to both a local InfiniBand device that was defined in the configuration file and the event channel.
Before the node allocates the protection domain with \texttt{ibv\_alloc\_pd()}, the communication identifier tries to resolve the remote address with \texttt{rdma\_resolve\_addr()} (in case of an active node) or places itself into a listening state with \texttt{rdma\_listen()} (in case of a passive node). Whether the node becomes an active or passive node depends on the presence of a remote host address to connect to in the configuration file. Finally, the start-function creates a separate thread with \texttt{pthread\_create()}~\cite{kerrisk2010linux} to monitor all asynchronous events on the \texttt{rdma\_cm\_id}.
When everything is set up successfully, the start-function will return 0, to indicate success. The super-node then moves the node to the \textit{started} state (\autoref{fig:villasnode_states}).
\subsection{Communication management thread\label{sec:comm_management}}
The function that is executed by the thread that is spawned by the start-function is kept busy by a while loop until the node is moved to the \textit{started} state. This avoids races and ensures that the state transitions from \autoref{fig:villasnode_states} are obeyed.
The remainder of this function consists of a while loop that monitors the communication identifier in a blocking manner with \texttt{rdma\_get\_cm\_event()} (\autoref{sec:rdmacm}). Within this loop, the different events are handled by a switch statement. The loop, the switch statement, and a short description of what happens for every case are displayed in \autoref{lst:cm_switch}. Before expanding on the different operations of every case, a note on the blocking characteristics of \texttt{rdma\_get\_cm\_event()} has to be made. This function enables the \gls{os} to suspend further execution of the thread for an indefinite amount of time, which usually results in difficulties when trying to cancel (or kill) the thread. However, \texttt{read()}, which lies at the heart of \texttt{rdma\_get\_cm\_event()}, is a required cancellation point. A thread, for which cancelability is enabled, only acts upon cancellation requests when it reaches a cancellation point~\cite{kerrisk2010linux}. Furthermore, as defined in IEEE Std 1003.1\texttrademark-2017~\cite{posix2018}: ``[when] a cancellation request is made with the thread as a target while the thread is suspended at a cancellation point, the thread shall be awakened and the cancellation request shall be acted upon.'' Thus, even though the thread is suspended, it can be canceled with \texttt{pthread\_cancel()} if necessary.
\begin{figure}[ht!]
\vspace{0.4cm}
\lstinputlisting[caption=The events that are monitored by the communication management thread. Although not explicitly stated in this listing\comma every case block ends with a \texttt{break}.,
label=lst:cm_switch,
style=customc]{listings/cm_switch.c}
\vspace{-0.2cm}
\end{figure}
\paragraph{Active node} As defined in the previous subsection, an active node is a node that tries to connect to another node. The first event that should appear after the start-function has been called is \texttt{RDMA\_CM\_EVENT\_ADDR\_RESOLVED}. This event denotes that the address has been resolved and that the \gls{qp} and two \glspl{cq}---one for the receive and one for the send queue---can be created. These instances are created using \texttt{rdma\_create\_qp()} and \texttt{ibv\_create\_cq()}, respectively. It is important for the functioning of the \textit{InfiniBand} node-type's write-function (\autoref{sec:villas_write}) that the \gls{qp}'s initialization attribute \texttt{sq\_sig\_all} is set to \zero.
After all necessary structures have been initialized, \texttt{rdma\_resolve\_route()} will be invoked. Then, when the route has successfully been resolved, the event channel will unblock again and return \texttt{RDMA\_CM\_EVENT\_ROUTE\_RESOLVED}. This means that everything is set up, and \texttt{rdma\_connect()} may be called to invoke a connection request. The state of the active node is then set to \textit{pending connect}.
When the remote node accepts the connection, \texttt{RDMA\_CM\_EVENT\_ESTABLISHED} occurs and the state of the node is set to \textit{connected}.
If the node operates with the \gls{ud} service type, the last mentioned event structure contains the \acrfull{ah}, which includes information to reach the remote node. This value is saved because in \gls{ud} mode it has to be defined in every work request (\autoref{sec:postingWRs}). Although the node is not really connected---after all \gls{ud} is an unconnected service type---the node will be transitioned to the \textit{connected} state. In the context of VILLASnode, this state implies that data can be send, either because the \glspl{qp} are connected or because the remote \gls{ah} is known.
\paragraph{Passive node} As mentioned before, a passive node listens on the communication identifier and waits until another node reaches out to it. If another node calls \texttt{rdma\_connect()} on it, the channel will unblock and return the event \linebreak\texttt{RDMA\_CM\_EVENT\_CONNECT\_REQUEST}. Thereon, the node will build its \gls{qp}, its \glspl{cq}, and accept the connection with \texttt{rdma\_accept()}. If the service type of the node is a connected service type (i.e., \gls{uc} or \gls{rc}), the node will move to the \textit{pending connect} state. If the service type is unconnected (i.e., \gls{ud}), it will move directly to the \textit{connected} state.
In case of a connected service type, the \texttt{RDMA\_CM\_EVENT\_ESTABLISHED} event occurs when the connection has successfully been established. The state is then set to \textit{connected}.
\paragraph{Error events} Error events which are caused because a remote node could not be reached are not necessarily fatal for the complete node. In this case, a fallback function which sets the node into the listening mode instead of the active mode will be invoked. This behavior is configurable; if a user sets the appropriate flag in the configuration file these errors can be made fatal.
\subsection{Read-function\label{sec:villas_read}}
This subsection focuses on the implementation of the read-function which was previously proposed in \autoref{sec:readwrite_interfaces}. Contrary to the functioning principle in \autoref{fig:villas_read}, which suggests that all samples that are passed to the read-function will definitely be submitted and must thus be held, there is a chance that some samples will not be submitted successfully. These samples must be released again.
\Autoref{fig:read_implementation} shows a decision graph for the algorithm that is implemented by the read-function. The example case, depicted by the red path, assumes that 5 empty samples are passed to the read-function, and that there are at least \textit{threshold} \glspl{wqe} in the \gls{rq}. This threshold, which is set in the configuration file, is necessary to ensure that a node can always receive samples because there are always at least \textit{threshold} pointers in the \gls{rq}. If this threshold has not yet been reached, all passed samples are submitted to the receive queue and \texttt{*release} is set to 0 (depicted by the black path). Then, the function returns with $ret=0$, without ever polling the completion queue.
\begin{figure}[ht!]
\begin{subfigure}{\textwidth}
\includegraphics{images/read_implementation.pdf}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\vspace{-0.6cm}
\includegraphics{images/read_write_implementation_legend.pdf}
\vspace{-1.4cm}
\end{subfigure}
\caption{The decision graph for the read-function in the \textit{InfiniBand} node. Prior to invoking the read-function, \texttt{*release} is always set to \texttt{cnt} by the super-node.}\label{fig:read_implementation}
\end{figure}
If the threshold has been reached, the red path is followed. The completion queue is polled in a while loop until at least one, but not more than \texttt{cnt}, \glspl{cqe} are available. This will block further execution the read-function, which is the intended behavior. After all, when a certain amount of \glspl{wqe} resides in the \gls{rq}, it is undesired to continue to submit new \glspl{wr}. At a certain moment, the queue would be full and it would not be possible to submit new addresses that the node got from the super-node. However, this is necessary to free places in \texttt{*smps[]}, which can only hold up to \texttt{cnt} values. So, if this blocking behavior was not in place, the super-node would keep passing new addresses until the receive queue would overflow and no addresses from \glspl{cqe} could be returned to the super-node anymore.
Because \texttt{ibv\_poll\_cq()} does not rely on any of the system calls that are listed in~\cite{posix2018}, the thread that contains this while loop would not notice if a cancellation request is sent. Therefore, \texttt{pthread\_testcancel()}~\cite{kerrisk2010linux} should regularly be called within this loop.
The addresses in \texttt{*smps[]} are not immediately swapped with the \textit{X} addresses that are returned with \texttt{ibv\_poll\_cq()}. First, after the poll-function has indicated that \textit{X} \glspl{cqe} with addresses are available, \textit{X} addresses from \texttt{*smps[]} are submitted to the \gls{rq}. This ensures that the \gls{rq} does not drain and it makes room for the addresses from the \glspl{cqe} in \texttt{*smps[]}. Finally, the polled addresses are swapped with the addresses that were posted to the receive queue, and the read-function returns with $ret = X$. Note that \texttt{*release} remains untouched: all values in \texttt{*smps[]} are either received or not used, and must thus be released.
\subsection{Write-function\label{sec:villas_write}}
The write-function, depicted in \autoref{fig:write_implementation}, is a bit more complex than the read-function. This time, the algorithm depicted in the decision graph includes four example cases.
Immediately after the write-function is invoked by the super-node, it tries to submit all \texttt{cnt} samples to the \gls{sq}. While going through \texttt{*smps[]}, the node dynamically checks whether the data can be sent inline (\autoref{sec:postingWRs}) and whether an \gls{ah} must be added. The node has to distinguish among four cases:
\begin{itemize}
\setlength\itemsep{0.2em}
\item the samples will be submitted normally and may thus not be released by the super-node until a \gls{cqe} with the address appears;
\item the samples will be submitted normally, but some samples will be immediately marked as \textit{bad} and must thus be released by the super-node;
\item the samples will be sent inline and, because the \gls{cpu} directly copies them to the \gls{hca}'s memory, must thus be released by the super-node;
\item an arbitrary combination of all abovementioned cases.
\end{itemize}
For samples that are sent normally, the \gls{wr}'s \texttt{send\_flags} (\autoref{lst:ibv_send_wr}) must be set to \texttt{IBV\_SEND\_SIGNALED}. These samples may only be released after the \gls{hca} has processed them, which must not necessarily be in the same call of the write-function. The only way for the \gls{hca} to let the node know that it is done with a sample, is through a completion queue entry. Since the \gls{qp} is created with \texttt{sq\_sig\_all=0}, the generation of \glspl{cqe} for samples must explicitly be requested.
When a sample is sent inline, \texttt{send\_flags} must only be set to \texttt{IBV\_SEND\_INLINE}. It is not desired to get a \gls{cqe} for an inline \gls{wr} since it can be---and thus will be---released immediately after being submitted to the \gls{sq}. After all, it is not possible to release a sample twice.
\begin{figure}[ht!]
\begin{subfigure}{\textwidth}
\includegraphics{images/write_implementation.pdf}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\vspace{-0.6cm}
\includegraphics{images/read_write_implementation_legend.pdf}
\vspace{-1.4cm}
\end{subfigure}
\caption{The decision graph for the write-function in the \textit{InfiniBand} node. Prior to invoking the write-function, \texttt{*release} is always set to \texttt{cnt} by the super-node.}
\label{fig:write_implementation}
\end{figure}
There is one exception to this, however. Although no notifications will be generated if the \textit{signaled} flag is not set, the send queue will start to fill up nonetheless. Therefore, when a lot of subsequent \glspl{wr} are submitted with the \textit{inline} flag set, occasionally a \gls{wr} with the \textit{signaled} flag must be submitted. For this reason, the write-function contains a counter which, when reaching a configurable threshold, changes an \texttt{IBV\_SEND\_INLINE} to an \texttt{IBV\_SEND\_SIGNALED}.
When all samples have been submitted to the \gls{sq}, the value \textit{ret}, which will be returned to the super-node when the write-function returns, is set to the total number of samples that were successfully posted to the send queue.
Now, because the node can only use \texttt{*release} to communicate how many samples to release, \texttt{*smps[]} must be reordered. All samples that must be released, i.e., samples that were not successfully submitted to the send queue or samples that were sent inline, must be placed at the top of the list.
In the next step, the write-function shall try to poll
\begin{equation}
C_{poll} = \texttt{cnt} - C_{release}
\end{equation}
\glspl{cqe}, which corresponds to the number of places in \texttt{*smps[]} that are still free. Here, \texttt{cnt} is the total number of samples in \texttt{*smps[]} and $C_{release}$ the number of samples that have already been marked to be released when the write-function returns. It is certain that all addresses that return from the \gls{cq} must be released, since samples that were sent inline will not generate a \gls{cqe}.
\subsection{Overview of the InfiniBand node-type\label{sec:overview}}
\Autoref{fig:villasnode_implementation} summarizes all components in the VILLASnode \textit{InfiniBand} node-type. Every component that is marked with an asterisk is listed in \autoref{tab:infiniband_node_components}. Here, the sections that describe the respective basics (\autoref{chap:basics}), architecture (\autoref{chap:architecture}), and implementation (\autoref{chap:implementation}) are summarized.
\input{tables/infiniband_node_components}
\begin{figure}[ht!]
\includegraphics{images/villasnode_implementation.pdf}
\caption{An overview of the VILLASnode \textit{InfiniBand} node-type and its components.}
\label{fig:villasnode_implementation}
\end{figure}
\newpage
\section{VILLASnode node-type benchmark\label{sec:villas_benchmark}}
The VILLASnode node-type benchmark is intended to compare different node-types with each other. The structure of the benchmark is depicted in \autoref{fig:villas_benchmark}. The node-type under test could be, for example, the \textit{InfiniBand} node-type. The benchmark is completely based on existing mechanisms within VILLASnode.
First, a \textit{signal} node generates samples which, as aforementioned, also include timestamps. These samples are then sent to a \textit{file} node, which in turn writes them to a \gls{csv} file, here called \textit{in}. Simultaneously, the samples are sent to a sending instance of the node-type that is being tested. Eventually, a receiving instance of that node-type adds a receive timestamp and sends the samples to a second \textit{file} node. This node writes the samples to a \gls{csv} file called \textit{out}.
\begin{figure}[ht!]
\includegraphics{images/villas_benchmark.pdf}
\vspace{-0.2cm}
\caption{The VILLASnode node-type benchmark is formed by connecting a \textit{signal} node, two \textit{file} nodes, and two instances of the node-type that shall be tested.}
\label{fig:villas_benchmark}
\end{figure}
Although the \textit{out} log file will contain both the generation timestamp and the receive timestamp, the \textit{in} log file is necessary to monitor and analyze lost samples. This benchmark is meant to analyze the latencies of the different node-types, but also to discover their limits. Because it is possible that the signal generation misses steps at high frequencies (more on that in the next subsection), a missing sample in the \textit{out} log file does not necessarily mean that something went wrong within the nodes that were tested. By comparing the \textit{in} and \textit{out} log file, the benchmark can decide which samples were missed by the \textit{signal} node, and which samples were missed by the node that was tested.
\subsection{Signal generation rate\label{sec:signal_generation}}
In order for the benchmark to create an environment similar to the real use cases VILLASnode, the \textit{signal} node must be time-aware and insert samples at a given rate. This injection rate of samples must be adjustable. Although this work only focused on rates between \SI{100}{\hertz} and \SI{100}{\kilo\hertz}, lower and higher rates are theoretically possible.
\Autoref{lst:signal_generation} displays a simplified version of the \textit{signal} node-type's read-function. When a super-node that holds a \textit{signal} node tries to acquire samples from it, it calls its read-function. This function blocks further execution until a function \texttt{task\_wait()} returns. Assuming that the super-node would usually call the read-function at an infinite high frequency, the wait-function ensures that it now only returns after a fixed amount of time.
The wait-function returns an integer \texttt{steps}, which indicates the number of steps between the timestamps of the samples. Lets assume that
\begin{equation}
t_{\texttt{task\_wait()}}^{i+1} > t_{sample}^{i} + \SI[parse-numbers = false]{\frac{1}{f_{signal}}}{\second},
\label{eq:timing_violation}
\end{equation}
when attempting to generate the sample with the timestamp $t^{i+1}$. Here, $t_{\texttt{task\_wait()}}^{i+1}$ is the moment \texttt{task\_wait()} is called, $t_{sample}^i$ the moment the last sample was generated, $i$ the iteration of \texttt{signal\_generator\_read()}, and $f_{signal}$ the frequency the \textit{signal} node is set to. When the condition in \autoref{eq:timing_violation} holds, \texttt{task\_wait()} cannot wait until $t_{sample}^{i+1}$ since that time has already passed. Hence, the function must wait until
\begin{equation}
t_{sample}^{i+2} = t_{sample}^{i} + 2\cdot\SI[parse-numbers = false]{\frac{1}{f_{signal}}}{\second},
\end{equation}
in order to stay synchronized with the set frequency. Now, instead of 1 step, 2 timesteps have passed since the last call of the wait-function. In other words, 1 step is missed.
After the missed steps have been counted and the timestamp has been calculated, the actual samples are generated. These will be returned to the super-node through the \texttt{*smps[]} parameter of the read-function. This behavior is similar to that of the read-function of the \textit{InfiniBand} node-type.
\begin{figure}[ht!]
\vspace{0.4cm}
\lstinputlisting[caption=Simplified version of the read-function of the \textit{signal} node-type.,
label=lst:signal_generation,
style=customc]{listings/signal_generation.c}
\vspace{-0.2cm}
\end{figure}
This subsection will expand on two different methods to implement \texttt{task\_wait()} and thus to monitor the rate with which samples are sent. Although the first method is the easier and preferred method, it does not work for high frequencies such as \SI{100}{\kilo\hertz} on which the \textit{InfiniBand} node can operate. For these frequencies to work, the second method is introduced.
\paragraph{Timer expiration notifications via a file descriptor}
Linux provides an \gls{api} for timers. The function \texttt{timerfd\_create()} creates a new timer object and returns a file descriptor that refers to that timer. Once the timer's period is set with \texttt{timerfd\_settime()}, the file descriptor can be read with \texttt{read()}~\cite{kerrisk2010linux}.
\Autoref{lst:timerfd_wait} shows the implementation of \texttt{task\_wait()} with a Linux timer object. When \texttt{read()} is called on the timer's file descriptor (line 6, \autoref{lst:timerfd_wait}), it will write the number of elapsed periods since the last modification of the timer or since the last read to \texttt{steps}. If no complete period has gone by when \texttt{read()} is called, the function will block until this is the case.
\begin{figure}[ht!]
\vspace{0.4cm}
\lstinputlisting[caption=Implementation of \texttt{task\_wait()} by waiting on timer expiration notifications via a file descriptor.,
label=lst:timerfd_wait,
style=customc]{listings/timerfd_wait.c}
\vspace{-0.2cm}
\end{figure}
Although Linux' \gls{api} for timer notifications via a file descriptor offers a convenient way of keeping track of elapsed time periods, it is not suited for high-frequency signals. On the one hand, \texttt{read()} causes a system call which is relatively expensive since it causes a switch between user and kernel mode. On the other hand, the operating system is inclined to suspend the process when the read-function blocks. Since it takes a certain amount of time to wake up the process when a period has been elapsed, this can cause a potential timing violation for the next sample according to \autoref{eq:timing_violation}.
\paragraph{Busy polling the x86 Time Stamp Counter}
All x86 \glspl{cpu} since the Pentium era contain a 64-bit register called \gls{tsc}. Since the Pentium 4 era, this counter increments at a constant rate which depends on the maximum core-clock to bus-clock ratio or the maximum resolved frequency at which the processor is booted~\cite{guide2018intelc3b}. The nominal frequency can be calculated using:
\begin{equation}
f_{nominal}^{TSC} = \mathtt{\frac{CPUID.15H.ECX[31:0]\cdot CPUID.15H.EBX[31:0]}{CPUID.15H.EAX[31:0]}}.
\end{equation}
In his white paper~\cite{paoloni2010benchmark}, Paoloni describes how the \gls{tsc} can be used to measure elapsed time during code execution. In his work, the \gls{rdtsc} and \gls{rdtscp} instructions that are described in \cite{guide2018intelb2b} are used to read the \gls{tsc}. \Autoref{lst:tsc} shows the inline assembler that was used in VILLASnode to acquire the timestamp.
The functioning of both instructions is largely the same. After the \texttt{rdtsc}/\texttt{rdtscp} instruction is invoked, the 32 \gls{msb} of the timestamp are placed in \texttt{rdx} and the 32 \gls{lsb} in \texttt{rax}. To get a valid 64-bit variable, \texttt{rdx} is shifted left \SI{32}{\bit} and subsequently disjuncted with \texttt{rax}. The resulting value is set as output variable \texttt{tsc}, which is also returned by both functions in \autoref{lst:tsc}. During this operation, the high-order \SI{32}{\bit} of \texttt{rax}, \texttt{rdx}, and \texttt{rcx} are cleared. When hard-coded registers are clobbered as a result of the inline assembly code, this must be revealed up front to the compiler (line 12, \autoref{lst:tsc}).
\begin{listing}[ht!]
\refstepcounter{lstlisting}
\noindent\begin{minipage}[b]{.46\textwidth}
\lstinputlisting[nolol=true, style=customc]{listings/rdtsc.h}
\captionof{sublisting}{\gls{rdtsc}.}\label{lst:tsc_a}
\end{minipage}%
\hfill
\begin{minipage}[b]{.46\textwidth}
\lstinputlisting[nolol=true, style=customc]{listings/rdtscp.h}
\captionof{sublisting}{\gls{rdtscp}.}\label{lst:tsc_b}
\end{minipage}
\addtocounter{lstlisting}{-1}
\captionof{lstlisting}{The \gls{rdtsc} instruction with fencing and the \gls{rdtscp} instruction, written in inline assembler. Both functions must be placed inline and thus be preceded by \texttt{\_\_attribute\_\_((unused,always\_inline))}.}
\label{lst:tsc}
\end{listing}
The main difference between \gls{rdtsc} and \gls{rdtscp} is that, unlike the former, the latter waits until all previous instructions have been executed and all previous loads are globally visible. One consequence of this, among others, was described by Paoloni~\cite{paoloni2010benchmark}. He demonstrated that \gls{rdtsc} showed a standard deviation of 6.9 cycles, whereas \gls{rdtscp} only showed a standard deviation of 2 cycles.
Since not all x86 processors support \gls{rdtscp}, VILLASnode nonetheless includes \gls{rdtsc}. However, to improve its behavior, the \gls{lfence} instruction~\cite{guide2018intelb2a} is executed prior to the actual read instruction. This type of fence serializes all load-from-memory instructions prior to its call. Furthermore, no instructions that are placed after the load fence execute until the fence has completed.
\Autoref{lst:rdtscp_wait} shows the implementation of \texttt{task\_wait()} based on the \gls{tsc}. During the $(i+1)^{th}$ call of \texttt{task\_wait()}, the counter is busy polled until the desired timestamp $t_{sample}^{i+1}$ is reached. Then, it updates the next timestamp $t_{sample}^{i+2}$ and simultaneously calculates whether $t_{sample}^{i+1}$ is actually only one step after $t_{sample}^{i}$ or if some steps were missed. The period can be calculated according to:
\begin{equation}
T = \frac{f_{nominal}^{tsc}}{\mathtt{rate}}.
\end{equation}
\begin{figure}[ht!]
\vspace{0.4cm}
\lstinputlisting[caption=Implementation of \texttt{task\_wait()} by busy polling the x86 \acrfull{tsc}.,
label=lst:rdtscp_wait,
style=customc]{listings/rdtscp_wait.c}
\vspace{-0.2cm}
\end{figure}
The advantage of this implementation of \texttt{task\_wait()} is that given periods can be approximated very accurately ($\sigma=2$ clock cycles,~\cite{paoloni2010benchmark}). Now, complications will rather arise because \texttt{signal\_generator\_read()} is not called frequently enough because datapaths are too long.
\subsection{Further optimizations of the benchmark's datapath\label{sec:optimizations_datapath}}
Before the \textit{signal} node from \autoref{fig:villas_benchmark} generates a sample, it checks whether steps were missed. Then, after it has generated a sample, the super-node has to write it to the file node and an instance of the node-type that is being tested. Only then, the \textit{signal} node can generate the next sample. Both the time that is spent on this check and the time that is spent in the file node are part of the datapath and affect the time it takes before \texttt{task\_wait()} is invoked again. Increasing $t_{\texttt{task\_wait()}}^{i+1}$ accordingly increases the chance of a timing violation according to \autoref{eq:timing_violation}. It is thus desirable that the time that is spent on the check and in the file node is minimized.
\paragraph{Suppressing information to the standard output} Originally, a file node always kept track of the total number of missed steps, and wrote a message to the standard output as soon as one or more steps were missed. Especially the latter is relatively expensive since \texttt{printf()}~\cite{kerrisk2010linux} invokes a system call. For high rates, it can cause a snowball effect: this situation only occurs when the generation rate is already too high so that timing requirements are not met, and now, additionally, the time that is spent in the datapath is increased even more by adding system calls to write to the standard output. Since the missed steps can also be derived from the \textit{in} and \textit{out} log file, it is made configurable to disable internal logging of missed steps. Now, when minimal latency is required, like in the case of the VILLASnode node-type benchmark, a flag can be set in the configuration file.
\paragraph{Buffering the file stream} Usually, each call to the \textit{stdio} library---which is used by the file node-type to read from and write to files---results in a system call. Although it is not possible to get rid of these system calls completely---after all, they are necessary to write to the \textit{in} and \textit{out} log files---they should be reduced to an absolute minimum in the datapath. To achieve this, the file node-type was modified so that the buffering of the file stream can be configured. Now, a user can define the size of a buffer in the configuration file. Buffering is controlled with \texttt{setvbuf()} \cite{kerrisk2010linux}, which enables an instance of the file node-type to read or write data in units equal to the size of that buffer.
\section{Enabling UC support in the RDMA CM\label{sec:uc_support}}
The \gls{rdma} \gls{cm} does not officially support unreliable connections. However, by modifying small parts of the \texttt{librdmacm} library and by re-compiling it, it is possible to facilitate \gls{uc} anyway. This enables the present work to also analyze the unreliable connection with the custom and the VILLASnode node-type benchmark.
To enable support, the \texttt{rdma\_create\_id2()} function of the \texttt{librdmacm} has to be made non-static. As a result, this function can directly be accessed, whereas it is normally only accessible through the wrapper \texttt{rdma\_create\_id()}. Now, also the \gls{qp} type can be passed on the the \gls{rdma} \gls{cm} library, and by passing \texttt{RDMA\_PS\_IPOIB} as \texttt{port\_space} and \texttt{IBV\_QPT\_UC} as \texttt{qp\_type}, a managed \gls{uc} \gls{qp} will be created.
\section{Processing data\label{sec:processing_data}}
In order to analyze the generated comma-separated value dumps, several Python~3.7 scripts were developed in Jupyter Notebook.\footnote{\url{https://python.org}}\footnote{\url{https://jupyter.org}} Jupyter Notebook (formerly IPython Notebook) is part of Project Jupyter and allows a user to interactively explore Python scripts. On the one hand, it enables (stepwise) execution of Python code in a web browser, based on IPython~\cite{perez2007ipython}. On the other hand, rich text documentation, written in Markdown\footnote{\url{https://daringfireball.net/projects/markdown/}}, can directly be included in the document. The documentation, together with the source code, can be exported to several formats, e.g., to \texttt{.py}, \texttt{.tex}, \texttt{.html}, \texttt{.md}, and \texttt{.pdf}.
Jupyter Notebook's command line \gls{api} makes it also highly suitable for automatic analysis of large datasets of timestamps. It is, for example, included in the \acrshort{cicd} pipeline of VILLASnode to automatically analyze the performance impact of certain changes in the source code and to compare node-types against each other. Furthermore, the scripts are included in the present work's build automation, which makes it possible to easily convert raw data from the benchmarks to convenient graphs.
Besides several standard libraries, NumPy\footnote{\url{http://numpy.org}}---which adds support for numerical calculations in Python---and matplotlib\footnote{\url{https://matplotlib.org}}---which adds a comprehensive toolset to create 2D plots---were used.
\subsection{Processing the host channel adapter benchmark's results\label{sec:processing_hca}}
\paragraph{Histograms} The first type of graph that is used in \autoref{chap:evaluation} and \autoref{a:results_benchmarks} is a histogram. The Python script that generates this graph first needs the path that contains the timestamps. This can be passed on through the command line or directly in the notebook. Then, the script loads the \acrshort{json} file that must be present in every data directory. It contains settings on how to process the data, but also information about the plots, e.g., dimensions of the figure and labels.
When all preparatory work is done, the Python script loads all timestamps as defined in \autoref{sec:timestamps}. To keep the minimum message size as low as possible, this benchmark only sends the 8-byte \texttt{long tv\_nsec} from \autoref{lst:timespec}. However, the complication with only sending this long integer is that it overflows from \SI{999999999}{\nano\second} to \SI{0}{\nano\second}. But, since transmissions cannot take longer than \SI{1}{\second}---assuming no severe errors occur---this overflow is resolved by adding \SI{1}{\second} to $t_{recv}$ and $t_{comp}$ if they are smaller than $t_{subm}$/$t_{send}$.
Subsequently, all data is displayed in a histogram. To be able to see differences in the distribution of latencies at a glance and thus to make the comparison of the results easier, all histograms range from \SI{0}{\nano\second} to \SI{10000}{\nano\second}. A small box in the top left or top right corner then provides information on the percentage of values above this limit and about the maximum value. A red, vertical line indicates the median value of the data set.
This script is able to compare data sets from the same run---for example, $t_{lat}$ and $t_{lat}^{comp}$---or data sets from different runs---for example, $t_{lat}$ from various runs with distinct settings. In the present work, the former and the latter first occur in \autoref{fig:oneway_event} and \autoref{fig:oneway_inline}, respectively.
\paragraph{Median plot with variability indication} Histograms are great for getting a more comprehensive view of the distribution of latencies and the effect specific changes have on this distribution. However, this type of plot is not suitable for displaying many different setups in one comprehensible graph. Therefore, a simple line chart is used to display the median values of several data sets. In order to add information about dispersion of latency, error bars are added to every marker. In the present work's line charts, these indicate an \SI{80}{\percent} interval around the median value. Thus, for every marker, \SI{10}{\percent} of the values are bigger than the upper limit of the error bar and \SI{10}{\percent} of the values are smaller than the lower limit.
In the present work, this type of graph first occurs in \autoref{fig:oneway_message_size}.
\subsection{Processing the VILLASnode node-type benchmark's results}
As discussed in \autoref{sec:villas_benchmark} and depicted in \autoref{fig:villas_benchmark}, the VILLASnode node-type benchmark results in two files with data: an \textit{in} and \textit{out} file. For every sample, the former includes a generation timestamp, a sequence number, and the actual values of the sample. Additionally, the latter includes a receive timestamp which is computed by the receiving instance of the node-type that is being benchmarked.
The VILLASnode node-type benchmark serves two purposes. On the one hand, there must be a graph that shows the performance of all node-types in one glance and makes comparison of node-types easy. For this purpose, the line graph from the previous subsection is well suited.
On the other hand, the benchmark should give a comprehensive insight in the latency distribution and the maxima of a certain node-type. For this purpose, the histogram from \autoref{sec:processing_hca} is better suited. However, as described in \autoref{sec:villas_benchmark}, these graphs should also provide information about the limitations of node-types. Not all node-types will be limited to the same maximum frequency. Therefore, the graph should provide additional information about the missing samples in the \textit{in} and \textit{out} file. By comparing these files, it can be determined if samples were not transmitted by the node-type that was tested.
\paragraph{3D surface plot} To be able to wrap up all information up in one plot, a third type of graph is introduced: the 3D surface plot. With this type of graph, it is possible to vary both the message size and sample generation rate, whilst still displaying all data in a comprehensible manner. In addition to the median latencies of size/generation rate combinations, an indication of the percentage of missed steps is plotted. In that way, it is easy to identify which combinations were detrimental for the sample generation.
In the present work, this type of graph first occurs in \autoref{fig:rate_size_3d_RC}.

107
chapters/introduction.tex Normal file

@ -0,0 +1,107 @@
\chapter{Introduction}
\section{Motivation\label{sec:motivation}}
At present, there is an increasing shift in electric energy generation from centralized---often environmentally harmful---power plants to distributed renewable energy sources. In their paper on intelligence in future electric energy systems, Strasser et al.~\cite{strasser2015review} describe the new challenges which arise together with this shift to sustainable electric energy systems.
\subsection{New challenges in power system simulations}
Nowadays, \glspl{drts} are most frequently used to get accurate models of the output waveforms of electric energy systems. In \gls{rt} simulations, the equations of one time step in the simulation have to be solved within the corresponding time span in the actual physical world. As Faruque et al.~\cite{faruque2015real} describe, \gls{drts} can be divided into two classes: \textit{full digital} and \gls{phil} real-time simulations. While the former are completely modeled inside the simulator, the latter provide \gls{io} interfaces which allow the user to replace digital models with actual physical components.
Since power grids should be reflected into power models as accurate as possible, more complex grids will naturally result in more complex simulations. Hence, the shift towards distributed electric energy generation poses new challenges regarding \gls{drts} complexity. One possible solution to counteract the arising computational bottlenecks is the distribution of simulation systems into smaller sub-systems~\cite{faruque2015real}.
As a solution to this problem, Stevic et al.~\cite{stevic2017multi} propose a framework which enables geographically distributed laboratories to integrate their off-the-shelf real-time digital simulators virtually, thereby also enabling \gls{rt} co-simulations. Later, Mirz et al.~\cite{mirz2018distributed} summarized other important benefits of such a system: hardware and software of various laboratories can be shared; easy knowledge exchange among research groups is facilitated and encouraged; there is no need to share confidential data since every laboratory can decide to run its own simulations and only share interface variables; laboratories without certain hardware can now, nonetheless, test algorithms on this hardware.
The following subsection presents the implementation of such a system, as presented by Vogel et al.~\cite{vogel2017open}: \textit{VILLASframework}.
\subsection{VILLASframework: distributed real-time co-simulations\label{sec:intro_villas}}
VILLASframework\footnote{\url{https://www.fein-aachen.org/projects/villas-framework/}} is an open-source set of tools to enable distributed real-time simulations, published under the \gls{gpl} v3.0. Within VILLASframework, \textit{VILLASnode} instances form gateways for simulation data. \Autoref{tab:villasnode_nodes} shows the interfaces---which are called \textit{node-types} in VILLASnode---which are currently supported. Node-types can roughly be divided into three categories: node-types that can solely communicate with node-types on the same server (\textit{internal communication}), node-types that can communicate with node-types on different servers (\textit{server-server communication}), and node-types that form an interface between a simulator and a server (\textit{simulator-server communication}). An instance of a node-type is called a \textit{node}.
\input{tables/villasnode_nodes}
\Autoref{fig:villasframework} shows VILLASframework with its main components: VILLASnode and \textit{VILLASweb}. The figure shows nodes in laboratories that form gateways between software (e.g., a file on a host system) or hardware (e.g., a simulator). A node can also be connected to other nodes; these can be located on the same host system, on a different host system in the same laboratory, or on a host system in a remote laboratory. Within VILLASframework, a distinction must be made between the \textit{soft real-time integration layer} and the \textit{hard real-time integration layer}.
\begin{figure}[ht]
\includegraphics{images/villasframework.pdf}
\vspace{-0.5cm}
\caption{VILLASweb and VILLASnode, the main components of VILLASframework.}\label{fig:villasframework}
\end{figure}
Although node-types that realize internal communication are able to achieve hard real-time, none of the node-types that connect different hosts with each other are able to do so. So far, all node-types rely on the \gls{tcp}---e.g., \textit{amqp} and \textit{mqtt}---or on the \gls{udp}---e.g., \textit{socket}. Both protocols are part of the Internet protocol suite's transport layer and these nodes thus rely on Ethernet as networking technology.
Within Ethernet, a large portion of the latency between submitting a request to send data and actually receiving the data is caused by software overhead, switches between user and kernel space, and interrupts. For example, Larsen and Huggahalli~\cite{larsen2009architectural} report that, on average, it takes \SI{3}{\micro\second} on their Linux system before control is actually handed to the \gls{nic} when a host tries to send a simple ping message. For the Intel\textregistered{} 82571 \SI{1}{\gigabitethernet} controller they used, these \SI{3}{\micro\second} are \SI{72}{\percent} of the time the message spends in the sending node. Similar proportions of software and hardware latency can be seen at the receiving host. After optimizations, Larsen and Huggahalli reduced the latency of an Intel\textregistered{} 82598 \SI{10}{\gigabitethernet} controller to just over \SI{10}{\micro\second}, in which software latency was still predominant.
Another issue of Ethernet is its variability~\cite{larsen2009architectural}: real-time applications require a high predictability and thus low variability of the latency of samples. Furthermore, \gls{qos} support is limited in Ethernet~\cite{reinemo2006overview}. Techniques to avoid and control congestion can become essential for networks with a high load, which can be caused, for example, by a high number of small samples due to real-time communication.
\subsection{Hard real-time communication between different hosts\label{sec:hard_real_time_communication_between_servers}}
\footnotetext{\url{https://villas.fein-aachen.org/doc/node-types.html}}
Thus, in order to achieve hard real-time between different hosts, a different technology than Ethernet must be used. An alternative technology that is particularly suitable for this purpose is \textit{InfiniBand}. This technology is specifically designed as a low-latency, high-throughput inter-server communication standard. Due to its design, every process assumes that it owns the network interface controller and the operating system does not need to multiplex it to processes. Consequently, processes do not need to invoke system calls---and thus trigger switches between user and kernel space---while transferring data. It is even possible to send data to a remote host without its software noticing that data is written into its memory. Furthermore, InfiniBand has extensive support for \gls{qos} and is a lossless architecture, which means that it---other than Ethernet---does not rely on dropping packets to handle congestion of the network. Finally, the InfiniBand Architecture handles many, more complex, tasks, such as reliability, directly in the hardware.
Because this technology seems so well suited for this purpose, the present work investigates the possibilities of implementing a VILLASnode node-type that relies upon InfiniBand as its communication technology.
\section{Related work\label{sec:related_work}}
The goal of the present work was to develop a communication channel among different host systems that is optimized regarding latency. Therefore, this section will examine previous performance studies on InfiniBand that present optimizations regarding latency.
In their work, MacArthur and Russel evaluate how certain programming decisions affect the performance of messages that are sent over an InfiniBand network~\cite{macarthur2012performance}. They examine several features that potentially affect the performance:
\begin{enumerate}
\setlength\itemsep{-0.1em}
\item The \textbi{operation code}, which determines if a message will be sent with either channel or memory semantics.
\item The \textbi{message size}.
\item The \textbi{completion detection}, which determines whether the completion queue gets actively polled or provides notifications to the waiting application. This setting also heavily affects \acrshort{cpu} utilization.
\item \textbi{Sending data inline}, with which the \acrshort{cpu} directly copies data to the network adapter instead of relying on the adapter's \acrshort{dma}.
\item \textbi{Processing data simultaneously}, by sending data from multiple buffers instead of one.
\item Using a \textbi{work request submission list}, with which instructions are submitted to the network adapter as a list instead of one at a time.
\item Turning \textbi{completion signaling} periodically on and off for certain operations.
\item The \textbi{wire transmission speed}.
\end{enumerate}
They conclude, that an application should use the operation code that best suits its needs. A limiting factor here is often the need to notify the receiver about new data. When comparing the operation codes that support notifying the receive side, i.e., \textit{send} and \textit{\acrshort{rdma} write with immediate}, the performance difference is negligible.
For ``small'' messages ($\leq\SI{1024}{\kibi\byte}$), the message size did not influence the latency too much under normal circumstances. For ``large'' messages ($\geq\SI{1024}{\kibi\byte}$), however, they observed that the latency increased with the message size.
When letting the completion queue provide notifications when new data arrived, they measured a \acrshort{cpu} utilization of \SI{20}{\percent} for messages smaller than \SI{512}{\byte} and \SI{0}{\percent} for messages larger than \SI{4}{\mebi\byte}. When the queue was actively polled, the \acrshort{cpu} utilization turned out to always be \SI{100}{\percent}. Although completion detection with notifications was more resource friendly, they found that it, in case of small messages, resulted in latencies that were almost 4\times{} higher than when actively polling. For large messages this difference diminished. The latencies of messages larger than \SI{16}{\kibi\byte} showed no difference at all anymore.
They advice to send data inline whenever this feature is supported by the network adapter that is used and when the message size is smaller than the cache line size of the adapter. They discovered that sending data inline required a few additional \acrshort{cpu} cycles, but resulted in a latency decrease of up to \SI{25}{\percent}. They also called attention to the fact that sending messages larger than the cache line size of the network adapter inline had a detrimental effect on latency.
With regards to the number of buffers, they found the ideal number of buffers to be around 8 for small messages and 3 for large messages. Using more buffers did not increase the performance any more, and even resulted in slightly worse performance in some cases. By using 8 buffers and sending data inline, they detected one-way latencies as low as \SI{300}{\nano\second}. This is considerably less than the latencies for Ethernet, as reported by Larsen and Huggahalli~\cite{larsen2009architectural}.
Their recommendation regarding the submission of lists of instructions is to only use it when appropriate: whenever it is possible to submit an instruction individually, this should be the preferred method. In that way, the adapter can queue the instructions and is thus kept busy.
Last but not least, they examined the influence of completion signaling. Usually, after a message has been (successfully) sent, the sender gets notified, for example, to release the buffer. MacArthur and Russel first inspected periodic signaling, where only every $\left(\frac{n_{buffers}}{2}\right)^{th}$ message triggered a notification. They found that this usually had little effect on latency. It only had a larger effect when a list with multiple instructions was submitted to the adapter. However, when messages were sent inline, they found that it could be beneficial to disable signaling.
MacArthur and Russel also compared their InfiniBand setup with a contemporary \gls{roce} setup. Although they concluded that InfiniBand outperformed \gls{roce} for large messages, they also concluded that the difference for small messages was negligible. However, as Reinemo et al.\ state in their publication~\cite{reinemo2006overview}, support for \gls{qos} is limited in Ethernet and abundantly available in InfiniBand.
In a later work~\cite{liu2014performance}, Liu and Russel solely focused on throughput. Although they exclusively focused on messages larger than \SI{32}{\kibi\byte}, which are uncommon in VILLASnode, they drew a few conclusions that can generally be applied to communication over InfiniBand. They observed that:
\begin{itemize}
\setlength\itemsep{-0.1em}
\item in most cases, \acrshort{numa} affinity effects the performance of the network adapter;
\item the performance (with regards to throughput) is sensitive to message alignment;
\item the maximum number of unsignaled instructions before a signaled instruction should be sent is:
\begin{equation}
S=\begin{cases}
\min(\frac{B}{s},1), \qquad \qquad \qquad\:\, \mathrm{if}~\SI{16}{\kibi\byte} < \mathrm{message~size} < \SI{128}{\kibi\byte}\\
\min(\frac{D_{SQ}}{2},D_{SQ}-B), \qquad \mathrm{otherwise}
\end{cases},
\label{eq:signaling}
\end{equation}
with $B$ the number of outstanding messages and $D_{SQ}$ the depth of the send queue.
\end{itemize}
Furthermore, they preferred the \textit{\acrshort{rdma} write with immediate} over the \textit{send} operation.
\newpage
\section{Structure of the present work}
\paragraph{\ref{chap:basics}~\nameref{chap:basics}} aims to give the reader an understanding of the communication architecture that lies at the heart of the VILLASnode node-type that was implemented as part of the present work. The chapter starts with an introduction on the Virtual Interface Architecture and proceeds with a section that is dedicated to InfiniBand. Before finishing with a section on real-time optimizations, \autoref{chap:basics} elaborates upon the software libraries that are used to access InfiniBand hardware.
\paragraph{\ref{chap:architecture}~\nameref{chap:architecture}} expands on the internals of VILLASnode. After having explained the concept of VILLASnode, this chapter discusses the adaptions that had to be made to its architecture to (efficiently) support an InfiniBand node-type. These include changes to function parameters of the interface between the global VILLASnode instance and an instance of a node-type, to the memory management of VILLASnode, and to the finite-state machine of instances of node-types.
\paragraph{\ref{chap:implementation}~\nameref{chap:implementation}} first discusses the non-trivial parts of the implementation of the benchmark that was used to profile the InfiniBand hardware, the \textit{InfiniBand} node-type, and the benchmark that was used to analyze VILLASnode node-types. Then, it discusses how an additional service type was enabled in the communication manager that was used and how the acquired data from the benchmarks was processed.
\paragraph{\ref{chap:evaluation}~\nameref{chap:evaluation}} evaluates the results that were found with the help of the benchmarks that were presented in the previous chapter.
\paragraph{\ref{chap:conclusion}~\nameref{chap:conclusion}} considers whether the assumptions from \autoref{sec:motivation} (\nameref{sec:motivation}) are legitimate and thus whether the \textit{InfiniBand} node-type is a valuable addition to the VILLASframework.
\paragraph{\ref{chap:future}~\nameref{chap:future}} presents possible optimizations that were not examined in the present work. It begins with a brief examination of the possibilities the \texttt{PREEMPT\_RT} patch could bring, continues with a section on optimizations \& profiling of the VILLASnode source code, and ends with a section on \acrshort{roce}.
\newline
In addition to this brief introduction on the structure of the present work, every chapter begins with a paragraph that presents the structure of the sections within that chapter.

125
glossary/acronyms.tex Normal file

@ -0,0 +1,125 @@
\newacronym[shortplural=DRTS, longplural={digital real-time simulations}]{drts}{DRTS}{digital real-time simulation}
\newacronym{rt}{RT}{real-time}
\newacronym{tcp}{TCP}{Transmission Control Protocol}
\newacronym{udp}{UDP}{User Datagram Protocol}
\newacronym{cq}{CQ}{Completion Queue}
\newacronym{via}{VIA}{Virtual Interface Architecture}
\newacronym{tcpip}{TCP/IP}{Internet protocol suite}
\newacronym{os}{OS}{operating system}
\newacronym{nic}{NIC}{network interface controller}
\newacronym{vi}{VI}{Virtual Interface}
\newacronym{ib}{IB}{InfiniBand}
\newacronym{ibta}{IBTA}{InfiniBand\textservicemark~Trade Association}
\newacronym{iba}{IBA}{InfiniBand Architecture}
\newacronym{mtu}{MTU}{Maximum\ Transmission\ Unit}
\newacronym{ca}{CA}{Channel Adapter}
\newacronym{hca}{HCA}{Host Channel Adapter}
\newacronym{tca}{TCA}{Target Channel Adapter}
\newacronym{rc}{RC}{Reliable Connection}
\newacronym{ud}{UD}{Unreliable Datagram}
\newacronym{uc}{UC}{Unreliable Connection}
\newacronym{rd}{RD}{Reliable Datagram}
\newacronym{qp}{QP}{Queue Pair}
\newacronym{sq}{SQ}{Send Queue}
\newacronym{rq}{RQ}{Receive Queue}
\newacronym{wr}{WR}{Work Request}
\newacronym{wqe}{WQE}{Work Queue Element}
\newacronym[shortplural=CQEs, longplural={Completion Queue Entries}]{cqe}{CQE}{Completion Queue Entry}
\newacronym{dma}{DMA}{Direct Memory Access}
\newacronym{lid}{LID}{Local Identifier}
\newacronym{gid}{GID}{Global Identifier}
\newacronym{sm}{SM}{Subnet Manager}
\newacronym{sma}{SMA}{Subnet Management Agent}
\newacronym{sa}{SA}{Subnet Administration}
\newacronym{mad}{MAD}{Management Datagram}
\newacronym{smp}{SMP}{Subnet Management Packet}
\newacronym{gmp}{GMP}{General Management Packet}
\newacronym{lmc}{LMC}{LID Mask Control}
\newacronym{guid}{GUID}{Global Unique Identifier}
\newacronym{lrh}{LRH}{Local Routing Header}
\newacronym{grh}{GRH}{Global Routing Header}
\newacronym{bth}{BTH}{Base Transport Header}
\newacronym{eth}{ETH}{Extended Transport Header}
\newacronym{rdeth}{RDETH}{Reliable Datagram Extended Transport Header}
\newacronym{deth}{DETH}{Datagram Extended Transport Header}
\newacronym{reth}{RETH}{RDMA Extended Transport Header}
\newacronym{atomiceth}{AtomicETH}{Atomic Extended Transport Header}
\newacronym{aeth}{AETH}{ACK Extended Transport Header}
\newacronym{atomicacketh}{AtomicAckETH}{Atomic ACK Extended Transport Header}
\newacronym{imm}{ImmDt}{Immediate Data}
\newacronym{ieth}{IETH}{Invalidate Extended Transport Header}
\newacronym{lkey}{lkey}{local key}
\newacronym{rkey}{rkey}{remote key}
\newacronym{crc}{CRC}{Cyclic Redundancy Check}
\newacronym{icrc}{ICRC}{Invariant CRC}
\newacronym{vcrc}{VCRC}{Variant CRC}
\newacronym{rdma}{RDMA}{Remote Direct Memory Access}
\newacronym{eui64}{EUI-64}{64-bit Extended Unique Identifier}
\newacronym{eui48}{EUI-48}{48-bit Extended Unique Identifier}
\newacronym{vl}{VL}{Virtual Lane}
\newacronym{qos}{QoS}{Quality of Service}
\newacronym{sl}{SL}{Service Level}
\newacronym{dpwrr}{DPWRR}{dual priority weighted round robin}
\newacronym{wrr}{WRR}{weighted round robin}
\newacronym{llfc}{LLFC}{Link-Level Flow Control}
\newacronym{fcpacket}{FC packet}{Flow Control Packet}
\newacronym{abr}{ABR}{Adjusted Block Received}
\newacronym{cca}{CCA}{Congestion Control Architecture}
\newacronym{ccm}{CCM}{Congestion Control Manager}
\newacronym{cct}{CCT}{Congestion Control Table}
\newacronym{mr}{MR}{Memory Region}
\newacronym{mw}{MW}{Memory Window}
\newacronym{pd}{PD}{Protection Domain}
\newacronym{fctbs}{FCTBS}{Flow Control Total Blocks Sent}
\newacronym{fccl}{FCCL}{Flow Control Credit Limit}
\newacronym{fecn}{FECN}{Forward Explicit Congestion Notification}
\newacronym{becn}{BECN}{Backward Explicit Congestion Notification}
\newacronym{qpn}{QPN}{Queue Pair Number}
\newacronym{cm}{CM}{Communication Manager}
\newacronym{req}{REQ}{request for communication}
\newacronym{mra}{MRA}{message receipt acknowledgment}
\newacronym{rej}{REJ}{reject}
\newacronym{rep}{REP}{reply to REQ}
\newacronym{rtu}{RTU}{ready to use}
\newacronym{dreq}{DREQ}{request for communication release}
\newacronym{drep}{DREP}{response to DREQ}
\newacronym{sidrreq}{SIDR\undershort REQ}{Service ID Resolution Request}
\newacronym{sidrrep}{SIDR\undershort REP}{Service ID Resolution Response}
\newacronym{api}{API}{Application Programming Interface}
\newacronym{ofed}{OFED\texttrademark}{OpenFabrics Enterprise Distribution}
\newacronym{sge}{sge}{scatter/gather element}
\newacronym{cc}{CC}{Completion Channel}
\newacronym{ipoib}{IPoIB}{Internet Protocol over InfiniBand}
\newacronym{arp}{ARP}{Address Resolution Protocol}
\newacronym{fifo}{FIFO}{first-in, first-out}
\newacronym{cpu}{CPU}{central processing unit}
\newacronym{ah}{AH}{Address Handle}
\newacronym{csv}{CSV}{comma-separated values}
\newacronym{ram}{RAM}{random-access memory}
\newacronym{numa}{NUMA}{non-uniform memory access}
\newacronym{pid}{PID}{process identifier}
\newacronym{pcie}{PCI-e}{Peripheral Component Interconnect Express}
\newacronym{mmu}{MMU}{memory management unit}
\newacronym{msb}{MSB}{most significant bit}
\newacronym{lsb}{LSB}{least significant bit}
\newacronym{tlb}{TLB}{translation lookaside buffer}
\newacronym{io}{I/O}{input/output}
\newacronym{irq}{IRQ}{interrupt request}
\newacronym{phil}{PHIL}{(power) hardware-in-the-loop}
\newacronym{gpl}{GPL}{GNU General Public License}
\newacronym{tmr}{TMR}{timer}
\newacronym{cicd}{CI/CD}{continuous integration and continuous delivery}
\newacronym{json}{JSON}{JavaScript Object Notation}
\newacronym{tsc}{TSC}{Time-Stamp Counter}
\newacronym{rdtsc}{RDTSC}{Read Time-Stamp Counter}
\newacronym{rdtscp}{RDTSCP}{Read Time-Stamp Counter and Processor ID}
\newacronym{lfence}{LFENCE}{Load Fence}
\newacronym{aspm}{ASPM}{Active State Power Management}
\newacronym{pmqos}{PM QoS}{Power Management Quality of Service}
\newacronym{posix}{POSIX}{Portable Operating System Interface}
\newacronym{roce}{RoCE}{RDMA over Converged Ethernet}
\newacronym{srq}{SRQ}{Shared Receive Queue}
\newacronym{xrc}{XRC}{eXtended Reliable Connection}
\newacronym{sl2vl}{SL to VL}{Service Level to Virtual Lange}
\newacronym{iwarp}{iWARP}{Internet Wide-area RDMA Protocol}
\newacronym{ntp}{NTP}{Network Time Protocol}

BIN
images/GID_multicast.odg Normal file

Binary file not shown.

BIN
images/GID_unicast.odg Normal file

Binary file not shown.

BIN
images/GRH.odg Normal file

Binary file not shown.

BIN
images/LRH.odg Normal file

Binary file not shown.

BIN
images/MAD.odg Normal file

Binary file not shown.

12
images/Makefile Normal file

@ -0,0 +1,12 @@
SRCS = $(wildcard *.odg)
PDFS = $(patsubst %.odg,%.pdf,$(SRCS))
all: $(PDFS)
%.pdf: %.odg
libreoffice --convert-to pdf $<
clean:
rm -f *.pdf
.PHONE: clean

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

BIN
images/iba_arbiter.odg Normal file

Binary file not shown.

BIN
images/iba_model.odg Normal file

Binary file not shown.

Binary file not shown.

BIN
images/memory_alignment.odg Normal file

Binary file not shown.

Binary file not shown.

BIN
images/memory_iba.odg Normal file

Binary file not shown.

Binary file not shown.

BIN
images/network_stack.odg Normal file

Binary file not shown.

BIN
images/numa_nodes.odg Normal file

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

BIN
images/qp_communication.odg Normal file

Binary file not shown.

BIN
images/qp_states.odg Normal file

Binary file not shown.

Binary file not shown.

Binary file not shown.

BIN
images/sm_states.odg Normal file

Binary file not shown.

BIN
images/via_model.odg Normal file

Binary file not shown.

BIN
images/via_states.odg Normal file

Binary file not shown.

BIN
images/villas_benchmark.odg Normal file

Binary file not shown.

BIN
images/villas_read.odg Normal file

Binary file not shown.

BIN
images/villas_read_iba.odg Normal file

Binary file not shown.

Binary file not shown.

Binary file not shown.

BIN
images/villas_write.odg Normal file

Binary file not shown.

BIN
images/villas_write_iba.odg Normal file

Binary file not shown.

Binary file not shown.

Binary file not shown.

BIN
images/villasframework.odg Normal file

Binary file not shown.

BIN
images/villasnode.odg Normal file

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

1
listings/activate_cpu.sh Normal file

@ -0,0 +1 @@
# echo 1 > /sys/devices/system/cpu/<cpuX>/online

48
listings/cm_switch.c Normal file

@ -0,0 +1,48 @@
struct rdma_cm_event *event;
while (rdma_get_cm_event(event_channel, &event) == 0) {
switch (event->event) {
case RDMA_CM_EVENT_ADDR_RESOLVED:
// Create QP, receive CQ, and send CQ.
// Call rdma_resolve_route()
// State: STARTED
case RDMA_CM_EVENT_ADDR_ERROR:
// Try fallback and set mode rdma_cm_id to listening
// State: STARTED
case RDMA_CM_EVENT_ROUTE_RESOLVED:
// Call rdma_connect()
// State: PENDING_CONNECT
case RDMA_CM_EVENT_ROUTE_ERROR:
// Try fallback and set mode rdma_cm_id to listening
// State: STARTED
case RDMA_CM_EVENT_UNREACHABLE:
// Try fallback and set mode rdma_cm_id to listening
// State: STARTED
case RDMA_CM_EVENT_CONNECT_REQUEST:
// Create QP, receive CQ, and send CQ.
// Call rdma_accept()
// State: PENDING_CONNECT
case RDMA_CM_EVENT_CONNECT_ERROR:
// Try fallback and set mode rdma_cm_id to listening
// State: STARTED
case RDMA_CM_EVENT_REJECTED:
// Try fallback and set mode rdma_cm_id to listening
// State: STARTED
case RDMA_CM_EVENT_ESTABLISHED:
// In case of UD, save address handle from event struct
// State: CONNECTED
case RDMA_CM_EVENT_DISCONNECTED:
// Release all buffers and destroy everything
// State: STARTED
case RDMA_CM_EVENT_TIMEWAIT_EXIT:
break;
default:
// Error message: unkown event
}
rdma_ack_cm_event(event);
}

17
listings/cq_time.c Normal file

@ -0,0 +1,17 @@
struct timespec tp;
while (1) {
ibv_get_cq_event(); // Only necessary for event based polling
while (ibv_poll_cq()) {
clock_gettime(CLOCK_MONOTONIC, &tp);
/**
* Save tp and message identifier in an array and
* return as soon as possible, so that as little
* as possible time is lost before polling goes on.
*/
}
ibv_req_notify_cq(); // Only necessary for event based polling
}

3
listings/cset_create.sh Normal file

@ -0,0 +1,3 @@
# cset set -c 0-15 -s system --cpu_exclusive
# cset set -c 16,18,20,22 -s real-time-0 --cpu_exclusive --mem=0
# cset set -c 17,19,21,23 -s real-time-1 --cpu_exclusive --mem=1

2
listings/cset_exec.sh Normal file

@ -0,0 +1,2 @@
# cset proc --set=real-time-0 --exec ./<application> -- <args>
# cset proc --set=real-time-1 --exec ./<application> -- <args>

1
listings/cset_move.sh Normal file

@ -0,0 +1 @@
# cset proc --move -f root -t system --kthread --thread --force

@ -0,0 +1 @@
$ cat /proc/irq/<irqX>/smp_affinity

@ -0,0 +1,5 @@
struct ibv_comp_channel {
struct ibv_context *context;
int fd;
int refcnt;
};

6
listings/ibv_recv_wr.h Normal file

@ -0,0 +1,6 @@
struct ibv_recv_wr {
uint64_t wr_id;
struct ibv_recv_wr *next;
struct ibv_sge *sg_list;
int num_sge;
};

26
listings/ibv_send_wr.h Normal file

@ -0,0 +1,26 @@
struct ibv_send_wr {
uint64_t wr_id;
struct ibv_send_wr *next;
struct ibv_sge *sg_list;
int num_sge;
enum ibv_wr_opcode opcode;
int send_flags;
uint32_t imm_data;
union {
struct {
uint64_t remote_addr;
uint32_t rkey;
} rdma;
struct {
uint64_t remote_addr;
uint64_t compare_add;
uint64_t swap;
uint32_t rkey;
} atomic;
struct {
struct ibv_ah *ah;
uint32_t remote_qpn;
uint32_t remote_qkey;
} ud;
} wr;
};

5
listings/ibv_sge.h Normal file

@ -0,0 +1,5 @@
struct ibv_sge {
uint64_t addr;
uint32_t length;
uint32_t lkey;
};

41
listings/infiniband.conf Normal file

@ -0,0 +1,41 @@
source_node = {
type = "infiniband",
rdma_transport_mode = "${IB_MODE}",
in = {
address = "10.0.0.2:1337",
max_wrs = 4,
cq_size = 4,
buffer_subtraction = 2
},
out = {
address = "10.0.0.1:1337",
resolution_timeout = 1000,
send_inline = true,
max_inline_data = 128,
use_fallback = true,
max_wrs = 4096,
cq_size = 4096,
periodic_signaling = 2048
}
},
target_node = {
type = "infiniband",
rdma_transport_mode = "${IB_MODE}",
in = {
address = "10.0.0.1:1337",
max_wrs = 512,
cq_size = 512,
buffer_subtraction = 64,
signals = {
count = ${NUM_VALUE},
type = "float"
}
}
}

@ -0,0 +1,5 @@
struct a {
char c;
int i;
short s;
}

@ -0,0 +1,5 @@
struct __attribute__((__packed__)) b {
char c;
int i;
short s;
}

33
listings/node_config.conf Normal file

@ -0,0 +1,33 @@
nodes = {
node_1 = {
type = "file",
// Global settings for node_1
in = {
// Settings for node input, e.g., file to read from
}
},
node_2 = {
type = "infiniband",
// Global settings for node
in = {
// Settings for node input, e.g., address of local
// InfiniBand HCA to use
},
out = {
// Settings for node output, e.g., remote InfiniBand
// node to write to
}
}
},
paths = (
{
in = "node_1",
out = "node_2"
}
)

16
listings/rdtsc.h Normal file

@ -0,0 +1,16 @@
static inline uint64_t rdtsc()
{
uint64_t tsc;
__asm__ __volatile__(
"lfence;"
"rdtsc;"
"shl $32, %%rdx;"
"or %%rdx,%%rax"
: "=a" (tsc)
:
: "%rcx", "%rdx", "memory"
);
return tsc;
}

17
listings/rdtscp.h Normal file

@ -0,0 +1,17 @@
static inline uint64_t rdtscp()
{
uint64_t tsc;
__asm__ __volatile__(
"rdtscp;"
"shl $32, %%rdx;"
"or %%rdx,%%rax"
: "=a" (tsc)
:
: "%rcx", "%rdx", "memory"
);
return tsc;
}

14
listings/rdtscp_wait.c Normal file

@ -0,0 +1,14 @@
uint64_t task_wait(struct task *t)
{
int ret;
uint64_t steps, now;
do {
now = rdtscp();
} while (now < t->next);
for (steps = 0; t->next < now; steps++)
t->next += t->period;
return steps;
}

@ -0,0 +1,2 @@
int (*read)(struct node *n, struct sample *smps[], unsigned cnt);
int (*write)(struct node *n, struct sample *smps[], unsigned cnt);

@ -0,0 +1,5 @@
int (*read)(struct node *n, struct sample *smps[], unsigned cnt,
unsigned *release);
int (*write)(struct node *n, struct sample *smps[], unsigned cnt,
unsigned *release);

13
listings/send_time.c Normal file

@ -0,0 +1,13 @@
// `int messages' represents the number of messages to be sent
struct timespec tp[messages];
for (int i = 0; i < messages; i++) {
/**
* Prepare WR with an sge that points to tv_nsec of tp[i].
* By using an array of timespecs, it is guaranteed that
* the timestamp will not be overwritten.
*/
clock_gettime(CLOCK_MONOTONIC, &tp[i]);
ibv_post_send();
}

@ -0,0 +1 @@
# echo FFFF > /proc/irq/<irqX>/smp_affinity

@ -0,0 +1,24 @@
int signal_generator_read(struct node *n, struct sample *smps[],
unsigned cnt, unsigned *release)
{
struct signal_generator *s = (struct signal_generator *) n->_vd;
struct timespace ts;
int steps;
/* Block until 1/p->rate seconds elapsed */
steps = task_wait(&s->task);
if (steps > 1 && s->monitor_missed) {
warn("Missed steps: %u", steps-1);
s->missed_steps += steps-1;
}
ts = time_now();
/**
* Generate sample(s) with signal and timestamp ts .
* Return this sample via the *smps[] parameter of
* signal_generator_read()
*/
}

8
listings/states.h Normal file

@ -0,0 +1,8 @@
enum state {
STATE_DESTROYED = 0,
STATE_INITIALIZED = 1,
STATE_PARSED = 2,
STATE_CHECKED = 3,
STATE_STARTED = 4,
STATE_STOPPED = 5
};

33
listings/struct_node.h Normal file

@ -0,0 +1,33 @@
struct node_direction {
int enabled;
int builtin;
int vectorize;
struct list hooks;
json_t *cfg;
};
struct node
{
char *name;
char *_name;
char *_name_long;
int affinity;
uint64_t sequence;
struct stats *stats;
struct node_direction in, out;
struct list signals;
enum state state;
struct node_type *_vt;
void *_vd;
json_t *cfg;
};

@ -0,0 +1,41 @@
struct node_type {
int vectorize;
int flags;
enum state state;
struct list instance;
size_t size;
size_t pool_size;
struct {
// Global, per node-type
int (*start)(struct super_node *sn);
int (*stop)();
} type;
// Function pointers
void * (*create)();
int (*init)();
int (*destroy)(struct node *n);
int (*parse)(struct node *n, json_t *cfg);
int (*check)(struct node *n);
char * (*print)(struct node *n);
int (*start)(struct node *n);
int (*stop)(struct node *n);
int (*read)(struct node *n, struct sample *smps[],
unsigned cnt, unsigned *release);
int (*write)(struct node *n, struct sample *smps[],
unsigned cnt, unsigned *release);
int (*reverse)(struct node *n);
int (*fd)(struct node *n);
// Memory Type
struct memory_type * (*memory_type)(struct node *n,
struct memory_type *parent);
};

19
listings/struct_sample.h Normal file

@ -0,0 +1,19 @@
struct sample {
uint64_t sequence;
int length;
int capacity;
int flags;
struct list *signals;
atomic_int refcnt;
ptrdiff_t pool_off;
struct {
struct timespec origin;
struct timespec received;
} ts;
union signal_data data[];
};

26
listings/time_thread.c Normal file

@ -0,0 +1,26 @@
// Global variable
struct timespec tp;
void * t_function(void * ctx)
{
while (1) {
clock_gettime(CLOCK_MONOTONIC, &tp);
}
return NULL;
}
pthread_t t_thread;
pthread_create(&t_thread, NULL, t_function, NULL);
// `int messages' represents the number of messages to be sent
for (int i = 0; i < messages; i++) {
/**
* Prepare WR with sge that points to tp.tv_nsec. It will
* continue to change since the thread continues to run in the
* background.
*/
// No need to invoke clock_gettime() here
ibv_post_send(); // Post prepared WR
}

12
listings/timerfd_wait.c Normal file

@ -0,0 +1,12 @@
uint64_t task_wait(struct task *t)
{
int ret;
uint64_t steps;
ret = read(t->fd, &steps, sizeof(steps));
if (ret < 0)
return 0;
return steps;
}

4
listings/timespec.c Normal file

@ -0,0 +1,4 @@
struct timespec {
time_t tv_sec; /* seconds */
long tv_nsec; /* nanoseconds */
};

@ -0,0 +1,16 @@
[main]
summary=Optimize for deterministic performance at the cost of
increased power consumption
[cpu]
force_latency=1
governor=performance
energy_perf_bias=performance
min_perf_pct=100
[sysctl]
kernel.sched_min_granularity_ns=10000000
vm.dirty_ratio=10
vm.dirty_background_ratio=3
vm.swappiness=10
kernel.sched_migration_cost_ns=5000000

Some files were not shown because too many files have changed in this diff Show More