Initial commit of master's thesis

This is the version I submitted to RWTH Aachen University at November 9,
2018.
This commit is contained in:
Dennis Potter 2018-11-12 12:56:59 +01:00
parent ffbcce77f9
commit af25b4b828
1136 changed files with 127398252 additions and 2 deletions

256
.gitignore vendored Normal file
View File

@ -0,0 +1,256 @@
## VS Code files
.vscode/
## Autogenerated files
*/build/
## Python
__pycache__
plots/*.py
*.ipynb_checkpoints
## Core latex/pdflatex auxiliary files:
*.aux
*.lof
*.log
*.lot
*.fls
*.out
*.toc
*.fmt
*.fot
*.cb
*.cb2
.*.lb
## Intermediate documents:
*.dvi
*.xdv
*-converted-to.*
# these rules might exclude image files for figures etc.
# *.ps
# *.eps
## Bibliography auxiliary files (bibtex/biblatex/biber):
*.bbl
*.bcf
*.blg
*-blx.aux
*-blx.bib
*.run.xml
## Build tool auxiliary files:
*.fdb_latexmk
*.synctex
*.synctex(busy)
*.synctex.gz
*.synctex.gz(busy)
*.pdfsync
## Build tool directories for auxiliary files
# latexrun
latex.out/
## Auxiliary and intermediate files from other packages:
# algorithms
*.alg
*.loa
# achemso
acs-*.bib
# amsthm
*.thm
# beamer
*.nav
*.pre
*.snm
*.vrb
# changes
*.soc
# cprotect
*.cpt
# elsarticle (documentclass of Elsevier journals)
*.spl
# endnotes
*.ent
# fixme
*.lox
# feynmf/feynmp
*.mf
*.mp
*.t[1-9]
*.t[1-9][0-9]
*.tfm
#(r)(e)ledmac/(r)(e)ledpar
*.end
*.?end
*.[1-9]
*.[1-9][0-9]
*.[1-9][0-9][0-9]
*.[1-9]R
*.[1-9][0-9]R
*.[1-9][0-9][0-9]R
*.eledsec[1-9]
*.eledsec[1-9]R
*.eledsec[1-9][0-9]
*.eledsec[1-9][0-9]R
*.eledsec[1-9][0-9][0-9]
*.eledsec[1-9][0-9][0-9]R
# glossaries
*.acn
*.acr
*.glg
*.glo
*.gls
*.glsdefs
# gnuplottex
*-gnuplottex-*
# gregoriotex
*.gaux
*.gtex
# htlatex
*.4ct
*.4tc
*.idv
*.lg
*.trc
*.xref
# hyperref
*.brf
# knitr
*-concordance.tex
# TODO Comment the next line if you want to keep your tikz graphics files
*.tikz
*-tikzDictionary
# listings
*.lol
# makeidx
*.idx
*.ilg
*.ind
*.ist
# minitoc
*.maf
*.mlf
*.mlt
*.mtc[0-9]*
*.slf[0-9]*
*.slt[0-9]*
*.stc[0-9]*
# minted
_minted*
*.pyg
# morewrites
*.mw
# nomencl
*.nlg
*.nlo
*.nls
# pax
*.pax
# pdfpcnotes
*.pdfpc
# sagetex
*.sagetex.sage
*.sagetex.py
*.sagetex.scmd
# scrwfile
*.wrt
# sympy
*.sout
*.sympy
sympy-plots-for-*.tex/
# pdfcomment
*.upa
*.upb
# pythontex
*.pytxcode
pythontex-files-*/
# thmtools
*.loe
# TikZ & PGF
*.dpth
*.md5
*.auxlock
# todonotes
*.tdo
# easy-todo
*.lod
# xmpincl
*.xmpi
# xindy
*.xdy
# xypic precompiled matrices
*.xyc
# endfloat
*.ttt
*.fff
# Latexian
TSWLatexianTemp*
## Editors:
# WinEdt
*.bak
*.sav
# Texpad
.texpadtmp
# LyX
*.lyx~
# Kile
*.backup
# KBibTeX
*~[0-9]*
# auto folder when using emacs and auctex
./auto/*
*.el
# expex forward references with \gathertags
*-tags.tex
# standalone packages
*.sta
# PDF
*.pdf

40
Makefile Normal file
View File

@ -0,0 +1,40 @@
MAINTEX := thesis.tex
LATEX := lualatex
FLAGS := -quiet -shell-escape
.PHONY : default help pdf verbose clean veryclean
default :
cd scripts && $(MAKE)
cd images && $(MAKE)
cd plots && $(MAKE)
latexmk -$(LATEX) $(FLAGS) $(MAINTEX)
help :
@echo ""
@echo "This Makefile creates the PDF of the thesis by using 'latexmk'"
@echo " make : Generate PDF of the thesis"
@echo " make pdf : Generate PDF of the thesis (forced mode)"
@echo " make verbose : Show output from latex compiler"
@echo " make clean : Delete temporary files"
@echo " make veryclean : Delete temporary files including PDF"
@echo ""
pdf :
latexmk -g -$(LATEX) $(FLAGS) $(MAINTEX)
verbose :
latexmk -g -$(LATEX) $(FLAGS) -verbose $(MAINTEX)
clean :
latexmk -c
cd scripts && $(MAKE) clean
cd images && $(MAKE) clean
cd plots && $(MAKE) clean
rm -f appendices/*.aux chapters/*.aux
rm -f *.lol *.fls thesis-blx.bib *.xml *.bbl *.nlo *.nls *.acn *.acr *.alg *.glo *.ist *.tdo
veryclean : clean
latexmk -C

View File

@ -1,2 +1,8 @@
# masters-thesis
## Setup
Add this to `~/.latexmkrc`:
```bash
add_cus_dep('acn', 'acr', 0, 'makeacn2acr');
sub makeacn2acr {
system("makeindex -s \"$_[0].ist\" -t \"$_[0].alg\" -o \"$_[0].acr\" \"$_[0].acn\"");
}
```

13
abstract.tex Normal file
View File

@ -0,0 +1,13 @@
% English abstract
\begin{otherlanguage}{english}
\begin{abstract}
\input{abstract/english}
\end{abstract}
\end{otherlanguage}
% German abstract
\begin{otherlanguage}{ngerman}
\begin{abstract}
\input{abstract/german}
\end{abstract}
\end{otherlanguage}

5
abstract/english.tex Normal file
View File

@ -0,0 +1,5 @@
The present work evaluates the feasibility and added value of an InfiniBand based communication in the co-simulation framework VILLASframework and its simulation data gateway VILLASnode. InfiniBand is characterized by its high throughput and low latencies, which makes it particularly suitable for the hard real-time requirements of VILLASnode. It allows applications on different host systems to communicate with each other, without many of the latency bottlenecks that are present in other technologies such as Ethernet.
The present work shows that---with some optimizations---sub-microsecond latencies were achievable in a benchmark that mimics the characteristics of the co-simulation framework. After it presents how InfiniBand was integrated in the framework, thereby only making minor adjustments to the existing communication \acrshort{api}, it shows how the newly implemented interface performs compared to the existing ones.
The results showed that, regarding latency, the InfiniBand interface performed more than one order of magnitude better than VILLASnode's other interfaces that enable server-server communication. Furthermore, much higher transmission rates could be achieved and the latency's predictability substantially improved. Its latencies, which lie between \SI{1.7}{\micro\second} and \SI{4.9}{\micro\second}, were only 1.5--\SI{2.5}{\micro\second} worse than the zero-latency reference, in which VILLASnode uses the \textit{\acrshort{posix} shared memory} \acrshort{api} to communicate. However, since the shared memory interface is only supported when the different VILLASnode instances are located on the same computer, the InfiniBand interface turned out to have the lowest latency of the currently implemented server-server interfaces.

5
abstract/german.tex Normal file
View File

@ -0,0 +1,5 @@
Die vorliegende Arbeit thematisiert die Realisierbarkeit und den Mehrwert einer auf InfiniBand basierten Kommunikation in dem Co-Simulationsframework VILLASframework und insbesondere seiner Simulationsdatenschnittstelle VILLASnode. Charakteristisch für die Datenübertragungstechnik InfiniBand sind hohe Durchsatzraten und niedrige Latenzzeiten, welche es besonders geeignet machen für die harten Echtzeitanforderungen von VILLASnode. Die Technik ermöglicht es Anwendungssoftware, auf verschiedenen Hostrechnern miteinander zu kommunizieren, ohne dabei die Engpässe anderer Datenübertragungstechniken, wie zum Beispiel Ethernet, zu spüren.
Ein Mess- und Bewertungsverfahren, welches das Verhalten des \linebreak Co-Simulationsframeworks nachahmt und im Rahmen dieser Arbeit entwickelt wurde, zeigt, dass nach Optimierung, Latenzen im Submikrosekundenbereich möglich waren. Nachdem die Arbeit sich damit auseinandergesetzt hat wie InfiniBand, mit minimalen Änderungen der Programmiersschnittstelle, in das Framework integriert wurde, stellt es die implementierte Technik den existierenden Techniken gegenüber.
Wie sich herausstellt, sind die Latenzzeiten der InfiniBand Übertragungstechnik in VILLASnode um mehr als eine Grö\ss enordnung niedriger als die Latenzzeiten der existierenden Techniken, die Kommunikation zwischen verschiedenen Hostrechnern ermöglichen. Au\ss erdem ermöglicht InfiniBand eine höhere Prognostizierbarkeit der Latenzen und können erheblich höhere Übertragungsraten bewältigt werden. Darüber hinaus sind die Latenzzeiten, die zwischen \SI{1.7}{\micro\second} und \SI{4.9}{\micro\second} liegen, lediglich 1.5--\SI{2.5}{\micro\second} grö\ss er als die der Null-Latenz-Referenz, die jedoch die \textit{\acrshort{posix} shared memory} Programmierschnittstelle zur Datenübertragung nutzt. Da diese Schnittstelle nur genutzt werden kann als Kommunikation zwischen VILLASnode Instanzen auf dem gleichen Rechner, kann gefolgert werden, dass die InfiniBandschnittstelle die niedrigste Latenz der gegenwärtigen Rechner-Rechner Schnittstellen aufweist.

20
appendices.tex Normal file
View File

@ -0,0 +1,20 @@
\bookmarksetupnext{level=part}
\appendices
\addtocontents{toc}{\protect\setcounter{tocdepth}{2}}
\makeatletter
\addtocontents{toc}{%
\begingroup
\let\protect\l@chapter\protect\l@section
\let\protect\l@section\protect\l@subsection
}
\makeatother
\input{appendices/verbs}
\input{appendices/tuned}
\input{appendices/nodetype_interface}
\input{appendices/villas_structs}
\input{appendices/infiniband_configuration}
\input{appendices/results_benchmarks}
\bookmarksetupnext{startatroot}
\addtocontents{toc}{\endgroup}
\endappendices
\backmatter

View File

@ -0,0 +1,9 @@
\chapter{InfiniBand node configuration\label{a:infiniband_config}}
\begin{figure}[ht!]
\vspace{-0.0cm}
\lstinputlisting[caption=The configuration that was used to examine the InfiniBand node-type with the benchmark from \autoref{fig:villas_benchmark}. The bash variables were replaced by a script that controlled the benchmark.,
label=lst:infiniband_config,
style=customconfig]{listings/infiniband.conf}
\vspace{-1.4cm}
\end{figure}

View File

@ -0,0 +1,3 @@
\chapter{VILLASnode node-type interface\label{a:nodetype_functions}}
\input{scripts/build/nodetype_functions}

View File

@ -0,0 +1,204 @@
\chapter{Results benchmarks\label{a:results_benchmarks}}
\section{Influence of CQEs on latency of RDMA write\label{a:oneway_unsignaled_rdma}}
\input{tables/oneway_settings_unsignaled_rdma}
\begin{figure}[ht!]
\vspace{1.5cm}
\begin{subfigure}{\textwidth}
\includegraphics[width=14.75cm, keepaspectratio]{plots/oneway_unsignaled_rdma_hist/plot_0.pdf}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\vspace{0.05cm}
\includegraphics{plots/oneway_unsignaled_rdma_hist/legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Results of the one-way benchmark with the settings from \autoref{tab:oneway_settings_unsignaled_rdma}. These were used to analyze the difference in latency between messages that did and did not cause a \acrfull{cqe}. The \textit{\gls{rdma} write} operation mode was used in this test.}\label{fig:oneway_unsignaled_rdma}
\end{figure}
\newpage
\section{Influence of constant burst size on latency\label{a:oneway_message_size_inline}}
\input{tables/oneway_settings_message_size_inline}
\begin{figure}[ht!]
\begin{subfigure}{0.351\textwidth}
\includegraphics[width=\linewidth, keepaspectratio]{plots/oneway_message_size_inline_median/plot_0.pdf}
\caption{\gls{rc}}\label{fig:oneway_message_size_inline_a}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{0.312\textwidth}
\includegraphics[width=\linewidth, keepaspectratio]{plots/oneway_message_size_inline_median/plot_1.pdf}
\caption{\gls{uc}}\label{fig:oneway_message_size_inline_b}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{0.312\textwidth}
\includegraphics[width=\linewidth, keepaspectratio]{plots/oneway_message_size_inline_median/plot_2.pdf}
\caption{\gls{ud}}\label{fig:oneway_message_size_inline_c}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{\textwidth}
\centering
\vspace{0.15cm}
\includegraphics{plots/oneway_message_size_inline_median/legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Results of the one-way benchmark with the settings from \autoref{tab:oneway_settings_message_size_inline}. While a triangle indicates $\tilde{t}_{lat}$ for a certain message size, the error bars indicate the upper and lower 10\% of $t_{lat}$ for that message size.}\label{fig:oneway_message_size_inline}
\end{figure}
\newpage
\section{Influence of intermediate pauses on latency\label{a:oneway_message_size_wait}}
\input{tables/oneway_settings_message_size_wait}
\begin{figure}[ht!]
\vspace{0.5cm}
\begin{subfigure}{0.351\textwidth}
\includegraphics[width=\linewidth, keepaspectratio]{plots/oneway_message_size_wait_median/plot_0.pdf}
\caption{\gls{rc}}\label{fig:oneway_message_size_wait_a}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{0.312\textwidth}
\includegraphics[width=\linewidth, keepaspectratio]{plots/oneway_message_size_wait_median/plot_1.pdf}
\caption{\gls{uc}}\label{fig:oneway_message_size_wait_b}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{0.312\textwidth}
\includegraphics[width=\linewidth, keepaspectratio]{plots/oneway_message_size_wait_median/plot_2.pdf}
\caption{\gls{ud}}\label{fig:oneway_message_size_wait_c}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{\textwidth}
\centering
\vspace{0.15cm}
\includegraphics{plots/oneway_message_size_wait_median/legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Results of the one-way benchmark with the settings from \autoref{tab:oneway_settings_message_size_wait}. While a triangle indicates $\tilde{t}_{lat}$ for a certain message size, the error bars indicate the upper and lower 10\% of $t_{lat}$ for that message size.}\label{fig:oneway_message_size_wait}
\vspace{-0.5cm}
\end{figure}
\newpage
\section{Comparison of timer functions\label{a:timer_comparison}}
\begin{figure}[ht!]
\vspace{-0.5cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:timer_comparison_a}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/nodetype_timer_comparison_wo_optimizations/infiniband_RC_0i_0j.pdf}
\end{minipage}
\end{subfigure}
\vspace{-0.2cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:timer_comparison_b}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/nodetype_timer_comparison_wo_optimizations/infiniband_RC_1i_0j.pdf}
\end{minipage}
\end{subfigure}
\vspace{-0.2cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:timer_comparison_c}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/nodetype_timer_comparison_w_optimizations/infiniband_RC_0i_0j.pdf}
\end{minipage}
\end{subfigure}
\vspace{-0.2cm}
\begin{subfigure}{\textwidth}
\begin{minipage}{0.45cm}
\vspace{-1.3cm}
\caption{}\label{fig:timer_comparison_d}
\end{minipage}
\hfill
\begin{minipage}{14.75cm}
\includegraphics[width=14.75cm, keepaspectratio]{plots/nodetype_timer_comparison_w_optimizations/infiniband_RC_1i_0j.pdf}
\end{minipage}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\includegraphics{plots/nodetype_timer_comparison_w_optimizations/histogram_legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{Comprehensive plots of the results from \autoref{tab:timer_comparison}. Subfigure (a) and (b) show the results in the unoptimized environment with \texttt{timerfd} and \gls{tsc}, respectively. Subfigure (c) and (d) show the results for the same settings, but in the optimized environment.}\label{fig:timer_comparison}
\vspace{-3.0cm}
\end{figure}
\newpage
\section{3D plots InfiniBand nodes (UC \& UD)\label{a:rate_size_3d_UC_UD}}
\begin{figure}[ht!]
\begin{subfigure}{\textwidth}
\includegraphics[width=15.2cm, keepaspectratio]{plots/nodetype_3d_IB/median_3d_graph_UC.pdf}
\end{subfigure}
\begin{subfigure}{\textwidth}
\vspace{0.2cm}
\centering
\includegraphics{plots/nodetype_3d_IB/3d_UC_legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{The influence of the message size and generation rate on $\tilde{t}_{lat}$ between two InfiniBand nodes that communicate over an \acrfull{uc}.}\label{fig:rate_size_3d_UC}
\end{figure}
\begin{figure}[ht!]
\begin{subfigure}{\textwidth}
\includegraphics[width=15.2cm, keepaspectratio]{plots/nodetype_3d_IB/median_3d_graph_UD.pdf}
\end{subfigure}
\begin{subfigure}{\textwidth}
\vspace{0.2cm}
\centering
\includegraphics{plots/nodetype_3d_IB/3d_UD_legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{The influence of the message size and generation rate on $\tilde{t}_{lat}$ between two InfiniBand nodes that communicate over \acrfull{ud}.}\label{fig:rate_size_3d_UD}
\end{figure}
\newpage
\section{3D plot shmem node\label{a:shmem_3d}}
\begin{figure}[ht!]
\vspace{5.5cm}
\begin{subfigure}{\textwidth}
\includegraphics[width=15.2cm, keepaspectratio]{plots/nodetype_3d_shmem/median_3d_graph_XX.pdf}
\end{subfigure}
\begin{subfigure}{\textwidth}
\vspace{0.2cm}
\centering
\includegraphics{plots/nodetype_3d_shmem/3d_XX_legend.pdf}
\vspace{-0.15cm}
\end{subfigure}
\caption{The influence of signal generation rate and the message size on the median latency between two \textit{shmem}.}\label{fig:shmem_3d}
\end{figure}
\newpage
\section{Missed steps nanomsg and zeromq nodes\label{a:missed_steps_nanomsg_zeromq}}
\input{tables/missed_steps_nanomsg_zeromq}

10
appendices/tuned.tex Normal file
View File

@ -0,0 +1,10 @@
\chapter{Tuned daemon profile\label{a:tuned_profile}}
This appendix shows the \textit{latency-performance} \texttt{tuned} profile that was used during the benchmarks that were run on the \glspl{hca} and VILLASnode.
\begin{figure}[ht!]
\vspace{0.5cm}
\lstinputlisting[caption=The \texttt{tuned} default profile \textit{latency-performance}. Comments are omitted for the sake of brevity.,
label=lst:tuned_latency_performance,
style=customconfig]{listings/tuned_latency_performance.conf}
\vspace{-0.2cm}
\end{figure}

12
appendices/verbs.tex Normal file
View File

@ -0,0 +1,12 @@
\chapter{OpenFabrics Verbs\label{a:openfabrics}}
Experimental functions are not included in this appendix. Furthermore, the \gls{rdma} verbs \gls{api} is omitted because it is not used in the present work. A comprehensive documentation on all verbs can be found in the \gls{rdma} Aware Networks Programming User Manual~\cite{mellanox2015RDMA}.
\section{IB verbs API}
This section presents the default InfiniBand verbs API.
\input{scripts/build/ib_verbs}
\newpage
\section{RDMA CM API}
This section presents the RDMA communication manager API, as presented in \autoref{sec:rdmacm}.
\input{scripts/build/rdma_cm_verbs}

View File

@ -0,0 +1,28 @@
\chapter{VILLASnode structs\label{a:villas_structs}}
This appendix presents a few structures which help to understand the VILLASnode architecture from \autoref{chap:architecture}. A full overview of all header files can be found on the VILLASnode Git repository\footnote{\url{https://git.rwth-aachen.de/acs/public/villas/VILLASnode}}.
\section{\texttt{struct sample}\label{a:sec:structsample}}
\begin{figure}[ht!]
\lstinputlisting[caption=The C structure of a VILLASnode sample.,
label=lst:struct_sample,
style=customc]{listings/struct_sample.h}
\vspace{-0.2cm}
\end{figure}
\newpage
\section{\texttt{struct node}\label{a:sec:structnode}}
\begin{figure}[ht!]
\lstinputlisting[caption=The C structure of a VILLASnode node.,
label=lst:struct_node,
style=customc]{listings/struct_node.h}
\vspace{-0.2cm}
\end{figure}
\newpage
\section{\texttt{struct node\_type}\label{a:sec:structnodetype}}
\begin{figure}[ht!]
\lstinputlisting[caption=The C structure of a VILLASnode node-type.,
label=lst:struct_nodetype,
style=customc]{listings/struct_nodetype.h}
\vspace{-0.2cm}
\end{figure}

19
biblatex.cfg Normal file
View File

@ -0,0 +1,19 @@
\NewBibliographyString{noauthor}
\NewBibliographyString{noeditor}
\NewBibliographyString{nodate}
\NewBibliographyString{notitle}
\NewBibliographyString{nolocation}
\NewBibliographyString{nopublisher}
\DefineBibliographyStrings{english}{%
noauthor = {s\adddot a\adddot},
noeditor = {s\adddot ed\adddot},
nodate = {s\adddot a\adddot},
notitle = {s\adddot t\adddot},
nolocation = {s\adddot l\adddot},
nopublisher = {s\adddot ed\adddot},
}
\newcommand*\nosomethings{noauthor,noeditor,nodate,notitle,nolocation,nopublisher}
\@for \xx:=\nosomethings \do {%
\expandafter\ifcsname\xx\endcsname\relax\else
\expandafter\expandafter\expandafter\expandafter\edef\csname\xx\endcsname{\noexpand\bibstring{\xx}}%
\fi}

482
bibliography.bib Normal file
View File

@ -0,0 +1,482 @@
%
% $Description: ACS Thesis Bibliography$
%
% $Author: pickartz $
% $Date: 2015/04/23 $
% $Revision: 0.1 $
%
@manual{compaq1997microsoft,
author={\noauthor},
title={{Virtual Interface Architecture Specification}},
organization={Compaq, Intel, Microsoft},
note={Version 1.0},
month={12},
year={1997}
}
@article{dunning1998virtual,
title={{The Virtual Interface Architecture}},
author={Dunning, Dave and Regnier, Greg and McAlpine, Gary and Cameron, Don and Shubert, Bill and Berry, Frank and Merritt, Anne Marie and Gronke, Ed and Dodd, Chris},
journal={IEEE micro},
volume={18},
number={2},
pages={66--76},
year={1998},
month={3},
publisher={IEEE},
ISSN={0272-1732},
doi={10.1109/40.671404}
}
@article{pfister2001introduction,
title={{An Introduction to the Infiniband™ Architecture}},
author={Pfister, Gregory F},
journal={High Performance Mass Storage and Parallel I/O},
volume={42},
pages={617--632},
year={2001},
publisher={chapter42}
}
@book{tanenbaum2014modern,
title={{Modern Operating System}},
author={Tanenbaum, Andrew S and Bos, Herbert},
year={2014},
isbn={978-0-13-359162-0},
edition={4},
publisher={Pearson Education, Inc}
}
@book{kozierok2005tcp,
title={{The TCP/IP-Guide: A Comprehensive, Illustrated Internet Protocols Reference}},
author={Kozierok, Charles M},
year={2005},
isbn={978-1593270476},
publisher={No Starch Press}
}
@manual{infinibandvol1,
author={\noauthor},
title={{InfiniBand\texttrademark~Architecture Specification, Volume 1}},
organization={InfiniBand Trade Association and others},
note={Release 1.2.1},
month={11},
year={2007}
}
@manual{infinibandvol2,
author={\noauthor},
title={{InfiniBand\texttrademark~Architecture Specification Volume 2}},
organization={InfiniBand Trade Association and others},
note={Release 1.3.1},
month={11},
year = {2016}
}
@article{grun2010introduction,
title={Introduction to infiniband for end users},
author={Grun, Paul},
organization={InfiniBand Trade Association},
year={2010}
}
@techreport{crupnicoff2005deploying,
title={{Deploying Quality of Service and Congestion Control in InfiniBand-based Data Center Networks}},
author={Crupnicoff, Diego and Das, Sujal and Zahavi, Eitan},
organization={{Mellanox Technologies}},
number={2379},
year={2005},
}
@manual{eui64,
author={\noauthor},
title={{Guidelines for Use of Extended Unique Identifier (EUI), Organizationally Unique Identifier (OUI), and Company ID (CID)}},
organization={Institute of Electrical and Electronics Engineers},
month={8},
year={2017}
}
%%%%%%%%%%%%%%%%%%%% INTRODUCTION %%%%%%%%%%%%%%%%%%%
@article{strasser2015review,
title={{A Review of Architectures and Concepts for Intelligence in Future Electric Energy Systems}},
author={Strasser, Thomas and Andr{\'e}n, Filip and Kathan, Johannes and Cecati, Carlo and Buccella, Concettina and Siano, Pierluigi and Leitao, Paulo and Zhabelova, Gulnara and Vyatkin, Valeriy and Vrba, Pavel and others},
journal={IEEE Transactions on Industrial Electronics},
volume={62},
number={4},
pages={2424--2438},
year={2015},
month={4},
publisher={IEEE},
ISSN={0278-0046},
doi={10.1109/TIE.2014.2361486}
}
@article{faruque2015real,
title={{Real-Time Simulation Technologies for Power Systems Design, Testing, and Analysis}},
author={Faruque, MD Omar and Strasser, Thomas and Lauss, Georg and Jalili-Marandi, Vahid and Forsyth, Paul and Dufour, Christian and Dinavahi, Venkata and Monti, Antonello and Kotsampopoulos, Panos and Martinez, Juan A and others},
journal={IEEE Power and Energy Technology Systems Journal},
volume={2},
number={2},
pages={63--73},
year={2015},
month={6},
publisher={IEEE},
ISSN={2332-7707},
doi={10.1109/JPETS.2015.2427370}
}
@article{larsen2009architectural,
title={{Architectural breakdown of end-to-end latency in a TCP/IP network}},
author={Larsen, Steen and Sarangam, Parthasarathy and Huggahalli, Ram and Kulkarni, Siddharth},
journal={{International Journal of Parallel Programming}},
year={2009},
month={12},
volume={37},
number={6},
pages={556--571},
issn={1573-7640},
doi={10.1007/s10766-009-0109-6},
publisher={Springer}
}
@article{reinemo2006overview,
author={S. Reinemo and T. Skeie and T. Sødring and O. Lysne and O. Trudbakken},
journal={IEEE Communications Magazine},
title={{An Overview of QoS Capabilities in InfiniBand, Advanced Switching Interconnect, and Ethernet}},
year={2006},
volume={44},
number={7},
pages={32-38},
doi={10.1109/MCOM.2006.1668378},
ISSN={0163-6804},
month={09}
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%% OFED %%%%%%%%%%%%%%%%%%%%%%%%
@misc{allianceofed,
author={\noauthor},
year={2018},
title={{OFA Overview}},
organization={{OpenFabric Alliance}},
url={https://www.openfabrics.org/ofa-overview/},
urldate = {2018-08-22},
}
@manual{mellanox2018linux,
author={\noauthor},
title={{Mellanox OFED for Linux User Manual}},
organization={{Mellanox Technologies}},
year={2018},
month={3},
number={2877},
note={Rev 4.3}
}
@manual{mellanox2015RDMA,
author={\noauthor},
title={{RDMA Aware Networks Programming User Manual}},
organization={{Mellanox Technologies}},
year={2015},
month={5},
edition={Rev 1.7}
}
@manual{ipoib,
title={{IP over InfiniBand (IPoIB) Architecture}},
author={{Kashyap, V}},
organization={Internet Engineering Task Force},
year={2015},
month={5},
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%% VILLAS NODE %%%%%%%%%%%%%%%%%%%%
@article{stevic2017multi,
title={{Multi-site European framework for real-time co-simulation of power systems}},
author={Stevic, Marija and Estebsari, Abouzar and Vogel, Steffen and Pons, Enrico and Bompard, Ettore and Masera, Marcelo and Monti, Antonello},
journal={IET Generation, Transmission \& Distribution},
volume={11},
number={17},
pages={4126--4135},
year={2017},
publisher={IET},
ISSN={1751-8687},
doi={10.1049/iet-gtd.2016.1576}
}
@inproceedings{vogel2017open,
title={{An Open Solution for Next-generation Real-time Power System Simulation}},
author={Vogel, Steffen and Mirz, Markus and Razik, Lukas and Monti, Antonello},
booktitle={{Energy Internet and Energy System Integration (EI2), 2017 IEEE Conference on}},
pages={1--6},
year={2017},
month={11},
publisher={IEEE},
doi={10.1109/EI2.2017.8245739}
}
@inproceedings{mirz2018distributed,
title={{Distributed Real-Time Co-Simulation as a Service}},
author={Mirz, Markus and Vogel, Steffen and Sch{\"a}fer, Bettina and Monti, Antonello},
booktitle={Industrial Electronics for Sustainable Energy Systems (IESES), 2018 IEEE International Conference on},
pages={534--539},
year={2018},
month={2},
doi={10.1109/IESES.2018.8349934},
publisher={IEEE},
address={Hamilton, New Zealand}
}
@mastersthesis{vogel2016development,
title={{Development of a modular and fully-digital PCIe-based interface to Real-Time Digital Simulator}},
author={Vogel, Steffen},
year=2016,
month=8,
school={RWTH Aachen University},
institution={Institute for Automation of Complex Power Systems}
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%% PERFORMANCE STUDIES %%%%%%%%%%%%%%%%%%%%
@inproceedings{macarthur2012performance,
title={{A Performance Study to Guide RDMA Programming Decisions}},
author={MacArthur, Patrick and Russell, Robert D},
booktitle={{High Performance Computing and Communication \& 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC-ICESS), 2012 IEEE 14th International Conference on}},
pages={778--785},
year={2012},
publisher={IEEE},
doi={10.1109/HPCC.2012.110}
}
@inproceedings{liu2014performance,
author = {Liu, Qian and Russell, Robert D.},
title = {{A Performance Study of InfiniBand Fourteen Data Rate (FDR)}},
booktitle = {{Proceedings of the High Performance Computing Symposium}},
series = {HPC '14},
year = {2014},
location = {Tampa, Florida},
pages = {1--10},
articleno = {16},
numpages = {10},
acmid = {2663526},
publisher = {Society for Computer Simulation International},
address = {San Diego, CA, USA},
keywords = {InfiniBand, NUMA, RDMA, RDMA_WRITE_WITH_IMM, fourteen data rate},
}
%%%%%%%%%%%%%%%%%%%%%%%% LINUX BOOKS %%%%%%%%%%%%%%%%%%%%%%%%
@book{kerrisk2010linux,
title={{The Linux Programming Interface: a Linux and UNIX System Programming Handbook}},
author={Kerrisk, Michael},
year={2010},
isbn={978-1-59327-220-3},
publisher={No Starch Press}
}
@manual{posix2018,
author={\noauthor},
title={{IEEE Standard for Information Technology---Portable Operating System Interface (POSIX\textregistered)}},
organization={Institute of Electrical and Electronics Engineers},
note={Base Specifications, Issue 7},
isbn={978-1-5044-4542-9},
month={1},
year={2018},
doi={10.1109/IEEESTD.2018.8277153}
}
@book{kernighan1978c,
title={{The C Programming Language}},
author={Kernighan, Brian W and Ritchie, Dennis M},
note={1st ed.},
isbn={0-13-110163-3},
month={2},
year={1978}
}
@article{barabanov1996real,
title={{Real-Time Linux}},
author={Barabanov, Michael and Yodaiken, Victor},
journal={Linux journal},
volume={23},
number={4.2},
pages={1},
year={1996}
}
@inproceedings{rostedt2007internals,
title={{Internals of the RT Patch}},
author={Rostedt, Steven and Hart, Darren V},
booktitle={Proceedings of the Linux symposium},
volume={2},
pages={161--172},
month={6},
year={2007}
}
@book{love2010linux,
title={{Linux Kernel Development}},
author={Love, Robert},
month={6},
year={2010},
publisher={Pearson Education, Inc.},
isbn={978-0-672-32946-3}
}
@article{lameter2013numa,
author = {Lameter, Christoph},
title = {{NUMA (Non-Uniform Memory Access): An Overview}},
journal = {Queue},
issue_date = {July 2013},
volume = {11},
number = {7},
month = jul,
year = {2013},
issn = {1542-7730},
pages = {40--51},
articleno = {40},
numpages = {12},
doi = {10.1145/2508834.2513149},
publisher = {ACM},
address = {New York, NY, USA},
}
@misc{derr2004cpusets,
title={Cpusets},
author={Derr, Simon and Jackson, P and Lameter, C and Menage, P and Seto, H},
year={2004},
url={https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt},
urldate={2018-09-16}
}
@misc{menage2004cgroups,
title={Cgroups},
author={Menage, Paul and Jackson, Paul and Lameter, Christoph},
year={2008},
url={https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt},
urldate={2018-09-16}
}
@inproceedings{kroah2003udev,
title={udev--A Userspace Implementation of devfs},
author={Kroah-Hartman, Greg},
booktitle={Proceedings of the Linux symposium},
pages={263--271},
month={7},
year={2003},
}
@misc{drepper2007every,
title={{What Every Programmer Should Know About Memory}},
author={Drepper, Ulrich},
organization={Red Hat, Inc.},
note={Version 1.0},
month={11},
year={2007}
}
@manual{guide2018intelc3a,
author={\noauthor},
title={{Intel\textregistered~64 and IA-32 Architectures Software Developers Manual}},
organization={Intel},
note={Volume 3A: System Programming Guide, Part 1},
year={2018},
month={5}
}
@manual{guide2018intelc3b,
author={\noauthor},
title={{Intel\textregistered~64 and IA-32 Architectures Software Developers Manual}},
organization={Intel},
note={Volume 3B: System Programming Guide, Part 2},
year={2018},
month={5}
}
@manual{guide2018intelb2a,
author={\noauthor},
title={{Intel\textregistered~64 and IA-32 Architectures Software Developers Manual}},
organization={Intel},
note={Volume 2A: Instruction Set Reference, A-L},
year={2018},
month={5}
}
@manual{guide2018intelb2b,
author={\noauthor},
title={{Intel\textregistered~64 and IA-32 Architectures Software Developers Manual}},
organization={Intel},
note={Volume 2B: Instruction Set Reference, M-U},
year={2018},
month={5}
}
@techreport{paoloni2010benchmark,
title={{How to Benchmark Code Execution Times on Intel\textregistered{} IA-32 and IA-64 Instruction Set Architectures}},
author={Paoloni, Gabriele},
organization={Intel},
year={2010},
month={9}
}
@article{gandhi2016range,
title={Range Translations for Fast Virtual Memory.},
author={Gandhi, Jayneel and Karakostas, Vasileios and Ayar, Furkan and Cristal, Adri{\'a}n and Hill, Mark D and McKinley, Kathryn S and Nemirovsky, Mario and Swift, Michael M and Unsal, Osman S},
journal={IEEE Micro},
volume={36},
number={3},
pages={118--126},
doi={10.1109/MM.2016.10},
ISSN={0272-1732},
month={5},
year={2016}
}
@misc{bowden2009proc,
title={The /proc Filesystem},
author={Bowden, Terrehon and Bauer, Bodo and Nerin, Jorge and Feng, Shen and Seibold, Stefani},
year={2009},
month={6},
note={Version 1.3},
url={https://www.kernel.org/doc/Documentation/filesystems/proc.txt},
urldate={2018-09-19}
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@article{perez2007ipython,
Author = {P\'erez, Fernando and Granger, Brian E.},
Title = {{IPython: a System for Interactive Scientific Computing}},
Journal = {Computing in Science and Engineering},
Volume = {9},
Number = {3},
Pages = {21--29},
month = may,
year = 2007,
url = "https://ipython.org",
ISSN = "1521-9615",
doi = {10.1109/MCSE.2007.53},
publisher = {IEEE Computer Society},
}
@manual{pcisig2010pciexpress,
author={\noauthor},
title={{PCI Express\textregistered{} Base Specification}},
organization={{PCI-SIG}},
year={2010},
month={11},
note={Revision 3.0}
}
@inproceedings{susan1983gprof,
title={{gprof: A Call Graph Execution Profiler}},
author={Susan L. and Graham Peter B. and Kessler Marshall K. and McKusick K.},
booktitle={Proceedings: USENIX Association [and] Software Tools Users Group Summer Conference},
pages={81--88},
year={1983},
address={Toronto, Ontario, Canada}
}

7
chapters.tex Normal file
View File

@ -0,0 +1,7 @@
\include{chapters/introduction}
\include{chapters/basics}
\include{chapters/architecture}
\include{chapters/implementation}
\include{chapters/evaluation}
\include{chapters/conclusion}
\include{chapters/future}

226
chapters/architecture.tex Normal file
View File

@ -0,0 +1,226 @@
\chapter{Architecture\label{chap:architecture}}
The first section of this chapter (\ref{sec:villasbasics}) explains the concept and internals of a VILLASnode instance. In the second section (\ref{sec:configuration}), a brief introduction on the configuration of node-type instances is given. Then, in \autoref{sec:readwrite_interfaces},~\ref{sec:memorymanagement}, and~\ref{sec:villas_fsm}, the adaptions that had to be made to the interface of node-types, the memory management of VILLASnode, and the finite-state machine of nodes are explained, respectively.
\section{Concept\label{sec:villasbasics}}
The functioning principles and general structure of VILLASframework, of which VILLASnode is a sub-project, were already presented in \autoref{sec:intro_villas}. This section solely focuses on the structure of VILLASnode.
\Autoref{tab:villasnode_nodes} presented the different \textit{node-types} that VILLASnode supported at the time of writing the present work. One VILLASnode instance---in the remainder of the present work often referred to as \textit{super-node}---may have several \textit{nodes} which act as source and/or sink of simulation data. A node is defined as an instance of a node-type. Accordingly, a super-node can serve as a gateway for simulation data. Node-types can roughly be divided into three categories:
\begin{itemize}
\setlength\itemsep{0.2em}
\item \textit{internal node-types}, which enable communication with node-types on the same host (e.g., writing data to a file descriptor through a \textit{file} node);
\item \textit{server-server node-types}, which enable communication with nodes on different hosts (e.g., communicating with a \textit{socket} node on a remote host);
\item \textit{simulator-server node-types}, which enable communication with simulators (e.g., acquiring data from an OPAL-RT simulator).
\end{itemize}
(In the remainder of this work, names of node-types and nodes are written in a cursive font, for example, \textit{file} node, \textit{socket} node, or \textit{InfiniBand} node-type.)
Within a super-node, so called \textit{paths} connect different nodes. A path starts at a node from which it acquires data. Immediately after data is obtained, it is optionally sent through a \textit{hook}, which can be seen as an extension to manipulate the data (e.g., to filter or transform it). Then, the data is written into a \gls{fifo} (also called: \textit{queue}), which holds it until it can be passed on. Subsequently, the data is sent through a \textit{register}, which can multiplex and mask it. Before the data is placed into the output queue and right before the sending node obtains it, it can be manipulated by more hooks. Finally, if the output node is ready, the data is moved from the output queue to the output node, which then sends it to a given destination node.
Data is transmitted in \textit{samples}, which store the simulation data for a given point in time, send and receive timestamps, and a sequence number. The sample structure is deliberately kept simple because it is the smallest common denominator of all supported simulators.
\begin{figure}[ht!]
\includegraphics{images/villasnode.pdf}
\vspace{-0.5cm}
\caption{The internal VILLASnode architecture~\cite{vogel2017open}. Depicted is one VILLASnode instance (\textit{super-node}) that includes three \textit{paths}, which connect five node-type instances (\textit{nodes}) with each other.}
\label{fig:villasnode}
\end{figure}
\Autoref{fig:villasnode} depicts the internal connections of an example super-node. This VILLASnode instance includes five node-type instances: \textit{opal} ($n_1$), \textit{file} ($n_2$), \textit{socket} ($n_3$), \textit{mqtt} ($n_4$), and a yet to be implemented \textit{InfiniBand} ($n_5$) node. On receive, data from the \textit{opal} node $n_1$ is modified by hook $h_1$ before it is placed in queue $q_{i,1}$. Path 1 continues through register $r_1$, hook $h_2$, and hook $h_3$, before it enters the output queue $q_{o,1}$. Before the \textit{socket} node $n_3$ sends the data from the queue to another \textit{socket} node, it is modified one last time by hook $h_4$.
Path 2 connects a \textit{socket} node ($n_3$), an \textit{mqtt} node ($n_4$), and an \textit{InfiniBand} node ($n_5$) with an \textit{opal} node $n_1$. In this path, the register $r_2$ determines the forwarding conditions for $q_{i,2}$, $q_{i,3}$, and $q_{i,4}$; it could, for example, depending on the data available in the queues, mask them. Before the data is placed in the output queue $q_{o,2}$ and right before the \textit{opal} node sends the data, it is modified by hook $h_5$ and $h_6$, respectively.
Path 3 connects a \textit{file} node $n_2$, which reads data from a local file, with an \textit{mqtt} node $n_4$ and \textit{InfiniBand} node $n_5$.
\section{Configuration of nodes\label{sec:configuration}}
\Autoref{lst:node_config} shows an example of a stripped down VILLASnode configuration file. The first part of the configuration consists of a list of nodes to be initialized (comparable with $n_{1\ldots5}$ in \autoref{fig:villasnode}). In this example, an instance of a \textit{file} node-type (\texttt{node\_1}) and an instance of an \textit{InfiniBand} node-type (\texttt{node\_2}) would be instantiated. Besides the type, a user can specify a range of settings for every node. These can be divided into global settings for the complete instance, settings only for the input part of the node, and settings only for the output part. The supported settings for every node-type can be found on the VILLASframework documentation pages.\footnote{\url{https://villas.fein-aachen.org/doc/node-types.html}}
\begin{figure}[ht!]
\vspace{0.5cm}
\lstinputlisting[caption=Structure of the configuration file of a \textit{file} node and an \textit{InfiniBand} node with a path connecting them.,
label=lst:node_config,
style=customconfig]{listings/node_config.conf}
\vspace{-0.2cm}
\end{figure}
The \textit{paths} section describes how nodes are connected within the super-node (compare with path 1, path 2, and path 3 in \autoref{fig:villasnode}). In this case, there is a path between \texttt{node\_1} and \texttt{node\_2}. This means that data is read from a file, which would be specified in the in-section of \texttt{node\_1}, and then placed in a buffer in the super-node. Then, after it is sent through possible hooks---which are not defined in this configuration file---it is copied to the memory that is allocated as output buffer for the \textit{InfiniBand} node. The super-node then sends these samples to the write-function of that node, which in turn sends the samples to a remote node as specified in its out-section.
\section{Interface of node-types\label{sec:readwrite_interfaces}}
To ensure interoperability between different node-types and VILLASnode, the VILLASframework specification defines an interface to use between the super-node and node-types. It is realized as a fixed set of functions with a given set of parameters that every node-type can implement. These functions have to be registered with the framework by passing it the pointers of the respective functions. Examples of functions to be implemented are \texttt{start()} and \texttt{stop()}, as well as \texttt{read()} and \texttt{write()}. Since their parameters had to be changed to efficiently support an \textit{InfiniBand} node-type, this section will expand upon the latter.
Not every function is mandatory; some functions will simply be ignored if they are not implemented. A complete list of all functions a node-type should implement, together with a brief description, is presented in \autoref{a:nodetype_functions}.
\subsection{Original implementation of the read- and write-function}
\Autoref{lst:read_write_original} shows the variables which were originally used in the \texttt{node\_type} C structure (\autorefap{a:sec:structnodetype}) to save the function pointers to the read- and write-function. Since this listing shows the initial parameters, it helps to understand the working principles of both functions and their weaknesses.
For both functions, \texttt{*n} is a C structure that holds information about the node-type instance. It contains, among others, information about the state, the number of generated or received samples, the configuration of the node and a field for node-type specific virtual data. The node structure is displayed in \autorefap{a:sec:structnode}; the present work will not expand further upon this struct.
\begin{figure}[ht!]
\vspace{0.5cm}
\lstinputlisting[caption=Original parameters of \texttt{read()} and \texttt{write()},
label=lst:read_write_original,
style=customc]{listings/read_write_original.h}
\vspace{-0.2cm}
\end{figure}
\paragraph{Read-function} The working principle of the read-function is displayed in \autoref{fig:villas_read}. The \textit{\undershort read()} box represents the function to which the \texttt{(*read)} pointer (line 1 in \autoref{lst:read_write_original}) of a given node-type points and is often simply referred to as \textit{read-function} in the remainder of the present work. The box thus depicts a part of the interface between the super-node and the node.
In order to retrieve data from a node, the super-node starts by allocating $\mathtt{cnt} \geq 1$ empty samples. A sample contains fields for, i.a., an origin timestamp, a receive timestamp, a sequence number, a reference counter, and a field to save the actual signal. The signal can contain unsigned 64-bit integers, 64-bit floating-point numbers, booleans, or complex numbers. \Autorefap{a:sec:structsample} presents the \texttt{sample} C structure. Since this structure contains some host specific information, it contains more data than will actually be sent.
After samples have been allocated, their reference counter (\textit{refcnt}) is increased by one. Samples in VILLASnode cannot be destroyed unless the reference counter is 1 when the release-function is called. When $refcnt>1$, other instances within VILLASnode still rely on the sample; calling the release-function on such a sample will cause the reference counter to be decremented by 1. In the remainder of the present work, \textit{releasing a sample} and \textit{decreasing the reference counter of a sample by one} is used interchangeably.
\begin{figure}[ht!]
\vspace{-0.5cm}
\begin{subfigure}{0.49\textwidth}
\includegraphics[width=\linewidth, page=1]{images/villas_read.pdf}
\vspace{-0.8cm}
\caption{Invoking the read-function.}\label{fig:villas_read_a}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{0.49\textwidth}
\includegraphics[width=\linewidth, page=2]{images/villas_read.pdf}
\vspace{-0.8cm}
\caption{Return of the read-function.}\label{fig:villas_read_b}
\end{subfigure}
\begin{subfigure}{\textwidth}
\includegraphics[width=\linewidth]{images/villas_read_legend.pdf}
\end{subfigure}
\vspace{-1.5cm}
\caption{A depiction of the working principle of the read-function in VILLASnode. This function is part of the interface between a super-node and a node.}\label{fig:villas_read}
\end{figure}
After memory to hold the samples has been allocated, a pointer to the first sample (\texttt{*smps[]}) and the total number of allocated samples (\texttt{cnt}) is passed to the node by calling the read-function (\autoref{fig:villas_read_a}). The node then tries to receive a maximum of \texttt{cnt} values to subsequently copy them to the allocated memory.
The return of the read-function is depicted in \autoref{fig:villas_read_b}. After the receive module, which is blackboxed here, has filled up $ret \leq \mathtt{cnt}$ samples, it lets the read-function return with \textit{ret}. The super-node then processes \textit{ret} samples (e.g., sending them through several hooks, before sending them to another node). Finally, all \texttt{cnt}---thus not only \textit{ret}---samples are released. So, after a read cycle, the reference counter of all samples is decreased by 1, and in that way the samples are usually destroyed.
\paragraph{Write-function} The write-function works in a similar fashion as the read-function and has identical parameters (line 2 in \autoref{lst:read_write_original}). The working principle of this function is depicted in \autoref{fig:villas_write}. When a super-node's path needs to write data to a node, it calls the write-function (\autoref{fig:villas_write_a}) and passes the total number of samples and the pointer to the first sample as arguments.
\begin{figure}[ht!]
\vspace{-0.5cm}
\begin{subfigure}{0.49\textwidth}
\includegraphics[width=\linewidth, page=1]{images/villas_write.pdf}
\vspace{-0.8cm}
\caption{Invoking the write-function.}\label{fig:villas_write_a}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{0.49\textwidth}
\includegraphics[width=\linewidth, page=2]{images/villas_write.pdf}
\vspace{-0.8cm}
\caption{Return of the write-function.}\label{fig:villas_write_b}
\end{subfigure}
\begin{subfigure}{\textwidth}
\includegraphics[width=\linewidth]{images/villas_write_legend.pdf}
\end{subfigure}
\vspace{-1.5cm}
\caption{A depiction of the working principle of the write-function in VILLASnode. This function is part of the interface between a super-node and a node.} \label{fig:villas_write}
\end{figure}
When the write-function is called, the node starts processing the samples by copying \texttt{cnt} samples to its send module and instructing it to send the data. The send module does not return until all samples are copied to the send module, and in case of many nodes, not until the data is successfully sent. When the send module is done, depicted in \autoref{fig:villas_write_b}, it lets the write-function return with the number of samples that have been successfully sent. Ideally, the returned value \textit{ret} is equal to the number of passed samples \texttt{cnt}. If this is not the case, the super-node will detect this and act upon a possible error. In all cases, the reference counter of all \texttt{cnt} samples is decremented by~1.
\subsection{Requirements for the read- and write-function of an InfiniBand node\label{sec:requirements}}
As discussed in the previous section, the reference counters of all samples that have been sent into the read- or write-functions are decreased after the functions return. For nodes with either a receive module that has a local buffer or with a send module which does not return until it has made a copy of the data or actually sent the data, this approach works exactly as intended. But, as soon as the modules are implemented by an architecture which is based on the \gls{via}---in this particular case the \gls{iba}---the implementation causes problems. To adhere to the zero-copy principle of the \gls{via}, data should not be copied from the super-node's buffer to a local buffer or the other way around. Rather, a pointer to, and the length of, a memory location should be passed to the network adapter, which then independently copies the data from the host's memory to its local buffers or the other way around.
In the following, the ideal situation for a read and write operation for the InfiniBand Architecture is presented. Although this approach is specifically for the \gls{iba}, it can relatively easily be translated to other \glspl{via}. After the desired approach has been discussed, the next subsection will discuss the shortcomings of the parameters in \autoref{lst:read_write_original}, that obstruct the implementation of this approach.
\paragraph{Read-function}
\Autoref{fig:villas_read_iba} depicts a super-node that reads from a node-type instance whose communication is based on the \gls{iba}. The receive module in this figure relies on the receive queue of an InfiniBand \gls{qp}. As explained in \autoref{sec:qp}, a queue pair cannot receive data unless its \gls{rq} holds receive \glspl{wqe}. Hence, work requests that point to buffers of the super-node have to be submitted.
\begin{figure}[ht!]
\vspace{-0.4cm}
\begin{subfigure}{0.49\textwidth}
\includegraphics[width=\linewidth, page=1]{images/villas_read_iba.pdf}
\vspace{-0.8cm}
\caption{Invoking the read-function.}\label{fig:villas_read_iba_a}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{0.49\textwidth}
\includegraphics[width=\linewidth, page=2]{images/villas_read_iba.pdf}
\vspace{-0.8cm}
\caption{Return of the read-function.}\label{fig:villas_read_iba_b}
\end{subfigure}
\begin{subfigure}{\textwidth}
\includegraphics[width=\linewidth]{images/villas_read_iba_legend.pdf}
\end{subfigure}
\vspace{-1.5cm}
\caption{A depiction of the working principle of the read-function in an \textit{InfiniBand} node. The \acrshort{rq} is part of a complete \acrshort{qp}, but the \acrshort{sq} is omitted for the sake of simplicity.} \label{fig:villas_read_iba}
\end{figure}
An important requirement for this node-type was that it should be compatible with the original node-type interface; or at least that the changes would be minimal. Hence, in order to acquire pointers to samples from the super-node, the \texttt{*smps[]} parameter from the read-function is used. Like the super-node in \autoref{fig:villas_read_a}, the super-node in \autoref{fig:villas_read_iba_a} starts by allocating $cnt \geq 1$ empty samples, increasing their reference counters, and passing their pointers to the node's read-function. The node, in turn, takes the addresses of the samples, wraps them up in scatter/gather elements, places them in work requests, and submits them to the \gls{rq}. Now, when the \gls{hca} receives a message, it will write the data directly into the allocated memory within the super-node. In this way, an additional copy between the node and the super-node is avoided.
Since the receive module of an \textit{InfiniBand} node does not copy data to the passed samples, the returning of function in \autoref{fig:villas_read_iba_b} works fundamentally different from the returning of the function in \autoref{fig:villas_read_b}. If there are no \glspl{cqe} in the completion queue, thus if the HCA did not receive any data, the return value \textit{ret} of the node shall be 0. In that way, the super-node knows that the set of previously allocated \texttt{smps[]} does not hold any data. The reference counters of none of the buffers shall be decreased since they are all submitted to the \gls{rq} and the \gls{hca} will thus write data to them.
If \glspl{cqe} are available, pointers to samples which are submitted to the \gls{rq} (light gray in \autoref{fig:villas_read_iba}) are replaced by the pointers to the buffers that are filled by the HCA (dark gray in \autoref{fig:villas_read_iba}). The return value \textit{ret} shall be the number of pointers that have been replaced since these buffers now contain valid data that was sent to this node. The reference counters of these buffers must be decreased after they have been processed by the super-node.
Consequently, in order for the \textit{InfiniBand} node to be able to receive data, the super-node has to invoke the read-function at least once without acquiring any data. To store the pointers to the buffers in the \glspl{cqe}, the \gls{wr} C structure member \texttt{wr\_id} can be used (see \autoref{sec:postingWRs}).
\paragraph{Write-function} The write-function, depicted in \autoref{fig:villas_write_iba}, has to adhere to similar conventions as the read-function in order to realize zero-copy. Again, the addresses of the samples are passed to the node as arguments of the write-function, to be subsequently submitted to the \gls{sq}. The \gls{hca} will then process the submitted work requests and take care of the necessary memory operations.
\begin{figure}[ht!]
\vspace{-0.5cm}
\begin{subfigure}{0.49\textwidth}
\includegraphics[width=\linewidth, page=1]{images/villas_write_iba.pdf}
\vspace{-0.8cm}
\caption{Invoking the write-function.}\label{fig:villas_write_iba_a}
\end{subfigure}
\hspace*{\fill} % separation between the subfigures
\begin{subfigure}{0.49\textwidth}
\includegraphics[width=\linewidth, page=2]{images/villas_write_iba.pdf}
\vspace{-0.8cm}
\caption{Return of the write-function.}\label{fig:villas_write_iba_b}
\end{subfigure}
\begin{subfigure}{\textwidth}
\includegraphics[width=\linewidth]{images/villas_write_iba_legend.pdf}
\end{subfigure}
\vspace{-1.5cm}
\caption{A depiction of the working principle of the write-function in an \textit{InfiniBand} node. The \acrshort{sq} is part of a complete \acrshort{qp}, but the \acrshort{rq} is omitted for the sake of simplicity.} \label{fig:villas_write_iba}
\end{figure}
When the pointers are successfully submitted to the \gls{sq}, the function shall return the total number of submitted pointers \textit{ret}. If the completion queue is empty, none of these pointers may be released because the HCA has yet to access the memory locations. If the completion queue contains entries, that means that previously submitted send \glspl{wr} are finished; these pointers can be released. So, in order to release them, the initial pointers to the data to be sent (light gray in \autoref{fig:villas_write_iba}) are replaced by pointers to buffers which were submitted to the \gls{sq} in a previous call of the write-function. The super-node has to be notified that it must only decrease the reference counter of pointers that were yielded by the \glspl{cqe}.
\subsection{Proposal for a new read- and write-function\label{sec:proposal}}
Apparently, the major shortcoming of the functions from \autoref{lst:read_write_original} is the lack of an interface to pass the number of samples to be released to the super-node. There is no way the super-node can predict how many samples may be released; this becomes even more difficult if it is taken into account that some samples may be sent inline---thus can be released immediately after submitting the \gls{wr}---and that some work requests may not be successfully submitted to the \gls{sq}.
Therefore, new parameters for the read- and write-function are proposed in \autoref{lst:read_write_proposal}. The additional parameter in each function lets a node decide how many items of \texttt{smps[]} should actually be released. The several distinctions which have to be considered are further elaborated upon in \autoref{sec:villas_implementation}.
\begin{figure}[ht!]
\vspace{0.5cm}
\lstinputlisting[caption=Proposal for an additional parameter in \texttt{read()} and \texttt{write()}.,
label=lst:read_write_proposal,
style=customc]{listings/read_write_proposal.h}
\vspace{-0.2cm}
\end{figure}
\section{Memory management\label{sec:memorymanagement}}
Originally, memory that was allocated within the framework could be allocated with a fixed set of settings called \textit{memory-types}. The VILLASnode internal \texttt{alloc()} could be called, for example, with \texttt{memory\_hugepage}, which pins memory and maps it to hugepages (see \autoref{sec:mem_optimization}), or with \texttt{memory\_heap}, which allocates aligned memory on the heap. These embedded memory-types are not sufficient for the \textit{InfiniBand} node-type. \Autoref{sec:requirements} already showed that the \gls{hca} will directly access the memory that is allocated by the super-node. Thus, as follows from \autoref{sec:memory}, the buffer must be registered with a memory region and the \glspl{wr} that are submitted to either queue of the \gls{qp} must contain the local key.
Since embedding a memory-type for every node-type in the VILLASnode source code would go against the principle of modularity, this is not an option. Consequently, the most obvious solution is to allow every node-type to register its own memory-type if necessary. In that way, every node-type can exactly define what the \texttt{alloc()} and \texttt{free()} functions implement. For \texttt{alloc()}, a node-type can, for example, define how memory should be allocated, whether the pages should be aligned, how big the pages should be, and if the memory should be registered with a memory region. It is also possible for a node-type to implement certain functions which interact with the memory that is allocated by the memory-type; this can, for example, be used within the \textit{InfiniBand} node to acquire the local key of a sample that is passed as an argument of the read- or write-function.
With this method, every node-type may define a \texttt{memory\_type} C structure, which it must register in the same fashion as it registers the interface functions with the super-node (line 39, \autoref{lst:struct_nodetype}). By enabling node-types to register their own memory-type, the super-node knows what type of memory to use for input and/or output buffers that are connected to nodes of this type ($q_{i,x}$ and $q_{o,x}$ in \autoref{fig:villasnode}).
If no memory-type is specified, the super-node will assume \texttt{memory\_hugepage}.
\section{VILLASnode finite-state machine\label{sec:villas_fsm}}
Initially, a node could reside in one of the six states displayed in \autoref{lst:states}. The super-node transitions the node through the states depending on the results of functions from \autoref{a:nodetype_functions}. E.g., when the super-node calls a node's start-function, the transition \textit{checked}$\,\to\,$\textit{started} is performed if the function returns successfully.
\begin{figure}[ht!]
\vspace{0.5cm}
\lstinputlisting[caption=The six states a node could originally reside in.,
label=lst:states,
style=customc]{listings/states.h}
\vspace{-0.2cm}
\end{figure}
These states were sufficient for the node-types which existed up to now (\autoref{tab:villasnode_nodes}); when a node resided in \textit{started}, this meant it was ready to send and receive data. This is not the case for node-types that are based (descendants of) the Virtual Interface Architecture. Here, a node can be initiated---for which the \textit{started} state can be used---but not connected and thus not able to send data to another node. Accordingly, the introduction of a new state \textit{connected} would be appropriate. Furthermore, architectures that are based on the \gls{via} rely on descriptors (called work requests in the \gls{iba}) in a send and receive queue. Hence, in order to be able to receive data directly after the connection has been established, descriptors have to be present in the \gls{rq} at this moment. For this reason, in (descendants of) the \gls{via}, it is possible to prepare elements in the receive queue prior to the actual connection.
These considerations yield the finite-state machine in \autoref{fig:villasnode_states}. The states which are indicated with dashed borders, \textit{pending connect} and \textit{connected}, may be set by the node after the super-node transitioned the instance to the \textit{started} state. The use of both states is not mandatory. If a node is in one of these two states, the super-node interprets it as were the node in the \textit{started} state. But, they can be used within the node itself to distinguish between a node being started, being in a pending connect state, or actually being connected. This state machine shows similarities with the \gls{via}'s finite-state machine in \autoref{fig:via_diagram}. It can therefore be used for future node-types that are based on the \gls{via}---other than the \textit{InfiniBand} node-type that is presented in the present work---as well.
Although it is necessary to execute the transition \textit{checked}$\,\to\,$\textit{started}, it is possible to transition to \textit{stopped} and \textit{destroyed} from any of the three states in the dashed square.
\begin{figure}[ht]
\vspace{-0.65cm}
\hspace{0.4cm}
\includegraphics{images/villasnode_states.pdf}
\vspace{-0.45cm}
\caption{The VILLASnode state diagram with the two newly introduced states \textit{pending connect} and \textit{connected}.}
\label{fig:villasnode_states}
\end{figure}

829
chapters/basics.tex Normal file
View File

@ -0,0 +1,829 @@
\chapter{Basics\label{chap:basics}}
This first section of this chapter (\ref{sec:via}) introduces the Virtual Interface Architecture, of which the InfiniBand Architecture is a descendant. After this brief introduction on InfiniBand's origins, \autoref{sec:infiniband} is completely devoted to the InfiniBand Architecture itself. Subsequently, \autoref{sec:iblibs} introduces the software libraries that are used to operate the InfiniBand hardware in the present work's benchmarks and in the implementation of the VILLASnode \textit{InfiniBand} node-type. Finally, \autoref{sec:optimizations} goes on to discuss real-time optimizations in Linux, which is the operating system VILLASnode is most frequently operated on.
\section{The Virtual Interface Architecture\label{sec:via}}
InfiniBand is rooted in the \gls{via}~\cite{pfister2001introduction}, which was originally introduced by Compaq, Intel, and Microsoft~\cite{compaq1997microsoft}. Although InfiniBand does not completely adhere to the original \gls{via} specifications, it is important to understand its basics. In that way, some design decisions in the InfiniBand Architecture will be more comprehensible. This section will therefore elaborate on the characteristics of the \gls{via}\@.
The lion's share of the Internet protocol suite, also known as \acrshort{tcpip}, is implemented by the \gls{os}~\cite{kozierok2005tcp}. Even though the concept of the \acrshort{tcpip} stack allows the interface between a \gls{nic} and an \gls{os} to be relatively simple, a drawback is that the \gls{nic} is not directly accessible for consumer processes, but only over this stack. Since the \acrshort{tcpip} stack resides in the operating system's kernel, communication operations result in \textit{trap} machine instructions (or on more recent x86 architecture's: \textit{sysenter} instructions), which cause the \gls{cpu} to switch from user to kernel mode~\cite{kerrisk2010linux}. This back-and-forth between both modes is relatively expensive and thus adds a certain amount of latency to the communication operation that caused the switch. Furthermore, since the \acrshort{tcpip} stack also includes reliability protocols and the (de)multiplexing of the \gls{nic} to processes, the operating system has to take care of these rather expensive tasks as well~\cite{kozierok2005tcp}. \Autoref{sec:motivation} already described Larsen and Huggahalli's~\cite{larsen2009architectural} research on the proportions of the latency in the Internet protocol suite. This overhead resulted in the need---and thus the development---of a new architecture which would provide each process with a directly accessible interface to the \gls{nic}\@: the Virtual Interface Architecture was born.
In their publication, Dunning et al.~\cite{dunning1998virtual} describe that the most important characteristics of the \gls{via} are:
\begin{itemize}
\setlength\itemsep{-0.2em}
\item data transfers are realized through zero-copy;
\item system calls are avoided whenever possible;
\item the \gls{nic} is not multiplexed between processes by a driver;
\item the number of instructions needed to initiate data transport is minimized;
\item no interrupts are required when initiating or completing data transport;
\item there is a simple set of instructions for sending and receiving data;
\item it can both be mimicked in software and synthesized to hardware.
\end{itemize}
Accordingly, several tasks which are handled in software in the Internet protocol suite---e.g., multiplexing the \gls{nic} to processes, data transfer scheduling, and preferably reliability of communication---must be handled by the \gls{nic} in the \gls{via}\@.
\subsection{Basic components}
A model of the \gls{via} is depicted in \autoref{fig:via_model}. At the top of the stack are the processes and applications that want to communicate over the network controller. Together with \gls{os} communication protocols and a special set of instructions which are called the \textit{\acrshort{vi} User Agent}, they form the \textit{\acrshort{vi} Consumer}. The VI consumer is colored light gray in \autoref{fig:via_model} and resides completely in the operating system's user space. The user agent provides the upper layer applications and communication protocols with an interface to the \textit{\acrshort{vi} Provider} and a direct interface to the \glspl{vi}.
\begin{figure}[ht]
\hspace{0.4cm}
\includegraphics{images/via_model.pdf}
\vspace{-0.5cm}
\caption{The \acrfull{via} model.}\label{fig:via_model}
\end{figure}
The VI provider, colored dark gray in \autoref{fig:via_model}, is responsible for the instantiation of the virtual interfaces and completion queues, and consists of the \textit{kernel agent} and the \gls{nic}\@. In the \gls{via}, the \gls{nic} implements and manages the virtual interfaces and completion queues---which will both be further elaborated upon in \autoref{sec:data_transfer}---and is responsible for performing data transfers. The kernel agent is part of the operating system and is responsible for resource management, e.g., creation and destruction of \glspl{vi}, management of memory used by the \gls{nic}, and interrupt management. Although communication between consumer and kernel agent requires switches between user and kernel mode, this does not influence the latency of data transfers because no data is actually transferred via this interface.
\subsection{Data transfer\label{sec:data_transfer}}
One of the most distinctive elements of the \gls{via}, compared to the Internet protocol suite, is the \acrfull{vi}. Because of this direct interface to the \gls{nic}, each process assumes that it owns the interface and there is no need for system calls when performing data transfers. Each virtual interface consists of a send and a receive work queue which can hold \textit{descriptors}. These contain all information necessary to transfer data, for example, destination addresses, transfer mode to be used, and the location of data to be transferred in the main memory. Hence, both send and receive data transfers are initiated by writing a descriptor memory structure to a \gls{vi}, and subsequently notifying the VI provider about the submitted structure. This notification happens with the help of a \textit{doorbell}, which is directly implemented in the \gls{nic}\@. As soon as the \gls{nic}'s doorbell has been rung, it starts to asynchronously process the descriptors.
When a transfer has been completed---successfully or with an error---the descriptors are marked by the \gls{nic}\@. Usually, it is the consumer's responsibility to remove completed descriptors from the work queues. Alternatively, on creation, a \gls{vi} can be bound to a \gls{cq}. Then, notifications on completed transfers are directed to this queue. A \gls{cq} has to be bound to at least one work queue. This means that, on the other hand, completion notifications of several work queues can be redirected to one single completion queue. Hence, if there is an environment with $N$ virtual interfaces with each two work queues, there can be
\begin{equation}
0 \leq M \leq 2\cdot N
\end{equation}
completion queues.
The Virtual Interface Architecture supports two asynchronously operating data transfer models: the \textit{send and receive messaging} model and the \gls{rdma} model. The characteristics of both models are described below.
\paragraph{Send and receive messaging model (channel semantics)} This model is the concept behind various other popular data transfer architectures. First, a receiving node explicitly specifies where data which will be received shall be saved in its local memory. In the \gls{via}, this is done by submitting a descriptor to the receive work queue. Subsequently, a sending node specifies the address of the data to be sent to that receiving node in its own memory. This location is then submitted to its send work queue, analogous to the procedure for the receive work queue.
\paragraph{Remote Direct Memory Access model (memory semantics)} This approach is lesser-known. When using the \gls{rdma} model, one node, the active node, specifies both the local and the remote memory region. There are two possible operations in this model: \textit{\gls{rdma} write} and \textit{\gls{rdma} read}. In the former, the active node specifies a local memory region which contains data to be sent and a remote memory region to which the data shall be written. In the latter, the active node specifies a remote memory region which contains data it wants to acquire and a local memory region to which the data shall be written. To initiate an \gls{rdma} transfer, the active node has to specify the local and remote memory addresses and the operation mode in a descriptor and submit it to the send work queue. The operating system and software on the passive node are not aware of both \gls{rdma} operations. Hence, there is no need to submit descriptors to the receive work queue at the passive side.
\subsection{The virtual interface finite-state machine}
The original \gls{via} proposal defines four states in which a virtual interface can reside: \textit{idle}, \textit{pending connect}, \textit{connected}, and \textit{error}. Transitions between states are handled by the VI provider and are invoked by the VI consumer or events on the network. The four states and all possible state transitions are depicted in the finite-state machine in \autoref{fig:via_diagram}. A short clarification on every state is given in the list below:
\begin{itemize}
\setlength\itemsep{0.2em}
\item \textbf{\textit{Idle}}: A \gls{vi} resides in this state after its creation and before it gets destroyed. Receive descriptors may be submitted but will not be processed. Send descriptors will immediately complete with an error.
\item \textbf{\textit{Pending connect}}: An active \gls{vi} can move to this state by invoking a connection request to a passive \gls{vi}\@. A passive \gls{vi} will transition to this state when it attempts to accept a connection. In both cases, it stays in this state until the connection is completely established. If the connection request times out, the connection is rejected, or if one of the \glspl{vi} disconnects, the \gls{vi} will return to the \textit{idle} state. If a hardware or transport error occurs, a transition to the \textit{error} state will be made. Descriptors which are submitted to either work queue in this state are treated in the same fashion as they are in the \textit{idle} state.
\item \textbf{\textit{Connected}}: A \gls{vi} resides in this state if a connection request it has submitted has been accepted or after it has successfully accepted a connection request. The \gls{vi} will transition to the \textit{idle} state if it itself or the remote \gls{vi} disconnects. It will transition to the \textit{error} state on hardware, transport, or, dependent on the reliability level of the connection, on other connection related errors. All descriptors which have been submitted in previous states and did not result in an immediate error and all descriptors which are submitted in this state are processed.
\item \textbf{\textit{Error}}: If the \gls{vi} transitions to this state, all descriptors present in both work queues are marked as erroneous. The VI consumer must handle the error, transition the \gls{vi} to the \textit{idle} state, and restart the connection if desired.
\end{itemize}
\begin{figure}[ht]
\hspace{0.5cm}
\includegraphics{images/via_states.pdf}
\vspace{-0.5cm}
\caption{The \acrfull{via} state diagram.}\label{fig:via_diagram}
\end{figure}
\section{The InfiniBand Architecture\label{sec:infiniband}}
After a brief introduction on the Virtual Interface Architecture in \autoref{sec:via}, this section will further elaborate upon \gls{ib}. Because the \gls{via} is an abstract model, the purpose of the previous section was not to provide the reader with its exact specification, but rather to give him/her a general idea of the \gls{via} design decisions. Since the exact implementation of various parts of the Virtual Interface Architecture is left open, the \gls{iba} does not completely correspond to the \gls{via}\@. Therefore, a more comprehensive breakdown of the \gls{iba} will be given in this section.
The \gls{ibta} was founded by more than 180 companies in August 1999 to create a new industry standard for inter-server communication. After 14 months of work, this resulted in a collection of manuals of which the first volume describes the architecture~\cite{infinibandvol1} and the second the physical implementation of InfiniBand~\cite{infinibandvol2}. In addition, Pfister~\cite{pfister2001introduction} wrote an excellent summary of the \gls{iba}.
\subsection{Basics of the InfiniBand Architecture\label{sec:iba}}
\paragraph{Network stack}
Like most modern network technologies, the \gls{iba} can be described as a network stack, which is depicted in \autoref{fig:iba_network_stack}. The stack consists of a physical, link, network, and transport layer.
\begin{figure}[ht!]
\includegraphics{images/network_stack.pdf}
\caption{The network stack of the \acrfull{iba}.}\label{fig:iba_network_stack}
\end{figure}
The \gls{iba} implementations of the different layers are displayed in the right column of \autoref{fig:iba_network_stack}. Although the present work attempts to separate the different layers into different subsections, some features cannot be explained without referring to features in other layers. Hence, the subsections do not directly correspond with the different layers.
First, this subsection gives some basic definitions for InfiniBand. It also includes some information about segmentation \& reassembly of messages (although that is part of the transport layer). The main component of the transport layer, the queue pair, is presented in \autoref{sec:qp}. That subsection also points out some similarities and differences between the \gls{via} and the \gls{iba}\@. Then, after the basics of the \gls{iba} subnet, the subnet manager, and managers in general are described in \autoref{sec:networking}, inner subnet routing and subnet routing will be elaborated upon in \autoref{sec:addressing}. Subsequently, \autoref{sec:vlandsl} clarifies InfiniBand's virtual lanes and service levels. \Autoref{sec:congestioncontrol} and~\ref{sec:memory} go further into flow control and memory management in the \gls{iba}, respectively. Finally, \autoref{sec:communication_management} explains how communication is established, managed, and destroyed.
An overview of the implementation of the physical link will not be given in the present work. The technical details on this can be found in the second volume of the InfiniBand\texttrademark~Architecture Specification~\cite{infinibandvol2}. The implementation of consumer operations will be elaborated upon later, in \autoref{sec:iblibs}.
\paragraph{Message segmentation} Communication on InfiniBand networks is divided into messages between \SI{0}{\byte} and $\SI[parse-numbers=false]{2^{32}}{\byte}$ (\SI{2}{\gibi\byte}) for all service types, except for unreliable datagram. The latter supports---depending on the \gls{mtu}---messages between \SI{0}{\byte} and \SI{4096}{\byte}.
Messages that are bigger than the \gls{mtu}, which describes the maximum size of a packet, are segmented into smaller packets by \gls{ib} hardware. The \gls{mtu} can be---depending on the hardware that is used---256, 512, 1024, 2048, or \SI{4096}{\byte}. Since segmentation and reassembly of packets is handled by hardware, the \gls{mtu} should not affect performance~\cite{crupnicoff2005deploying}. \Autoref{fig:message_segmentation} depicts the principle of breaking a message down into packets. An exact breakdown of the composition of packets will be described in \autoref{sec:addressing}.
\begin{figure}[ht!]
\includegraphics{images/message_segmentation.pdf}
\vspace{-0.5cm}
\caption{The segmentation of a message into packets.}\label{fig:message_segmentation}
\end{figure}
\paragraph{Endnodes and channel adapters} Ultimately, all communication on an InfiniBand network happens between \textit{endnodes} (also referred to as nodes in the present work). Such an endnode could be a host computer, but also, for example, a storage system.
A \gls{ca} forms the interface between the soft- and hardware of an endnode and the physical link which connects the endnode to a network. A channel adapter can either be a \gls{hca} or a \gls{tca}. The former is most commonly used, and distinguishes itself from the latter by implementing so-called \textit{verbs}. Verbs form the interface between processes on a host computer and the InfiniBand fabric; they are the implementation of the user agent from \autoref{fig:via_model}.
\paragraph{Service types} InfiniBand supports several types of communication services which are introduced in \autoref{tab:service_types}. Every channel adapter must implement \gls{ud}, which is conceptually comparable to \gls{udp}\@. \glspl{hca} must implement \glspl{rc}; this is optional for \glspl{tca}. The reliable connection is similar to \gls{tcp}\@. Neither of the channel adapter types is required to implement \glspl{uc} and \gls{rd}.
\Autoref{tab:service_types} describes the service levels on a very abstract level. More information on the implementation, for example, on the different headers which are used in \gls{iba} data packets, will be given later on. Furthermore, \autoref{tab:service_types} already contains references to the abbreviation \acrshort{qp}, which stands for queue pair and is InfiniBand's equivalent to a virtual interface (\autoref{sec:via}). This will be elaborated upon in the next subsection.
\input{tables/service_types}
\subsection{Queue pairs \& completion queues\label{sec:qp}}
As mentioned before, the InfiniBand Architecture is inspired by the Virtual Interface Architecture. \Autoref{fig:iba_model}, which is derived from \autoref{fig:via_model}, depicts an abstract model of the InfiniBand Architecture. In order to simplify this picture, the consumer and kernel agent are omitted. In the following, the functioning principle of this model will be explained.
\begin{figure}[ht!]
\hspace{0.4cm}
\includegraphics{images/iba_model.pdf}
\vspace{-0.5cm}
\caption{The \acrfull{iba} model.}\label{fig:iba_model}
\end{figure}
Virtual interfaces are called \glspl{qp} in the \gls{iba} and also consists of \glspl{sq} and \glspl{rq}. They are the highest level of abstraction and enable processes to directly communicate with the \gls{hca}\@. After everything has been initialized, a process will perform most operations on queue pairs while communicating over an InfiniBand network.
Similarly to a descriptor in the \gls{via}, a \gls{wr} has to be submitted to the send or receive queue in order to send or receive messages. Submitting a \gls{wr} results in a \gls{wqe} in the respective queue. Among others, a \gls{wqe} holds the address to a location in the host's main memory. In case of a send \gls{wqe}, this memory location contains the data to be sent to a remote host. In case of a receive \gls{wqe}, the containing memory address points to the location in the main memory to which received data shall be written. Not every \gls{qp} can access all memory locations; this protection is handled by specific memory management mechanisms. These also handle which locations may be accessed by the remote hosts and by the \gls{hca}\@. More information on memory management can be found in \autoref{sec:memory}.
A work queue element in the send queue also contains the network address of the remote endnode and the transfer model, e.g., the send messaging model or an \gls{rdma} model. Except for the initialization of data transmissions, a work request can be used to bind a memory window to a memory region. This is further enlarged upon in \autoref{sec:memory}. A more comprehensive overview of the composition of \glspl{wr} in general will be provided in \autoref{sec:iblibs}.
\begin{figure}[ht!]
\hspace{0.4cm}
\includegraphics{images/qp_communication.pdf}
\vspace{-0.5cm}
\caption{Three \acrfullpl{sq} on a sending node communicate with three \acrfullpl{rq} on a receiving node. Both nodes have both a send and a receive queue, but the unused queues have been omitted for the sake of clarity.}\label{fig:qp_communication}
\end{figure}
\paragraph{Example} \autoref{fig:qp_communication} shows an example with three queue pairs in one node---in this example called \textit{sending node}---that communicate with three queue pairs of another node---here, \textit{receiving node}. Note that a queue pair is always initialized with a send and a receive queue; for the sake of clarity, the unused queues have been omitted in this depiction. Hence, the image shows no receive queues for the sending node and no send queues for the receiving node.
First, before any message can be transmitted between the two nodes, the receiving node has to prepare receive \glspl{wqe} by submitting receive work requests to the receive queues. Every receive \gls{wr} includes a pointer to a local memory region, which provides the \gls{hca} with a memory location to save received messages to. In the picture, the consumer is submitting a \gls{wr} to the red receive queue.
Secondly, send work requests may be submitted, which will then be processed by the channel adapter. Although the processing order of the queues depends on the priority of the services (\autoref{sec:vlandsl}), on congestion control (\autoref{sec:congestioncontrol}), and on the manufacturer's implementation of the \gls{hca}, \glspl{wqe} in a single queue will alway obey the \gls{fifo} principle. In this image, the consumer is submitting a send work request to the red send queue, and the \gls{hca} is processing a \gls{wqe} from the blue send queue.
After the \gls{hca} processed a \gls{wqe}, it places a \gls{cqe} in the completion queue. This entry contains, among others, information about the \gls{wqe} which was processed, but also about the status of the operation. The status could indicate a successful transmission, but also an error, e.g., if not sufficient receive work queue elements were available in the receive queue. A \gls{cqe} is posted when a \gls{wqe} is completely processed, so the exact moment that it is posted depends on the service type that is used. E.g., if the service type is unreliable, the \gls{wqe} will be completed as soon as the channel adapter processed it and sent the data. However, if a reliable service type is used, the \gls{wqe} will not complete until the message is successfully received by the remote host.
Obviously, after the message has been sent over the physical link, the receiving node's \gls{hca} will receive that same message. Then, it will acquire the destination \gls{qp} from the packets' base transport headers---more on that in \autoref{sec:addressing}---and grab the first available element from that \gls{qp}'s receive queue. In the case of this example, the channel adapter is consuming a \gls{wqe} from the blue receive queue. After retrieving a work queue element, the \gls{hca} will read the memory address from the \gls{wqe} and write the message to that memory location. When it is done doing so, it will post a completion queue entry to the completion queue. If the consumer of the sending node included immediate data in the message, that will be available in the \gls{cqe} at the receive side.
\paragraph{Processing WQEs}After a process has submitted a work request to one of the queues, the channel adapter starts processing the resulting \gls{wqe}. As can be seen in \autoref{fig:iba_model}, an internal \gls{dma} engine will access the memory location which is included in the work queue element, and will copy the data from the host's main memory to a local buffer of the \gls{hca}. Every port of an \gls{hca} has several of these buffers which are called \glspl{vl}. Subsequently, separately for every port, an arbiter decides from which virtual lane packets will be sent onto the physical link. How packets are distributed among the virtual lanes and how the arbiter decides from which virtual lane to send is explained in \autoref{sec:vlandsl}.
\paragraph{Queue pair state machine} Like the virtual interfaces in \autoref{sec:via}, queue pairs can reside in several states as depicted in \autoref{fig:qp_states}. All black lines are normal transitions and have to be explicitly initialized by a consumer with a \textit{modify queue pair verb}. Red lines are transitions to error states, which usually happen automatically. Because this diagram is more extensive than the state machine of the \gls{via} (\autoref{fig:via_diagram}), the descriptions of the state transitions are omitted in this figure. All states, their characteristics, and the way to enter the state are summarized in the list below. Every list item has a sublist which provides information on how work requests, received messages, and messages to be sent are handled.