Commit 03da0a5e authored by Sebastian Wagner's avatar Sebastian Wagner Committed by Raphael Defosseux

CI: fixing compilation warning in phy simulators

Signed-off-by: Raphael Defosseux's avatarRaphael Defosseux <raphael.defosseux@eurecom.fr>
parent b3bc216d

Too many changes to show.

To preserve performance only 254 of 254+ files are displayed.

......@@ -622,6 +622,10 @@ inline static uint32_t vcd_get_write_index(void)
return write_index;
}
#if defined(ENABLE_ITTI)
int signal_mask(void);
#endif
void *vcd_dumper_thread_rt(void *args)
{
vcd_queue_user_data_t *data;
......
......@@ -118,6 +118,33 @@
\end{tabular}
}
\paragraph{Version 1.0:}
\begin{itemize}
\item Initial version
\end{itemize}
\paragraph{Version 2.0:}
\begin{itemize}
\item Enhancements in message passing:
\begin{itemize}
\item LUTs replaced by smaller BG-specific parameters
\item Inefficient load/store replaced by circular memcpy
\end{itemize}
\item Bug fixes:
\begin{itemize}
\item Fixed bug in function \texttt{llr2CnProcBuf}
\item Corrected input LLR dynamic range in BLER simulations
\end{itemize}
\item Results:
\begin{itemize}
\item Size of LUTs reduced significantly (60MB to 200KB)
\item Siginifcantly enhances execution time (factor 3.5)
\item Improved BLER performance (all simulation results have been updated)
\end{itemize}
\end{itemize}
\newpage
\tableofcontents
\newpage
......@@ -293,7 +320,7 @@ The implementation on a general purpose processor (GPP) has to take advantage of
\draw (bnProcPc) edge[connector] (bnProc);
\draw (bnProc) edge[connector] (bn2cnProcBuf);
\draw (bn2cnProcBuf) edge[connector] (cnProc);
\draw (bn2cnProcBuf) edge[connector] node[above left] {\texttt{cnProcPc}} (cnProc);
\draw (cnProc) edge[connector] (cn2bnProcBuf);
\draw (cn2bnProcBuf) edge[connector] (bnProcPc);
......@@ -316,7 +343,7 @@ The implementation on a general purpose processor (GPP) has to take advantage of
\caption{LDPC Decoder processing flow.}
\end{figure}
The functions involved are described in more detail in table \ref{tab:sum_func}.
The functions involved are described in more detail in Table \ref{tab:sum_func}.
\begin{table}[ht]
\centering
......@@ -327,6 +354,7 @@ The functions involved are described in more detail in table \ref{tab:sum_func}.
\texttt{llr2llrProcBuf} & Copies input LLRs to LLR processing buffer \\
\texttt{llr2CnProcBuf} & Copies input LLRs to CN processing buffer \\
\texttt{cnProc} & Performs CN signal processing \\
\texttt{cnProcPc} & Performs parity check \\
\texttt{cn2bnProcBuf} & Copies the CN results to the BN processing buffer \\
\texttt{bnProcPc} & Performs BN processing for parity check and/or hard-decision \\
\texttt{bnProc} & Utilizes the results of \texttt{bnProcPc} to compute LLRs for CN processing \\
......@@ -514,24 +542,7 @@ The sum of the LLRs is carried out in 16 bit for accuracy and is then saturated
\subsection{Mapping to the Processing Buffers}
\label{sec:mapp-cn-proc}
For efficient processing with the AVX instructions, the data is required to be aligned in a certain manner. That is the reason why processing buffers have been introduced. The drawback is that the results of the processing need to copied every time to the processing buffer of the next task. However, the speed up in computation with AVX more than makes up for the time wasted in copying data. The copying is implemented using look-up tables (LUTs) which are described in table \ref{tab:sum_lut}.
\begin{table}[ht]
\centering
\begin{tabular}{ll}
\toprule
\textbf{LUT} & \textbf{Description} \\
\midrule
\texttt{lut\_llr2llrProcBuf\_BGX\_ZX\_RX} & Indices for function \texttt{llr2llrProcBuf} \\
\texttt{lut\_llr2CnProcBuf\_BGX\_ZX\_RX} & Indices for function \texttt{llr2CnProcBuf} \\
\texttt{lut\_cn2bnProcBuf\_BGX\_ZX\_RX} & Indices for functions \texttt{cn2bnProcBuf} and \texttt{bn2cnProcBuf} \\
\bottomrule
\end{tabular}
\caption{Summary of the LUTs.}
\label{tab:sum_lut}
\end{table}
These LUTs are depending on the BG, the lifting size and the code rate. Assuming 5 rates for BG2 and 7 rates for BG1, the total number of LUTs is 617.
For efficient processing with the AVX instructions, the data is required to be aligned in a certain manner. That is the reason why processing buffers have been introduced. The drawback is that the results of the processing need to copied every time to the processing buffer of the next task. However, the speed up in computation with AVX more than makes up for the time wasted in copying data. The copying is implemented as a circular memcpy because every edge in the BG is a circular shift of a $Z\times Z$ identity matrix. Hence, a circular mempcy consists of two regular memcpys each copying a part of the $Z$ values depending on the circular shift in the BG definition. The circular shifts are stored in \texttt{nrLDPC\_lut.h} in arrays \texttt{circShift\_BGX\_ZX\_CNGX}. In the specification there are only 8 sets of cirular shifts defined. However, the applied circular shift depends on $Z$, i.e. modulo $Z$. To avoid inefficient modulo operations in loops, we store the the circular shift values for every $Z$. Moreover, for convinience the arrays are already arranged depending on the CN group (CNG).
\newpage
\section{Performance Results}
......@@ -542,9 +553,9 @@ In this section, the performance in terms of BLER and decoding latency of the cu
\subsection{BLER Performance}
\label{sec:bler-performance}
In all simulations, we assume AWGN, QPSK modulation and 8-bit input LLRs. The results are averaged over at least $10\,000$ channel realizations.
In all simulations, we assume AWGN, QPSK modulation and 8-bit input LLRs, i.e. $-127$ until $+127$. The DLSCH coding procedure in 38.212 is used to encode/decode the TB and an error is declared if the TB CRC check failed. Results are averaged over at least $10\,000$ channel realizations.
The first set of simulations in Figure \ref{fig:bler-bg2-15} compares the current LDPC decoder implementation to the reference implementation developed by Kien. This reference implementation is called \textit{LDPC Ref} and uses the min-sum algorithm with 2 layers and 16 bit for processing. Out current optimized decoder implementation is referred to as \textit{LDPC Opt}. Moreover, reference results provided by Huawei are also shown.
The first set of simulations in Figure \ref{fig:bler-bg2-15} compares the current LDPC decoder implementation to the reference implementation developed by Kien. This reference implementation is called \textit{LDPC Ref} and uses the min-sum algorithm with 2 layers and 16 bit for processing. Our current optimized decoder implementation is referred to as \textit{LDPC OAI}. Moreover, reference results provided by Huawei are also shown.
\begin{figure}[ht]
\centering
......@@ -563,46 +574,57 @@ The first set of simulations in Figure \ref{fig:bler-bg2-15} compares the curren
% HUAWEI merged BG2 2017-06-15
\addplot[black, solid] plot coordinates { (-3.91839,0.01) (-3.5567,0.0001) };
% 5 iterations
% LDPC Ref
\addplot[red, solid, mark=o] plot coordinates {(-1.250000,0.781300) (-1.000000,0.421000) (-0.750000,0.140400) (-0.500000,0.028900) (-0.250000,0.003300) (0.000000,0.000300) (0.250000,0.000000) (0.500000,0.000000)};
% LDPC OAI
\addplot[blue, solid, mark=square] plot coordinates {(-1.000000,0.693730) (-0.750000,0.370190) (-0.500000,0.137260) (-0.250000,0.038850) (0.000000,0.009740) (0.250000,0.002510) (0.500000,0.000730) (0.750000,0.000180) };
% Matlab layered min-sum with scaling factor 1
\addplot[green, solid, mark=triangle] plot coordinates {(-1.750000,0.709000) (-1.500000,0.360600) (-1.250000,0.105500) (-1.000000,0.015700) (-0.750000,0.001300) (-0.500000,0.000100) (-0.250000,0.000000) (0.000000,0.000000) };
% Matlab layered min-sum with scaling factor 0.8
%\addplot[green, solid, mark=triangle] plot coordinates {(-2.750000,0.982300) (-2.500000,0.882200) (-2.250000,0.573100) (-2.000000,0.214100) (-1.750000,0.041300) (-1.500000,0.003800) (-1.250000,0.000000) (-1.000000,0.000000) };
% 10 iterations
% Kien's 2-layer 16bit code
\addplot[red, solid, mark=o] plot coordinates { (-2.750000,0.915500) (-2.500000,0.576000) (-2.250000,0.165000) (-2.000000,0.017100) (-1.750000,0.000600) (-1.500000,0.000000) (-1.250000,0.000000) (-1.000000,0.000000)};
% LDPC opt with 16bit BN processing
\addplot[blue, solid, mark=square] plot coordinates { (-2.750000,0.998600) (-2.500000,0.953600) (-2.250000,0.718800) (-2.000000,0.299300) (-1.750000,0.053700) (-1.500000,0.005100) (-1.250000,0.000500) (-1.000000,0.000100)};
% Matlab
\addplot[green, solid, mark=triangle] plot coordinates {(-2.750000,0.318200) (-2.500000,0.135900) (-2.250000,0.102000) (-2.000000,0.092300) (-1.750000,0.079200) (-1.500000,0.063100) (-1.250000,0.041400) (-1.000000,0.029600) (-0.750000,0.017500) (-0.500000,0.011800) (-0.250000,0.006100) (0.000000,0.004100) (0.250000,0.002800) (0.500000,0.000800) (0.750000,0.000300) (1.000000,0.000400) };
% LDPC OAI
\addplot[blue, solid, mark=square] plot coordinates { (-2.750000,0.997200) (-2.500000,0.955000) (-2.250000,0.710900) (-2.000000,0.270400) (-1.750000,0.042400) (-1.500000,0.002200) (-1.250000,0.000000) (-1.000000,0.000000)};
% Matlab layered min-sum with scaling factor 1
\addplot[green, solid, mark=triangle] plot coordinates {(-2.750000,0.942900) (-2.500000,0.723200) (-2.250000,0.362300) (-2.000000,0.098400) (-1.750000,0.014500) (-1.500000,0.001100) (-1.250000,0.000000) (-1.000000,0.000000) };
% Matlab layered min-sum with scaling factor 0.8
%\addplot[green, solid, mark=triangle] plot coordinates {(-3.750000,0.994300) (-3.500000,0.927200) (-3.250000,0.651100) (-3.000000,0.252000) (-2.750000,0.042500) (-2.500000,0.002700) (-2.250000,0.000000) (-2.000000,0.000000) (-1.750000,0.000000) (-1.500000,0.000000) };
% \addplot[blue, solid, mark=triangle] plot coordinates {(-2.750000,0.997600) (-2.500000,0.961600) (-2.250000,0.815200) (-2.000000,0.628800) (-1.750000,0.586200) (-1.500000,0.572800) (-1.250000,0.507600) (-1.000000,0.376700) (-0.750000,0.262000) (-0.500000,0.157100) (-0.250000,0.087100) (0.000000,0.045400) (0.250000,0.021000) (0.500000,0.010000) (0.750000,0.004400) (1.000000,0.002600) (1.250000,0.000700) (1.500000,0.000500) (1.750000,0.000000) (2.000000,0.000000) (2.250000,0.000000) (2.500000,0.000000) (2.750000,0.000000) (3.000000,0.000000)};
% 20 iterations
% Kien's 2-layer 16bit code
\addplot[red, solid, mark=o] plot coordinates { (-2.750000,0.330300) (-2.500000,0.067800) (-2.250000,0.006000) (-2.000000,0.000100) (-1.750000,0.000000) (-1.500000,0.000000) (-1.250000,0.000000) (-1.000000,0.000000)};
% LDPC OAI
\addplot[blue, solid, mark=square] plot coordinates {(-2.750000,0.337900) (-2.500000,0.058300) (-2.250000,0.004000) (-2.000000,0.000200) (-1.750000,0.000000) (-1.500000,0.000000) };
% Matlab layered min-sum with scaling factor 1
%\addplot[green, solid, mark=triangle] plot coordinates {(-2.750000,0.843200) (-2.500000,0.524600) (-2.250000,0.198100) (-2.000000,0.037300) (-1.750000,0.003200) (-1.500000,0.000000) };
% Matlab layered min-sum with scaling factor 0.8
%\addplot[green, solid, mark=triangle] plot coordinates {(-3.750000,0.872300) (-3.500000,0.544600) (-3.250000,0.186400) (-3.000000,0.027500) (-2.750000,0.001900) (-2.500000,0.000000) };
\addplot[blue, solid, mark=square] plot coordinates {(-2.750000,0.341300) (-2.500000,0.065100) (-2.250000,0.004100) (-2.000000,0.000200) (-1.750000,0.000100) (-1.500000,0.000000) (-1.250000,0.000000) (-1.000000,0.000000)};
% 5 iterations
\addplot[red, solid, mark=o] plot coordinates {(-1.250000,0.781300) (-1.000000,0.421000) (-0.750000,0.140400) (-0.500000,0.028900) (-0.250000,0.003300) (0.000000,0.000300) (0.250000,0.000000) (0.500000,0.000000)};
\addplot[blue, solid, mark=square] plot coordinates {(-0.250000,0.705000) (0.000000,0.406200) (0.250000,0.181300) (0.500000,0.061600) (0.750000,0.015900) (1.000000,0.004900) (1.250000,0.000900) (1.500000,0.000200)};
\addplot[green, solid, mark=triangle] plot coordinates {(-1.000000,0.778900) (-0.500000,0.226400) (0.000000,0.027400) (0.500000,0.002600) (1.000000,0.000300) };
% 30 iterations
% \addplot[blue, dashed, mark=square] plot coordinates {(-3.000000,0.623800) (-2.750000,0.224100) (-2.500000,0.031600) (-2.250000,0.001100) (-2.000000,0.000000)};
\draw (axis cs:-3.3,0.1) node[fill=white,draw=black] (pint0) {20 iter};
\draw (axis cs:-2.3,0.01) node[draw,black,thick,ellipse,minimum height=0.3cm] (ell0) {}; \draw[black,thick] (pint0) -- (ell0);
% Parity check 50 iterations
%\addplot[blue, solid, mark=square] plot coordinates {(-2.750000,0.214600) (-2.500000,0.029200) (-2.250000,0.001500) (-2.000000,0.000100) (-1.750000,0.000000) (-1.500000,0.000000) };
\draw (axis cs:0,0.0001) node[fill=white,draw=black] (pint1) {10 iter};
\draw (axis cs:-1.6,0.002) node[draw,black,thick,ellipse,minimum width=0.8cm] (ell1) {}; \draw[black,thick] (pint1) -- (ell1);
\draw (axis cs:1.3,0.2) node[fill=white,draw=black] (pint2) {5 iter};
\draw (axis cs:0.3,0.01) node[draw,black,thick,ellipse,minimum width=2cm] (ell2) {}; \draw[black,thick] (pint2) -- (ell2);
\draw (axis cs:-3.3,0.1) node[fill=white,draw=black] (pint0) {20 iter};
\draw (axis cs:-2.3,0.01) node[draw,black,thick,ellipse,minimum height=0.3cm] (ell0) {}; \draw[black,thick] (pint0) -- (ell0);
\draw (axis cs:-1.2,0.0001) node[fill=white,draw=black] (pint1) {10 iter};
\draw (axis cs:-1.6,0.002) node[draw,black,thick,ellipse,minimum width=0.8cm] (ell1) {}; \draw[black,thick] (pint1) -- (ell1);
\draw (axis cs:1.3,0.2) node[fill=white,draw=black] (pint2) {5 iter};
\draw (axis cs:-0.4,0.01) node[draw,black,thick,ellipse,minimum width=2cm] (ell2) {}; \draw[black,thick] (pint2) -- (ell2);
\legend{ {Huawei 2017-06-15}\\
{LDPC Ref}\\
{LDPC Opt}\\
{MATLAB 5Glib}\\};
{LDPC OAI}\\
{MATLAB NMS SF=1}\\};
\end{semilogyaxis}
\end{tikzpicture}
......@@ -610,12 +632,20 @@ The first set of simulations in Figure \ref{fig:bler-bg2-15} compares the curren
\label{fig:bler-bg2-15}
\end{figure}
From Figure \ref{fig:bler-bg2-15} it can be observed that the reference decoder outperforms the current implementation significantly for low to medium number of iterations. The reason is the implementation of 2 layers in the reference decoder, which results in faster convergence for punctured codes and hence requires less iterations to achieve a given BLER target. Note that there is a large performance loss of nearly 6 dB at BLER $10^{-2}$ between the Huawei reference and the current optimized decoder implementation with 5 iterations.
From Figure \ref{fig:bler-bg2-15} it can be observed that the reference decoder outperforms the current implementation significantly for low to medium number of iterations. The reason is the implementation of 2 layers in the reference decoder, which results in faster convergence for punctured codes and hence requires less iterations to achieve a given BLER target. Note that there is a large performance loss of about 4 dB at BLER $10^{-2}$ between the Huawei reference and the current optimized decoder implementation with 5 iterations.
Moreover, there is a gap of about 1.5 dB between the results provided by Huawei and the current decoder with 20 iterations. The reason is the min-sum approximation algorithm used in both the reference decoder and the current implementation. The gap can be closed by using a tighter approximation like the min-sum with normalization or the lambda-min approach. Moreover, the gap closes for higher code rates which can be observed from Figure \ref{fig:bler-bg2-r23}. The gap is only about 0.6 dB for 50 iterations.
Concerning the LDPC decoder provided by MATLAB, the performance appears to be rather inconsistent. For 5 iterations, the MATLAB decoder outperforms the optimized decoder most likely due to a tighter approximation used in the check node processing. However, it is inferior to the reference algorithm which suggests that the MATLAB decoder is not optimized for punctured LDPC codes, i.e. no layered processing. For 50 iterations the MATLAB LDPC decoder shows a strange behavior, the slope of the BLER curve is not as expected. This suggests that there might be some internal decoder problems with the NR base graph 2.
The Matlab results denoted \texttt{MATLAB NMS} are obtained with the function \texttt{nrLDPCDecode} provided by the MATLAB 5G Toolbox R2019b. The following options are provided to the function: \texttt{'Termination','max','Algorithm','Normalized min-sum','ScalingFactor',1}. Furthermore, the 8-bit input LLRs are adapted to fit the dynamic range of \texttt{nrLDPCDecode} which is shown in Listing \ref{ldpc_matlab}.
\begin{lstlisting}[frame=single,caption={Input adaptation for MATLAB LDPC Decoder},label=ldpc_matlab]
maxLLR = max(abs(softbits));
rxLLRs = round((softbits/maxLLR)*127);
// adjust range to fit tanh use in decoder code
softbits = rxLLRs/3.4;
\end{lstlisting}
A scaling factor (SF) of 1 has been chosen to compare the results more easily with the \textit{LDPC OAI} since the resulting check node processing is the same. However, the Matlab normelized min-sum algorithm uses layered processing and floating point operations. Thus, for the same number of iterations, the performance is significantly better than \textit{LDPC OAI}, especially for small a number of iterations.
\begin{figure}[ht]
\centering
......@@ -628,7 +658,7 @@ Concerning the LDPC decoder provided by MATLAB, the performance appears to be ra
\pgfplotsset{every axis/.append style={mark options=solid, mark size=2.5pt}}
\begin{semilogyaxis}[title={}, xlabel={$\SNR$ [dB]}, ylabel={BLER},
grid={both}, xmin=3, xmax=5.5, xtick={3,3.5,...,5.5}, ymin=0,
grid={both}, xmin=3, xmax=6.5, xtick={3,3.5,...,6.5}, ymin=0,
ymax=1,ytickten={-5,-4,-3,-2,-1,0},legend columns=1]
% Kien's 2-layer 16bit code
......@@ -638,16 +668,19 @@ Concerning the LDPC decoder provided by MATLAB, the performance appears to be ra
\addplot[black, solid] plot coordinates { (3.28392,0.01) (3.73319,0.0001) };
% LDPC opt with 16bit BN processing
\addplot[blue, solid, mark=square] plot coordinates {(4.000000,0.487500) (4.250000,0.163400) (4.500000,0.029800) (4.750000,0.002700) (5.000000,0.000100)};
%\addplot[blue, solid, mark=square] plot coordinates {(4.000000,0.487500) (4.250000,0.163400) (4.500000,0.029800) (4.750000,0.002700) (5.000000,0.000100)};
\addplot[blue, solid, mark=square] plot coordinates {(5.000000,0.439600) (5.250000,0.185800) (5.500000,0.062100) (5.750000,0.015000) (6.000000,0.003900)};
%\addplot[blue, dashed, mark=triangle] plot coordinates {(4.000000,0.487500) (4.250000,0.163700) (4.500000,0.030000) (4.750000,0.002900) (5.000000,0.000100)};
\addplot[blue, dashed, mark=square] plot coordinates {(3.000000,0.911600) (3.250000,0.614100) (3.500000,0.230100) (3.750000,0.036900) (4.000000,0.001100) (4.250000,0.000000) (4.500000,0.000000)};
%\addplot[blue, dashed, mark=square] plot coordinates {(3.000000,0.911600) (3.250000,0.614100) (3.500000,0.230100) (3.750000,0.036900) (4.000000,0.001100) (4.250000,0.000000) (4.500000,0.000000)};
\addplot[blue, dashed, mark=square] plot coordinates {(3.000000,0.900400) (3.250000,0.600000) (3.500000,0.216400) (3.750000,0.036000) (4.000000,0.002600) (4.250000,0.000000) };
\legend{ {Huawei 2017-06-15}\\
{LDPC Opt 5 iter}\\
{LDPC Opt 50 iter}\\};
{LDPC OAI 5 iter}\\
{LDPC OAI 50 iter}\\};
\end{semilogyaxis}
\end{tikzpicture}
......@@ -655,7 +688,7 @@ Concerning the LDPC decoder provided by MATLAB, the performance appears to be ra
\label{fig:bler-bg2-r23}
\end{figure}
Figure \ref{fig:bler-bg1-r89} shows the performance of BG1 with largest block size of $B=8448$ and highest code rate $R=8/9$.
In Figure \ref{fig:bler-bg2-15-2} we compare the performance of different algorithms using at most 50 iterations with early stopping if the parity check passes. The Matlab layered believe propagation (LBP) is used with unquantized input LLRs and performs the best since no approximation is done in the processing. Both NMS and offset min-sum (OMS) use a scaling factor and offset, respectively, that has been empirically found to perform best in this simulation setting. Theirs performance is very close to the BLP and OMS is slightly better than NMS. The performance of \textit{LDPC OAI} is more than 1 dB worse mainly because of the looser approximation. Moreover, the NMS algorithm with SF=1 performs worst probably because the SF is not optimized for the input LLRs. From the results in Figure \ref{fig:bler-bg2-15-2} we can conclude that the performance of the \textit{LDPC OAI} can be significantly improved by adopting an offset min-sum approximation improving the performance to within 0.3dB of the Huawei reference curve.
\begin{figure}[ht]
\centering
......@@ -668,22 +701,73 @@ Figure \ref{fig:bler-bg1-r89} shows the performance of BG1 with largest block si
\pgfplotsset{every axis/.append style={mark options=solid, mark size=2.5pt}}
\begin{semilogyaxis}[title={}, xlabel={$\SNR$ [dB]}, ylabel={BLER},
grid={both}, xmin=6, xmax=11, xtick={6,6.5,...,11}, ymin=0,
grid={both}, xmin=-4, xmax=-1, xtick={-4,-3.5,...,-1}, ymin=0,
ymax=1,ytickten={-5,-4,-3,-2,-1,0},legend columns=1]
% HUAWEI merged BG2 2017-06-15
\addplot[black, solid] plot coordinates { (-3.91839,0.01) (-3.5567,0.0001) };
% Parity check 50 iterations
\addplot[blue, solid, mark=square] plot coordinates {(-2.750000,0.214600) (-2.500000,0.029200) (-2.250000,0.001500) (-2.000000,0.000100) (-1.750000,0.000000) (-1.500000,0.000000) };
% Matlab layered believe propagation
\addplot[red, solid, mark=diamond] plot coordinates {(-4.500000,0.854200) (-4.250000,0.495800) (-4.000000,0.147700) (-3.750000,0.016100) (-3.500000,0.000800) (-3.250000,0.000200) (-3.000000,0.000000) };
% Matlab layered min-sum with scaling factor 1
\addplot[green, dashed, mark=triangle] plot coordinates {(-2.750000,0.830100) (-2.500000,0.497700) (-2.250000,0.165800) (-2.000000,0.024000) (-1.750000,0.001900) (-1.500000,0.000000) };
% Matlab layered min-sum with scaling factor 0.8
%\addplot[green, solid, mark=triangle] plot coordinates {(-3.750000,0.734800) (-3.500000,0.353800) (-3.250000,0.084300) (-3.000000,0.008000) (-2.750000,0.000400) };
\addplot[green, solid, mark=triangle] plot coordinates {(-4.500000,0.964400) (-4.250000,0.748200) (-4.000000,0.333600) (-3.750000,0.057700) (-3.500000,0.004400) (-3.250000,0.000400) };
% Matlab layered offset min-sum with offset 0.025
\addplot[brown, solid, mark=asterisk] plot coordinates {(-4.250000,0.688800) (-4.000000,0.253800) (-3.750000,0.035600) (-3.500000,0.002000) (-3.250000,0.000000) (-3.000000,0.000000) };
\legend{ {Huawei 2017-06-15}\\
{LDPC OAI}\\
{MATLAB LBP}\\
{MATLAB NMS SF=1}\\
{MATLAB NMS SF=0.65}\\
{MATLAB OMS OS=0.025}\\};
\end{semilogyaxis}
\end{tikzpicture}
\caption{BLER vs. SNR, BG2, Rate=1/5, max iterations = 50, B=1280.}
\label{fig:bler-bg2-15-2}
\end{figure}
Figure \ref{fig:bler-bg1-r89} shows the performance of BG1 with largest block size of $B=8448$ and highest code rate $R=8/9$. From Figure \ref{fig:bler-bg1-r89} it can be observed that the performance gap is only about 0.3 dB if 50 iterations are used. However, for 5 iterations there is still a significant performance loss of about 2.3 dB at BLER $10^{-2}$.
\begin{figure}[ht]
\centering
\begin{tikzpicture}
\tikzstyle{every pin}=[fill=white,draw=black]
\pgfplotsset{every axis legend/.append style={
cells={anchor=west}, at={(1.05,1)}, anchor=north west}}
% \pgfplotsset{every axis plot/.append style={smooth}}
\pgfplotsset{every axis/.append style={line width=0.5pt}}
\pgfplotsset{every axis/.append style={mark options=solid, mark size=2.5pt}}
\begin{semilogyaxis}[title={}, xlabel={$\SNR$ [dB]}, ylabel={BLER},
grid={both}, xmin=6, xmax=9, xtick={6,6.5,...,9}, ymin=0,
ymax=1,ytickten={-5,-4,-3,-2,-1,0},legend columns=1]
% Huawei
\addplot[black, solid] plot coordinates { (6.118717,0.01) (6.291449,0.0001) };
% LDPC opt 5 iter
\addplot[blue, solid, mark=square] plot coordinates {(8.500000,0.350000) (8.750000,0.155100) (9.000000,0.062400) (9.250000,0.023000) (9.500000,0.008700) (9.750000,0.003500) (10.000000,0.000900) (10.250000,0.000300) };
%\addplot[blue, solid, mark=square] plot coordinates {(8.500000,0.350000) (8.750000,0.155100) (9.000000,0.062400) (9.250000,0.023000) (9.500000,0.008700) (9.750000,0.003500) (10.000000,0.000900) (10.250000,0.000300) };
\addplot[blue, solid, mark=square] plot coordinates {(7.500000,0.858900) (7.750000,0.449500) (8.000000,0.129700) (8.250000,0.025500) (8.500000,0.002300) (8.750000,0.000300) (9.000000,0.000000) };
% LDPC opt 50 iter
\addplot[blue, dashed, mark=square] plot coordinates {(6.000000,0.705333) (6.100000,0.353367) (6.200000,0.102100) (6.300000,0.015133) (6.400000,0.000967) (6.500000,0.000000)};
%\addplot[blue, dashed, mark=square] plot coordinates {(6.000000,0.705333) (6.100000,0.353367) (6.200000,0.102100) (6.300000,0.015133) (6.400000,0.000967) (6.500000,0.000000)};
\addplot[blue, dashed, mark=square] plot coordinates {(6.000000,0.970000) (6.100000,0.830800) (6.200000,0.527300) (6.300000,0.216900) (6.400000,0.045500) (6.500000,0.005600) (6.600000,0.000300) (6.700000,0.000000) (6.800000,0.000000) };
\legend{ {Huawei}\\
{LDPC Opt 5 iter}\\
{LDPC Opt 50 iter}\\};
{LDPC OAI 5 iter}\\
{LDPC OAI 50 iter}\\};
\end{semilogyaxis}
\end{tikzpicture}
......@@ -691,8 +775,6 @@ Figure \ref{fig:bler-bg1-r89} shows the performance of BG1 with largest block si
\label{fig:bler-bg1-r89}
\end{figure}
From \ref{fig:bler-bg1-r89} it can be observed that the performance gap is only about 0.2 dB if 50 iterations are used. However, for 5 iterations there is still a significant performance loss of about 3.4 dB at BLER $10^{-2}$.
\newpage
\subsection{Decoding Latency}
\label{sec:decoding-time}
......@@ -707,20 +789,30 @@ The results in Table \ref{tab:lat-bg2-r15} show the impact of the number of iter
\toprule
\textbf{Function} & \textbf{Time [$\mu s$] (5 it)} & \textbf{Time [$\mu s$] (10 it)} & \textbf{Time [$\mu s$] (20 it)}\\
\midrule
\texttt{llr2llrProcBuf} & 1.1 & 1.1 & 1.1 \\
\texttt{llr2CnProcBuf} & 12.4 & 12.0 & 12.0 \\
\texttt{cnProc} & 11.7 & 22.1 & 43.5 \\
\texttt{bnProcPc} & 6.6 & 12.1 & 23.8 \\
\texttt{bnProc} & 4.2 & 8.1 & 16.2 \\
\texttt{cn2bnProcBuf} & 61.3 & 118.3 & 234.9 \\
\texttt{bn2cnProcBuf} & 38.1 & 82.5 & 172.3 \\
\texttt{llrRes2llrOut} & 3.5 & 3.4 & 3.4 \\
\texttt{llr2bit} & 0.2 & 0.1 & 0.1 \\
% \texttt{llr2llrProcBuf} & 1.1 & 1.1 & 1.1 \\
% \texttt{llr2CnProcBuf} & 12.4 & 12.0 & 12.0 \\
% \texttt{cnProc} & 11.7 & 22.1 & 43.5 \\
% \texttt{bnProcPc} & 6.6 & 12.1 & 23.8 \\
% \texttt{bnProc} & 4.2 & 8.1 & 16.2 \\
% \texttt{cn2bnProcBuf} & 61.3 & 118.3 & 234.9 \\
% \texttt{bn2cnProcBuf} & 38.1 & 82.5 & 172.3 \\
% \texttt{llrRes2llrOut} & 3.5 & 3.4 & 3.4 \\
% \texttt{llr2bit} & 0.2 & 0.1 & 0.1 \\
\texttt{llr2llrProcBuf} & 0.5 & 0.5 & 0.5 \\
\texttt{llr2CnProcBuf} & 5.0 & 4.8 & 4.9 \\
\texttt{cnProc} & 12.4 & 23.0 & 42.7 \\
\texttt{bnProcPc} & 8.4 & 14.8 & 27.0 \\
\texttt{bnProc} & 5.5 & 10.1 & 19.0 \\
\texttt{cn2bnProcBuf} & 14.9 & 24.4 & 44.0 \\
\texttt{bn2cnProcBuf} & 10.5 & 17.8 & 31.8 \\
\texttt{llrRes2llrOut} & 0.3 & 0.3 & 0.3 \\
\texttt{llr2bit} & 0.2 & 0.2 & 0.2 \\
\midrule
\textbf{Total} & \textbf{139.4} & \textbf{260.3} & \textbf{508.4} \\
% \textbf{Total} & \textbf{139.4} & \textbf{260.3} & \textbf{508.4} \\
\textbf{Total} & \textbf{58.5} & \textbf{97.1} & \textbf{172.6} \\
\bottomrule
\end{tabular}
\caption{BG2, Z=128, R=1/5, B=1280, LDPC Opt}
\caption{BG2, Z=128, R=1/5, B=1280, LDPC OAI}
\label{tab:lat-bg2-r15}
\end{table}
......@@ -732,20 +824,30 @@ Table \ref{tab:lat-bg2-i5} shows the impact of the code rate on the latency for
\toprule
\textbf{Function} & \textbf{Time [$\mu s$] (R=1/5)} & \textbf{Time [$\mu s$] (R=1/3)} & \textbf{Time [$\mu s$] (R=2/3)}\\
\midrule
\texttt{llr2llrProcBuf} & 3.2 & 2.9 & 2.6 \\
\texttt{llr2CnProcBuf} & 36.5 & 25.4 & 14.8 \\
\texttt{cnProc} & 33.6 & 25.2 & 13.3 \\
\texttt{bnProcPc} & 17.6 & 10.2 & 4.5 \\
\texttt{bnProc} & 8.5 & 5.4 & 2.5 \\
\texttt{cn2bnProcBuf} & 175.3 & 110.6 & 50.7 \\
\texttt{bn2cnProcBuf} & 106.6 & 71.2 & 36.1 \\
\texttt{llrRes2llrOut} & 10.2 & 6.3 & 3.3 \\
\texttt{llr2bit} & 0.4 & 0.2 & 0.1 \\
% \texttt{llr2llrProcBuf} & 3.2 & 2.9 & 2.6 \\
% \texttt{llr2CnProcBuf} & 36.5 & 25.4 & 14.8 \\
% \texttt{cnProc} & 33.6 & 25.2 & 13.3 \\
% \texttt{bnProcPc} & 17.6 & 10.2 & 4.5 \\
% \texttt{bnProc} & 8.5 & 5.4 & 2.5 \\
% \texttt{cn2bnProcBuf} & 175.3 & 110.6 & 50.7 \\
% \texttt{bn2cnProcBuf} & 106.6 & 71.2 & 36.1 \\
% \texttt{llrRes2llrOut} & 10.2 & 6.3 & 3.3 \\
% \texttt{llr2bit} & 0.4 & 0.2 & 0.1 \\
\texttt{llr2llrProcBuf} & 1.5 & 0.9 & 0.5 \\
\texttt{llr2CnProcBuf} & 6.0 & 4.1 & 2.2 \\
\texttt{cnProc} & 32.2 & 23.7 & 14.4 \\
\texttt{bnProcPc} & 21.2 & 12.1 & 5.5 \\
\texttt{bnProc} & 9.8 & 5.9 & 2.9 \\
\texttt{cn2bnProcBuf} & 23.3 & 13.9 & 6.8 \\
\texttt{bn2cnProcBuf} & 14.8 & 9.7 & 5.0 \\
\texttt{llrRes2llrOut} & 0.6 & 0.4 & 0.3 \\
\texttt{llr2bit} & 0.7 & 0.4 & 0.2 \\
\midrule
\textbf{Total} & \textbf{392.4} & \textbf{258.0} & \textbf{128.2} \\
% \textbf{Total} & \textbf{392.4} & \textbf{258.0} & \textbf{128.2} \\
\textbf{Total} & \textbf{111.0} & \textbf{71.8} & \textbf{38.5} \\
\bottomrule
\end{tabular}
\caption{BG2, Z=384, B=3840, LDPC Opt, 5 iterations}
\caption{BG2, Z=384, B=3840, LDPC OAI, 5 iterations}
\label{tab:lat-bg2-i5}
\end{table}
......@@ -757,33 +859,43 @@ Table \ref{tab:lat-bg1-i5} shows the results for BG1, larges block size and diff
\toprule
\textbf{Function} & \textbf{Time [$\mu s$] (R=1/3)} & \textbf{Time [$\mu s$] (R=2/3)} & \textbf{Time [$\mu s$] (R=8/9)}\\
\midrule
\texttt{llr2llrProcBuf} & 5.5 & 4.9 & 4.6 \\
\texttt{llr2CnProcBuf} & 60.6 & 34.1 & 24.4 \\
\texttt{cnProc} & 102.0 & 74.1 & 56.0 \\
\texttt{bnProcPc} & 26.0 & 11.0 & 6.4 \\
\texttt{bnProc} & 15.7 & 7.4 & 4.5 \\
\texttt{cn2bnProcBuf} & 291.0 & 140.8 & 83.1 \\
\texttt{bn2cnProcBuf} & 193.6 & 100.5 & 63.0 \\
\texttt{llrRes2llrOut} & 13.3 & 6.9 & 5.2 \\
\texttt{llr2bit} & 0.4 & 0.2 & 0.2 \\
% \texttt{llr2llrProcBuf} & 5.5 & 4.9 & 4.6 \\
% \texttt{llr2CnProcBuf} & 60.6 & 34.1 & 24.4 \\
% \texttt{cnProc} & 102.0 & 74.1 & 56.0 \\
% \texttt{bnProcPc} & 26.0 & 11.0 & 6.4 \\
% \texttt{bnProc} & 15.7 & 7.4 & 4.5 \\
% \texttt{cn2bnProcBuf} & 291.0 & 140.8 & 83.1 \\
% \texttt{bn2cnProcBuf} & 193.6 & 100.5 & 63.0 \\
% \texttt{llrRes2llrOut} & 13.3 & 6.9 & 5.2 \\
% \texttt{llr2bit} & 0.4 & 0.2 & 0.2 \\
\texttt{llr2llrProcBuf} & 2.1 & 1.2 & 0.9 \\
\texttt{llr2CnProcBuf} & 10.6 & 5.4 & 2.9 \\
\texttt{cnProc} & 89.8 & 66.3 & 50.0 \\
\texttt{bnProcPc} & 28.1 & 12.4 & 7.1 \\
\texttt{bnProc} & 17.1 & 8.1 & 4.8 \\
\texttt{cn2bnProcBuf} & 38.7 & 17.1 & 9.3 \\
\texttt{bn2cnProcBuf} & 25.6 & 12.7 & 7.2 \\
\texttt{llrRes2llrOut} & 0.8 & 0.4 & 0.3 \\
\texttt{llr2bit} & 0.9 & 0.4 & 0.3 \\
\midrule
\textbf{Total} & \textbf{708.9} & \textbf{380.6} & \textbf{248.1}\\
% \textbf{Total} & \textbf{708.9} & \textbf{380.6} & \textbf{248.1}\\
\textbf{Total} & \textbf{214.6} & \textbf{124.6} & \textbf{83.6}\\
\bottomrule
\end{tabular}
\caption{BG1, Z=384, B=8448, LDPC Opt, 5 iterations}
\caption{BG1, Z=384, B=8448, LDPC OAI, 5 iterations}
\label{tab:lat-bg1-i5}
\end{table}
From the above results it can be observed that the data transfer between CNs and BNs takes up a significant amount of the run time. However, the performance gain due to AVX instructions in both CN and BN processing is significantly larger than the penalty incurred by the data transfers.
\section{Parity Check and early stopping Criteria}
It is often unnecessary to carry out the maximum number of iterations. After each iteration a parity check \eqref{eq:29} can be computed and if a valid code word is found the decoder can stop. This functionality has been implemented and the additional overhead is reasonable. The PC is carried out in the CN processing buffer and the calculation complexity itself is negligible. However, for the processing it is necessary to move the BN results to the CN buffer which takes time, the overall overhead is at most $10\%$ compared to an algorithm without early stopping criteria with the same number of iterations. The PC has to be activated via the define \texttt{NR\_LDPC\_ENABLE\_PARITY\_CHECK}.
\section{Parity Check and Early Stopping Criteria}
It is often unnecessary to carry out the maximum number of iterations. After each iteration a parity check (PC) \eqref{eq:29} can be computed and if a valid code word is found the decoder can stop. This functionality has been implemented and the additional overhead is reasonable. The PC is carried out in the CN processing buffer and the calculation complexity itself is negligible. However, for the processing it is necessary to move the BN results to the CN buffer which takes time, the overall overhead is at most $10\%$ compared to an algorithm without early stopping criteria with the same number of iterations. The PC has to be activated via the define \texttt{NR\_LDPC\_ENABLE\_PARITY\_CHECK}.
\section{Conclusion}
\label{sec:conclusion}
The results in the previous sections show that the current optimized LDPC implementation full-fills the requirements in terms of decoding latency for low to medium number of iterations at the expanse of a significant loss in BLER performance. To improve BLER performance, it is recommended to implement a layered algorithm and a min-sum algorithm with normalization. Further improvements upon the current implementation are detailed in the next section.
The results in the previous sections show that the current optimized LDPC implementation full-fills the requirements in terms of decoding latency for low to medium number of iterations at the expense of a loss in BLER performance. To improve BLER performance, it is recommended to implement a layered algorithm and a min-sum algorithm with normalization. Further improvements upon the current implementation are detailed in the next section.
\newpage
\section{Future Work}
......@@ -822,12 +934,16 @@ The following improvements will reduce the decoding latency:
\begin{itemize}
\item Adapt to AVX512
\item Optimization of CN processing
\item Implement 2/3-layers for faster convergence
\end{itemize}
\paragraph{AVX512:}
The computations in the CN and BN processing can be further accelerated by using AVX512 instructions. This improvement will speed-up the CN and BN processing by a approximately a factor of 2.
\paragraph{Optimization of CN Processing:}
It can be investigated if CN processing can be improved by computing two minima regardless of the number of BNs. Susequently, the (absolute) value fed back to the BN is one of those minima.
\paragraph{Layered processing:}
The LDPC code in NR always punctures the first 2 columns of the base graph. Hence, the decoder inserts LLRs with value 0 at their place and needs to retrieve those bits during the decoding process. Instead of computing all the parity equations and then passing the results to the BN processing, it is beneficial to first compute parity equations where at most one punctured BN is connected to that CN. If two punctured BNs are connected than according to \eqref{eq:40}, the result will be again 0. Thus in a first sub-iteration those parity equation are computed and the results are send to BN processing which calculates the results using only those rows in the PCM. In the second sub-iteration the remaining check equation are used.
The convergence of this layered approach is much fast since the bit can be retrieved more quickly while the decoding complexity remains the same. Therefore, for a fixed number of iterations the layered algorithm will have a significantly better performance.
......
......@@ -771,7 +771,7 @@ WARN_LOGFILE =
# spaces. See also FILE_PATTERNS and EXTENSION_MAPPING
# Note: If this tag is empty the current directory is searched.
INPUT = ../nrLDPC_defs.h ../nrLDPC_types.h ../nrLDPC_init.h ../nrLDPC_cnProc.h ../nrLDPC_bnProc.h ../nrLDPC_mPass.h ../nrLDPC_decoder.h ../nrLDPC_decoder.c ../nrLDPC_lut/nrLDPC_lut.h
INPUT = ../nrLDPC_defs.h ../nrLDPC_types.h ../nrLDPC_init.h ../nrLDPC_cnProc.h ../nrLDPC_bnProc.h ../nrLDPC_mPass.h ../nrLDPC_decoder.h ../nrLDPC_decoder.c ../nrLDPC_lut.h
# This tag can be used to specify the character encoding of the source files
# that doxygen parses. Internally doxygen uses the UTF-8 encoding. Doxygen uses
......
......@@ -22,7 +22,7 @@
/*!\file nrLDPC_cnProc.h
* \brief Defines the functions for check node processing
* \author Sebastian Wagner (TCL Communications) Email: <mailto:sebastian.wagner@tcl.com>
* \date 27-03-2018
* \date 30-09-2019
* \version 1.0
* \note
* \warning
......@@ -34,6 +34,7 @@
/**
\brief Performs CN processing for BG2 on the CN processing buffer and stores the results in the CN processing results buffer.
\param p_lut Pointer to decoder LUTs
\param p_procBuf Pointer to processing buffers
\param Z Lifting size
*/
static inline void nrLDPC_cnProc_BG2(t_nrLDPC_lut* p_lut, t_nrLDPC_procBuf* p_procBuf, uint16_t Z)
......
......@@ -22,8 +22,8 @@
/*!\file nrLDPC_decoder.c
* \brief Defines the LDPC decoder
* \author Sebastian Wagner (TCL Communications) Email: <mailto:sebastian.wagner@tcl.com>
* \date 27-03-2018
* \version 1.0
* \date 30-09-2019
* \version 2.0
* \note
* \warning
*/
......@@ -96,6 +96,8 @@ static inline uint32_t nrLDPC_decoder_core(int8_t* p_llr, int8_t* p_out, t_nrLDP
{
// Use LLR processing buffer as temporary output buffer
p_llrOut = p_procBuf->llrProcBuf;
// Clear llrProcBuf
memset(p_llrOut,0, NR_LDPC_MAX_NUM_LLR*sizeof(int8_t));
}
......@@ -116,7 +118,14 @@ static inline uint32_t nrLDPC_decoder_core(int8_t* p_llr, int8_t* p_out, t_nrLDP
#ifdef NR_LDPC_PROFILER_DETAIL
start_meas(&p_profiler->llr2CnProcBuf);
#endif
nrLDPC_llr2CnProcBuf(p_lut, p_llr, p_procBuf, numLLR, Z, BG);
if (BG == 1)
{