PUSCH LLR 128/256 SIMDe routines for ARM/x86 (!2229) · Merge requests · oai / openairinterface5G

Quency Lin requested to merge pusch-llr-SIMDe into develop Jul 13, 2023

In this MR, the major change is in nr_ulsch_llr_computation.c from the branch pusch-parallelization. The change includes:

For MIMO as well as SISO, using proper size of SIMDe on different architecture (128 on ARM; 256 on x86;) for a better performance in terms of execution time. The compile time on ARM is reduced.
- 128 SIMDe is faster on ARM. Testing parameters (refer to !2081 (merged)) on armix, 09f61af7, Refactor: 16QAM LLR 128 SIMDe assign:
```
sudo ./nr_ulsim -n200 -b14 -I5 -i 0,1 -g B,l -t70 -u 1 -m2 -R106 -r106 -U 1,1,1,2 -W2 -y2 -z2 -s1.7 -S1.7 -P -C12
```
  meas / SIMDe 128 256
  
  Total PHY proc rx 1655.80 1865.50
  
  RX PUSCH time 1042.89 1269.57
  
  ULSCH llr computation 13.81 33.37
  
  Note that ULSCH llr computation time is the average for 1 symbol. We have 12 symbols in this case.
- Compile time on ARM (armix):
```
time ./build_oai -c -C --ninja -P
```
  - 2023.w39: 20min+, usually ninja got killed by OS
  - this MR: ~2min
Turn the a-little-bit lengthy function macros into static inline functions. Since these functions are called frequently, expanding them in preprocessing takes more time to compile. And inline function is eaiser to maintain in terms of clearer variable scope and syntax checking. The performance (llr computation time in nr_ulsim) remains the same.
Refactor the duplicated code of nr_ulsch_qam16_qam16().

meas / SIMDe	128	256
Total PHY proc rx	1655.80	1865.50
RX PUSCH time	1042.89	1269.57
ULSCH llr computation	13.81	33.37

Refactor the duplicated code of nr_ulsch_qam64_qam64().

MMSE / ML 64QAM LLR benchmark

This MR does not enables ML LLR 64QAM. To enable it, apply this change:

diff --git a/openair1/PHY/NR_TRANSPORT/nr_ulsch_demodulation.c b/openair1/PHY/NR_TRANSPORT/nr_ulsch_demodulation.c
index 18aea849f7..5e9e796337 100644
--- a/openair1/PHY/NR_TRANSPORT/nr_ulsch_demodulation.c
+++ b/openair1/PHY/NR_TRANSPORT/nr_ulsch_demodulation.c
@@ -1346,7 +1346,8 @@ static void inner_rx (PHY_VARS_gNB *gNB,
   }

   if (nb_layer == 2) {
-    if (rel15_ul->qam_mod_order < 6) {
+    if (rel15_ul->qam_mod_order < 8) {
+      // QPSK, 16QAM, 64QAM ML
       nr_ulsch_compute_ML_llr(pusch_vars,
                               symbol,
                               (c16_t*)&pusch_vars->rxdataF_comp[0][symbol * buffer_length],
@@ -1361,6 +1362,7 @@ static void inner_rx (PHY_VARS_gNB *gNB,
                               rel15_ul->qam_mod_order);
     }
     else {
+      // 256QAM MMSE
       nr_ulsch_mmse_2layers(frame_parms,
                             (int32_t **)pusch_vars->rxdataF_comp,
                             (int **)rxF_ch_maga,
@@ -1377,7 +1379,8 @@ static void inner_rx (PHY_VARS_gNB *gNB,
                             buffer_length);
     }
   }
-  if (nb_layer != 2 || rel15_ul->qam_mod_order >= 6)
+  // For single layer or 2 layer 256QAM MMSE
+  if (nb_layer != 2 || rel15_ul->qam_mod_order >= 8)
     for (int aatx = 0; aatx < nb_layer; aatx++)
       nr_ulsch_compute_llr((int32_t*)&pusch_vars->rxdataF_comp[aatx * nb_rx_ant][symbol * buffer_length],
                           (int32_t*)rxF_ch_maga[aatx],

Testing with commits:

develop 2023.w39 0be397b2 Oct/03/23  Merge branch 'integration_2023_w39' into 'develop'
pusch-llr-SIMDe  f451f62c Oct/06/23  Fix cannot pass sanitizer, Align to 16 for 2 ...

Command line parameters:

sudo ./nr_ulsim -n500 -g C,l -m17 -s17 -S25 -W2 -y2 -z2 -P -t70 -C6

-C6 means 6 threads created in thread pool and it is subject to different test case
Checkout to the commit, it will be MMSE 64QAM, for ML 64 QAM, please apply above changes (in the toggle list).
This MR improves "Symbol Processing" by 17% by improving 64QAM ML LLR computation.
In 64QAM MCS=17, for effective throughput >= 70%, ML is better than MMSE by roughly 2dB in SNR.

In 2023.w39 with ML, the eff throughput drops with thread numbers. It seems to be fixed in this MR for unknown reason.

	2023.w39	2023.w39	2023.w39	2023.w39	2023.w39	2023.w39	MR2229	MR2229	MR2229
MMSE / ML	MMSE	MMSE	MMSE	ML	ML	ML	ML	ML	ML
`-C`	1	6	12	1	6	12	1	6	12
gNB RX	1482.93	490.42	460.50	2104.24	637.67	572.44	1877.58	580.64	521.81
RX PUSCH time	647.44	217.23	180.55	1244.24	351.96	272.24	1024.92	297.38	228.71
Symbol Processing	579.98	149.05	109.45	1176.61	283.51	203.85	956.94	229.38	158.58
BER	0.082669	0.080998	0.079311	0.094239	0.082764	0.077498	0.094947	0.10220	0.104123
n_error xxx/500	298	270	278	268	273	266	272	288	284
eff throughput	70.02	72.10	72.20	72.12	71.12	71.52	72.02	70.37	70.73
SNR	20	20.2	20.2	18.0	19.8	19.9	17.9	17.9	18.0

@chan @link

Edited Oct 16, 2023 by Quency Lin

PUSCH LLR 128/256 SIMDe routines for ARM/x86

MMSE / ML 64QAM LLR benchmark

Merge request reports