PUSCH LLR 128/256 SIMDe routines for ARM/x86
In this MR, the major change is in nr_ulsch_llr_computation.c
from the branch pusch-parallelization. The change includes:
-
For MIMO as well as SISO, using proper size of SIMDe on different architecture (128 on ARM; 256 on x86;) for a better performance in terms of execution time. The compile time on ARM is reduced.
-
128 SIMDe is faster on ARM. Testing parameters (refer to !2081 (merged)) on armix, 09f61af7, Refactor: 16QAM LLR 128 SIMDe assign:
sudo ./nr_ulsim -n200 -b14 -I5 -i 0,1 -g B,l -t70 -u 1 -m2 -R106 -r106 -U 1,1,1,2 -W2 -y2 -z2 -s1.7 -S1.7 -P -C12
meas / SIMDe 128 256 Total PHY proc rx 1655.80 1865.50 RX PUSCH time 1042.89 1269.57 ULSCH llr computation 13.81 33.37 Note that ULSCH llr computation time is the average for 1 symbol. We have 12 symbols in this case.
-
Compile time on ARM (armix):
time ./build_oai -c -C --ninja -P
- 2023.w39: 20min+, usually ninja got killed by OS
- this MR: ~2min
-
-
Turn the a-little-bit lengthy function macros into static inline functions. Since these functions are called frequently, expanding them in preprocessing takes more time to compile. And inline function is eaiser to maintain in terms of clearer variable scope and syntax checking. The performance (llr computation time in nr_ulsim) remains the same.
-
Refactor the duplicated code of
nr_ulsch_qam16_qam16()
. -
Refactor the duplicated code of
nr_ulsch_qam64_qam64()
.This MR does not enables ML LLR 64QAM. To enable it, apply this change:
diff --git a/openair1/PHY/NR_TRANSPORT/nr_ulsch_demodulation.c b/openair1/PHY/NR_TRANSPORT/nr_ulsch_demodulation.c index 18aea849f7..5e9e796337 100644 --- a/openair1/PHY/NR_TRANSPORT/nr_ulsch_demodulation.c +++ b/openair1/PHY/NR_TRANSPORT/nr_ulsch_demodulation.c @@ -1346,7 +1346,8 @@ static void inner_rx (PHY_VARS_gNB *gNB, } if (nb_layer == 2) { - if (rel15_ul->qam_mod_order < 6) { + if (rel15_ul->qam_mod_order < 8) { + // QPSK, 16QAM, 64QAM ML nr_ulsch_compute_ML_llr(pusch_vars, symbol, (c16_t*)&pusch_vars->rxdataF_comp[0][symbol * buffer_length], @@ -1361,6 +1362,7 @@ static void inner_rx (PHY_VARS_gNB *gNB, rel15_ul->qam_mod_order); } else { + // 256QAM MMSE nr_ulsch_mmse_2layers(frame_parms, (int32_t **)pusch_vars->rxdataF_comp, (int **)rxF_ch_maga, @@ -1377,7 +1379,8 @@ static void inner_rx (PHY_VARS_gNB *gNB, buffer_length); } } - if (nb_layer != 2 || rel15_ul->qam_mod_order >= 6) + // For single layer or 2 layer 256QAM MMSE + if (nb_layer != 2 || rel15_ul->qam_mod_order >= 8) for (int aatx = 0; aatx < nb_layer; aatx++) nr_ulsch_compute_llr((int32_t*)&pusch_vars->rxdataF_comp[aatx * nb_rx_ant][symbol * buffer_length], (int32_t*)rxF_ch_maga[aatx],
Testing with commits:
develop 2023.w39 0be397b2 Oct/03/23 Merge branch 'integration_2023_w39' into 'develop' pusch-llr-SIMDe f451f62c Oct/06/23 Fix cannot pass sanitizer, Align to 16 for 2 ...
Command line parameters:
sudo ./nr_ulsim -n500 -g C,l -m17 -s17 -S25 -W2 -y2 -z2 -P -t70 -C6
-
-C6
means 6 threads created in thread pool and it is subject to different test case -
Checkout to the commit, it will be MMSE 64QAM, for ML 64 QAM, please apply above changes (in the toggle list).
-
This MR improves "Symbol Processing" by 17% by improving 64QAM ML LLR computation.
-
In 64QAM MCS=17, for effective throughput >= 70%, ML is better than MMSE by roughly 2dB in SNR.
-
In 2023.w39 with ML, the eff throughput drops with thread numbers. It seems to be fixed in this MR for unknown reason.
2023.w39 2023.w39 2023.w39 2023.w39 2023.w39 2023.w39 MR2229 MR2229 MR2229 MMSE / ML MMSE MMSE MMSE ML ML ML ML ML ML -C
1 6 12 1 6 12 1 6 12 gNB RX 1482.93 490.42 460.50 2104.24 637.67 572.44 1877.58 580.64 521.81 RX PUSCH time 647.44 217.23 180.55 1244.24 351.96 272.24 1024.92 297.38 228.71 Symbol Processing 579.98 149.05 109.45 1176.61 283.51 203.85 956.94 229.38 158.58 BER 0.082669 0.080998 0.079311 0.094239 0.082764 0.077498 0.094947 0.10220 0.104123 n_error xxx/500 298 270 278 268 273 266 272 288 284 eff throughput 70.02 72.10 72.20 72.12 71.12 71.52 72.02 70.37 70.73 SNR 20 20.2 20.2 18.0 19.8 19.9 17.9 17.9 18.0
-