Segfault during initialization of USRP
# Issue Description This is a segmentation fault reported from CI loops (https://jenkins-oai.eurecom.fr/job/RAN-SA-2x2-Module-CN5G/817/). I am in possession of the coredump and docker image that was running the software. The gNB segfaults inside USRP lib during initialization: ```gdb (gdb) bt #0 0x00007f0e7f6236fd in uhd::rfnoc::chdr::mgmt_payload::deserialize(unsigned long const*, unsigned long, std::function<unsigned long (unsigned long)> const&) () from /usr/local/lib64/libuhd.so.4.4.0 #1 0x00007f0e7f690b34 in uhd::rfnoc::mgmt::mgmt_portal_impl::_send_recv_mgmt_transaction(uhd::rfnoc::chdr_ctrl_xport&, uhd::rfnoc::chdr::mgmt_payload const&, double) [clone .constprop.0] () from /usr/local/lib64/libuhd.so.4.4.0 #2 0x00007f0e7f695c1d in uhd::rfnoc::mgmt::mgmt_portal_impl::_get_ostrm_status(uhd::rfnoc::chdr_ctrl_xport&, uhd::rfnoc::detail::topo_node_t const&) () from /usr/local/lib64/libuhd.so.4.4.0 #3 0x00007f0e7f69874b in uhd::rfnoc::mgmt::mgmt_portal_impl::config_local_rx_stream_commit(uhd::rfnoc::chdr_ctrl_xport&, unsigned short const&, double, bool) () from /usr/local/lib64/libuhd.so.4.4.0 #4 0x00007f0e7f62c99a in uhd::rfnoc::chdr_rx_data_xport::configure_sep(std::shared_ptr<uhd::transport::io_service>, std::shared_ptr<uhd::transport::recv_link_if>, std::shared_ptr<uhd::transport::send_link_if>, uhd::rfnoc::chdr::chdr_packet_factory const&, uhd::rfnoc::mgmt::mgmt_portal&, std::pair<unsigned short, unsigned short> const&, uhd::rfnoc::sw_buff_t, uhd::rfnoc::sw_buff_t, uhd::rfnoc::stream_buff_params_t const&, uhd::rfnoc::stream_buff_params_t const&, uhd::rfnoc::stream_buff_params_t const&, bool, std::function<void ()>) () from /usr/local/lib64/libuhd.so.4.4.0 #5 0x00007f0e7fb06a5d in uhd::mpmd::mpmd_mboard_impl::mpmd_mb_iface::make_rx_data_transport(uhd::rfnoc::mgmt::mgmt_portal&, std::pair<std::pair<unsigned short, unsigned short>, std::pair<unsigned short, unsigned short> > const&, std::pair<unsigned short, unsigned short> const&, uhd::rfnoc::sw_buff_t, uhd::rfnoc::sw_buff_t, uhd::device_addr_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /usr/local/lib64/libuhd.so.4.4.0 #6 0x00007f0e7f649116 in link_stream_manager_impl::create_device_to_host_data_stream(std::pair<unsigned short, unsigned short>, uhd::rfnoc::sw_buff_t, uhd::rfnoc::sw_buff_t, uhd::device_addr_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () --Type <RET> for more, q to quit, c to continue without paging-- from /usr/local/lib64/libuhd.so.4.4.0 #7 0x00007f0e7f64c748 in graph_stream_manager_impl::create_device_to_host_data_stream(std::pair<unsigned short, unsigned short>, uhd::rfnoc::sw_buff_t, uhd::rfnoc::sw_buff_t, unsigned long, uhd::device_addr_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /usr/local/lib64/libuhd.so.4.4.0 #8 0x00007f0e7f6841a5 in rfnoc_graph_impl::connect(uhd::rfnoc::block_id_t const&, unsigned long, std::shared_ptr<uhd::rx_streamer>, unsigned long, unsigned long) () from /usr/local/lib64/libuhd.so.4.4.0 #9 0x00007f0e7f7a862a in multi_usrp_rfnoc::get_rx_stream(uhd::stream_args_t const&) () from /usr/local/lib64/libuhd.so.4.4.0 #10 0x00007f0e99460816 in device_init (device=0x1a956ba0, openair0_cfg=0x1a9577b8) at /oai-ran/radio/USRP/usrp_lib.cpp:1478 #11 0x00000000006fdb1a in load_lib (device=device@entry=0x1a956ba0, openair0_cfg=openair0_cfg@entry=0x1a9577b8, cfg=cfg@entry=0x0, flag=flag@entry=0 '\000') at /oai-ran/radio/COMMON/common_lib.c:139 #12 0x00000000006fdcec in openair0_device_load (device=device@entry=0x1a956ba0, openair0_cfg=openair0_cfg@entry=0x1a9577b8) at /oai-ran/radio/COMMON/common_lib.c:147 #13 0x00000000006f56a0 in ru_thread (param=0x1a956710) at /oai-ran/executables/nr-ru.c:1209 #14 0x00007f0e9c129c02 in start_thread () from /lib64/libc.so.6 #15 0x00007f0e9c1aded4 in clone () from /lib64/libc.so.6 ``` Frame 1: https://github.com/EttusResearch/uhd/blob/a5ed1872be6d0fc36de9a7e0b508933da1f119bc/host/lib/rfnoc/mgmt_portal.cpp#L1132 Frame 0: https://github.com/EttusResearch/uhd/blob/a5ed1872be6d0fc36de9a7e0b508933da1f119bc/host/lib/rfnoc/chdr_types.cpp#L518 # Investigation I'm certain that segfault is due to reading buff in line ```c++ std::list<uint64_t> src_list(buff, buff + (buff_size * (_padding_size + 1))); ``` ```assembly │ 0x7f0e7f6236e1 <_ZN3uhd5rfnoc4chdr12mgmt_payload11deserializeEPKmmRKSt8functionIFmmEE+81> lea (%rsi,%rdx,8),%r13 │ 0x7f0e7f6236e5 <_ZN3uhd5rfnoc4chdr12mgmt_payload11deserializeEPKmmRKSt8functionIFmmEE+85> cmp %rsi,%r13 │ │ 0x7f0e7f6236e8 <_ZN3uhd5rfnoc4chdr12mgmt_payload11deserializeEPKmmRKSt8functionIFmmEE+88> je 0x7f0e7f62371c <_ZN3uhd5rfnoc4chdr12mgmt_payload11deserializeEPKmmRKSt8functionIFmmEE+140> │ │ 0x7f0e7f6236ea <_ZN3uhd5rfnoc4chdr12mgmt_payload11deserializeEPKmmRKSt8functionIFmmEE+90> nopw 0x0(%rax,%rax,1) │ │ 0x7f0e7f6236f0 <_ZN3uhd5rfnoc4chdr12mgmt_payload11deserializeEPKmmRKSt8functionIFmmEE+96> mov $0x18,%edi │ │ 0x7f0e7f6236f5 <_ZN3uhd5rfnoc4chdr12mgmt_payload11deserializeEPKmmRKSt8functionIFmmEE+101> call 0x7f0e7f326810 <_Znwm@plt> │ │ 0x7f0e7f6236fa <_ZN3uhd5rfnoc4chdr12mgmt_payload11deserializeEPKmmRKSt8functionIFmmEE+106> mov %rax,%rdi │ │ >0x7f0e7f6236fd <_ZN3uhd5rfnoc4chdr12mgmt_payload11deserializeEPKmmRKSt8functionIFmmEE+109> mov 0x0(%rbp),%rax │ │ 0x7f0e7f623701 <_ZN3uhd5rfnoc4chdr12mgmt_payload11deserializeEPKmmRKSt8functionIFmmEE+113> mov %r12,%rsi │ │ 0x7f0e7f623704 <_ZN3uhd5rfnoc4chdr12mgmt_payload11deserializeEPKmmRKSt8functionIFmmEE+116> add $0x8,%rbp │ │ 0x7f0e7f623708 <_ZN3uhd5rfnoc4chdr12mgmt_payload11deserializeEPKmmRKSt8functionIFmmEE+120> mov %rax,0x10(%rdi) │ │ 0x7f0e7f62370c <_ZN3uhd5rfnoc4chdr12mgmt_payload11deserializeEPKmmRKSt8functionIFmmEE+124> call 0x7f0e7f326cd0 <_ZNSt8__detail15_List_node_base7_M_hookEPS0_@plt> │ │ 0x7f0e7f623711 <_ZN3uhd5rfnoc4chdr12mgmt_payload11deserializeEPKmmRKSt8functionIFmmEE+129> addq $0x1,0x70(%rsp) │ │ 0x7f0e7f623717 <_ZN3uhd5rfnoc4chdr12mgmt_payload11deserializeEPKmmRKSt8functionIFmmEE+135> cmp %rbp,%r13 │ │ 0x7f0e7f62371a <_ZN3uhd5rfnoc4chdr12mgmt_payload11deserializeEPKmmRKSt8functionIFmmEE+138> jne 0x7f0e7f6236f0 <_ZN3uhd5rfnoc4chdr12mgmt_payload11deserializeEPKmmRKSt8functionIFmmEE+96> ``` Comments: `buff` + offset dereference (segfault cause) ```assembly mov 0x0(%rbp),%rax ``` This is the end of the loop generated by `std::list` constructor ```assembly cmp %rbp,%r13 jne 0x7f0e7f6236f0 ``` `r13` holds the address 8 bytes past last element, calculated here: ```assembly lea (%rsi,%rdx,8),%r13 ``` This corresponds to ```c++ buff + (buff_size * (_padding_size + 1)) ``` `rdx` is already equal to `buff_size * (_padding_size + 1)` The segfault happens because the loop end condition is invalid: ```gdb (gdb) p $r13 $17 = 139700656659168 (gdb) p (uint64_t)$rbp $18 = 139700658503680 ``` `r13` is lower than `rbp`, so the calculated end loop condition is before the current iterator. I've tried to access this_p (`mgmt_payload`) and I think it is saved in `rbx` ```assembly │ 0x7f0e7f6236b3 <_ZN3uhd5rfnoc4chdr12mgmt_payload11deserializeEPKmmRKSt8functionIFmmEE+35> mov 0x8(%rdi),%rdx // This is loading padding_size for the loop end condition calculation │ │ 0x7f0e7f6236b7 <_ZN3uhd5rfnoc4chdr12mgmt_payload11deserializeEPKmmRKSt8functionIFmmEE+39> lea 0x60(%rsp),%r12 │ │ 0x7f0e7f6236bc <_ZN3uhd5rfnoc4chdr12mgmt_payload11deserializeEPKmmRKSt8functionIFmmEE+44> mov %rdi,%rbx // this pointer in rbx? ``` ```gdb p *(size_t*)($rbx+8) $10 = 0 ``` ``` (gdb) x/40x $rbx 0x7f0e90bfcad0: 0x0003 0x0100 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 ``` This is the `mgmt_payload` type (https://github.com/EttusResearch/uhd/blob/a5ed1872be6d0fc36de9a7e0b508933da1f119bc/host/include/uhd/rfnoc/chdr_types.hpp#L953) Looking for buffer_size (to complete loop end condition calculation): from frame 1: ```assembly │ 0x7f0e7f690b29 <_ZN3uhd5rfnoc4mgmt16mgmt_portal_impl27_send_recv_mgmt_transactionERNS0_15chdr_ctrl_xportERKNS0_4chdr12mgmt_payloadEd.constprop.0+729> mov %rbp,%rdx │ │ 0x7f0e7f690b2c <_ZN3uhd5rfnoc4mgmt16mgmt_portal_impl27_send_recv_mgmt_transactionERNS0_15chdr_ctrl_xportERKNS0_4chdr12mgmt_payloadEd.constprop.0+732> mov %r13,%rdi │ │ 0x7f0e7f690b2f <_ZN3uhd5rfnoc4mgmt16mgmt_portal_impl27_send_recv_mgmt_transactionERNS0_15chdr_ctrl_xportERKNS0_4chdr12mgmt_payloadEd.constprop.0+735> call 0x7f0e7f3299e0 <_ZN3uhd5rfnoc4chdr12mgmt_payload11deserializeEPKmmRKSt8functionIFmmEE@plt> ``` `rdx` set, in frame 0 it is read: ```assembly │ 0x7f0e7f6236a6 <_ZN3uhd5rfnoc4chdr12mgmt_payload11deserializeEPKmmRKSt8functionIFmmEE+22> cmp $0x1,%rdx ``` therefore `rdx` is `buff_size` in frame 0, this assembly is for this line in the code: ```c++ UHD_ASSERT_THROW(buff_size > 1); ``` Also `rbp` is `buff_size` in frame 1. It is pushed on the stack here: ```assembly │ 0x7f0e7f623690 <_ZN3uhd5rfnoc4chdr12mgmt_payload11deserializeEPKmmRKSt8functionIFmmEE> push %r15 │ │ 0x7f0e7f623692 <_ZN3uhd5rfnoc4chdr12mgmt_payload11deserializeEPKmmRKSt8functionIFmmEE+2> push %r14 │ │ 0x7f0e7f623694 <_ZN3uhd5rfnoc4chdr12mgmt_payload11deserializeEPKmmRKSt8functionIFmmEE+4> push %r13 │ │ 0x7f0e7f623696 <_ZN3uhd5rfnoc4chdr12mgmt_payload11deserializeEPKmmRKSt8functionIFmmEE+6> push %r12 │ │ 0x7f0e7f623698 <_ZN3uhd5rfnoc4chdr12mgmt_payload11deserializeEPKmmRKSt8functionIFmmEE+8> push %rbp │ │ 0x7f0e7f623699 <_ZN3uhd5rfnoc4chdr12mgmt_payload11deserializeEPKmmRKSt8functionIFmmEE+9> push %rbx │ │ 0x7f0e7f62369a <_ZN3uhd5rfnoc4chdr12mgmt_payload11deserializeEPKmmRKSt8functionIFmmEE+10> sub $0x208,%rsp ``` push pushes the register on the stack. Stack grows downwards(i.e. bigger stack -\> lower address). ``` (gdb) p $rsp $18 = (void *) 0x7f0e90bfc6b0 (gdb) p ($rsp+0x208) $19 = (void *) 0x7f0e90bfc8b8 ``` `rsp + 0x208` should be the frame pointer. ```gdb 0x7f0e90bfc8b8: 0x67a0 0x7409 0x7f0e 0x0000 0xffff 0xffff 0xffff 0x1fff 0x7f0e90bfc8c8: 0x6b80 0x740e 0x7f0e 0x0000 0xcad0 0x90bf 0x7f0e 0x0000 0x7f0e90bfc8d8: 0xcae0 0x90bf 0x7f0e 0x0000 0xc940 0x90bf 0x7f0e 0x0000 0x7f0e90bfc8e8: 0x0b34 0x7f69 0x7f0e 0x0000 0x5050 0x740e 0x7f0e 0x0000 0x7f0e90bfc8f8: 0x0003 0x0000 0x0000 0x0000 0x0018 0x0000 0x0000 0x0000 0x7f0e90bfc908: 0xc970 0x90bf 0x7f0e 0x0000 0x0000 0x0000 0x0000 0x0000 0x7f0e90bfc918: 0xe4f0 0x7402 0x7f0e 0x0000 0x6b80 0x740e 0x7f0e 0x0000 ``` `(0xffff 0xffff 0xffff 0x1fff)` should be `rbx` - `buff_size` (at address `0x7f0e90bfc8c0`) - **uncertain** this must be the list `size` field ```assembly │ 0x7f0e7f623711 <_ZN3uhd5rfnoc4chdr12mgmt_payload11deserializeEPKmmRKSt8functionIFmmEE+129> addq $0x1,0x70(%rsp) ``` ``` (gdb) p (uint64_t_)($rsp+(0x70)) $28 = 230563 ``` So 230563 elements were added before crash. Therefore `buff` might be `rbp - 230563 * 8` ``` (gdb) p $rbp - *(uint64_t*)($rsp+(0x70)) * 8 $30 = (void *) 0x7f0e9803dae8 0x7f0e9803dae8: 0x0801 0x0024 0x0000 0x0000 0x0800 0x0028 0xffff 0x00ff 0x7f0e9803daf8: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x7f0e9803db08: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x7f0e9803db18: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x7f0e9803db28: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x7f0e9803db38: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x7f0e9803db48: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x7f0e9803db58: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x7f0e9803db68: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x7f0e9803db78: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 ``` Looks good. It does look like a buffer pointer, but the `buff_size` must be wrong. It could be -1 (cast to uint64_t) - that would explain the faulty loop end condition. This is as much time as I am willing to put into this. Looking at the documentation I found one remark about calling `multi_usrp_rfnoc::get_rx_stream(uhd::stream_args_t const&)` (in the call stack) [source: https://files.ettus.com/manual/classuhd_1_1device.html#a0a9e36f353dcce36b4dd8d394c8813e3] ``` Make a new receive streamer from the streamer arguments. Note: For RFNoC devices, there can always be only one streamer per channel. When calling get_rx_stream() a second time, the first streamer connected to this channel must be destroyed beforehand. Multiple streamers for different channels are allowed. For non-RFNoC devices, you can only have one RX streamer at a time. Be careful to destroy the old one if you want to create a new one. ``` How to ensure the device is in a correct state - i.e. that the "old" rx streamer is destroyed correctly? # Conculsions 1. Consult authors of the driver regarding the rx streamer behavior 2. Suggestion for UHD to add extra asserts on buff_size in the code. These seem to be non-time critical functionalities as commented in the code.
issue