• Paul E. McKenney's avatar
    rcu: Fix deadlock with CPU hotplug, RCU GP init, and timer migration · 971394f3
    Paul E. McKenney authored
    In Steven Rostedt's words:
    
    > I've been debugging the last couple of days why my tests have been
    > locking up. One of my tracing tests, runs all available tracers. The
    > lockup always happened with the mmiotrace, which is used to trace
    > interactions between priority drivers and the kernel. But to do this
    > easily, when the tracer gets registered, it disables all but the boot
    > CPUs. The lockup always happened after it got done disabling the CPUs.
    >
    > Then I decided to try this:
    >
    > while :; do
    > 	for i in 1 2 3; do
    > 		echo 0 > /sys/devices/system/cpu/cpu$i/online
    > 	done
    > 	for i in 1 2 3; do
    > 		echo 1 > /sys/devices/system/cpu/cpu$i/online
    > 	done
    > done
    >
    > Well, sure enough, that locked up too, with the same users. Doing a
    > sysrq-w (showing all blocked tasks):
    >
    > [ 2991.344562]   task                        PC stack   pid father
    > [ 2991.344562] rcu_preempt     D ffff88007986fdf8     0    10      2 0x00000000
    > [ 2991.344562]  ffff88007986fc98 0000000000000002 ffff88007986fc48 0000000000000908
    > [ 2991.344562]  ffff88007986c280 ffff88007986ffd8 ffff88007986ffd8 00000000001d3c80
    > [ 2991.344562]  ffff880079248a40 ffff88007986c280 0000000000000000 00000000fffd4295
    > [ 2991.344562] Call Trace:
    > [ 2991.344562]  [<ffffffff815437ba>] schedule+0x64/0x66
    > [ 2991.344562]  [<ffffffff81541750>] schedule_timeout+0xbc/0xf9
    > [ 2991.344562]  [<ffffffff8154bec0>] ? ftrace_call+0x5/0x2f
    > [ 2991.344562]  [<ffffffff81049513>] ? cascade+0xa8/0xa8
    > [ 2991.344562]  [<ffffffff815417ab>] schedule_timeout_uninterruptible+0x1e/0x20
    > [ 2991.344562]  [<ffffffff810c980c>] rcu_gp_kthread+0x502/0x94b
    > [ 2991.344562]  [<ffffffff81062791>] ? __init_waitqueue_head+0x50/0x50
    > [ 2991.344562]  [<ffffffff810c930a>] ? rcu_gp_fqs+0x64/0x64
    > [ 2991.344562]  [<ffffffff81061cdb>] kthread+0xb1/0xb9
    > [ 2991.344562]  [<ffffffff81091e31>] ? lock_release_holdtime.part.23+0x4e/0x55
    > [ 2991.344562]  [<ffffffff81061c2a>] ? __init_kthread_worker+0x58/0x58
    > [ 2991.344562]  [<ffffffff8154c1dc>] ret_from_fork+0x7c/0xb0
    > [ 2991.344562]  [<ffffffff81061c2a>] ? __init_kthread_worker+0x58/0x58
    > [ 2991.344562] kworker/0:1     D ffffffff81a30680     0    47      2 0x00000000
    > [ 2991.344562] Workqueue: events cpuset_hotplug_workfn
    > [ 2991.344562]  ffff880078dbbb58 0000000000000002 0000000000000006 00000000000000d8
    > [ 2991.344562]  ffff880078db8100 ffff880078dbbfd8 ffff880078dbbfd8 00000000001d3c80
    > [ 2991.344562]  ffff8800779ca5c0 ffff880078db8100 ffffffff81541fcf 0000000000000000
    > [ 2991.344562] Call Trace:
    > [ 2991.344562]  [<ffffffff81541fcf>] ? __mutex_lock_common+0x3d4/0x609
    > [ 2991.344562]  [<ffffffff815437ba>] schedule+0x64/0x66
    > [ 2991.344562]  [<ffffffff81543a39>] schedule_preempt_disabled+0x18/0x24
    > [ 2991.344562]  [<ffffffff81541fcf>] __mutex_lock_common+0x3d4/0x609
    > [ 2991.344562]  [<ffffffff8103d11b>] ? get_online_cpus+0x3c/0x50
    > [ 2991.344562]  [<ffffffff8103d11b>] ? get_online_cpus+0x3c/0x50
    > [ 2991.344562]  [<ffffffff815422ff>] mutex_lock_nested+0x3b/0x40
    > [ 2991.344562]  [<ffffffff8103d11b>] get_online_cpus+0x3c/0x50
    > [ 2991.344562]  [<ffffffff810af7e6>] rebuild_sched_domains_locked+0x6e/0x3a8
    > [ 2991.344562]  [<ffffffff810b0ec6>] rebuild_sched_domains+0x1c/0x2a
    > [ 2991.344562]  [<ffffffff810b109b>] cpuset_hotplug_workfn+0x1c7/0x1d3
    > [ 2991.344562]  [<ffffffff810b0ed9>] ? cpuset_hotplug_workfn+0x5/0x1d3
    > [ 2991.344562]  [<ffffffff81058e07>] process_one_work+0x2d4/0x4d1
    > [ 2991.344562]  [<ffffffff81058d3a>] ? process_one_work+0x207/0x4d1
    > [ 2991.344562]  [<ffffffff8105964c>] worker_thread+0x2e7/0x3b5
    > [ 2991.344562]  [<ffffffff81059365>] ? rescuer_thread+0x332/0x332
    > [ 2991.344562]  [<ffffffff81061cdb>] kthread+0xb1/0xb9
    > [ 2991.344562]  [<ffffffff81061c2a>] ? __init_kthread_worker+0x58/0x58
    > [ 2991.344562]  [<ffffffff8154c1dc>] ret_from_fork+0x7c/0xb0
    > [ 2991.344562]  [<ffffffff81061c2a>] ? __init_kthread_worker+0x58/0x58
    > [ 2991.344562] bash            D ffffffff81a4aa80     0  2618   2612 0x10000000
    > [ 2991.344562]  ffff8800379abb58 0000000000000002 0000000000000006 0000000000000c2c
    > [ 2991.344562]  ffff880077fea140 ffff8800379abfd8 ffff8800379abfd8 00000000001d3c80
    > [ 2991.344562]  ffff8800779ca5c0 ffff880077fea140 ffffffff81541fcf 0000000000000000
    > [ 2991.344562] Call Trace:
    > [ 2991.344562]  [<ffffffff81541fcf>] ? __mutex_lock_common+0x3d4/0x609
    > [ 2991.344562]  [<ffffffff815437ba>] schedule+0x64/0x66
    > [ 2991.344562]  [<ffffffff81543a39>] schedule_preempt_disabled+0x18/0x24
    > [ 2991.344562]  [<ffffffff81541fcf>] __mutex_lock_common+0x3d4/0x609
    > [ 2991.344562]  [<ffffffff81530078>] ? rcu_cpu_notify+0x2f5/0x86e
    > [ 2991.344562]  [<ffffffff81530078>] ? rcu_cpu_notify+0x2f5/0x86e
    > [ 2991.344562]  [<ffffffff815422ff>] mutex_lock_nested+0x3b/0x40
    > [ 2991.344562]  [<ffffffff81530078>] rcu_cpu_notify+0x2f5/0x86e
    > [ 2991.344562]  [<ffffffff81091c99>] ? __lock_is_held+0x32/0x53
    > [ 2991.344562]  [<ffffffff81548912>] notifier_call_chain+0x6b/0x98
    > [ 2991.344562]  [<ffffffff810671fd>] __raw_notifier_call_chain+0xe/0x10
    > [ 2991.344562]  [<ffffffff8103cf64>] __cpu_notify+0x20/0x32
    > [ 2991.344562]  [<ffffffff8103cf8d>] cpu_notify_nofail+0x17/0x36
    > [ 2991.344562]  [<ffffffff815225de>] _cpu_down+0x154/0x259
    > [ 2991.344562]  [<ffffffff81522710>] cpu_down+0x2d/0x3a
    > [ 2991.344562]  [<ffffffff81526351>] store_online+0x4e/0xe7
    > [ 2991.344562]  [<ffffffff8134d764>] dev_attr_store+0x20/0x22
    > [ 2991.344562]  [<ffffffff811b3c5f>] sysfs_write_file+0x108/0x144
    > [ 2991.344562]  [<ffffffff8114c5ef>] vfs_write+0xfd/0x158
    > [ 2991.344562]  [<ffffffff8114c928>] SyS_write+0x5c/0x83
    > [ 2991.344562]  [<ffffffff8154c494>] tracesys+0xdd/0xe2
    >
    > As well as held locks:
    >
    > [ 3034.728033] Showing all locks held in the system:
    > [ 3034.728033] 1 lock held by rcu_preempt/10:
    > [ 3034.728033]  #0:  (rcu_preempt_state.onoff_mutex){+.+...}, at: [<ffffffff810c9471>] rcu_gp_kthread+0x167/0x94b
    > [ 3034.728033] 4 locks held by kworker/0:1/47:
    > [ 3034.728033]  #0:  (events){.+.+.+}, at: [<ffffffff81058d3a>] process_one_work+0x207/0x4d1
    > [ 3034.728033]  #1:  (cpuset_hotplug_work){+.+.+.}, at: [<ffffffff81058d3a>] process_one_work+0x207/0x4d1
    > [ 3034.728033]  #2:  (cpuset_mutex){+.+.+.}, at: [<ffffffff810b0ec1>] rebuild_sched_domains+0x17/0x2a
    > [ 3034.728033]  #3:  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff8103d11b>] get_online_cpus+0x3c/0x50
    > [ 3034.728033] 1 lock held by mingetty/2563:
    > [ 3034.728033]  #0:  (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff8131e28a>] n_tty_read+0x252/0x7e8
    > [ 3034.728033] 1 lock held by mingetty/2565:
    > [ 3034.728033]  #0:  (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff8131e28a>] n_tty_read+0x252/0x7e8
    > [ 3034.728033] 1 lock held by mingetty/2569:
    > [ 3034.728033]  #0:  (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff8131e28a>] n_tty_read+0x252/0x7e8
    > [ 3034.728033] 1 lock held by mingetty/2572:
    > [ 3034.728033]  #0:  (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff8131e28a>] n_tty_read+0x252/0x7e8
    > [ 3034.728033] 1 lock held by mingetty/2575:
    > [ 3034.728033]  #0:  (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff8131e28a>] n_tty_read+0x252/0x7e8
    > [ 3034.728033] 7 locks held by bash/2618:
    > [ 3034.728033]  #0:  (sb_writers#5){.+.+.+}, at: [<ffffffff8114bc3f>] file_start_write+0x2a/0x2c
    > [ 3034.728033]  #1:  (&buffer->mutex#2){+.+.+.}, at: [<ffffffff811b3b93>] sysfs_write_file+0x3c/0x144
    > [ 3034.728033]  #2:  (s_active#54){.+.+.+}, at: [<ffffffff811b3c3e>] sysfs_write_file+0xe7/0x144
    > [ 3034.728033]  #3:  (x86_cpu_hotplug_driver_mutex){+.+.+.}, at: [<ffffffff810217c2>] cpu_hotplug_driver_lock+0x17/0x19
    > [ 3034.728033]  #4:  (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff8103d196>] cpu_maps_update_begin+0x17/0x19
    > [ 3034.728033]  #5:  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff8103cfd8>] cpu_hotplug_begin+0x2c/0x6d
    > [ 3034.728033]  #6:  (rcu_preempt_state.onoff_mutex){+.+...}, at: [<ffffffff81530078>] rcu_cpu_notify+0x2f5/0x86e
    > [ 3034.728033] 1 lock held by bash/2980:
    > [ 3034.728033]  #0:  (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff8131e28a>] n_tty_read+0x252/0x7e8
    >
    > Things looked a little weird. Also, this is a deadlock that lockdep did
    > not catch. But what we have here does not look like a circular lock
    > issue:
    >
    > Bash is blocked in rcu_cpu_notify():
    >
    > 1961		/* Exclude any attempts to start a new grace period. */
    > 1962		mutex_lock(&rsp->onoff_mutex);
    >
    >
    > kworker is blocked in get_online_cpus(), which makes sense as we are
    > currently taking down a CPU.
    >
    > But rcu_preempt is not blocked on anything. It is simply sleeping in
    > rcu_gp_kthread (really rcu_gp_init) here:
    >
    > 1453	#ifdef CONFIG_PROVE_RCU_DELAY
    > 1454			if ((prandom_u32() % (rcu_num_nodes * 8)) == 0 &&
    > 1455			    system_state == SYSTEM_RUNNING)
    > 1456				schedule_timeout_uninterruptible(2);
    > 1457	#endif /* #ifdef CONFIG_PROVE_RCU_DELAY */
    >
    > And it does this while holding the onoff_mutex that bash is waiting for.
    >
    > Doing a function trace, it showed me where it happened:
    >
    > [  125.940066] rcu_pree-10      3.... 28384115273: schedule_timeout_uninterruptible <-rcu_gp_kthread
    > [...]
    > [  125.940066] rcu_pree-10      3d..3 28384202439: sched_switch: prev_comm=rcu_preempt prev_pid=10 prev_prio=120 prev_state=D ==> next_comm=watchdog/3 next_pid=38 next_prio=120
    >
    > The watchdog ran, and then:
    >
    > [  125.940066] watchdog-38      3d..3 28384692863: sched_switch: prev_comm=watchdog/3 prev_pid=38 prev_prio=120 prev_state=P ==> next_comm=modprobe next_pid=2848 next_prio=118
    >
    > Not sure what modprobe was doing, but shortly after that:
    >
    > [  125.940066] modprobe-2848    3d..3 28385041749: sched_switch: prev_comm=modprobe prev_pid=2848 prev_prio=118 prev_state=R+ ==> next_comm=migration/3 next_pid=40 next_prio=0
    >
    > Where the migration thread took down the CPU:
    >
    > [  125.940066] migratio-40      3d..3 28389148276: sched_switch: prev_comm=migration/3 prev_pid=40 prev_prio=0 prev_state=P ==> next_comm=swapper/3 next_pid=0 next_prio=120
    >
    > which finally did:
    >
    > [  125.940066]   <idle>-0       3...1 28389282142: arch_cpu_idle_dead <-cpu_startup_entry
    > [  125.940066]   <idle>-0       3...1 28389282548: native_play_dead <-arch_cpu_idle_dead
    > [  125.940066]   <idle>-0       3...1 28389282924: play_dead_common <-native_play_dead
    > [  125.940066]   <idle>-0       3...1 28389283468: idle_task_exit <-play_dead_common
    > [  125.940066]   <idle>-0       3...1 28389284644: amd_e400_remove_cpu <-play_dead_common
    >
    >
    > CPU 3 is now offline, the rcu_preempt thread that ran on CPU 3 is still
    > doing a schedule_timeout_uninterruptible() and it registered it's
    > timeout to the timer base for CPU 3. You would think that it would get
    > migrated right? The issue here is that the timer migration happens at
    > the CPU notifier for CPU_DEAD. The problem is that the rcu notifier for
    > CPU_DOWN is blocked waiting for the onoff_mutex to be released, which is
    > held by the thread that just put itself into a uninterruptible sleep,
    > that wont wake up until the CPU_DEAD notifier of the timer
    > infrastructure is called, which wont happen until the rcu notifier
    > finishes. Here's our deadlock!
    
    This commit breaks this deadlock cycle by substituting a shorter udelay()
    for the previous schedule_timeout_uninterruptible(), while at the same
    time increasing the probability of the delay.  This maintains the intensity
    of the testing.
    Reported-by: 's avatarSteven Rostedt <rostedt@goodmis.org>
    Signed-off-by: 's avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
    Tested-by: 's avatarSteven Rostedt <rostedt@goodmis.org>
    971394f3
rcutree.c 103 KB