1. 26 Mar, 2015 1 commit
    • Tejun Heo's avatar
      workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for PREEMPT_NONE · d8bee0e3
      Tejun Heo authored
      commit 8603e1b30027f943cc9c1eef2b291d42c3347af1 upstream.
      
      cancel[_delayed]_work_sync() are implemented using
      __cancel_work_timer() which grabs the PENDING bit using
      try_to_grab_pending() and then flushes the work item with PENDING set
      to prevent the on-going execution of the work item from requeueing
      itself.
      
      try_to_grab_pending() can always grab PENDING bit without blocking
      except when someone else is doing the above flushing during
      cancelation.  In that case, try_to_grab_pending() returns -ENOENT.  In
      this case, __cancel_work_timer() currently invokes flush_work().  The
      assumption is that the completion of the work item is what the other
      canceling task would be waiting for too and thus waiting for the same
      condition and retrying should allow forward progress without excessive
      busy looping
      
      Unfortunately, this doesn't work if preemption is disabled or the
      latter task has real time priority.  Let's say task A just got woken
      up from flush_work() by the completion of the target work item.  If,
      before task A starts executing, task B gets scheduled and invokes
      __cancel_work_timer() on the same work item, its try_to_grab_pending()
      will return -ENOENT as the work item is still being canceled by task A
      and flush_work() will also immediately return false as the work item
      is no longer executing.  This puts task B in a busy loop possibly
      preventing task A from executing and clearing the canceling state on
      the work item leading to a hang.
      
      task A			task B			worker
      
      						executing work
      __cancel_work_timer()
        try_to_grab_pending()
        set work CANCELING
        flush_work()
          block for work completion
      						completion, wakes up A
      			__cancel_work_timer()
      			while (forever) {
      			  try_to_grab_pending()
      			    -ENOENT as work is being canceled
      			  flush_work()
      			    false as work is no longer executing
      			}
      
      This patch removes the possible hang by updating __cancel_work_timer()
      to explicitly wait for clearing of CANCELING rather than invoking
      flush_work() after try_to_grab_pending() fails with -ENOENT.
      
      Link: http://lkml.kernel.org/g/20150206171156.GA8942@axis.com
      
      v3: bit_waitqueue() can't be used for work items defined in vmalloc
          area.  Switched to custom wake function which matches the target
          work item and exclusive wait and wakeup.
      
      v2: v1 used wake_up() on bit_waitqueue() which leads to NULL deref if
          the target bit waitqueue has wait_bit_queue's on it.  Use
          DEFINE_WAIT_BIT() and __wake_up_bit() instead.  Reported by Tomeu
          Vizoso.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarRabin Vincent <rabin.vincent@axis.com>
      Cc: Tomeu Vizoso <tomeu.vizoso@gmail.com>
      Tested-by: default avatarJesper Nilsson <jesper.nilsson@axis.com>
      Tested-by: default avatarRabin Vincent <rabin.vincent@axis.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d8bee0e3
  2. 06 Feb, 2015 1 commit
    • Tejun Heo's avatar
      workqueue: fix subtle pool management issue which can stall whole worker_pool · 85be16ba
      Tejun Heo authored
      commit 29187a9eeaf362d8422e62e17a22a6e115277a49 upstream.
      
      A worker_pool's forward progress is guaranteed by the fact that the
      last idle worker assumes the manager role to create more workers and
      summon the rescuers if creating workers doesn't succeed in timely
      manner before proceeding to execute work items.
      
      This manager role is implemented in manage_workers(), which indicates
      whether the worker may proceed to work item execution with its return
      value.  This is necessary because multiple workers may contend for the
      manager role, and, if there already is a manager, others should
      proceed to work item execution.
      
      Unfortunately, the function also indicates that the worker may proceed
      to work item execution if need_to_create_worker() is false at the head
      of the function.  need_to_create_worker() tests the following
      conditions.
      
      	pending work items && !nr_running && !nr_idle
      
      The first and third conditions are protected by pool->lock and thus
      won't change while holding pool->lock; however, nr_running can change
      asynchronously as other workers block and resume and while it's likely
      to be zero, as someone woke this worker up in the first place, some
      other workers could have become runnable inbetween making it non-zero.
      
      If this happens, manage_worker() could return false even with zero
      nr_idle making the worker, the last idle one, proceed to execute work
      items.  If then all workers of the pool end up blocking on a resource
      which can only be released by a work item which is pending on that
      pool, the whole pool can deadlock as there's no one to create more
      workers or summon the rescuers.
      
      This patch fixes the problem by removing the early exit condition from
      maybe_create_worker() and making manage_workers() return false iff
      there's already another manager, which ensures that the last worker
      doesn't start executing work items.
      
      We can leave the early exit condition alone and just ignore the return
      value but the only reason it was put there is because the
      manage_workers() used to perform both creations and destructions of
      workers and thus the function may be invoked while the pool is trying
      to reduce the number of workers.  Now that manage_workers() is called
      only when more workers are needed, the only case this early exit
      condition is triggered is rare race conditions rendering it pointless.
      
      Tested with simulated workload and modified workqueue code which
      trigger the pool deadlock reliably without this patch.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarEric Sandeen <sandeen@sandeen.net>
      Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      85be16ba
  3. 17 Jul, 2014 2 commits
    • Yasuaki Ishimatsu's avatar
      workqueue: zero cpumask of wq_numa_possible_cpumask on init · 3e24998c
      Yasuaki Ishimatsu authored
      commit 5a6024f1604eef119cf3a6fa413fe0261a81a8f3 upstream.
      
      When hot-adding and onlining CPU, kernel panic occurs, showing following
      call trace.
      
        BUG: unable to handle kernel paging request at 0000000000001d08
        IP: [<ffffffff8114acfd>] __alloc_pages_nodemask+0x9d/0xb10
        PGD 0
        Oops: 0000 [#1] SMP
        ...
        Call Trace:
         [<ffffffff812b8745>] ? cpumask_next_and+0x35/0x50
         [<ffffffff810a3283>] ? find_busiest_group+0x113/0x8f0
         [<ffffffff81193bc9>] ? deactivate_slab+0x349/0x3c0
         [<ffffffff811926f1>] new_slab+0x91/0x300
         [<ffffffff815de95a>] __slab_alloc+0x2bb/0x482
         [<ffffffff8105bc1c>] ? copy_process.part.25+0xfc/0x14c0
         [<ffffffff810a3c78>] ? load_balance+0x218/0x890
         [<ffffffff8101a679>] ? sched_clock+0x9/0x10
         [<ffffffff81105ba9>] ? trace_clock_local+0x9/0x10
         [<ffffffff81193d1c>] kmem_cache_alloc_node+0x8c/0x200
         [<ffffffff8105bc1c>] copy_process.part.25+0xfc/0x14c0
         [<ffffffff81114d0d>] ? trace_buffer_unlock_commit+0x4d/0x60
         [<ffffffff81085a80>] ? kthread_create_on_node+0x140/0x140
         [<ffffffff8105d0ec>] do_fork+0xbc/0x360
         [<ffffffff8105d3b6>] kernel_thread+0x26/0x30
         [<ffffffff81086652>] kthreadd+0x2c2/0x300
         [<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60
         [<ffffffff815f20ec>] ret_from_fork+0x7c/0xb0
         [<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60
      
      In my investigation, I found the root cause is wq_numa_possible_cpumask.
      All entries of wq_numa_possible_cpumask is allocated by
      alloc_cpumask_var_node(). And these entries are used without initializing.
      So these entries have wrong value.
      
      When hot-adding and onlining CPU, wq_update_unbound_numa() is called.
      wq_update_unbound_numa() calls alloc_unbound_pwq(). And alloc_unbound_pwq()
      calls get_unbound_pool(). In get_unbound_pool(), worker_pool->node is set
      as follow:
      
      3592         /* if cpumask is contained inside a NUMA node, we belong to that node */
      3593         if (wq_numa_enabled) {
      3594                 for_each_node(node) {
      3595                         if (cpumask_subset(pool->attrs->cpumask,
      3596                                            wq_numa_possible_cpumask[node])) {
      3597                                 pool->node = node;
      3598                                 break;
      3599                         }
      3600                 }
      3601         }
      
      But wq_numa_possible_cpumask[node] does not have correct cpumask. So, wrong
      node is selected. As a result, kernel panic occurs.
      
      By this patch, all entries of wq_numa_possible_cpumask are allocated by
      zalloc_cpumask_var_node to initialize them. And the panic disappeared.
      Signed-off-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: bce90380 ("workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]")
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3e24998c
    • Maxime Bizon's avatar
      workqueue: fix dev_set_uevent_suppress() imbalance · 7c36f88c
      Maxime Bizon authored
      commit bddbceb688c6d0decaabc7884fede319d02f96c8 upstream.
      
      Uevents are suppressed during attributes registration, but never
      restored, so kobject_uevent() does nothing.
      Signed-off-by: default avatarMaxime Bizon <mbizon@freebox.fr>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: 226223abSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7c36f88c
  4. 07 Jun, 2014 4 commits
  5. 07 Mar, 2014 1 commit
    • Lai Jiangshan's avatar
      workqueue: ensure @task is valid across kthread_stop() · 4403be9e
      Lai Jiangshan authored
      commit 5bdfff96c69a4d5ab9c49e60abf9e070ecd2acbb upstream.
      
      When a kworker should die, the kworkre is notified through WORKER_DIE
      flag instead of kthread_should_stop().  This, IIRC, is primarily to
      keep the test synchronized inside worker_pool lock.  WORKER_DIE is
      first set while holding pool->lock, the lock is dropped and
      kthread_stop() is called.
      
      Unfortunately, this means that there's a slight chance that the target
      kworker may see WORKER_DIE before kthread_stop() finishes and exits
      and frees the target task before or during kthread_stop().
      
      Fix it by pinning the target task before setting WORKER_DIE and
      putting it after kthread_stop() is done.
      
      tj: Improved patch description and comment.  Moved pinning above
          WORKER_DIE for better signify what it's protecting.
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4403be9e
  6. 04 Dec, 2013 1 commit
    • Tejun Heo's avatar
      workqueue: fix ordered workqueues in NUMA setups · ced4ac92
      Tejun Heo authored
      commit 8a2b75384444488fc4f2cbb9f0921b6a0794838f upstream.
      
      An ordered workqueue implements execution ordering by using single
      pool_workqueue with max_active == 1.  On a given pool_workqueue, work
      items are processed in FIFO order and limiting max_active to 1
      enforces the queued work items to be processed one by one.
      
      Unfortunately, 4c16bd32 ("workqueue: implement NUMA affinity for
      unbound workqueues") accidentally broke this guarantee by applying
      NUMA affinity to ordered workqueues too.  On NUMA setups, an ordered
      workqueue would end up with separate pool_workqueues for different
      nodes.  Each pool_workqueue still limits max_active to 1 but multiple
      work items may be executed concurrently and out of order depending on
      which node they are queued to.
      
      Fix it by using dedicated ordered_wq_attrs[] when creating ordered
      workqueues.  The new attrs match the unbound ones except that no_numa
      is always set thus forcing all NUMA nodes to share the default
      pool_workqueue.
      
      While at it, add sanity check in workqueue creation path which
      verifies that an ordered workqueues has only the default
      pool_workqueue.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarLibin <huawei.libin@huawei.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ced4ac92
  7. 08 Sep, 2013 1 commit
    • Tejun Heo's avatar
      workqueue: cond_resched() after processing each work item · 6ff96f73
      Tejun Heo authored
      commit b22ce2785d97423846206cceec4efee0c4afd980 upstream.
      
      If !PREEMPT, a kworker running work items back to back can hog CPU.
      This becomes dangerous when a self-requeueing work item which is
      waiting for something to happen races against stop_machine.  Such
      self-requeueing work item would requeue itself indefinitely hogging
      the kworker and CPU it's running on while stop_machine would wait for
      that CPU to enter stop_machine while preventing anything else from
      happening on all other CPUs.  The two would deadlock.
      
      Jamie Liu reports that this deadlock scenario exists around
      scsi_requeue_run_queue() and libata port multiplier support, where one
      port may exclude command processing from other ports.  With the right
      timing, scsi_requeue_run_queue() can end up requeueing itself trying
      to execute an IO which is asked to be retried while another device has
      an exclusive access, which in turn can't make forward progress due to
      stop_machine.
      
      Fix it by invoking cond_resched() after executing each work item.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJamie Liu <jamieliu@google.com>
      References: http://thread.gmane.org/gmane.linux.kernel/1552567Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6ff96f73
  8. 12 Aug, 2013 1 commit
    • Shaohua Li's avatar
      workqueue: copy workqueue_attrs with all fields · 73b8bd6d
      Shaohua Li authored
      commit 2865a8fb44cc32420407362cbda80c10fa09c6b2 upstream.
      
       $echo '0' > /sys/bus/workqueue/devices/xxx/numa
       $cat /sys/bus/workqueue/devices/xxx/numa
      
      I got 1. It should be 0, the reason is copy_workqueue_attrs() called
      in apply_workqueue_attrs() doesn't copy no_numa field.
      
      Fix it by making copy_workqueue_attrs() copy ->no_numa too.  This
      would also make get_unbound_pool() set a pool's ->no_numa attribute
      according to the workqueue attributes used when the pool was created.
      While harmelss, as ->no_numa isn't a pool attribute, this is a bit
      confusing.  Clear it explicitly.
      
      tj: Updated description and comments a bit.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      73b8bd6d
  9. 15 May, 2013 1 commit
    • Tejun Heo's avatar
      workqueue: don't perform NUMA-aware allocations on offline nodes in wq_numa_init() · 1be0c25d
      Tejun Heo authored
      wq_numa_init() builds per-node cpumasks which are later used to make
      unbound workqueues NUMA-aware.  The cpumasks are allocated using
      alloc_cpumask_var_node() for all possible nodes.  Unfortunately, on
      machines with off-line nodes, this leads to NUMA-aware allocations on
      existing bug offline nodes, which in turn triggers BUG in the memory
      allocation code.
      
      Fix it by using NUMA_NO_NODE for cpumask allocations for offline
      nodes.
      
        kernel BUG at include/linux/gfp.h:323!
        invalid opcode: 0000 [#1] SMP
        Modules linked in:
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0+ #1
        Hardware name: ProLiant BL465c G7, BIOS A19 12/10/2011
        task: ffff880234608000 ti: ffff880234602000 task.ti: ffff880234602000
        RIP: 0010:[<ffffffff8117495d>]  [<ffffffff8117495d>] new_slab+0x2ad/0x340
        RSP: 0000:ffff880234603bf8  EFLAGS: 00010246
        RAX: 0000000000000000 RBX: ffff880237404b40 RCX: 00000000000000d0
        RDX: 0000000000000001 RSI: 0000000000000003 RDI: 00000000002052d0
        RBP: ffff880234603c28 R08: 0000000000000000 R09: 0000000000000001
        R10: 0000000000000001 R11: ffffffff812e3aa8 R12: 0000000000000001
        R13: ffff8802378161c0 R14: 0000000000030027 R15: 00000000000040d0
        FS:  0000000000000000(0000) GS:ffff880237800000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: ffff88043fdff000 CR3: 00000000018d5000 CR4: 00000000000007f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Stack:
         ffff880234603c28 0000000000000001 00000000000000d0 ffff8802378161c0
         ffff880237404b40 ffff880237404b40 ffff880234603d28 ffffffff815edba1
         ffff880237816140 0000000000000000 ffff88023740e1c0
        Call Trace:
         [<ffffffff815edba1>] __slab_alloc+0x330/0x4f2
         [<ffffffff81174b25>] kmem_cache_alloc_node_trace+0xa5/0x200
         [<ffffffff812e3aa8>] alloc_cpumask_var_node+0x28/0x90
         [<ffffffff81a0bdb3>] wq_numa_init+0x10d/0x1be
         [<ffffffff81a0bec8>] init_workqueues+0x64/0x341
         [<ffffffff810002ea>] do_one_initcall+0xea/0x1a0
         [<ffffffff819f1f31>] kernel_init_freeable+0xb7/0x1ec
         [<ffffffff815d50de>] kernel_init+0xe/0xf0
         [<ffffffff815ff89c>] ret_from_fork+0x7c/0xb0
        Code: 45  84 ac 00 00 00 f0 41 80 4d 00 40 e9 f6 fe ff ff 66 0f 1f 84 00 00 00 00 00 e8 eb 4b ff ff 49 89 c5 e9 05 fe ff ff <0f> 0b 4c 8b 73 38 44 89 ff 81 cf 00 00 20 00 4c 89 f6 48 c1 ee
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-Tested-by: default avatarLingzhu Xiang <lxiang@redhat.com>
      1be0c25d
  10. 14 May, 2013 2 commits
  11. 10 May, 2013 1 commit
    • Tejun Heo's avatar
      workqueue: workqueue_congested() shouldn't translate WORK_CPU_UNBOUND into node number · d3251859
      Tejun Heo authored
      df2d5ae4 ("workqueue: map an unbound workqueues to multiple per-node
      pool_workqueues") made unbound workqueues to map to multiple per-node
      pool_workqueues and accordingly updated workqueue_contested() so that,
      for unbound workqueues, it maps the specified @cpu to the NUMA node
      number to obtain the matching pool_workqueue to query the congested
      state.
      
      Before this change, workqueue_congested() ignored @cpu for unbound
      workqueues as there was only one pool_workqueue and some users
      (fscache) called it with WORK_CPU_UNBOUND.  After the commit, this
      causes the following oops as WORK_CPU_UNBOUND gets translated to
      garbage by cpu_to_node().
      
        BUG: unable to handle kernel paging request at ffff8803598d98b8
        IP: [<ffffffff81043b7e>] unbound_pwq_by_node+0xa1/0xfa
        PGD 2421067 PUD 0
        Oops: 0000 [#1] SMP
        CPU: 1 PID: 2689 Comm: cat Tainted: GF            3.9.0-fsdevel+ #4
        task: ffff88003d801040 ti: ffff880025806000 task.ti: ffff880025806000
        RIP: 0010:[<ffffffff81043b7e>]  [<ffffffff81043b7e>] unbound_pwq_by_node+0xa1/0xfa
        RSP: 0018:ffff880025807ad8  EFLAGS: 00010202
        RAX: 0000000000000001 RBX: ffff8800388a2400 RCX: 0000000000000003
        RDX: ffff880025807fd8 RSI: ffffffff81a31420 RDI: ffff88003d8016e0
        RBP: ffff880025807ae8 R08: ffff88003d801730 R09: ffffffffa00b4898
        R10: ffffffff81044217 R11: ffff88003d801040 R12: 0000000064206e97
        R13: ffff880036059d98 R14: ffff880038cc8080 R15: ffff880038cc82d0
        FS:  00007f21afd9c740(0000) GS:ffff88003d100000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: ffff8803598d98b8 CR3: 000000003df49000 CR4: 00000000000007e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Stack:
         ffff8800388a2400 0000000000000002 ffff880025807b18 ffffffff810442ce
         ffffffff81044217 ffff880000000002 ffff8800371b4080 ffff88003d112ec0
         ffff880025807b38 ffffffffa00810b0 ffff880036059d88 ffff880036059be8
        Call Trace:
         [<ffffffff810442ce>] workqueue_congested+0xb7/0x12c
         [<ffffffffa00810b0>] fscache_enqueue_object+0xb2/0xe8 [fscache]
         [<ffffffffa007facd>] __fscache_acquire_cookie+0x3b9/0x56c [fscache]
         [<ffffffffa00ad8fe>] nfs_fscache_set_inode_cookie+0xee/0x132 [nfs]
         [<ffffffffa009e112>] do_open+0x9/0xd [nfs]
         [<ffffffff810e804a>] do_dentry_open+0x175/0x24b
         [<ffffffff810e8298>] finish_open+0x41/0x51
      
      Fix it by using smp_processor_id() if @cpu is WORK_CPU_UNBOUND.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-and-Acked-by: default avatarDavid Howells <dhowells@redhat.com>
      d3251859
  12. 01 May, 2013 1 commit
    • Tejun Heo's avatar
      workqueue: include workqueue info when printing debug dump of a worker task · 3d1cb205
      Tejun Heo authored
      One of the problems that arise when converting dedicated custom
      threadpool to workqueue is that the shared worker pool used by workqueue
      anonimizes each worker making it more difficult to identify what the
      worker was doing on which target from the output of sysrq-t or debug
      dump from oops, BUG() and friends.
      
      This patch implements set_worker_desc() which can be called from any
      workqueue work function to set its description.  When the worker task is
      dumped for whatever reason - sysrq-t, WARN, BUG, oops, lockdep assertion
      and so on - the description will be printed out together with the
      workqueue name and the worker function pointer.
      
      The printing side is implemented by print_worker_info() which is called
      from functions in task dump paths - sched_show_task() and
      dump_stack_print_info().  print_worker_info() can be safely called on
      any task in any state as long as the task struct itself is accessible.
      It uses probe_*() functions to access worker fields.  It may print
      garbage if something went very wrong, but it wouldn't cause (another)
      oops.
      
      The description is currently limited to 24bytes including the
      terminating \0.  worker->desc_valid and workder->desc[] are added and
      the 64 bytes marker which was already incorrect before adding the new
      fields is moved to the correct position.
      
      Here's an example dump with writeback updated to set the bdi name as
      worker desc.
      
       Hardware name: Bochs
       Modules linked in:
       Pid: 7, comm: kworker/u9:0 Not tainted 3.9.0-rc1-work+ #1
       Workqueue: writeback bdi_writeback_workfn (flush-8:0)
        ffffffff820a3ab0 ffff88000f6e9cb8 ffffffff81c61845 ffff88000f6e9cf8
        ffffffff8108f50f 0000000000000000 0000000000000000 ffff88000cde16b0
        ffff88000cde1aa8 ffff88001ee19240 ffff88000f6e9fd8 ffff88000f6e9d08
       Call Trace:
        [<ffffffff81c61845>] dump_stack+0x19/0x1b
        [<ffffffff8108f50f>] warn_slowpath_common+0x7f/0xc0
        [<ffffffff8108f56a>] warn_slowpath_null+0x1a/0x20
        [<ffffffff81200150>] bdi_writeback_workfn+0x2a0/0x3b0
       ...
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3d1cb205
  13. 09 Apr, 2013 1 commit
  14. 04 Apr, 2013 1 commit
    • Lai Jiangshan's avatar
      workqueue: avoid false negative WARN_ON() in destroy_workqueue() · 5c529597
      Lai Jiangshan authored
      destroy_workqueue() performs several sanity checks before proceeding
      with destruction of a workqueue.  One of the checks verifies that
      refcnt of each pwq (pool_workqueue) is over 1 as at that point there
      should be no in-flight work items and the only holder of pwq refs is
      the workqueue itself.
      
      This worked fine as a workqueue used to hold only one reference to its
      pwqs; however, since 4c16bd32 ("workqueue: implement NUMA affinity
      for unbound workqueues"), a workqueue may hold multiple references to
      its default pwq triggering this sanity check spuriously.
      
      Fix it by not triggering the pwq->refcnt assertion on default pwqs.
      
      An example spurious WARN trigger follows.
      
       WARNING: at kernel/workqueue.c:4201 destroy_workqueue+0x6a/0x13e()
       Hardware name: 4286C12
       Modules linked in: sdhci_pci sdhci mmc_core usb_storage i915 drm_kms_helper drm i2c_algo_bit i2c_core video
       Pid: 361, comm: umount Not tainted 3.9.0-rc5+ #29
       Call Trace:
        [<c04314a7>] warn_slowpath_common+0x7c/0x93
        [<c04314e0>] warn_slowpath_null+0x22/0x24
        [<c044796a>] destroy_workqueue+0x6a/0x13e
        [<c056dc01>] ext4_put_super+0x43/0x2c4
        [<c04fb7b8>] generic_shutdown_super+0x4b/0xb9
        [<c04fb848>] kill_block_super+0x22/0x60
        [<c04fb960>] deactivate_locked_super+0x2f/0x56
        [<c04fc41b>] deactivate_super+0x2e/0x31
        [<c050f1e6>] mntput_no_expire+0x103/0x108
        [<c050fdce>] sys_umount+0x2a2/0x2c4
        [<c050fe0e>] sys_oldumount+0x1e/0x20
        [<c085ba4d>] sysenter_do_call+0x12/0x38
      
      tj: Rewrote description.
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      5c529597
  15. 01 Apr, 2013 17 commits
    • Tejun Heo's avatar
      workqueue: update sysfs interface to reflect NUMA awareness and a kernel param... · d55262c4
      Tejun Heo authored
      workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity
      
      Unbound workqueues are now NUMA aware.  Let's add some control knobs
      and update sysfs interface accordingly.
      
      * Add kernel param workqueue.numa_disable which disables NUMA affinity
        globally.
      
      * Replace sysfs file "pool_id" with "pool_ids" which contain
        node:pool_id pairs.  This change is userland-visible but "pool_id"
        hasn't seen a release yet, so this is okay.
      
      * Add a new sysf files "numa" which can toggle NUMA affinity on
        individual workqueues.  This is implemented as attrs->no_numa whichn
        is special in that it isn't part of a pool's attributes.  It only
        affects how apply_workqueue_attrs() picks which pools to use.
      
      After "pool_ids" change, first_pwq() doesn't have any user left.
      Removed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      d55262c4
    • Tejun Heo's avatar
      workqueue: implement NUMA affinity for unbound workqueues · 4c16bd32
      Tejun Heo authored
      Currently, an unbound workqueue has single current, or first, pwq
      (pool_workqueue) to which all new work items are queued.  This often
      isn't optimal on NUMA machines as workers may jump around across node
      boundaries and work items get assigned to workers without any regard
      to NUMA affinity.
      
      This patch implements NUMA affinity for unbound workqueues.  Instead
      of mapping all entries of numa_pwq_tbl[] to the same pwq,
      apply_workqueue_attrs() now creates a separate pwq covering the
      intersecting CPUs for each NUMA node which has online CPUs in
      @attrs->cpumask.  Nodes which don't have intersecting possible CPUs
      are mapped to pwqs covering whole @attrs->cpumask.
      
      As CPUs come up and go down, the pool association is changed
      accordingly.  Changing pool association may involve allocating new
      pools which may fail.  To avoid failing CPU_DOWN, each workqueue
      always keeps a default pwq which covers whole attrs->cpumask which is
      used as fallback if pool creation fails during a CPU hotplug
      operation.
      
      This ensures that all work items issued on a NUMA node is executed on
      the same node as long as the workqueue allows execution on the CPUs of
      the node.
      
      As this maps a workqueue to multiple pwqs and max_active is per-pwq,
      this change the behavior of max_active.  The limit is now per NUMA
      node instead of global.  While this is an actual change, max_active is
      already per-cpu for per-cpu workqueues and primarily used as safety
      mechanism rather than for active concurrency control.  Concurrency is
      usually limited from workqueue users by the number of concurrently
      active work items and this change shouldn't matter much.
      
      v2: Fixed pwq freeing in apply_workqueue_attrs() error path.  Spotted
          by Lai.
      
      v3: The previous version incorrectly made a workqueue spanning
          multiple nodes spread work items over all online CPUs when some of
          its nodes don't have any desired cpus.  Reimplemented so that NUMA
          affinity is properly updated as CPUs go up and down.  This problem
          was spotted by Lai Jiangshan.
      
      v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
          however, wq may be freed at any time after dfl_pwq is put making
          the clearing use-after-free.  Clear wq->dfl_pwq before putting it.
      
      v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
          @pwq_tbl after success.  Fixed.
      
          Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
          application of new attrs is excluded via CPU hotplug.  Removed.
      
          Documentation on CPU affinity guarantee on CPU_DOWN added.
      
          All changes are suggested by Lai Jiangshan.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      4c16bd32
    • Tejun Heo's avatar
      workqueue: introduce put_pwq_unlocked() · dce90d47
      Tejun Heo authored
      Factor out lock pool, put_pwq(), unlock sequence into
      put_pwq_unlocked().  The two existing places are converted and there
      will be more with NUMA affinity support.
      
      This is to prepare for NUMA affinity support for unbound workqueues
      and doesn't introduce any functional difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      dce90d47
    • Tejun Heo's avatar
      workqueue: introduce numa_pwq_tbl_install() · 1befcf30
      Tejun Heo authored
      Factor out pool_workqueue linking and installation into numa_pwq_tbl[]
      from apply_workqueue_attrs() into numa_pwq_tbl_install().  link_pwq()
      is made safe to call multiple times.  numa_pwq_tbl_install() links the
      pwq, installs it into numa_pwq_tbl[] at the specified node and returns
      the old entry.
      
      @last_pwq is removed from link_pwq() as the return value of the new
      function can be used instead.
      
      This is to prepare for NUMA affinity support for unbound workqueues.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      1befcf30
    • Tejun Heo's avatar
      workqueue: use NUMA-aware allocation for pool_workqueues · e50aba9a
      Tejun Heo authored
      Use kmem_cache_alloc_node() with @pool->node instead of
      kmem_cache_zalloc() when allocating a pool_workqueue so that it's
      allocated on the same node as the associated worker_pool.  As there's
      no no kmem_cache_zalloc_node(), move zeroing to init_pwq().
      
      This was suggested by Lai Jiangshan.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      e50aba9a
    • Tejun Heo's avatar
      workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq() · f147f29e
      Tejun Heo authored
      Break init_and_link_pwq() into init_pwq() and link_pwq() and move
      unbound-workqueue specific handling into apply_workqueue_attrs().
      Also, factor out unbound pool and pool_workqueue allocation into
      alloc_unbound_pwq().
      
      This reorganization is to prepare for NUMA affinity and doesn't
      introduce any functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      f147f29e
    • Tejun Heo's avatar
      workqueue: map an unbound workqueues to multiple per-node pool_workqueues · df2d5ae4
      Tejun Heo authored
      Currently, an unbound workqueue has only one "current" pool_workqueue
      associated with it.  It may have multple pool_workqueues but only the
      first pool_workqueue servies new work items.  For NUMA affinity, we
      want to change this so that there are multiple current pool_workqueues
      serving different NUMA nodes.
      
      Introduce workqueue->numa_pwq_tbl[] which is indexed by NUMA node and
      points to the pool_workqueue to use for each possible node.  This
      replaces first_pwq() in __queue_work() and workqueue_congested().
      
      numa_pwq_tbl[] is currently initialized to point to the same
      pool_workqueue as first_pwq() so this patch doesn't make any behavior
      changes.
      
      v2: Use rcu_dereference_raw() in unbound_pwq_by_node() as the function
          may be called only with wq->mutex held.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      df2d5ae4
    • Tejun Heo's avatar
      workqueue: move hot fields of workqueue_struct to the end · 2728fd2f
      Tejun Heo authored
      Move wq->flags and ->cpu_pwqs to the end of workqueue_struct and align
      them to the cacheline.  These two fields are used in the work item
      issue path and thus hot.  The scheduled NUMA affinity support will add
      dispatch table at the end of workqueue_struct and relocating these two
      fields will allow us hitting only single cacheline on hot paths.
      
      Note that wq->pwqs isn't moved although it currently is being used in
      the work item issue path for unbound workqueues.  The dispatch table
      mentioned above will replace its use in the issue path, so it will
      become cold once NUMA support is implemented.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      2728fd2f
    • Tejun Heo's avatar
      workqueue: make workqueue->name[] fixed len · ecf6881f
      Tejun Heo authored
      Currently workqueue->name[] is of flexible length.  We want to use the
      flexible field for something more useful and there isn't much benefit
      in allowing arbitrary name length anyway.  Make it fixed len capping
      at 24 bytes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      ecf6881f
    • Tejun Heo's avatar
      workqueue: add workqueue->unbound_attrs · 6029a918
      Tejun Heo authored
      Currently, when exposing attrs of an unbound workqueue via sysfs, the
      workqueue_attrs of first_pwq() is used as that should equal the
      current state of the workqueue.
      
      The planned NUMA affinity support will make unbound workqueues make
      use of multiple pool_workqueues for different NUMA nodes and the above
      assumption will no longer hold.  Introduce workqueue->unbound_attrs
      which records the current attrs in effect and use it for sysfs instead
      of first_pwq()->attrs.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      6029a918
    • Tejun Heo's avatar
      workqueue: determine NUMA node of workers accourding to the allowed cpumask · f3f90ad4
      Tejun Heo authored
      When worker tasks are created using kthread_create_on_node(),
      currently only per-cpu ones have the matching NUMA node specified.
      All unbound workers are always created with NUMA_NO_NODE.
      
      Now that an unbound worker pool may have an arbitrary cpumask
      associated with it, this isn't optimal.  Add pool->node which is
      determined by the pool's cpumask.  If the pool's cpumask is contained
      inside a NUMA node proper, the pool is associated with that node, and
      all workers of the pool are created on that node.
      
      This currently only makes difference for unbound worker pools with
      cpumask contained inside single NUMA node, but this will serve as
      foundation for making all unbound pools NUMA-affine.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      f3f90ad4
    • Tejun Heo's avatar
      workqueue: drop 'H' from kworker names of unbound worker pools · e3c916a4
      Tejun Heo authored
      Currently, all workqueue workers which have negative nice value has
      'H' postfixed to their names.  This is necessary for per-cpu workers
      as they use the CPU number instead of pool->id to identify the pool
      and the 'H' postfix is the only thing distinguishing normal and
      highpri workers.
      
      As workers for unbound pools use pool->id, the 'H' postfix is purely
      informational.  TASK_COMM_LEN is 16 and after the static part and
      delimiters, there are only five characters left for the pool and
      worker IDs.  We're expecting to have more unbound pools with the
      scheduled NUMA awareness support.  Let's drop the non-essential 'H'
      postfix from unbound kworker name.
      
      While at it, restructure kthread_create*() invocation to help future
      NUMA related changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      e3c916a4
    • Tejun Heo's avatar
      workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[] · bce90380
      Tejun Heo authored
      Unbound workqueues are going to be NUMA-affine.  Add wq_numa_tbl_len
      and wq_numa_possible_cpumask[] in preparation.  The former is the
      highest NUMA node ID + 1 and the latter is masks of possibles CPUs for
      each NUMA node.
      
      This patch only introduces these.  Future patches will make use of
      them.
      
      v2: NUMA initialization move into wq_numa_init().  Also, the possible
          cpumask array is not created if there aren't multiple nodes on the
          system.  wq_numa_enabled bool added.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      bce90380
    • Tejun Heo's avatar
      workqueue: move pwq_pool_locking outside of get/put_unbound_pool() · a892cacc
      Tejun Heo authored
      The scheduled NUMA affinity support for unbound workqueues would need
      to walk workqueues list and pool related operations on each workqueue.
      
      Move wq_pool_mutex locking out of get/put_unbound_pool() to their
      callers so that pool operations can be performed while walking the
      workqueues list, which is also protected by wq_pool_mutex.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      a892cacc
    • Tejun Heo's avatar
      workqueue: fix memory leak in apply_workqueue_attrs() · 4862125b
      Tejun Heo authored
      apply_workqueue_attrs() wasn't freeing temp attrs variable @new_attrs
      in its success path.  Fix it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      4862125b
    • Tejun Heo's avatar
      workqueue: fix unbound workqueue attrs hashing / comparison · 13e2e556
      Tejun Heo authored
      29c91e99 ("workqueue: implement attribute-based unbound worker_pool
      management") implemented attrs based worker_pool matching.  It tried
      to avoid false negative when comparing cpumasks with custom hash
      function; unfortunately, the hash and comparison functions fail to
      ignore CPUs which are not possible.  It incorrectly assumed that
      bitmap_copy() skips leftover bits in the last word of bitmap and
      cpumask_equal() ignores impossible CPUs.
      
      This patch updates attrs->cpumask handling such that impossible CPUs
      are properly ignored.
      
      * Hash and copy functions no longer do anything special.  They expect
        their callers to clear impossible CPUs.
      
      * alloc_workqueue_attrs() initializes the cpumask to cpu_possible_mask
        instead of setting all bits and explicit cpumask_setall() for
        unbound_std_wq_attrs[] in init_workqueues() is dropped.
      
      * apply_workqueue_attrs() is now responsible for ignoring impossible
        CPUs.  It makes a copy of @attrs and clears impossible CPUs before
        doing anything else.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      13e2e556
    • Tejun Heo's avatar
      workqueue: fix race condition in unbound workqueue free path · bc0caf09
      Tejun Heo authored
      8864b4e5 ("workqueue: implement get/put_pwq()") implemented pwq
      (pool_workqueue) refcnting which frees workqueue when the last pwq
      goes away.  It determined whether it was the last pwq by testing
      wq->pwqs is empty.  Unfortunately, the test was done outside wq->mutex
      and multiple pwq release could race and try to free wq multiple times
      leading to oops.
      
      Test wq->pwqs emptiness while holding wq->mutex.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      bc0caf09
  16. 25 Mar, 2013 4 commits
    • Lai Jiangshan's avatar
      workqueue: remove pwq_lock which is no longer used · b5927605
      Lai Jiangshan authored
      To simplify locking, the previous patches expanded wq->mutex to
      protect all fields of each workqueue instance including the pwqs list
      leaving pwq_lock without any user.  Remove the unused pwq_lock.
      
      tj: Rebased on top of the current dev branch.  Updated description.
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      b5927605
    • Lai Jiangshan's avatar
      workqueue: protect wq->saved_max_active with wq->mutex · a357fc03
      Lai Jiangshan authored
      We're expanding wq->mutex to cover all fields specific to each
      workqueue with the end goal of replacing pwq_lock which will make
      locking simpler and easier to understand.
      
      This patch makes wq->saved_max_active protected by wq->mutex instead
      of pwq_lock.  As pwq_lock locking around pwq_adjust_mac_active() is no
      longer necessary, this patch also replaces pwq_lock lockings of
      for_each_pwq() around pwq_adjust_max_active() to wq->mutex.
      
      tj: Rebased on top of the current dev branch.  Updated description.
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      a357fc03
    • Lai Jiangshan's avatar
      workqueue: protect wq->pwqs and iteration with wq->mutex · b09f4fd3
      Lai Jiangshan authored
      We're expanding wq->mutex to cover all fields specific to each
      workqueue with the end goal of replacing pwq_lock which will make
      locking simpler and easier to understand.
      
      init_and_link_pwq() and pwq_unbound_release_workfn() already grab
      wq->mutex when adding or removing a pwq from wq->pwqs list.  This
      patch makes it official that the list is wq->mutex protected for
      writes and updates readers accoridingly.  Explicit IRQ toggles for
      sched-RCU read-locking in flush_workqueue_prep_pwqs() and
      drain_workqueues() are removed as the surrounding wq->mutex can
      provide sufficient synchronization.
      
      Also, assert_rcu_or_pwq_lock() is renamed to assert_rcu_or_wq_mutex()
      and checks for wq->mutex too.
      
      pwq_lock locking and assertion are not removed by this patch and a
      couple of for_each_pwq() iterations are still protected by it.
      They'll be removed by future patches.
      
      tj: Rebased on top of the current dev branch.  Updated description.
          Folded in assert_rcu_or_wq_mutex() renaming from a later patch
          along with associated comment updates.
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      b09f4fd3
    • Lai Jiangshan's avatar
      workqueue: protect wq->nr_drainers and ->flags with wq->mutex · 87fc741e
      Lai Jiangshan authored
      We're expanding wq->mutex to cover all fields specific to each
      workqueue with the end goal of replacing pwq_lock which will make
      locking simpler and easier to understand.
      
      wq->nr_drainers and ->flags are specific to each workqueue.  Protect
      ->nr_drainers and ->flags with wq->mutex instead of pool_mutex.
      
      tj: Rebased on top of the current dev branch.  Updated description.
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      87fc741e