Skip to content
  • Tejun Heo's avatar
    workqueue: implement NUMA affinity for unbound workqueues · 4c16bd32
    Tejun Heo authored
    
    
    Currently, an unbound workqueue has single current, or first, pwq
    (pool_workqueue) to which all new work items are queued.  This often
    isn't optimal on NUMA machines as workers may jump around across node
    boundaries and work items get assigned to workers without any regard
    to NUMA affinity.
    
    This patch implements NUMA affinity for unbound workqueues.  Instead
    of mapping all entries of numa_pwq_tbl[] to the same pwq,
    apply_workqueue_attrs() now creates a separate pwq covering the
    intersecting CPUs for each NUMA node which has online CPUs in
    @attrs->cpumask.  Nodes which don't have intersecting possible CPUs
    are mapped to pwqs covering whole @attrs->cpumask.
    
    As CPUs come up and go down, the pool association is changed
    accordingly.  Changing pool association may involve allocating new
    pools which may fail.  To avoid failing CPU_DOWN, each workqueue
    always keeps a default pwq which covers whole attrs->cpumask which is
    used as fallback if pool creation fails during a CPU hotplug
    operation.
    
    This ensures that all work items issued on a NUMA node is executed on
    the same node as long as the workqueue allows execution on the CPUs of
    the node.
    
    As this maps a workqueue to multiple pwqs and max_active is per-pwq,
    this change the behavior of max_active.  The limit is now per NUMA
    node instead of global.  While this is an actual change, max_active is
    already per-cpu for per-cpu workqueues and primarily used as safety
    mechanism rather than for active concurrency control.  Concurrency is
    usually limited from workqueue users by the number of concurrently
    active work items and this change shouldn't matter much.
    
    v2: Fixed pwq freeing in apply_workqueue_attrs() error path.  Spotted
        by Lai.
    
    v3: The previous version incorrectly made a workqueue spanning
        multiple nodes spread work items over all online CPUs when some of
        its nodes don't have any desired cpus.  Reimplemented so that NUMA
        affinity is properly updated as CPUs go up and down.  This problem
        was spotted by Lai Jiangshan.
    
    v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
        however, wq may be freed at any time after dfl_pwq is put making
        the clearing use-after-free.  Clear wq->dfl_pwq before putting it.
    
    v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
        @pwq_tbl after success.  Fixed.
    
        Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
        application of new attrs is excluded via CPU hotplug.  Removed.
    
        Documentation on CPU affinity guarantee on CPU_DOWN added.
    
        All changes are suggested by Lai Jiangshan.
    
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
    4c16bd32