Skip to content
  • Tejun Heo's avatar
    cgroup: use a dedicated workqueue for cgroup destruction · a6647e9e
    Tejun Heo authored
    commit e5fca243abae1445afbfceebda5f08462ef869d3 upstream.
    
    Since be445626
    
     ("cgroup: remove synchronize_rcu() from
    cgroup_diput()"), cgroup destruction path makes use of workqueue.  css
    freeing is performed from a work item from that point on and a later
    commit, ea15f8ccdb430 ("cgroup: split cgroup destruction into two
    steps"), moves css offlining to workqueue too.
    
    As cgroup destruction isn't depended upon for memory reclaim, the
    destruction work items were put on the system_wq; unfortunately, some
    controller may block in the destruction path for considerable duration
    while holding cgroup_mutex.  As large part of destruction path is
    synchronized through cgroup_mutex, when combined with high rate of
    cgroup removals, this has potential to fill up system_wq's max_active
    of 256.
    
    Also, it turns out that memcg's css destruction path ends up queueing
    and waiting for work items on system_wq through work_on_cpu().  If
    such operation happens while system_wq is fully occupied by cgroup
    destruction work items, work_on_cpu() can't make forward progress
    because system_wq is full and other destruction work items on
    system_wq can't make forward progress because the work item waiting
    for work_on_cpu() is holding cgroup_mutex, leading to deadlock.
    
    This can be fixed by queueing destruction work items on a separate
    workqueue.  This patch creates a dedicated workqueue -
    cgroup_destroy_wq - for this purpose.  As these work items shouldn't
    have inter-dependencies and mostly serialized by cgroup_mutex anyway,
    giving high concurrency level doesn't buy anything and the workqueue's
    @max_active is set to 1 so that destruction work items are executed
    one by one on each CPU.
    
    Hugh Dickins: Because cgroup_init() is run before init_workqueues(),
    cgroup_destroy_wq can't be allocated from cgroup_init().  Do it from a
    separate core_initcall().  In the future, we probably want to reorder
    so that workqueue init happens before cgroup_init().
    
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Reported-by: default avatarHugh Dickins <hughd@google.com>
    Reported-by: default avatarShawn Bohrer <shawn.bohrer@gmail.com>
    Link: http://lkml.kernel.org/r/20131111220626.GA7509@sbohrermbp13-local.rgmadvisors.com
    Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1310301606080.2333@eggly.anvils
    
    
    Cc: stable@vger.kernel.org # v3.9+
    Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    a6647e9e