Commit bd9c6f7f authored by Rohit Gupta's avatar Rohit Gupta

add the rejects and new files created after patch

parent 28f6e08d
The module hwlat_detector is a special purpose kernel module that is used to
detect large system latencies induced by the behavior of certain underlying
hardware or firmware, independent of Linux itself. The code was developed
originally to detect SMIs (System Management Interrupts) on x86 systems,
however there is nothing x86 specific about this patchset. It was
originally written for use by the "RT" patch since the Real Time
kernel is highly latency sensitive.
SMIs are usually not serviced by the Linux kernel, which typically does not
even know that they are occuring. SMIs are instead are set up by BIOS code
and are serviced by BIOS code, usually for "critical" events such as
management of thermal sensors and fans. Sometimes though, SMIs are used for
other tasks and those tasks can spend an inordinate amount of time in the
handler (sometimes measured in milliseconds). Obviously this is a problem if
you are trying to keep event service latencies down in the microsecond range.
The hardware latency detector works by hogging all of the cpus for configurable
amounts of time (by calling stop_machine()), polling the CPU Time Stamp Counter
for some period, then looking for gaps in the TSC data. Any gap indicates a
time when the polling was interrupted and since the machine is stopped and
interrupts turned off the only thing that could do that would be an SMI.
Note that the SMI detector should *NEVER* be used in a production environment.
It is intended to be run manually to determine if the hardware platform has a
problem with long system firmware service routines.
Loading the module hwlat_detector passing the parameter "enabled=1" (or by
setting the "enable" entry in "hwlat_detector" debugfs toggled on) is the only
step required to start the hwlat_detector. It is possible to redefine the
threshold in microseconds (us) above which latency spikes will be taken
into account (parameter "threshold=").
# modprobe hwlat_detector enabled=1 threshold=100
After the module is loaded, it creates a directory named "hwlat_detector" under
the debugfs mountpoint, "/debug/hwlat_detector" for this text. It is necessary
to have debugfs mounted, which might be on /sys/debug on your system.
The /debug/hwlat_detector interface contains the following files:
count - number of latency spikes observed since last reset
enable - a global enable/disable toggle (0/1), resets count
max - maximum hardware latency actually observed (usecs)
sample - a pipe from which to read current raw sample data
in the format <timestamp> <latency observed usecs>
(can be opened O_NONBLOCK for a single sample)
threshold - minimum latency value to be considered (usecs)
width - time period to sample with CPUs held (usecs)
must be less than the total window size (enforced)
window - total period of sampling, width being inside (usecs)
By default we will set width to 500,000 and window to 1,000,000, meaning that
we will sample every 1,000,000 usecs (1s) for 500,000 usecs (0.5s). If we
observe any latencies that exceed the threshold (initially 100 usecs),
then we write to a global sample ring buffer of 8K samples, which is
consumed by reading from the "sample" (pipe) debugfs file interface.
Using the Linux Kernel Latency Histograms
This document gives a short explanation how to enable, configure and use
latency histograms. Latency histograms are primarily relevant in the
context of real-time enabled kernels (CONFIG_PREEMPT/CONFIG_PREEMPT_RT)
and are used in the quality management of the Linux real-time
* Purpose of latency histograms
A latency histogram continuously accumulates the frequencies of latency
data. There are two types of histograms
- potential sources of latencies
- effective latencies
* Potential sources of latencies
Potential sources of latencies are code segments where interrupts,
preemption or both are disabled (aka critical sections). To create
histograms of potential sources of latency, the kernel stores the time
stamp at the start of a critical section, determines the time elapsed
when the end of the section is reached, and increments the frequency
counter of that latency value - irrespective of whether any concurrently
running process is affected by latency or not.
- Configuration items (in the Kernel hacking/Tracers submenu)
* Effective latencies
Effective latencies are actually occuring during wakeup of a process. To
determine effective latencies, the kernel stores the time stamp when a
process is scheduled to be woken up, and determines the duration of the
wakeup time shortly before control is passed over to this process. Note
that the apparent latency in user space may be somewhat longer, since the
process may be interrupted after control is passed over to it but before
the execution in user space takes place. Simply measuring the interval
between enqueuing and wakeup may also not appropriate in cases when a
process is scheduled as a result of a timer expiration. The timer may have
missed its deadline, e.g. due to disabled interrupts, but this latency
would not be registered. Therefore, the offsets of missed timers are
recorded in a separate histogram. If both wakeup latency and missed timer
offsets are configured and enabled, a third histogram may be enabled that
records the overall latency as a sum of the timer latency, if any, and the
wakeup latency. This histogram is called "timerandwakeup".
- Configuration items (in the Kernel hacking/Tracers submenu)
* Usage
The interface to the administration of the latency histograms is located
in the debugfs file system. To mount it, either enter
mount -t sysfs nodev /sys
mount -t debugfs nodev /sys/kernel/debug
from shell command line level, or add
nodev /sys sysfs defaults 0 0
nodev /sys/kernel/debug debugfs defaults 0 0
to the file /etc/fstab. All latency histogram related files are then
available in the directory /sys/kernel/debug/tracing/latency_hist. A
particular histogram type is enabled by writing non-zero to the related
variable in the /sys/kernel/debug/tracing/latency_hist/enable directory.
Select "preemptirqsoff" for the histograms of potential sources of
latencies and "wakeup" for histograms of effective latencies etc. The
histogram data - one per CPU - are available in the files
The histograms are reset by writing non-zero to the file "reset" in a
particular latency directory. To reset all latency data, use
if test -d $HISTDIR
for i in `find . | grep /reset$`
echo 1 >$i
* Data format
Latency data are stored with a resolution of one microsecond. The
maximum latency is 10,240 microseconds. The data are only valid, if the
overflow register is empty. Every output line contains the latency in
microseconds in the first row and the number of samples in the second
row. To display only lines with a positive latency count, use, for
grep -v " 0$" /sys/kernel/debug/tracing/latency_hist/preemptoff/CPU0
#Minimum latency: 0 microseconds.
#Average latency: 0 microseconds.
#Maximum latency: 25 microseconds.
#Total samples: 3104770694
#There are 0 samples greater or equal than 10240 microseconds
#usecs samples
0 2984486876
1 49843506
2 58219047
3 5348126
4 2187960
5 3388262
6 959289
7 208294
8 40420
9 4485
10 14918
11 18340
12 25052
13 19455
14 5602
15 969
16 47
17 18
18 14
19 1
20 3
21 2
22 5
23 2
25 1
* Wakeup latency of a selected process
To only collect wakeup latency data of a particular process, write the
PID of the requested process to
PIDs are not considered, if this variable is set to 0.
* Details of the process with the highest wakeup latency so far
Selected data of the process that suffered from the highest wakeup
latency that occurred in a particular CPU are available in the file
In addition, other relevant system data at the time when the
latency occurred are given.
The format of the data is (all in one line):
<PID> <Priority> <Latency> (<Timeroffset>) <Command> \
<- <PID> <Priority> <Command> <Timestamp>
The value of <Timeroffset> is only relevant in the combined timer
and wakeup latency recording. In the wakeup recording, it is
always 0, in the missed_timer_offsets recording, it is the same
as <Latency>.
When retrospectively searching for the origin of a latency and
tracing was not enabled, it may be helpful to know the name and
some basic data of the task that (finally) was switching to the
late real-tlme task. In addition to the victim's data, also the
data of the possible culprit are therefore displayed after the
"<-" symbol.
Finally, the timestamp of the time when the latency occurred
in <seconds>.<microseconds> after the most recent system boot
is provided.
These data are also reset when the wakeup histogram is reset.
--- arch/arm/mach-exynos/platsmp.c
+++ arch/arm/mach-exynos/platsmp.c
@@ -126,7 +126,7 @@
if (timeout == 0) {
printk(KERN_ERR "cpu1 power enable failed");
- spin_unlock(&boot_lock);
+ raw_spin_unlock(&boot_lock);
return -ETIMEDOUT;
--- arch/arm/mm/fault.c
+++ arch/arm/mm/fault.c
@@ -277,7 +277,7 @@
* If we're in an interrupt or have no user
* context, we must not take the fault..
- if (in_atomic() || !mm)
+ if (!mm || pagefault_disabled())
goto no_context;
if (user_mode(regs))
#ifndef _ASM_IRQ_WORK_H
#define _ASM_IRQ_WORK_H
#include <asm/processor.h>
static inline bool arch_irq_work_has_interrupt(void)
return cpu_has_apic;
#endif /* _ASM_IRQ_WORK_H */
--- drivers/gpu/drm/drm_irq.c
+++ drivers/gpu/drm/drm_irq.c
@@ -628,11 +628,6 @@
* code gets preempted or delayed for some reason.
for (i = 0; i < DRM_TIMESTAMP_MAXRETRIES; i++) {
- /* Disable preemption to make it very likely to
- * succeed in the first iteration even on PREEMPT_RT kernel.
- */
- preempt_disable();
/* Get system timestamp before query. */
stime = ktime_get();
@@ -644,8 +639,6 @@
if (!drm_timestamp_monotonic)
mono_time_offset = ktime_get_monotonic_offset();
- preempt_enable();
/* Return as no-op if scanout query unsupported or failed. */
if (!(vbl_status & DRM_SCANOUTPOS_VALID)) {
DRM_DEBUG("crtc %d : scanoutpos query failed [%d].\n",
--- drivers/gpu/drm/i915/i915_gem.c
+++ drivers/gpu/drm/i915/i915_gem.c
@@ -4449,7 +4467,7 @@
if (!mutex_is_locked(mutex))
return false;
-#if defined(CONFIG_SMP) && !defined(CONFIG_DEBUG_MUTEXES) && !defined(CONFIG_PREEMPT_RT_BASE)
+#if defined(CONFIG_SMP) && !defined(CONFIG_DEBUG_MUTEXES)
return mutex->owner == task;
/* Since UP may be pre-empted, we cannot assume that we own the lock */
--- drivers/misc/Makefile
+++ drivers/misc/Makefile
@@ -53,3 +53,4 @@
obj-$(CONFIG_VMWARE_VMCI) += vmw_vmci/
obj-$(CONFIG_LATTICE_ECP3_CONFIG) += lattice-ecp3-config.o
obj-$(CONFIG_SRAM) += sram.o
+obj-$(CONFIG_HWLAT_DETECTOR) += hwlat_detector.o
* hwlat_detector.c - A simple Hardware Latency detector.
* Use this module to detect large system latencies induced by the behavior of
* certain underlying system hardware or firmware, independent of Linux itself.
* The code was developed originally to detect the presence of SMIs on Intel
* and AMD systems, although there is no dependency upon x86 herein.
* The classical example usage of this module is in detecting the presence of
* SMIs or System Management Interrupts on Intel and AMD systems. An SMI is a
* somewhat special form of hardware interrupt spawned from earlier CPU debug
* modes in which the (BIOS/EFI/etc.) firmware arranges for the South Bridge
* LPC (or other device) to generate a special interrupt under certain
* circumstances, for example, upon expiration of a special SMI timer device,
* due to certain external thermal readings, on certain I/O address accesses,
* and other situations. An SMI hits a special CPU pin, triggers a special
* SMI mode (complete with special memory map), and the OS is unaware.
* Although certain hardware-inducing latencies are necessary (for example,
* a modern system often requires an SMI handler for correct thermal control
* and remote management) they can wreak havoc upon any OS-level performance
* guarantees toward low-latency, especially when the OS is not even made
* aware of the presence of these interrupts. For this reason, we need a
* somewhat brute force mechanism to detect these interrupts. In this case,
* we do it by hogging all of the CPU(s) for configurable timer intervals,
* sampling the built-in CPU timer, looking for discontiguous readings.
* WARNING: This implementation necessarily introduces latencies. Therefore,
* you should NEVER use this module in a production environment
* requiring any kind of low-latency performance guarantee(s).
* Copyright (C) 2008-2009 Jon Masters, Red Hat, Inc. <>
* Includes useful feedback from Clark Williams <>
* This file is licensed under the terms of the GNU General Public
* License version 2. This program is licensed "as is" without any
* warranty of any kind, whether express or implied.
#include <linux/module.h>
#include <linux/init.h>
#include <linux/ring_buffer.h>
#include <linux/time.h>
#include <linux/hrtimer.h>
#include <linux/kthread.h>
#include <linux/debugfs.h>
#include <linux/seq_file.h>
#include <linux/uaccess.h>
#include <linux/version.h>
#include <linux/delay.h>
#include <linux/slab.h>
#include <linux/trace_clock.h>
#define BUF_SIZE_DEFAULT 262144UL /* 8K*(sizeof(entry)) */
#define BUF_FLAGS (RB_FL_OVERWRITE) /* no block on full */
#define U64STR_SIZE 22 /* 20 digits max */
#define VERSION "1.0.0"
#define BANNER "hwlat_detector: "
#define DRVNAME "hwlat_detector"
#define DEFAULT_SAMPLE_WINDOW 1000000 /* 1s */
#define DEFAULT_SAMPLE_WIDTH 500000 /* 0.5s */
#define DEFAULT_LAT_THRESHOLD 10 /* 10us */
/* Module metadata */
MODULE_AUTHOR("Jon Masters <>");
MODULE_DESCRIPTION("A simple hardware latency detector");
/* Module parameters */
static int debug;
static int enabled;
static int threshold;
module_param(debug, int, 0); /* enable debug */
module_param(enabled, int, 0); /* enable detector */
module_param(threshold, int, 0); /* latency threshold */
/* Buffering and sampling */
static struct ring_buffer *ring_buffer; /* sample buffer */
static DEFINE_MUTEX(ring_buffer_mutex); /* lock changes */
static unsigned long buf_size = BUF_SIZE_DEFAULT;
static struct task_struct *kthread; /* sampling thread */
/* DebugFS filesystem entries */
static struct dentry *debug_dir; /* debugfs directory */
static struct dentry *debug_max; /* maximum TSC delta */
static struct dentry *debug_count; /* total detect count */
static struct dentry *debug_sample_width; /* sample width us */
static struct dentry *debug_sample_window; /* sample window us */
static struct dentry *debug_sample; /* raw samples us */
static struct dentry *debug_threshold; /* threshold us */
static struct dentry *debug_enable; /* enable/disable */
/* Individual samples and global state */
struct sample; /* latency sample */
struct data; /* Global state */
/* Sampling functions */
static int __buffer_add_sample(struct sample *sample);
static struct sample *buffer_get_sample(struct sample *sample);
/* Threading and state */
static int kthread_fn(void *unused);
static int start_kthread(void);
static int stop_kthread(void);
static void __reset_stats(void);
static int init_stats(void);
/* Debugfs interface */
static ssize_t simple_data_read(struct file *filp, char __user *ubuf,
size_t cnt, loff_t *ppos, const u64 *entry);
static ssize_t simple_data_write(struct file *filp, const char __user *ubuf,
size_t cnt, loff_t *ppos, u64 *entry);
static int debug_sample_fopen(struct inode *inode, struct file *filp);
static ssize_t debug_sample_fread(struct file *filp, char __user *ubuf,
size_t cnt, loff_t *ppos);
static int debug_sample_release(struct inode *inode, struct file *filp);
static int debug_enable_fopen(struct inode *inode, struct file *filp);
static ssize_t debug_enable_fread(struct file *filp, char __user *ubuf,
size_t cnt, loff_t *ppos);
static ssize_t debug_enable_fwrite(struct file *file,
const char __user *user_buffer,
size_t user_size, loff_t *offset);
/* Initialization functions */
static int init_debugfs(void);
static void free_debugfs(void);
static int detector_init(void);
static void detector_exit(void);
/* Individual latency samples are stored here when detected and packed into
* the ring_buffer circular buffer, where they are overwritten when
* more than buf_size/sizeof(sample) samples are received. */
struct sample {
u64 seqnum; /* unique sequence */
u64 duration; /* ktime delta */
u64 outer_duration; /* ktime delta (outer loop) */
struct timespec timestamp; /* wall time */
unsigned long lost;
/* keep the global state somewhere. */
static struct data {
struct mutex lock; /* protect changes */
u64 count; /* total since reset */
u64 max_sample; /* max hardware latency */
u64 threshold; /* sample threshold level */
u64 sample_window; /* total sampling window (on+off) */
u64 sample_width; /* active sampling portion of window */
atomic_t sample_open; /* whether the sample file is open */
wait_queue_head_t wq; /* waitqeue for new sample values */
} data;
* __buffer_add_sample - add a new latency sample recording to the ring buffer
* @sample: The new latency sample value
* This receives a new latency sample and records it in a global ring buffer.
* No additional locking is used in this case.
static int __buffer_add_sample(struct sample *sample)
return ring_buffer_write(ring_buffer,
sizeof(struct sample), sample);
* buffer_get_sample - remove a hardware latency sample from the ring buffer
* @sample: Pre-allocated storage for the sample
* This retrieves a hardware latency sample from the global circular buffer
static struct sample *buffer_get_sample(struct sample *sample)
struct ring_buffer_event *e = NULL;
struct sample *s = NULL;
unsigned int cpu = 0;
if (!sample)
return NULL;
for_each_online_cpu(cpu) {
e = ring_buffer_consume(ring_buffer, cpu, NULL, &sample->lost);
if (e)
if (e) {
s = ring_buffer_event_data(e);
memcpy(sample, s, sizeof(struct sample));
} else
sample = NULL;
return sample;
#define time_type ktime_t
#define time_get() ktime_get()
#define time_to_us(x) ktime_to_us(x)
#define time_sub(a, b) ktime_sub(a, b)
#define init_time(a, b) (a).tv64 = b
#define time_u64(a) (a).tv64
#define time_type u64
#define time_get() trace_clock_local()
#define time_to_us(x) div_u64(x, 1000)
#define time_sub(a, b) ((a) - (b))
#define init_time(a, b) a = b
#define time_u64(a) a
* get_sample - sample the CPU TSC and look for likely hardware latencies
* Used to repeatedly capture the CPU TSC (or similar), looking for potential
* hardware-induced latency. Called with interrupts disabled and with data.lock held.
static int get_sample(void)
time_type start, t1, t2, last_t2;
s64 diff, total = 0;
u64 sample = 0;
u64 outer_sample = 0;
int ret = -1;
init_time(last_t2, 0);
start = time_get(); /* start timestamp */
do {
t1 = time_get(); /* we'll look for a discontinuity */
t2 = time_get();
if (time_u64(last_t2)) {
/* Check the delta from the outer loop (t2 to next t1) */
diff = time_to_us(time_sub(t1, last_t2));
/* This shouldn't happen */
if (diff < 0) {
printk(KERN_ERR BANNER "time running backwards\n");
goto out;
if (diff > outer_sample)
outer_sample = diff;
last_t2 = t2;
total = time_to_us(time_sub(t2, start)); /* sample width */
/* This checks the inner loop (t1 to t2) */
diff = time_to_us(time_sub(t2, t1)); /* current diff */
/* This shouldn't happen */
if (diff < 0) {
printk(KERN_ERR BANNER "time running backwards\n");
goto out;
if (diff > sample)
sample = diff; /* only want highest value */
} while (total <= data.sample_width);
ret = 0;
/* If we exceed the threshold value, we have found a hardware latency */
if (sample > data.threshold || outer_sample > data.threshold) {
struct sample s;
ret = 1;
s.seqnum = data.count;
s.duration = sample;
s.outer_duration = outer_sample;
s.timestamp = CURRENT_TIME;
/* Keep a running maximum ever recorded hardware latency */
if (sample > data.max_sample)
data.max_sample = sample;
return ret;
* kthread_fn - The CPU time sampling/hardware latency detection kernel thread
* @unused: A required part of the kthread API.
* Used to periodically sample the CPU TSC via a call to get_sample. We
* disable interrupts, which does (intentionally) introduce latency since we
* need to ensure nothing else might be running (and thus pre-empting).
* Obviously this should never be used in production environments.
* Currently this runs on which ever CPU it was scheduled on, but most
* real-worald hardware latency situations occur across several CPUs,
* but we might later generalize this if we find there are any actualy
* systems with alternate SMI delivery or other hardware latencies.
static int kthread_fn(void *unused)
int ret;
u64 interval;
while (!kthread_should_stop()) {
ret = get_sample();
if (ret > 0)
wake_up(&data.wq); /* wake up reader(s) */
interval = data.sample_window - data.sample_width;
do_div(interval, USEC_PER_MSEC); /* modifies interval value */
if (msleep_interruptible(interval))
return 0;
* start_kthread - Kick off the hardware latency sampling/detector kthread
* This starts a kernel thread that will sit and sample the CPU timestamp
* counter (TSC or similar) and look for potential hardware latencies.
static int start_kthread(void)
kthread = kthread_run(kthread_fn, NULL,
if (IS_ERR(kthread)) {
printk(KERN_ERR BANNER "could not start sampling thread\n");
enabled = 0;
return -ENOMEM;
return 0;
* stop_kthread - Inform the hardware latency samping/detector kthread to stop
* This kicks the running hardware latency sampling/detector kernel thread and
* tells it to stop sampling now. Use this on unload and at system shutdown.
static int stop_kthread(void)
int ret;
ret = kthread_stop(kthread);
return ret;
* __reset_stats - Reset statistics for the hardware latency detector
* We use data to store various statistics and global state. We call this
* function in order to reset those when "enable" is toggled on or off, and
* also at initialization. Should be called with data.lock held.
static void __reset_stats(void)
data.count = 0;
data.max_sample = 0;
ring_buffer_reset(ring_buffer); /* flush out old sample entries */