Systemtap is an easy and powerful - yet kludgy - framework to instrument (linux) kernel internals. It allows one to define probes triggered at function entry or exit, and even permit dereferencing functions arguments (ie. you can dig all the way down through structures members). Those probes are scriptables with a concise language, and can be loaded at runtime (you can execute a new script tracing the kernel without rebooting).
It proven quite useful for my current needs (detecting power hogs on a running linux desktop). Here's an example. I wanted to pinpoint all applications spinning block devices for no reason. This not only includes applications reading and writing files, but also those causing inodes metadata changes (like atime), as far as it happens to actually spin the disks (reading cached metadata is ok).
Linux offers an ugly procfs interface for this purpose:
echo 1 > /proc/sys/vm/block_dump
This will log all applications causing block devices accesses in ... the kernel ring buffer. So you end up with a polluted dmesg and klogd/syslogd logging like a mad (causing new disks activity, and so on). Knowing nothing about kernel's internals, I just grepped for block_dump to find every instrumented functions, and emulated this with the following systemtap script :
#! stap
# Display block I/O consumers (doing reads, writes and dirtied inodes),
# exactly as "echo 1 > /proc/sys/vm/block_dump"
# but on stdout rather than polluting kernel ring buffer (dmesg).
probe kernel.function("submit_bio") {
op = $rw & 1 ? "write" : "read"
printf("%s(%d) %s on device %s\n", execname(), pid(), op,
kernel_string($bio->bi_bdev->bd_disk->di
sk_name))
}
probe kernel.function("__mark_inode_dirty") {
s_id = kernel_string($inode->i_sb->s_id)
if (($inode->i_state & $flags) != $flags && ($inode->i_ino || s_id == "bdev")) {
printf("%s(%d) dirtied inode %d on device %s\n",
execname(), pid(), $inode->i_ino, s_id)
}
}
Simple, isn't it ? So I
started cooking a top(1) like utility to trace the same things. Problem: I don't know how to clear the screen without this ugly system("clear"). Any thoughts ?
ps: would it be acceptable to convert the block_dump interface to something more like /proc/timer_stats ?