Weblog entry #18 for mcortese

Running out of entropy
Posted by mcortese on Thu 30 Jun 2011 at 12:10
Tags: none.
After installing a 2.6.38 kernel, I've seen the machine unexpectedly pausing when I'm not in front of it (typical scenario: while aptitude installs or upgrades things). It then recovers immediately as soon as I move the mouse.

Tracking /proc/sys/kernel/random/entropy_avail reveals it is constantly below 1000, quickly going down to ~100 in just a few seconds of inactivity.

I'm curious to track down who's eating up my entropy. Anybody knows how?

Then I wonder what changed in the latest kernel that's causing this behaviour.

Anybody has the same issue?

 

Comments on this Entry

Posted by Anonymous (83.215.xx.xx) on Thu 30 Jun 2011 at 20:18
How about lsof, fuser?

sudo apt-get install lsof psmisc
lsof /dev/random
lsof /dev/urandom

(fuser can be used in a similar way but may need some post processing
to resolve PID to process name)

[ Parent | Reply to this comment ]

Posted by mcortese (20.142.xx.xx) on Fri 1 Jul 2011 at 16:32
[ Send Message | View Weblogs ]
I had tried that path: no process seems to keep those device files open. Is there a way to capture each attempt at reading from /dev/[u]random?

[ Parent | Reply to this comment ]

Posted by Anonymous (83.215.xx.xx) on Sat 2 Jul 2011 at 12:49
If your kernel has audit support available, you might be able to come up with some userspace solution based on libaudit-dev, (see readahead-fedora for an example how to trace open files)

Another option would be to hack some debug code (printk) into sys_open in the kernel (might be dangerous, not recommended)

Both is rather involved (c programming / kernel prog. skills required).

If you've got a known good and known bad kernel version, you could try to use "git bisect" to track down the change in the kernel source that introduced the behavior. (requires kernel compiling and git skills)

other ideas:

some systems support hardware random number generation, could it be
that support for that was compiled into the old kernel but not in the new one?

could be worth to compare the kernel .config files (and dmesg output) from both versions

# example for one version (assuming self compiled kernel in /usr/src/)
grep -irn HW_RANDOM /usr/src/linux-2.6.38/.config

# .config for non self compiled kernels
grep -irn HW_RANDOM /boot/config-2.6.38

if it's compiled in ("Y") it should work (unless hardware has changed
and new hardware has no HW_RANDOM supported hardware)

if it's "M" (module) check whether the corresponding modules are loaded (lsmod)

if it's "# ... is not set" - then, well compile it in ;)


It might also be possible that for whatever reason an some former entropy resource (keyboard, mouse, harddisk, network latency) is no longer used by this
kernel version.

(no idea how to track that down without reading lots of source code, though.)

When really desparate, you can always try this: ;)


http
://blog.ossbox.com/2010/10/strengthening-devrandom-on-linux.html


ps.I had to break the link above in two lines in order to post it here, sorry.

[ Parent | Reply to this comment ]

Posted by mcortese (20.142.xx.xx) on Mon 4 Jul 2011 at 11:32
[ Send Message | View Weblogs ]
Thanks for the reply.

In fact, I switched from a self-compiled 2.6.36 to Debian-supplied 2.6.38, so a lot of config options changed. HW_RANDOM is explicitly one of the options that did change, although in the opposite direction from what you would have imagined: it was not set in my self-compiled kernel, it is now configured as "M". However, I have no hardware RNG, and the module is not loaded, so I guess this has no influence at all.

I tried deleting /dev/random and replacing it with a link to /dev/urandom that's supposed to provide the same kind of data without ever blocking, but nothing changed. I infer that the entropy-hungry process has other means for getting random numbers than accessing the device file, probably directly through get_random_bytes().

I still haven't found who's using up all this entropy. Of course ssh-ing drains the pool much quicker, which is expected as ssh is supposed to use random data for its crypto stuff. Next thing to try out is to boot in single user mode, and monitor the entropy pool while starting the various services one by one.

In any case, the actual issue is not who uses the random numbers, nor how to refill the pool, but rather why the whole system comes to a stop when the entropy is too low. I would expect a single process freezing while trying to read /dev/random, not the whole system becoming irresponsive. I even tried with a background script that refills the pool every 10 seconds, taking data from pre-built chunks of 2046 random bits:

#! /bin/bash
i=1
while true; do
  dd if=pool_$i of=/dev/random bs=1 count=512
  i=$((i + 1))
  sleep 10
done
When the system stops, this background process stops as well, so it does not work (I imagine the methods referred by your link above would behave in the same way).

Any idea?

[ Parent | Reply to this comment ]

Posted by Anonymous (83.215.xx.xx) on Mon 4 Jul 2011 at 22:45
The single user mode approach sounds good, maybe it leads you to the suspect
user space process.

If you suspect ssh - how much ssh traffic is there?

(maybe add some iptables LOG rules and check)

If there are lots of suspicious ssh connections, you might consider installing
something like denyhosts, fail2ban, ... (no easy decision on a public server, though)

Does disconnecting the network change the situation?
(maybe some kind of DOS attack?)

If you can rule out a user space issue (that means the problem persists in single user mode, with network disconnected) and you really suspect some buggy kernel driver abusing get_random_bytes, I'd suggest to give the printk approach, as mentioned above a try.

That means grep for all occurrences of it in the kernel that are relevant for your setup and add some printk's there.

or play the same game as in usespace, find modules / compiled-in parts
that make use of get_random_bytes and prevent the module from getting loaded,
(resp. the driver from being compiled in) until the problem disappears.

Be careful and apply common sense when excluding drivers or you system may cease to boot at all ;)

How about a self compiled 2.6.38 with "make oldconfig" based on the old kernel?

Could it be a rootkit?
(no idea how to check or rule that out - it could always be something that isn't
yet detected by rkhunter and friends)


[ Parent | Reply to this comment ]

Posted by mcortese (20.142.xx.xx) on Thu 7 Jul 2011 at 18:37
[ Send Message | View Weblogs ]
A quick update.

First of all, my approach of writing to /dev/random was pretty silly: the data sent to the device file do get added to the random pool, but no entropy bits are charged for that. The only way for userland to increase the entropy count is accessing the device file through ioctl().

Then, the introduction of ASLR in the last years has boasted the kernel's need for random bits: this means that simply spawning processes is a way to drain the entropy pool.

Finally, a deeper scrutiny of the kernel sources revealed that, apart from the standard input devices (mouse & keyboard), no other driver used on my system is designed to contibute entropy. For example, the network card, "via-rhine", does not set the flag IRQF_SAMPLE_RANDOM when registering the interrupt handler (contrary to other network cards that actually do). The same holds for the ATA subsystem. That came as a real surprise to me: isn't the disk activity supposed to be a source of entropy?

[ Parent | Reply to this comment ]