Truth about drop_cache & sync and dealing with OOM Killer

I use the drop_cache file of the /proc pseudo file system to force the OS to relinquish the memory taken away by the clean caches, inodes and dentries. I used it few times without even paying attention to: why there is a memory shortage to begin with? After realizing that the dropping caches is not the real solution, I learned to investigate the actual reason for memory shortage (i.e, who ate my memory?)

Other day, while googling about a memory issue, I ended up on this page. It describes how to force discard some memory resident objects by writing 1, 2 or 3 into drop_cache file in the /proc pseudo filesystem. Then I read the part where the author described forcing dentry, inode and pagecache drop could lead to some serious consequences. Here is an excerpt:

There are huge consequences to dumping these caches and running sync. If you are writing data at the time you run these commands, you’ll actually be dumping the data out of the filesystem cache before it reaches the disk, which could lead to very bad things.

I wonder what consequences the author is talking about? The notion that the data being written at the time of running the command will be dropped is not correct. Dirty objects can not be dropped by directing the kernel with drop_cache (as in writing 1/2/3 into that file ). The only way to force them out of memory is with the sync command, which may have consequences, but not the one described above. It may lead to more I/O by the processes that were using the dropped objects and more CPU cycles to build those objects, but ’sync’ will not drop data. Here is an excerpt from the sync command documentation:

sync writes any data buffered in memory out to disk.  This can include (but is not limited to) modified superblocks, modified inodes, and delayed reads and writes.  This must be implemented by the kernel; The `sync’ program does nothing but exercise the `sync’ system

It is important to use drop_cache tweak/hack only in a proper context. Just using it to relinquish some memory does not make sense, it might mask the actual reason for the OOM kill or what ever memory issue you are dealing with. Best way to zero in on the root cause is by gathering the process list, slabinfo, meminfo and buddyinfo (memory fragmentation) snapshot before, during and after the issue. If you are dealing with an after the fact situation like an OOM kill, try to reproduce it, and use a background process to capture afore mentioned stats every 5 or 10 seconds.

It is very important to know what hardware architecture you are dealing with, for instance x86 32bit vs 64bit. The way Linux manages memory zones in 32bit architectures is very different from 64bit architectures.  Knowing the Linux version is also extremely important, you have better controls in 2.6.x vs 2.4.x or previous versions. Here is another excerpt from the article that prompted me to write this:

When I checked the sysstat data with sar, I found that the server was only using about 2-3GB out of 4GB of physical memory at the time when OOM killer was started. At this point, I was utterly perplexed.

It almost seems like this guy was dealing with a 32bit machine and the OS killed his processes due to low memory exhaustion; explains why there is free memory around oom kill. It may be the case where his processes are seeking a set of physically contiguos memory pages. How he came to the conclusion that the OS was caching too much, just puzzles me!

8 Responses to “Truth about drop_cache & sync and dealing with OOM Killer”

  1. Major Hayden says:

    The machine in question was actually a 64-bit RHEL 5 machine, and I saw the memory usage by running slabtop. After some testing, we did find that if we dropped caches, the data being written by the application was incomplete. Since the application was writing simple text files, it was easy to see that data was missing.

  2. admin says:

    Wow, that beats the heck out of me! Combining your article and above comment, I gather you noticed a significant portion of your memory gone to slab caches, but there was still 1G or more free memory while the OOM killer killed your processes. On a 64 bit box, running 64bit OS, I do not see how you could have free memory (as big as 1G) and still OOM killer went after your processes. Are you sure your memory utilization snapshot is not average over a period rather than a point in time?

    Coming to the part where you saw your data loss due to dropping cache does not seem right either. I have used this several times and never had any such issue. drop_cache can only drop clean caches, not dirty caches, and sync will definitely write dirty caches to disk. I still wonder how your application data is lost. What file system does the application write to? Anyhow, take a moment to update me if you find anything more!

    Cheers!

  3. au capsules Capsules/ http://01DODGEPARTS.US/tag/r\x3dh : au capsules Capsules/…

    au capsules Capsules/…

Leave a Reply

Dansette