The answer is that the kernel can be made to behave that way by tweaking a runtime parameter, but it is not necessarily a good idea. Before getting into that, however, it's worth noting that recent 2.6 kernels have a memory management problem which can cause serious problems after an application which reads through entire filesystems (updatedb, say, or a backup) has run. The problem is the slab cache's tendency to request allocations of multiple, contiguous pages; these allocations, when done at the behest of filesystem code, can bring the system to a halt. A patch has been merged which fixes this particular problem for 2.6.6.
The bigger issue remains, however: should the kernel swap out user applications in order to cache more file contents? There are plenty of arguments in favor of this behavior. Quite a few large applications set up big areas of memory which they rarely, if ever use. If application memory is occasionally forced to disk, the unused parts will remain there, and that much physical memory will be freed for more useful contents. Without swapping application memory to disk and seeing what gets faulted back in, it is almost impossible to figure out which pages are not really needed. A large file cache is also a performance enhancer. The speedups that come from having frequently-accessed data in memory are harder to see than the slowdowns caused by having to fault in a large application, but they can lead to better system throughput overall.
Still, there are users who insist that, for example, a system backup should never force OpenOffice out to disk. They don't care how quickly a system maintenance application runs at 3:00 in the morning, but they care a lot about how the system responds when they are at the keyboard. This wish was expressed repeatedly until Andrew Morton exclaimed:
This helped quiet the debate as the parties involved looked more closely at this particular parameter. Or, perhaps, it was just fear of Andrew's singing. Either way, it has become clear that most people are unaware of what the "swappiness" parameter does; the fact that it has never been documented may have something to do with that.
So... swappiness, which is exported to /proc/sys/vm/swappiness, is a parameter which sets the kernel's balance between reclaiming pages from the page cache and swapping out process memory. The reclaim code works (in a very simplified way) by calculating a few numbers:
- The "distress" value is a measure of how much trouble the kernel is having freeing memory. The first time the kernel decides it needs to start reclaiming pages, distress will be zero; if more attempts are required, that value goes up, approaching a high value of 100.
- mapped_ratio is an approximate percentage of how much of the system's total memory is mapped (i.e. is part of a process's address space) within a given memory zone.
- vm_swappiness is the swappiness parameter, which is set to 60 by default.
With those numbers in hand, the kernel calculates its "swap tendency":
swap_tendency = mapped_ratio/2 + distress + vm_swappiness;
If swap_tendency is below 100, the kernel will only reclaim page cache pages. Once it goes above that value, however, pages which are part of some process's address space will also be considered for reclaim. So, if life is easy, swappiness is set to 60, and distress is zero, the system will not swap process memory until it reaches 80% of the total. Users who would like to never see application memory swapped out can set swappiness to zero; that setting will cause the kernel to ignore process memory until the distress value gets quite high.
The swappiness parameter should do what a lot of users want, but it does not solve the whole problem. Swappiness is a global parameter; it affects every process on the system in the same way. What a number of people would like to see, however, is a way to single out individual applications for special treatment. Possible approaches include using the process's "nice" value to control memory behavior; a low-priority process would not be able to push out significant amounts of a high-priority process's memory. Alternatively, the VM subsystem and the scheduler could become more tightly integrated. The scheduler already makes an effort to detect "interactive" processes; those processes could be given the benefit of a larger working set in memory. That sort of thing is 2.7 work, however; in the mean time, people who are unhappy with the kernel's swap behavior may want to try playing with the knobs which have been provided.(Log in to post comments)
How do you set 'swappiness'?
Posted May 6, 2004 16:19 UTC (Thu) by southey (subscriber, #9466) [Link]
As a ordinary user (with root access), how do you actually set swappiness? Especially every reboot. Also, what performance problems would be expected? I would there just be intense disk swapping when required.How do you set 'swappiness'?
Posted May 6, 2004 16:27 UTC (Thu) by corbet (editor, #1) [Link]
To set it to zero, type:echo 0 > /proc/sys/vm/swappiness
All there is to it.
How do you set 'swappiness'?
Posted May 6, 2004 17:32 UTC (Thu) by southey (subscriber, #9466) [Link]
Many thanks, I'll give it a whirl. It is not always clear what to do for less technical user. This is one of the best sections on Linux (web or print based) that allows at least me to understand what the kernel is and what is it doing in the past, present and future!How do you set 'swappiness'?
Posted May 6, 2004 17:38 UTC (Thu) by thomas_d_stewart (subscriber, #4328) [Link]
And if you want it set at every reboot try:-echo "vm/swappiness=0" >> /etc/sysctl.conf
(Thats how to do it in debian and fedora, its part of the procpc package)
HTH
--
Tom
How do you set 'swappiness'?
Posted May 13, 2004 22:55 UTC (Thu) by ArsonSmith (guest, #5695) [Link]
edit /etc/sysctl.confadd:
vm.swappiness =
replace with a number 1 to 100
Many people say just to add an echo > /proc/sys/vm/swappiness
but this wont persist after a reboot. sysctl is a utility provided by most distributions to set this up after reboot. You can also see what all the configurable peramiters are but running
sysctl -A
I am a fan of sysctl as it also keeps your runtime kernel configuration stuff in a central location /etc/sysctl.conf and not in various places /etc/init.d/kernel_custom_stuff or /etc/rc5.d/local or what ever other places people like to make up to put this kind of stuff.
How do you set 'swappiness'?
Posted May 13, 2004 22:56 UTC (Thu) by ArsonSmith (guest, #5695) [Link]
Sorry that should be value 0-100 not 1-1002.6 swapping behavior
Posted May 6, 2004 17:05 UTC (Thu) by guinan (subscriber, #4644) [Link]
I've felt this - it is highly annoying when I come down to continue my work in the morning, and each of the 8 tabs in my Galeon window takes 10 seconds to page back in because updatedb and various other things that have no business leaving pages in the cache ran overnight.I will try setting swappiness to 0, but why couldn't the kernel let processes provide a hint about page cache policy themselves? It would take a while for applications to catch up, but it keeps policy in userspace, on a per-application basis, instead of leaving it up to heuristics in the kernel. Examples,
updatedb,makewhatis,etc. - DISPOSABLE
galeon,evolution,etc. - INTERACTIVE
Use an ioctl(), /proc/self/ entry, whatever.
-Jamie
2.6 swapping behavior
Posted May 6, 2004 19:31 UTC (Thu) by abatters (subscriber, #6932) [Link]
Well, there is mlock() et al., but that would be like trying to swat a fly with a sledgehammer.After closing a memory-hog program that was causing swapping, sometimes I just do "swapoff -a; swapon -a" to get the system responsive again.
mlock
Posted May 8, 2004 17:44 UTC (Sat) by giraffedata (subscriber, #1954) [Link]
Actually, mlock is conceptually exactly what's required here. We're talking about a case where the following assumption inherent in real memory allocation policy fails: the pages for which fast access will be most appreciated are those that were most recently used.Here, we have a user who is willing to let 32MB of memory sit idle overnight, even at the cost of slowing down other things, just so he can have immediate response every time he clicks his web browser. That's what mlock is about.
I do a similar (but rather opposite) thing with a ramdisk. I copy various files that are used in tasks that I want to be responsive into a ramdisk. Ramdisk is just file cache that is locked in memory. That way, no matter how much memory pressure there has been since the last time I used these files, they're always right there when I click for them.
2.6 swapping behavior
Posted May 13, 2004 16:50 UTC (Thu) by jonsmirl (guest, #7874) [Link]
What about simply adding "swapoff -a; swapon -a" to the end of updatedb and prelink chron scripts?Speculative swap-in?
Posted May 6, 2004 17:45 UTC (Thu) by Ross (subscriber, #4065) [Link]
On an otherwise idle system with large amounts of free or cache-onlypages it might be useful if the kernel would speculatively load some of
the swapped pages back into memory. It should leave them in swap too in
case the memory is needed again (much faster to just throw it out than to
write it to disk). This would help restore the system to its previous
state after backups well before the person gets back to their computer.
2.6 swapping behavior
Posted May 6, 2004 17:50 UTC (Thu) by xorbe (guest, #3165) [Link]
"Without swapping application memory to disk and seeing what gets faulted back in, it is almost impossible to figure out which pages are not really needed."Oh come on.
You mark the app's pages inaccessible. When the app touches it, the OS notes that the app really does have permissions, changes, and resumes the app. Pages that are never touched after a while can be dropped to swap.
2.6 swapping behavior
Posted May 6, 2004 18:21 UTC (Thu) by corbet (editor, #1) [Link]
That sounds vaguely like what the 2.4 VM did. It works, but you have to mess around with a lot of page table entries, keep track of which pages you have invalidated (in affected process's page tables), and know when to get around to cleaning them up.
To an extent, things are pretty much still done that way, actually; pages are pulled from pages tables and put into the inactive list. Eventually they find their way to swap. If some process wants them in the mean time, they are soft-faulted back in.
2.6 swapping behavior
Posted May 6, 2004 18:42 UTC (Thu) by Duncan (guest, #6647) [Link]
Some weeks ago, as I was reading about yet more machinations and hoops thekernel has to go thru to efficiently handle swap, I realized that with a
gig of memory and seldom more than half of it used by apps, I really
didn't need swap at all, so I turned it off.
I've been running without swap since then, to no bad effect and perhaps a
slightly more responsive system, if I'm to believe the "feel" of things.
The biggest surprise, however, was when I did the recompile with swap
compiled out of the kernel. Normally, changing a single option in a tree
that's already compiled the kernel and hasn't been cleaned from doing so
means a rather speedy recompile, as only a couple of C files worth of code
actually has to be recompiled. Not so with the swap option! Just that
single config change meant recompiling almost the entire kernel, it would
seem. I didn't realize swapping code was THAT deeply interwoven into
nearly every aspect of kernel operation, but it obviously is. Thus,
without all that extra swapping code to deal with, the kernel SHOULD be
faster.
I DID have to modify my startup and shutdown scripts, as it seems unlike
everything ELSE, where Mandrake (my current distrib) checks to see if the
files are actually there before attempting to operate on them, their init
scripts simply ASSUME the swap related /proc files will be there, and I'd
just turned them off. However, that was easily done, and I've been swap
free for a month or possibly six weeks, now.
My system is for desktop use, mainly, altho I AM running an AMD64 so have
64-bit code taking up a bit more memory than 32-bit would. Still, running
the not exactly light KDE, and all my regular apps, I seldom use more than
half a gig for app memory, with the other half in cache. Thus, the system
works well without swap. I'd not recommend the solution for graphical
X-system desktop use, however, for anyone with less than half a gig of
memory, and preferably more, 768M, leaving plenty of caching room, even
with a decent compliment of desktop apps running. That half-apps
half-cache rule of thumb seems pretty good, but with a lighter window
manager and 32-bit, 384M for each seems reasonable, yielding the 768M I
mentioned. Console mode only or lighter X use may well use only 256M of
app memory, which leaving the same for cache means 512M. With modern
graphical apps, that may be a bit small and keeping a swap around may be a
good idea, but that's where the 0 swapability config this article
mentions, comes in. However, with the cost of memory, now days, unless
one is stuck on an old hardware platform with limited memory
upgradability, there's little reason not to have at least half to 3/4 gig
of memory in the system, and upgrading to a gig is likely to increase
performance more than that last notch in CPU speed would.
When I upgrade to 2 or 4 gig, 6 months or a year from now, I'll probably
turn tmpfs back on, and put /tmp and /var/tmp in physical memory rather
than disk, as well.
Duncan
2.6 swapping behavior
Posted May 6, 2004 20:47 UTC (Thu) by thyrsus (subscriber, #21004) [Link]
In olden days, the sticky bit on binary executables gave the kernel a hint that it should avoid swapping/paging out the memory for that executable. Might that still be appropriate today?2.6 swapping behavior
Posted May 7, 2004 5:12 UTC (Fri) by maney (subscriber, #12630) [Link]
I'm afraid you've been misinformed. The sticky bit told the kernel not to purge the executable from swap when it wasn't running (until there was no non-sticky swap available to avoid an out of memory panic, of course). IIRC, this actually goes back to the days when it wasn't swapping as we know it, but the wholesale paging of an app's executable memory in one chunk. (recall that on the PDP-11, executable space was less than 64KB maximum, and the memory management didn't support page swapping anyway)So the traditional use of the sticky bit is actually rather the opposite of what's wanted here! It's also less than clear that attaching the swap me only under duress property statically to the source file is the best choice even if it turns out to be practical to prioritize non-cache pages at that granularity. One obvious complication (that also wasn't present in the PDP-11 paging model) is shared libraries.
2.6 swapping behavior
Posted May 6, 2004 21:47 UTC (Thu) by iabervon (subscriber, #722) [Link]
I think the issue is really that stuff used for a minute five hours ago is preferred to stuff used for an hour six hours ago. Stuff that's of lasting significance is more likely to be needed again after a period of the system being idle, although it may be good to evict while the system is busy.Ideally, things would get swapped out while updatedb ran, and then swapped back in when nothing had used the memory cached for updatedb. But it wouldn't just be program memory getting swapped back in; it would be clever to pull into cache files and directories that get used a lot, so that (for example), your Mozilla cache would be in memory again when you got up.
2.6 swapping behavior
Posted May 7, 2004 7:59 UTC (Fri) by njhurst (guest, #6022) [Link]
I don't understand why updatedb needs so much cache memory? Surely it only needs to keep a stack of inodes from root to the current point in the filesystem in memory. Once it has looked at a file that file's memory should be returned to the pool immediately. I don't know how to force the kernel to do this though.(This is obviously updatedb specific information, but maybe it would be easier to fix updatedb than everything else?)
2.6 swapping behavior
Posted May 7, 2004 21:53 UTC (Fri) by addw (guest, #1771) [Link]
Trouble is that the kernel doesn't know that the updatedb is not going to look at those files ever again (well, 'till it runs again tomorrow). But the blocks from the file system are left in memory on the grounds that something recently used it likely to be used again in the near future.Simple prediction doesn't always work.
2.6 swapping behavior
Posted May 13, 2004 0:19 UTC (Thu) by njhurst (guest, #6022) [Link]
I agree, my point is just that maybe some thought could be put into making updatedb more well behaved, rather than trying to get that behaviour directly out of the kernel?I think it is allowable to have user space programs try to optimise their behaviour with the kernel :)
2.6 swapping behavior
Posted May 14, 2004 11:17 UTC (Fri) by forthy (guest, #1525) [Link]
IMHO the initial priority of a just-allocated or just-loaded buffer is toohigh. That's why memory hogs (which claim a page once and never look at it
again) swap out everything else in Linux, and updatedb also does the same
thing. The pages of OpenOffice, which have been used and reused all day
long over and over again have lower priority for the kernel than a newly
allocated use-once page.
But as long as kernel developer stick their thumbs in their ears and sing
"lalala", this won't change.
Another note: updatedb *only* reads the names of a file system. Well, it
also checks if a name is a directory, and that forces it to read the
entire inode. The real problem here is that updatedb tries to solve
something the file system should do itself (especially if you think of it
like Hans Reiser does). Why is it impossible for the file system to keep
the file names in a database like updatedb, and answer queries like locate
directly? Hint: BeOS did something like that.
2.6 swapping behavior
Posted May 11, 2004 1:28 UTC (Tue) by mcelrath (guest, #8094) [Link]
updatedb (and many other applications) need to be using O_DIRECT or some other flag that indicates explicitly that files will be read exactly once, and putting the file in the buffer cache isn't necessary.There is no way for the kernel to predict that some process named 'updatedb' will read every file exactly once, but another process named 'mozilla' likes to read the same file over and over. It's up to the application to specify that.
AFAIK O_DIRECT is not the appropriate flag for this, because read/write buffers must be page aligned to use it. An O_NOCACHE flag has been proposed before (especially by streaming video folks) but has not been added, though I did see an implementation once. I think an O_NOCACHE or O_READONCE is the solution to this...
2.6 swapping behavior
Posted May 13, 2004 21:00 UTC (Thu) by jzhao (guest, #2865) [Link]
Robert Love had a patch which does exactly this:http://www.kernel.org/pub/linux/kernel/people/rml/O_STREAMING/README