commit 5047887caf1806f31652210df27fb62a7c43f27d Merge: 996abf0... 973b7d8... Author: Linus Torvalds Date: Fri Jul 25 11:08:17 2008 -0700 Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (34 commits) powerpc: Wireup new syscalls Move update_mmu_cache() declaration from tlbflush.h to pgtable.h powerpc/pseries: Remove kmalloc call in handling writes to lparcfg powerpc/pseries: Update arch vector to indicate support for CMO ibmvfc: Add support for collaborative memory overcommit ibmvscsi: driver enablement for CMO ibmveth: enable driver for CMO ibmveth: Automatically enable larger rx buffer pools for larger mtu powerpc/pseries: Verify CMO memory entitlement updates with virtual I/O powerpc/pseries: vio bus support for CMO powerpc/pseries: iommu enablement for CMO powerpc/pseries: Add CMO paging statistics powerpc/pseries: Add collaborative memory manager powerpc/pseries: Utilities to set firmware page state powerpc/pseries: Enable CMO feature during platform setup powerpc/pseries: Split retrieval of processor entitlement data into a helper routine powerpc/pseries: Add memory entitlement capabilities to /proc/ppc64/lparcfg powerpc/pseries: Split processor entitlement retrieval and gathering to helper routines powerpc/pseries: Remove extraneous error reporting for hcall failures in lparcfg powerpc: Fix compile error with binutils 2.15 ... Fixed up conflict in arch/powerpc/platforms/52xx/Kconfig manually. commit 996abf053eec4d67136be8b911bbaaf989cfb99c Merge: 93082f0... d37e6bf... Author: Linus Torvalds Date: Fri Jul 25 11:02:17 2008 -0700 Merge branch 'linux-next' of git://git.infradead.org/~dedekind/ubi-2.6 * 'linux-next' of git://git.infradead.org/~dedekind/ubi-2.6: (22 commits) UBI: always start the background thread UBI: fix gcc warning UBI: remove pre-sqnum images support UBI: fix kernel-doc errors and warnings UBI: fix checkpatch.pl errors and warnings UBI: bugfix - do not torture PEB needlessly UBI: rework scrubbing messages UBI: implement multiple volumes rename UBI: fix and re-work debugging stuff UBI: amend commentaries UBI: fix error message UBI: improve mkvol request validation UBI: add ubi_sync() interface UBI: fix 64-bit calculations UBI: fix LEB locking UBI: fix memory leak on error path UBI: do not forget to free internal volumes UBI: fix memory leak UBI: avoid unnecessary division operations UBI: fix buffer padding ... commit 93082f0b15841b8926c38ef224d0e6f720000635 Author: Linus Torvalds Date: Fri Jul 25 10:56:36 2008 -0700 Fix ahci driver 'flags' type The new type checking of the flags arguments to irqsave and friends (commit 3f307891ce0e7b0438c432af1aacd656a092ff45) pointed out this thing with a big nice warning. Signed-off-by: Linus Torvalds commit f87bd330edf06fd49b3fbc368d90fb180375f2a2 Author: Dave Jiang Date: Fri Jul 25 01:49:14 2008 -0700 edac: mpc85xx fix pci ofdev 2nd pass Convert PCI err device from platform to open firmware of_dev to comply with powerpc schemes. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Dave Jiang Signed-off-by: Doug Thompson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit fcb19171d196172a4f57e056f7a60e6d1e2e8c85 Author: Dave Jiang Date: Fri Jul 25 01:49:14 2008 -0700 edac: mv64x60 add pci fixup Fixup of missing bit 0 on 64360 PCIx_ERR_MASK and errata FEr-#11 and FEr-#16 for the 64460. Bit 0 must remain 0. Signed-off-by: Dave Jiang Signed-off-by: Doug Thompson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 596d3941035d4d4b484c820f10f57fd4816c6615 Author: Dave Jiang Date: Fri Jul 25 01:49:13 2008 -0700 edac: mv64x60 fix get_property Update get_property() call to use of_get_property() in order to fix compile Signed-off-by: Dave Jiang Signed-off-by: Doug Thompson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 10d33e9c36827e5371479e55ef4089e000af2638 Author: Doug Thompson Date: Fri Jul 25 01:49:12 2008 -0700 edac: e752x fix too loud on nonmemory errors This module harvests more than just memory errors, it also harvests various bus and dma errors that the Chipset detects. Previously, it would report all such errors, which would cause output to be TOO loud. This patches therefore adds a parameter which is used to turn off NON-MEMORY error reports by default. Or the reporting can be enabled via the parameter Also did code style cleanup: less than 80 characters per line rule Signed-off-by: Doug Thompson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 124682c78563e10ba8b2ecd21b0f1098903b7808 Author: Arthur Jones Date: Fri Jul 25 01:49:12 2008 -0700 edac: core fix added newline to sysfs dimm labels The channel DIMM label does not seem to be used much in the edac code. However, where it is used (in the core code), it is assumed to not have a newline embedded. This leaves the sysfs file newline free which looks funny when cat'ing it. Here we just add the trailing newline to the sysfs chX_dimm_label output... [Doug Thompson note: the DIMM label is one of the primary uses of EDAC. User space daemon scripts, edac-utils@sourceforge, populate the DIMM label fields, via /sys/devices/system/edac attributes, with the silk screen labels of the motherboard in use. dmidecode access BIOS tables, but BIOS tables are well known to be incorrect and useless in these respects. edac-utils will strip off any newlines before its use of the output, when displaying DIMM slot silk screen labels. Signed-off-by: Arthur Jones Signed-off-by: Doug Thompson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit f9fc82adca43d38a1b79128d80750bd361e15abe Author: Arthur Jones Date: Fri Jul 25 01:49:11 2008 -0700 edac: core fix static to dynamic kset Static kobjects and ksets are not supported in Linux kernel. Convert the mc_kset from static to dynamic. This patch depends on my previous patch to remove the module parameter attributes from mc... Signed-off-by: Arthur Jones Signed-off-by: Doug Thompson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 327dafb1c61c9da7b95ac6cc7634a2340cc9509c Author: Arthur Jones Date: Fri Jul 25 01:49:10 2008 -0700 edac: core fix redundant sysfs controls to parameters /sys/devices/system/edac/mc has a few files which are duplicated in /sys/module/edac_core/parameters. Now that all the functionality is duplicated between these two locations, we remove the former kobject attributes and update the documentation. Signed-off-by: Arthur Jones Signed-off-by: Doug Thompson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 096846e2b0ef39cb7c348f837f06984ef6ba8aa7 Author: Arthur Jones Date: Fri Jul 25 01:49:09 2008 -0700 edac: core fix workq timer When updating the edac_mc_poll_msec module parameter from the sysfs /sys/module/edac_core/parameters/edac_mc_poll_msec file, we don't update the workq timers. So that, if we move from a big poll time to a small one, the small one won't take effect until the big one has timed out. Here we provide a new module parameter set method to call out to the update routine. This brings the /sys/module/edac_core/parameters functionality up to that provided by the /sys/drivers/system/edac/mc sysfs module parameter files so that we can remove them or at least link to the /sys/module files... Signed-off-by: Arthur Jones Signed-off-by: Doug Thompson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 14cc571bb1d072d3f4be2875ea520ab03e093471 Author: Arthur Jones Date: Fri Jul 25 01:49:08 2008 -0700 edac: core fix to use dynamic kobject Static kobjects are not supported in linux kernel. Convert the edac_pci_top_main_kobj from static to dynamic. This avoids the double free of the edac_pci_top_main_kobj.name that we see on module reload of the e752x edac driver (and probably others as well). In addition Greg KH has pointed out that this code may be cleaned up significantly. I will look at that as a follow-on patch, for now, I just want the minimum fix to get this double-free oops bug squashed... Many thanks to Greg KH for his patience in showing me what the Documentation/kobject.txt already said (oops)... Signed-off-by: Arthur Jones Signed-off-by: Doug Thompson Acked-by: Greg Kroah-Hartman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit b238e57723a6fb2c365fc35de5d7c48ccf9300cd Author: Arthur Jones Date: Fri Jul 25 01:49:08 2008 -0700 edac: i5100: cleanup Some code cleanliness issues found by Andrew Morton (thanks!) which should not affect functionality, but which should help make the code more maintainable. In particular, we now: * convert all #define's w/ a parameter to static inlines * use 1UL rather than 1ULL when calculating an unsigned long * use pci_disable_device The resulting code is tested and seems to work fine... Signed-off-by: Arthur Jones Cc: Doug Thompson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 178d5a742291976d13bff55fa2b130879d4510de Author: Arthur Jones Date: Fri Jul 25 01:49:06 2008 -0700 edac: i5100 fix unmask ecc bits Explicitly unmask ECC errors we are interested in reporting. Signed-off-by: Arthur Jones Signed-off-by: Doug Thompson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 43920a598f9358a12eb59eeddc4cd950f03aea8c Author: Arthur Jones Date: Fri Jul 25 01:49:06 2008 -0700 edac: i5100 fix enable ecc hardware It is possible that the BIOS did not enable ECC at boot time. We check for that case and fail to load if it is true. Signed-off-by: Arthur Jones Signed-off-by: Doug Thompson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit f7952ffcffa88c9a3fa92c26081f4ec9143c680f Author: Arthur Jones Date: Fri Jul 25 01:49:05 2008 -0700 edac: i5100 fix missing bits The error mask we use to trigger ECC notifications is missing many bits of interest. We add these bits here so that all possible ECC errors can be reported. Signed-off-by: Arthur Jones Signed-off-by: Doug Thompson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 8f421c595a9145959d8aab09172743132abdffdb Author: Arthur Jones Date: Fri Jul 25 01:49:04 2008 -0700 edac: i5100 new intel chipset driver Preliminary support for the Intel 5100 MCH. CE and UE errors are reported along with the current DIMM label information and other memory parameters. Reasons why this is preliminary: 1) This chip has 2 independent memory controllers which, for best perforance, use interleaved accesses to the DDR2 memory. This architecture does not map very well to the current edac data structures which depend on symmetric channel access to the interleaved data. Without core changes, the best I could do for now is to map both memory controllers to different csrows (first all ranks of controller 0, then all ranks of controller 1). Someone much more familiar with the edac core than I will probably need to come up with a more general data structure to handle the interleaving and de-interleaving of the two memory controllers. 2) I have not yet tackled the de-interleaving of the rank/controller address space into the physical address space of the CPU. There is nothing fundamentally missing, it is just ending up to be a lot of code, and I'd rather keep it separate for now, esp since it doesn't work yet... 3) The code depends on a particular i5100 chip select to DIMM mainboard chip select mapping. This mapping seems obvious to me in order to support dual and single ranked memory, but it is not unique and DIMM labels could be wrong on other mainboards. There is no way to query this mapping that I know of. 4) The code requires that the i5100 is in 32GB mode. Only 4 ranks per controller, 2 ranks per DIMM are supported. I do not have hardware (nor do I expect to have hardware anytime soon) for the 48GB (6 ranks per controller) mode. 5) The serial presence detect code should be broken out into a "real" i2c driver so that decode-dimms.pl can work. Signed-off-by: Arthur Jones Signed-off-by: Doug Thompson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 48e90761b570ff57f58b726229d229729949c5bb Author: Miklos Szeredi Date: Fri Jul 25 01:49:02 2008 -0700 fuse: lockd support If fuse filesystem doesn't define it's own lock operations, then allow the lock manager to work with fuse. Adding lockd support for remote locking is also possible, but more rarely used, so leave it till later. Signed-off-by: Miklos Szeredi Cc: "J. Bruce Fields" Cc: Trond Myklebust Cc: Matthew Wilcox Cc: David Teigland Cc: Christoph Hellwig Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 33670fa296860283f04a7975b8c790f101e43a6e Author: Miklos Szeredi Date: Fri Jul 25 01:49:02 2008 -0700 fuse: nfs export special lookups Implement the get_parent export operation by sending a LOOKUP request with ".." as the name. Implement looking up an inode by node ID after it has been evicted from the cache. This is done by seding a LOOKUP request with "." as the name (for all file types, not just directories). The filesystem can set the FUSE_EXPORT_SUPPORT flag in the INIT reply, to indicate that it supports these special lookups. Thanks to John Muir for the original implementation of this feature. Signed-off-by: Miklos Szeredi Cc: "J. Bruce Fields" Cc: Trond Myklebust Cc: Matthew Wilcox Cc: David Teigland Cc: Christoph Hellwig Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit c180eebe1390c2076ead6a9bc95a02efb994edb7 Author: Miklos Szeredi Date: Fri Jul 25 01:49:01 2008 -0700 fuse: add fuse_lookup_name() helper Add a new helper function which sends a LOOKUP request with the supplied name. This will be used by the next patch to send special LOOKUP requests with "." and ".." as the name. Signed-off-by: Miklos Szeredi Cc: Christoph Hellwig Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit dbd561d236ff16f8143bc727d91758ddd190e8cb Author: Miklos Szeredi Date: Fri Jul 25 01:49:00 2008 -0700 fuse: add export operations Implement export_operations, to allow fuse filesystems to be exported to NFS. This feature has been in the out-of-tree fuse module, and is widely used and tested. It has not been originally merged into mainline, because doing the NFS export in userspace was thought to be a cleaner and more efficient way of doing it, than through the kernel. While that is true, it would also have involved a lot of duplicated effort at reimplementing NFS exporting (all the different versions of the protocol). This effort was unfortunately not undertaken by anyone, so we are left with doing it the easy but less efficient way. If this feature goes in, the out-of-tree fuse module can go away, which would have several advantages: - not having to maintain two versions - less confusion for users - no bugs due to kernel API changes Comment from hch: - Use the same fh_type values as XFS, since we use the same fh encoding. Signed-off-by: Miklos Szeredi Cc: Christoph Hellwig Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 0de6256daafa3a97a269995e9b29f956bd419bbf Author: Miklos Szeredi Date: Fri Jul 25 01:48:59 2008 -0700 fuse: prepare lookup for nfs export Use d_splice_alias() instead of d_add() in fuse lookup code, to allow NFS exporting. Signed-off-by: Miklos Szeredi Cc: Christoph Hellwig Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 764c76b371722e0cba5c24d91225f0f954b69d44 Author: Miklos Szeredi Date: Fri Jul 25 01:48:58 2008 -0700 locks: allow ->lock() to return FILE_LOCK_DEFERRED Allow filesystem's ->lock() method to call posix_lock_file() instead of posix_lock_file_wait(), and return FILE_LOCK_DEFERRED. This makes it possible to implement a such a ->lock() function, that works with the lock manager, which needs the call to be asynchronous. Now the vfs_lock_file() helper can be used, so this is a cleanup as well. Signed-off-by: Miklos Szeredi Cc: "J. Bruce Fields" Cc: Trond Myklebust Cc: Matthew Wilcox Cc: David Teigland Cc: Christoph Hellwig Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit b648a6de00770cc325c22f43bdd4e935f6a2ee55 Author: Miklos Szeredi Date: Fri Jul 25 01:48:57 2008 -0700 locks: cleanup code duplication Extract common code into a function. Signed-off-by: Miklos Szeredi Cc: "J. Bruce Fields" Cc: Trond Myklebust Cc: Matthew Wilcox Cc: David Teigland Cc: Christoph Hellwig Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit bde74e4bc64415b142e556a34d295a52a1b7da9d Author: Miklos Szeredi Date: Fri Jul 25 01:48:57 2008 -0700 locks: add special return value for asynchronous locks Use a special error value FILE_LOCK_DEFERRED to mean that a locking operation returned asynchronously. This is returned by posix_lock_file() for sleeping locks to mean that the lock has been queued on the block list, and will be woken up when it might become available and needs to be retried (either fl_lmops->fl_notify() is called or fl_wait is woken up). f_op->lock() to mean either the above, or that the filesystem will call back with fl_lmops->fl_grant() when the result of the locking operation is known. The filesystem can do this for sleeping as well as non-sleeping locks. This is to make sure, that return values of -EAGAIN and -EINPROGRESS by filesystems are not mistaken to mean an asynchronous locking. This also makes error handling in fs/locks.c and lockd/svclock.c slightly cleaner. Signed-off-by: Miklos Szeredi Cc: Trond Myklebust Cc: "J. Bruce Fields" Cc: Matthew Wilcox Cc: David Teigland Cc: Christoph Hellwig Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit cc77b1521d06be07c9bb1a4a3e1f775dcaa15093 Author: Miklos Szeredi Date: Fri Jul 25 01:48:55 2008 -0700 lockd: dont return EAGAIN for a permanent error Fix nlm_fopen() to return NLM_FAILED (or NLM_LCK_DENIED_NOLOCKS) instead of NLM_LCK_DENIED. The latter means the lock request failed because of a conflicting lock (i.e. a temporary error), which is wrong in this case. Also fix the client to return ENOLCK instead of EAGAIN if a blocking lock request returns with NLM_LOCK_DENIED. Signed-off-by: Miklos Szeredi Cc: Trond Myklebust Cc: "J. Bruce Fields" Cc: Matthew Wilcox Cc: David Teigland Cc: Christoph Hellwig Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit b81f3ea92ba1fa676775677679889dc2a7f03c8b Author: Vegard Nossum Date: Fri Jul 25 01:48:55 2008 -0700 taskstats: remove initialization of static per-cpu variable Cc: Shailabh Nagar Signed-off-by: Vegard Nossum Cc: Balbir Singh Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 9b0975a20af1ff2f367e3b6b7c150eb114c6b500 Author: Keika Kobayashi Date: Fri Jul 25 01:48:54 2008 -0700 per-task-delay-accounting: update document and getdelays.c for memory reclaim Update document and make getdelays.c show delay accounting for memory reclaim. For making a distinction between "swapping in pages" and "memory reclaim" in getdelays.c, MEM is changed to SWAP. Signed-off-by: Keika Kobayashi Acked-by: Balbir Singh Cc: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 016ae219b920c4e606088761d3d6070cdf8ba706 Author: Keika Kobayashi Date: Fri Jul 25 01:48:53 2008 -0700 per-task-delay-accounting: update taskstats for memory reclaim delay Add members for memory reclaim delay to taskstats, and accumulate them in __delayacct_add_tsk() . Signed-off-by: Keika Kobayashi Cc: Hiroshi Shimamoto Cc: Balbir Singh Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 873b47717732c2f33a4b14de02571a4295a02f0c Author: Keika Kobayashi Date: Fri Jul 25 01:48:52 2008 -0700 per-task-delay-accounting: add memory reclaim delay Sometimes, application responses become bad under heavy memory load. Applications take a bit time to reclaim memory. The statistics, how long memory reclaim takes, will be useful to measure memory usage. This patch adds accounting memory reclaim to per-task-delay-accounting for accounting the time of do_try_to_free_pages(). - When System is under low memory load, memory reclaim may not occur. $ free total used free shared buffers cached Mem: 8197800 1577300 6620500 0 4808 1516724 -/+ buffers/cache: 55768 8142032 Swap: 16386292 0 16386292 $ vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 5069748 10612 3014060 0 0 0 0 3 26 0 0 100 0 0 0 0 5069748 10612 3014060 0 0 0 0 4 22 0 0 100 0 0 0 0 5069748 10612 3014060 0 0 0 0 3 18 0 0 100 0 Measure the time of tar command. $ ls -s test.dat 1501472 test.dat $ time tar cvf test.tar test.dat real 0m13.388s user 0m0.116s sys 0m5.304s $ ./delayget -d -p CPU count real total virtual total delay total 428 5528345500 5477116080 62749891 IO count delay total 338 8078977189 SWAP count delay total 0 0 RECLAIM count delay total 0 0 - When system is under heavy memory load memory reclaim may occur. $ vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 7159032 49724 1812 3012 0 0 0 0 3 24 0 0 100 0 0 0 7159032 49724 1812 3012 0 0 0 0 4 24 0 0 100 0 0 0 7159032 49848 1812 3012 0 0 0 0 3 22 0 0 100 0 In this case, one process uses more 8G memory by execution of malloc() and memset(). $ time tar cvf test.tar test.dat real 1m38.563s <- increased by 85 sec user 0m0.140s sys 0m7.060s $ ./delayget -d -p CPU count real total virtual total delay total 9021 7140446250 7315277975 923201824 IO count delay total 8965 90466349669 SWAP count delay total 3 21036367 RECLAIM count delay total 740 61011951153 In the later case, the value of RECLAIM is increasing. So, taskstats can show how much memory reclaim influences TAT. Signed-off-by: Keika Kobayashi Acked-by: Balbir Singh Acked-by: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 3e85ba034deec351f02cb55ff225bbd616463841 Author: David Howells Date: Fri Jul 25 01:48:50 2008 -0700 tsacct: fix bacct_add_tsk()'s use of do_div() Fix bacct_add_tsk()'s use of do_div() on an s64 by making ac_etime a u64 instead and dividing that. Possibly this should be guarded lest the interval calculation turn up negative, but the possible negativity of the result of the division is cast away, and it shouldn't end up negative anyway. This was introduced by patch f3cef7a99469afc159fec3a61b42dc7ca5b6824f. Signed-off-by: David Howells Cc: Jay Lan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 297c5d92634c809cef23d73e7b2556f2528ff7e2 Author: Andrea Righi Date: Fri Jul 25 01:48:49 2008 -0700 task IO accounting: provide distinct tgid/tid I/O statistics Report per-thread I/O statistics in /proc/pid/task/tid/io and aggregate parent I/O statistics in /proc/pid/io. This approach follows the same model used to account per-process and per-thread CPU times. As a practial application, this allows for example to quickly find the top I/O consumer when a process spawns many child threads that perform the actual I/O work, because the aggregated I/O statistics can always be found in /proc/pid/io. [ Oleg Nesterov points out that we should check that the task is still alive before we iterate over the threads, but also says that we can do that fixup on top of this later. - Linus ] Acked-by: Balbir Singh Signed-off-by: Andrea Righi Cc: Matt Heaton Cc: Shailabh Nagar Acked-by-with-comments: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 0c18d7a5df82524e634637c3aec24d4cba096442 Author: Pavel Emelyanov Date: Fri Jul 25 01:48:49 2008 -0700 bsdacct: fix and add comments around acct_process() Fix the one describing what this function is and add one more - about locking absence around pid namespaces loop. Signed-off-by: Pavel Emelyanov Cc: Randy Dunlap Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 7d1e13505be8c2bd2207894f4e0f069e1f9b51c9 Author: Pavel Emelyanov Date: Fri Jul 25 01:48:48 2008 -0700 bsdacct: account dying tasks in all relevant namespaces This just makes the acct_proces walk the pid namespaces from current up to the top and account a task in each with the accounting turned on. ns->parent access if safe lockless, since current it still alive and holds its namespace, which in turn holds its parent. Signed-off-by: Pavel Emelyanov Cc: Balbir Singh Cc: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit b5a7174875ea570cc675f2c503e800db8efdd6a7 Author: Pavel Emelyanov Date: Fri Jul 25 01:48:47 2008 -0700 bsdacct: turn acct off for all pidns-s on umount time All the bsd_acct_strcts with opened accounting are linked into a global list. So, the acct_auto_close(_mnt) walks one and drops the accounting for each. Signed-off-by: Pavel Emelyanov Cc: Balbir Singh Cc: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 0b6b030fc30d169bb406b34b4fc60d99dde4a9c6 Author: Pavel Emelyanov Date: Fri Jul 25 01:48:47 2008 -0700 bsdacct: switch from global bsd_acct_struct instance to per-pidns one Allocate the structure on the first call to sys_acct(). After this each namespace, that ordered the accounting, will live with this structure till its own death. Two notes - routines, that close the accounting on fs umount time use the init_pid_ns's acct by now; - accounting routine accounts to dying task's namespace (also by now). Signed-off-by: Pavel Emelyanov Cc: Balbir Singh Cc: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 6248b1b342005a428b1247b4e89249da1528d88d Author: Pavel Emelyanov Date: Fri Jul 25 01:48:46 2008 -0700 bsdacct: make internal code work with passed bsd_acct_struct, not global This adds the appropriate pointer to all the internal (i.e. static) functions that work with global acct instance. API calls pass a global instance to them (while we still have such). Mostly this is a s/acct_globals./acct->/ over the file. Signed-off-by: Pavel Emelyanov Cc: Balbir Singh Cc: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit a75d97976517dcda69150fd81d6be86ae63324a1 Author: Pavel Emelyanov Date: Fri Jul 25 01:48:45 2008 -0700 bsdacct: turn the acct_lock from on-the-struct to global Don't use per-bsd-acct-struct lock, but work with a global one. This lock is taken for short periods, so it doesn't seem it'll become a bottleneck, but it will allow us to easily avoid many locking difficulties in the future. So this is a mostly s/acct_globals.lock/acct_lock/ over the file. Signed-off-by: Pavel Emelyanov Cc: Balbir Singh Cc: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit e59a04a7aa5ce2483470aee4f2eb79ba6b9afe8b Author: Pavel Emelyanov Date: Fri Jul 25 01:48:44 2008 -0700 bsdacct: make check timer accept a bsd_acct_struct argument We're going to have many bsd_acct_struct instances, not just one, so the timer (currently working with a global one) has to know which one to work with. Use a handy setup_timer macro for it (thanks to Oleg for one). Signed-off-by: Pavel Emelyanov Cc: Balbir Singh Cc: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 1c552858ac2b1732a99d234d46b98098baef41ff Author: Pavel Emelyanov Date: Fri Jul 25 01:48:44 2008 -0700 bsdacct: "truthify" a comment near acct_process The acct_process does not accept any arguments actually. Signed-off-by: Pavel Emelyanov Cc: Balbir Singh Cc: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 20fad13ac66ac001c19220d3d08b4de5b6cca6e1 Author: Pavel Emelyanov Date: Fri Jul 25 01:48:43 2008 -0700 pidns: add the struct bsd_acct_struct pointer on pid_namespace struct All the bsdacct-related info will be stored in the area, pointer by this one. It will be NULL automatically for all new namespaces. Signed-off-by: Pavel Emelyanov Cc: Balbir Singh Cc: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 84406c153a5bfa5d8b428a0933e9d39db6b59a75 Author: Pavel Emelyanov Date: Fri Jul 25 01:48:42 2008 -0700 pidns: use kzalloc when allocating new pid_namespace struct It makes many fields initialization implicit helping in auto-setting #ifdef-ed fields (bsd-acct related pointer will be such). Signed-off-by: Pavel Emelyanov Cc: Balbir Singh Cc: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 081e4c8a75692c21f3a119a81ca3270081879d0e Author: Pavel Emelyanov Date: Fri Jul 25 01:48:42 2008 -0700 bsdacct: rename acct_gbls to bsd_acct_struct After I fixed access to task->tgid in kernel/acct.c, Oleg pointed out some bad side effects with this accounting vs pid namespaces interaction. I.e. when some task in pid namespace sets this accounting up, this blocks all the others from doing the same. Restricting this to init namespace only could help, but didn't look a graceful solution. So here is the approach to make this accounting work with pid namespaces properly. The idea is simple - when a task dies it accounts itself in each namespace it is visible from and which set the accounting up. For example here are the commands run and the output of lastcomm from init and sub namespaces: init_ns# accton pacct sub_ns# accton pacct (this is a different file - sub ns is run in a chroot-ed environment) init_ns# cat /dev/null sub_ns# ls /dev/null init_ns# accton sub_ns# accton sub_ns# lastcomm -f pacct ls 0 [136,0] 0.00 secs Thu May 15 10:30 accton 0 [136,0] 0.00 secs Thu May 15 10:30 init_ns# lastcomm -f pacct accton root pts/0 0.00 secs Thu May 15 14:30 << got from sub cat root pts/1 0.00 secs Thu May 15 14:30 ls root pts/0 0.00 secs Thu May 15 14:30 << got from sub accton root pts/1 0.00 secs Thu May 15 14:30 That was the summary, the details are in patches. This patch: It will be visible in pid_namespace.h file, so fix its name to look better outside the acct.c file. Signed-off-by: Pavel Emelyanov Cc: Balbir Singh Cc: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 49b5cf34727a6c1be1568ab28e89a2d9a6bf51e0 Author: Jonathan Lim Date: Fri Jul 25 01:48:40 2008 -0700 accounting: account for user time when updating memory integrals Adapt acct_update_integrals() to include user time when calculating the time difference. The units of acct_rss_mem1 and acct_vm_mem1 are also changed from pages-jiffies to pages-usecs to avoid calling jiffies_to_usecs() in xacct_add_tsk() which might overflow. Signed-off-by: Jonathan Lim Cc: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 7394f0f6c0baab650ea9194cb1be847df646fb57 Author: Adrian Bunk Date: Fri Jul 25 01:48:40 2008 -0700 unexport uts_sem With the removal of the Solaris binary emulation the export of uts_sem became unused. Signed-off-by: Adrian Bunk Acked-by: David S. Miller Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit a89cc1959d0ea5f36bf7421dc97b34f03809637d Author: Harvey Harrison Date: Fri Jul 25 01:48:39 2008 -0700 markers: fix sparse integer as NULL pointer warning kernel/trace/trace_sysprof.c:164:20: warning: Using plain integer as NULL pointer Signed-off-by: Harvey Harrison Cc: Mathieu Desnoyers Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 28325df0d9339b7f3aba9c45174d4586223ef46b Author: Mathieu Desnoyers Date: Fri Jul 25 01:48:38 2008 -0700 markers: use rcu_barrier_sched() and call_rcu_sched() rcu_barrier_sched() and call_rcu_sched() were introduced in 2.6.26 for the Markers. Change the marker code to use them. It can be seen as a fix since the marker code was using an ugly, temporary, #ifdef hack to work around CONFIG_PREEMPT_RCU. Signed-off-by: Mathieu Desnoyers Acked-by: Paul McKenney Cc: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 24879a8e3e68f146d4d85528cc0b5dea712b77c5 Author: Matthias Kaehlcke Date: Fri Jul 25 01:48:38 2008 -0700 aoe: convert emsgs_sema into a completion ATA over Ethernet: The semaphore emsgs_sema is used for signalling an event, convert it in a completion. Signed-off-by: Matthias Kaehlcke Cc: "Ed L. Cashin" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit dbda0de52618d13d1b927c7ba7bb839cfddc4e8c Author: Pavel Emelyanov Date: Fri Jul 25 01:48:37 2008 -0700 pidns: remove find_task_by_pid, unused for a long time It seems to me that it was a mistake marking this function as deprecated and scheduling it for removal, rather than resolutely removing it after the last caller's death. Anyway - better late, then never. Signed-off-by: Pavel Emelyanov Cc: Oleg Nesterov Cc: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit e49859e71e0318b564de1546bdc30fab738f9deb Author: Pavel Emelyanov Date: Fri Jul 25 01:48:36 2008 -0700 pidns: remove now unused find_pid function. This one had the only users so far - the kill_proc, which is removed, so drop this (invalid in namespaced world) call too. And of course - erase all references on it from comments. Signed-off-by: Pavel Emelyanov Cc: Oleg Nesterov Cc: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 19b0cfcca41dd772065671ad0584e1cea0f3fd13 Author: Pavel Emelyanov Date: Fri Jul 25 01:48:35 2008 -0700 pidns: remove now unused kill_proc function This function operated on a pid_t to kill a task, which is no longer valid in a containerized system. It has finally lost all its users and we can safely remove it from the tree. Signed-off-by: Pavel Emelyanov Cc: Oleg Nesterov Cc: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 33166b1ffca5e1945246bcaa77d72a22b0d3e531 Author: Richard Kennedy Date: Fri Jul 25 01:48:35 2008 -0700 shrink struct pid by removing padding on 64 bit builds When struct pid is built on a 64 bit platform gcc has to insert padding to maintain the correct alignment, by simply reordering its members the memory usage shrinks from 88 bytes to 80. I've successfully run with this patch on my desktop AMD64 machine. There are no significant kernel size changes to a default config.X86_64 on the latest git v2.6.26-rc1 text data bss dec hex filename 5404828 976760 734280 7115868 6c945c vmlinux 5404811 976760 734280 7115851 6c944b vmlinux.pid-patch Acked-by: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 3ae4eed34be0177a8e003411a84e4ee212adbced Author: Adrian Bunk Date: Fri Jul 25 01:48:34 2008 -0700 proper pid{hash,map}_init() prototypes This patch adds proper prototypes for pid{hash,map}_init() in include/linux/pid_namespace.h Signed-off-by: Adrian Bunk Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 4ecb90090c84210a8bd2a9d7a5906e616735873c Author: Stephen Hemminger Date: Fri Jul 25 01:48:32 2008 -0700 sysctl: allow override of /proc/sys/net with CAP_NET_ADMIN Extend the permission check for networking sysctl's to allow modification when current process has CAP_NET_ADMIN capability and is not root. This version uses the until now unused permissions hook to override the mode value for /proc/sys/net if accessed by a user with capabilities. Found while working with Quagga. It is impossible to turn forwarding on/off through the command interface because Quagga uses secure coding practice of dropping privledges during initialization and only raising via capabilities when necessary. Since the dameon has reset real/effective uid after initialization, all attempts to access /proc/sys/net variables will fail. Signed-off-by: Stephen Hemminger Acked-by: "Eric W. Biederman" Cc: Chris Wright Cc: Alexey Dobriyan Cc: Andrew Morgan Cc: Pavel Emelyanov Cc: "David S. Miller" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 99541c23cd32bacf1a591ca537a7c0cb9053ad7e Author: Alexey Dobriyan Date: Fri Jul 25 01:48:31 2008 -0700 sysctl: check for bogus modes Catch, e. g., 644/0644 typo. Signed-off-by: Alexey Dobriyan Acked-by: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 339caf2a224fc9af0f01686bf287dda32c6efca6 Author: David Sterba Date: Fri Jul 25 01:48:31 2008 -0700 proc: misplaced export of find_get_pid Move EXPORT_SYMBOL right after the func Signed-off-by: David Sterba Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 6eedf8d30d2b48e86fbcee1a32fb2fa5f42219ee Author: Alexey Dobriyan Date: Fri Jul 25 01:48:30 2008 -0700 proc: move Kconfig to fs/proc/Kconfig Signed-off-by: Alexey Dobriyan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit a9bd4a3e070ba7494f154e1a11687a8a957d88dc Author: Alexey Dobriyan Date: Fri Jul 25 01:48:30 2008 -0700 proc: remove pathetic remount code MS_RMT_MASK will unmask changes in do_remount_sb() anyway. Signed-off-by: Alexey Dobriyan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 881adb85358309ea9c6f707394002719982ec607 Author: Alexey Dobriyan Date: Fri Jul 25 01:48:29 2008 -0700 proc: always do ->release Current two-stage scheme of removing PDE emphasizes one bug in proc: open rmmod remove_proc_entry close ->release won't be called because ->proc_fops were cleared. In simple cases it's small memory leak. For every ->open, ->release has to be done. List of openers is introduced which is traversed at remove_proc_entry() if neeeded. Discussions with Al long ago (sigh). Signed-off-by: Alexey Dobriyan Cc: Al Viro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 6e644c3126149b65460610fe5a00d8a162092abe Author: Adrian Bunk Date: Fri Jul 25 01:48:28 2008 -0700 move proc_kmsg_operations to fs/proc/internal.h This patch moves the extern of struct proc_kmsg_operations to fs/proc/internal.h and adds an #include "internal.h" to fs/proc/kmsg.c so that the latter sees the former. Signed-off-by: Adrian Bunk Cc: Alexey Dobriyan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit cd9a6f1078ed07fe919667b73e829f3bac485573 Author: Adrian Bunk Date: Fri Jul 25 01:48:28 2008 -0700 unexport proc_clear_tty With the removal of the Solaris binary emulation the export of proc_clear_tty became unused. Signed-off-by: Adrian Bunk Acked-by: David S. Miller Acked-by: Alan Cox Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 25377479de7539fdc871a0f0ecaa39da42353bbc Author: Akinobu Mita Date: Fri Jul 25 01:48:27 2008 -0700 dell_rbu: use memory_read_from_buffer() Signed-off-by: Akinobu Mita Cc: Abhay Salunke Cc: Zhang Rui Cc: Matt Domsch Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit d991696263a704be7f41ac186f1a0ed17963c260 Author: Thomas Gleixner Date: Fri Jul 25 01:48:26 2008 -0700 fs/partitions/efi: convert to pr_debug convert the local Dprintk() compile time debug printk wrappers to the generic pr_debug() wrapper. Signed-off-by: Thomas Gleixner Cc: Matt Domsch Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 04ebd4aee52b06a2c38127d9208546e5b96f3a19 Author: Abdel Benamrouche Date: Fri Jul 25 01:48:26 2008 -0700 block/ioctl.c and fs/partition/check.c: check value returned by add_partition() Now that add_partition() has been aught to propagate errors, let's check them. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Abdel Benamrouche Cc: Jens Axboe Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit d805dda412346225a50af2d399d958a4bc676c38 Author: Abdel Benamrouche Date: Fri Jul 25 01:48:25 2008 -0700 fs/partition/check.c: fix return value warning fs/partitions/check.c:381: warning: ignoring return value of ___device_add___, declared with attribute warn_unused_result [akpm@linux-foundation.org: multiple-return-statements-per-function are evil] Signed-off-by: Abdel Benamrouche Cc: Jens Axboe Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit abe19b7b822a8fdbe3dbfd6e066d0698b4eefb06 Author: Akinobu Mita Date: Fri Jul 25 01:48:24 2008 -0700 dcdbas: use memory_read_from_buffer() Signed-off-by: Akinobu Mita Cc: Doug Warzecha Cc: Zhang Rui Cc: Matt Domsch Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit f37e66173e0cc09b4e5a89eb0294abbefc15f435 Author: Akinobu Mita Date: Fri Jul 25 01:48:23 2008 -0700 firmware: use memory_read_from_buffer() Signed-off-by: Akinobu Mita Cc: Greg Kroah-Hartman Cc: Markus Rechberger Cc: Kay Sievers Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit ec905a18656daa4d9300bad2bebc02d5dba7883d Author: Jiri Slaby Date: Fri Jul 25 01:48:23 2008 -0700 drivers/misc/phantom: note PCI Tell users that the driver is only for PCI devices to stop asking for support of firewire and parallel devices. Signed-off-by: Jiri Slaby Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit ace7dd96695769f9d76980c7e52139e73228221c Author: Jiri Slaby Date: Fri Jul 25 01:48:22 2008 -0700 Char: mxser, various cleanups - remove unused macro - some whitespace cleanup - useless debug prints removal Signed-off-by: Jiri Slaby Acked-by: Alan Cox Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 1df0092477b8b2df605812e298624f5c35bb4805 Author: Jiri Slaby Date: Fri Jul 25 01:48:22 2008 -0700 Char: mxser, remove predefined isa support Remove a support of ISA addresses predefined at compile time. It is unused (filled by zeroes) and prolongs the code. Don't initialize global array and add `ioaddr' module param description. Signed-off-by: Jiri Slaby Acked-by: Alan Cox Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 83766bc63f7e49b0215811026e7802bd09a9c7e1 Author: Jiri Slaby Date: Fri Jul 25 01:48:21 2008 -0700 Char: mxser, prints cleanup - use dev_* for printing in pci probe function - move ISA p[rints directly into isa find function, do not postpone it. Remove macros bound to it then. - prepend some prints by "mxser: " to know what it belongs to Signed-off-by: Jiri Slaby Acked-by: Alan Cox Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 729f0edbecd0c59c82ee9bf92009acc7e984c425 Author: Jiri Slaby Date: Fri Jul 25 01:48:20 2008 -0700 Char: mxser, update documentation Update Documentation/moxa-smartio to the later document from the mxser package. Signed-off-by: Jiri Slaby Acked-by: Alan Cox Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 72800df9ba3199df02a95b3830c49fbf16ec4a6d Author: Jiri Slaby Date: Fri Jul 25 01:48:20 2008 -0700 Char: mxser, globals cleanup - remove unused mxvar_diagflag - move mxser_msr into the only user/function - GMStatus, hmm, fix race-prone access to it. We need only one instance for real, not MXSER_PORTS. Move it to MOXA_GETMSTATUS ioctl. - mxser_mon_ext, almost the same, but alloc it on heap, since it has more than 2 kilos. - fix indexing, `i' is not the index value, `i * MXSER_PORTS_PER_BOARD + j' is Signed-off-by: Jiri Slaby Acked-by: Alan Cox Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 41aee9a121fd0c31ae22dfe57e8f9ee9d6d85c25 Author: Jiri Slaby Date: Fri Jul 25 01:48:19 2008 -0700 Char: mxser, ioctl cleanup - remove break ctl from ioctl handler, it's never reached, since tty_ops->break_ctl is defined (mxser break handling is done in software) - mark MOXA_GET_MAJOR as deprecated - fix TIOCGICOUNT (some retval non-checks of put_user). Use copy_to_user to whole structure instead. Signed-off-by: Jiri Slaby Acked-by: Alan Cox Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 6ee8928d94841aa764aeaf645ad16daff811dc26 Author: Akinobu Mita Date: Fri Jul 25 01:48:18 2008 -0700 nwflash: use simple_read_from_buffer() Signed-off-by: Akinobu Mita Cc: Russell King Cc: Tim Schmielau Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 236b8756a2b6f90498d45b2c36d43e5372f2d4b8 Author: Alan Cox Date: Fri Jul 25 01:48:17 2008 -0700 dsp56k: BKL pushdown Push the BKL down into the driver ioctl methods Signed-off-by: Alan Cox Cc: Jiri Slaby Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit b8e35919653d76e7dceb8d3b8569c4ec1004d546 Author: Alan Cox Date: Fri Jul 25 01:48:17 2008 -0700 ds1302: push down the BKL into the driver ioctl code Signed-off-by: Alan Cox Cc: Jiri Kosina Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 6d535d3e6ad395345750c361bd2b7f1b9429455d Author: Alan Cox Date: Fri Jul 25 01:48:16 2008 -0700 ppdev: wrap ioctl handler in driver and push lock down Signed-off-by: Alan Cox Cc: Jiri Slaby Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit e05e9f7c4aeb82eaa23e46b29580ff514590c641 Author: Alan Cox Date: Fri Jul 25 01:48:16 2008 -0700 ixj: push BKL into driver and wrap ioctls Signed-off-by: Alan Cox Cc: Nishanth Aravamudan Cc: Domen Puncer Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 11af7478addd34c42999b3b84095903ed9e67038 Author: Alan Cox Date: Fri Jul 25 01:48:15 2008 -0700 sx: push BKL down into the firmware ioctl handler Also fix the capability checking for firmware load. Signed-off-by: Alan Cox Cc: Jiri Slaby Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit f6759fdcfd79ff1827fd5d4ddfe876164466d30d Author: Alan Cox Date: Fri Jul 25 01:48:14 2008 -0700 rio: push down the BKL into the firmware ioctl handler TTY side is already done. Signed-off-by: Alan Cox Cc: Jiri Slaby Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 909d145f0decbc4f17955e1fc4122a669a51fbc0 Author: Alan Cox Date: Fri Jul 25 01:48:14 2008 -0700 mwave: ioctl BKL pushdown Push the BKL down to the point it wraps the actual mwave method handlers Signed-off-by: Alan Cox Cc: Eric Sesterhenn Cc: Yani Ioannou Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 47be36a24defbd19aea1354c416ec99f291c7ab8 Author: Alan Cox Date: Fri Jul 25 01:48:13 2008 -0700 ip2: push BKL down for the firmware interface (The tty side is already done) Signed-off-by: Alan Cox Cc: Jiri Slaby Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 76528a42e2c5199a1208909318a9c9948d25d0b7 Author: Alan Cox Date: Fri Jul 25 01:48:12 2008 -0700 efirtc: push down the BKL Push it down as far as the EFI method calls. Someone who knows EFI can do the other bits. Also fix another wrong unknown ioctl return. Signed-off-by: Alan Cox Cc: Joe Perches Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 372572e9b1dcc5e36091199be63766d13e5a8ae0 Author: Adrian Bunk Date: Fri Jul 25 01:48:11 2008 -0700 #if 0 hpet_unregister() This patch #if 0's the unused hpet_unregister(). Signed-off-by: Adrian Bunk Acked-by: Clemens Ladisch Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 8d1e120f695e9bcf01585e052577dc1e099033f9 Author: Adrian Bunk Date: Fri Jul 25 01:48:11 2008 -0700 proper extern for mwave_s_mdd This patch adds a proper extern for mwave_s_mdd in drivers/char/mwave/mwavedd.h Signed-off-by: Adrian Bunk Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 79885b227740b9c7d3057f2de556f4098d37cc8f Author: Edgar E. Iglesias Date: Fri Jul 25 01:48:10 2008 -0700 elf: use ELF_CORE_EFLAGS for kcore ELF header flags ELF_CORE_EFLAGS is already used by the binfmt_elf coredumper to set correct arch specific ELF header flags on coredumps. Use it for kcore dumps as well. At the moment, this affects the CRIS and the H8300 arch. Signed-off-by: Edgar E. Iglesias Cc: Mikael Starvik Cc: Yoshinori Sato Cc: Ralf Baechle Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 7833351b5260b3a58b54a0c2e7065001d986d749 Author: Adrian Bunk Date: Fri Jul 25 01:48:09 2008 -0700 pty: remove unused UNIX98_PTY_COUNT options The h8300 and sparc options somehow survived when the code stopped using CONFIG_UNIX98_PTY_COUNT. Reviewed-by: Robert P. J. Day Signed-off-by: Adrian Bunk Cc: Yoshinori Sato Cc: "David S. Miller" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 9eefe520c814f6f62c5d36a2ddcd3fb99dfdb30e Author: Nadia Derbey Date: Fri Jul 25 01:48:08 2008 -0700 ipc: do not use a negative value to re-enable msgmni automatic recomputing This patch proposes an alternative to the "magical positive-versus-negative number trick" Andrew complained about last week in http://lkml.org/lkml/2008/6/24/418. This had been introduced with the patches that scale msgmni to the amount of lowmem. With these patches, msgmni has a registered notification routine that recomputes msgmni value upon memory add/remove or ipc namespace creation/ removal. When msgmni is changed from user space (i.e. value written to the proc file), that notification routine is unregistered, and the way to make it registered back is to write a negative value into the proc file. This is the "magical positive-versus-negative number trick". To fix this, a new proc file is introduced: /proc/sys/kernel/auto_msgmni. This file acts as ON/OFF for msgmni automatic recomputing. With this patch, the process is the following: 1) kernel boots in "automatic recomputing mode" /proc/sys/kernel/msgmni contains the value that has been computed (depends on lowmem) /proc/sys/kernel/automatic_msgmni contains "1" 2) echo > /proc/sys/kernel/msgmni . sets msg_ctlmni to . de-activates automatic recomputing (i.e. if, say, some memory is added msgmni won't be recomputed anymore) . /proc/sys/kernel/automatic_msgmni now contains "0" 3) echo "0" > /proc/sys/kernel/automatic_msgmni . de-activates msgmni automatic recomputing this has the same effect as 2) except that msg_ctlmni's value stays blocked at its current value) 3) echo "1" > /proc/sys/kernel/automatic_msgmni . recomputes msgmni's value based on the current available memory size and number of ipc namespaces . re-activates automatic recomputing for msgmni. Signed-off-by: Nadia Derbey Cc: Solofo Ramangalahy Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit f1a43f93f0f3bab418800eaccb9e2e3b5427e173 Author: Akinobu Mita Date: Fri Jul 25 01:48:07 2008 -0700 ipc: use simple_read_from_buffer() Also this patch kills unneccesary trailing NULL character. Signed-off-by: Akinobu Mita Cc: Nadia Derbey Cc: Manfred Spraul Cc: Pierre Peiffer Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 380af1b33b3ff92df5cda96329b58f5d1b6b5a53 Author: Manfred Spraul Date: Fri Jul 25 01:48:06 2008 -0700 ipc/sem.c: rewrite undo list locking The attached patch: - reverses the locking order of ulp->lock and sem_lock: Previously, it was first ulp->lock, then inside sem_lock. Now it's the other way around. - converts the undo structure to rcu. Benefits: - With the old locking order, IPC_RMID could not kfree the undo structures. The stale entries remained in the linked lists and were released later. - The patch fixes a a race in semtimedop(): if both IPC_RMID and a semget() that recreates exactly the same id happen between find_alloc_undo() and sem_lock, then semtimedop() would access already kfree'd memory. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Manfred Spraul Reviewed-by: Nadia Derbey Cc: Pierre Peiffer Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit a1193f8ec091cd8fd309cc2982abe4499f6f2b4d Author: Manfred Spraul Date: Fri Jul 25 01:48:06 2008 -0700 ipc/sem.c: convert sem_array.sem_pending to struct list_head sem_array.sem_pending is a double linked list, the attached patch converts it to struct list_head. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Manfred Spraul Reviewed-by: Nadia Derbey Cc: Pierre Peiffer Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 2c0c29d414087f3b021059673c20a7088f5f1fff Author: Manfred Spraul Date: Fri Jul 25 01:48:05 2008 -0700 ipc/sem.c: remove unused entries from struct sem_queue sem_queue.sma and sem_queue.id were never used, the attached patch removes them. Signed-off-by: Manfred Spraul Reviewed-by: Nadia Derbey Cc: Pierre Peiffer Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 4daa28f6d8f5cda8ea0f55048e3c8811c384cbdd Author: Manfred Spraul Date: Fri Jul 25 01:48:04 2008 -0700 ipc/sem.c: convert undo structures to struct list_head The undo structures contain two linked lists, the attached patch replaces them with generic struct list_head lists. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Manfred Spraul Cc: Nadia Derbey Cc: Pierre Peiffer Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 00c2bf85d8febfcfddde63822043462b026134ff Author: Nadia Derbey Date: Fri Jul 25 01:48:03 2008 -0700 ipc: get rid of ipc_lock_down() Remove the ipc_lock_down() routines: they used to call idr_find() locklessly (given that the ipc ids lock was already held), so they are not needed anymore. Signed-off-by: Nadia Derbey Acked-by: "Paul E. McKenney" Cc: Manfred Spraul Cc: Jim Houston Cc: Pierre Peiffer Acked-by: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 983bfb7db303cfde56ae5bbf4e0f2f46e38c9576 Author: Nadia Derbey Date: Fri Jul 25 01:48:03 2008 -0700 ipc: call idr_find() without locking in ipc_lock() Call idr_find() locklessly from ipc_lock(), since the idr tree is now RCU protected. Signed-off-by: Nadia Derbey Acked-by: "Paul E. McKenney" Cc: Manfred Spraul Cc: Jim Houston Cc: Pierre Peiffer Acked-by: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit cf481c20c476ad2c0febdace9ce23f5a4db19582 Author: Nadia Derbey Date: Fri Jul 25 01:48:02 2008 -0700 idr: make idr_remove rcu-safe Introduce the free_layer() routine: it is the one that actually frees memory after a grace period has elapsed. Signed-off-by: Nadia Derbey Reviewed-by: "Paul E. McKenney" Cc: Manfred Spraul Cc: Jim Houston Cc: Pierre Peiffer Acked-by: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit f9c46d6ea5ce138a886c3a0f10a46130afab75f5 Author: Nadia Derbey Date: Fri Jul 25 01:48:01 2008 -0700 idr: make idr_find rcu-safe Make idr_find rcu-safe: it can now be called inside an rcu_read critical section. Signed-off-by: Nadia Derbey Reviewed-by: "Paul E. McKenney" Cc: Manfred Spraul Cc: Jim Houston Cc: Pierre Peiffer Acked-by: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 3219b3b7456d5cf15ba7b1fe7b1bcf15ce8840e2 Author: Nadia Derbey Date: Fri Jul 25 01:48:00 2008 -0700 idr: make idr_get_new* rcu-safe Make the idr_get_new* routines rcu-safe. Signed-off-by: Nadia Derbey Reviewed-by: "Paul E. McKenney" Cc: Manfred Spraul Cc: Jim Houston Cc: Pierre Peiffer Acked-by: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 944ca05c7b4972f2ebf37262e0f4933d178ad6db Author: Nadia Derbey Date: Fri Jul 25 01:47:59 2008 -0700 idr: error checking factorization Do some code factorization in the return code analysis. Signed-off-by: Nadia Derbey Cc: "Paul E. McKenney" Cc: Manfred Spraul Cc: Jim Houston Cc: Pierre Peiffer Acked-by: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit f098ad655f4dd8e3da98ffbeda9cedcc4459c01a Author: Nadia Derbey Date: Fri Jul 25 01:47:59 2008 -0700 idr: fix a printk call Fix the incomplete printk call. Signed-off-by: Nadia Derbey Reviewed-by: "Paul E. McKenney" Cc: Manfred Spraul Cc: Jim Houston Cc: Pierre Peiffer Acked-by: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 4ae537892ab9858f71c78701f4651ad1ca531a1b Author: Nadia Derbey Date: Fri Jul 25 01:47:58 2008 -0700 idr: rename some of the idr APIs internal routines This is a trivial patch that renames: . alloc_layer to get_from_free_list since it idr_pre_get that actually allocates memory. . free_layer to move_to_free_list since memory is not actually freed there. This makes things more clear for the next patches. Signed-off-by: Nadia Derbey Reviewed-by: "Paul E. McKenney" Cc: Manfred Spraul Cc: Jim Houston Cc: Pierre Peiffer Acked-by: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 2027d1abc25ff770cc3bc936abd33570ce85d85a Author: Nadia Derbey Date: Fri Jul 25 01:47:57 2008 -0700 idr: change the idr structure After scalability problems have been detected when using the sysV ipcs, I have proposed to use an RCU based implementation of the IDR api instead (see threads http://lkml.org/lkml/2008/4/11/212 and http://lkml.org/lkml/2008/4/29/295). This resulted in many people asking to convert the idr API and make it rcu safe (because most of the code was duplicated and thus unmaintanable and unreviewable). So here is a first attempt. The important change wrt to the idr API itself is during idr removes: idr layers are freed after a grace period, instead of being moved to the free list. The important change wrt to ipcs, is that idr_find() can now be called locklessly inside a rcu read critical section. Here are the results I've got for the pmsg test sent by Manfred: 2.6.25-rc3-mm1 2.6.25-rc3-mm1+ 2.6.25-mm1 Patched 2.6.25-mm1 1 1168441 1064021 876000 947488 2 1094264 921059 1549592 1730685 3 2082520 1738165 1694370 2324880 4 2079929 1695521 404553 2400408 5 2898758 406566 391283 3246580 6 2921417 261275 263249 3752148 7 3308761 126056 191742 4243142 8 3329456 100129 141722 4275780 1st column: stock 2.6.25-rc3-mm1 2nd column: 2.6.25-rc3-mm1 + ipc patches (store ipcs into idrs) 3nd column: stock 2.6.25-mm1 4th column: 2.6.25-mm1 + this pacth series. This patch: Add an rcu_head to the idr_layer structure in order to free it after a grace period. Signed-off-by: Nadia Derbey Reviewed-by: "Paul E. McKenney" Cc: Manfred Spraul Cc: Jim Houston Cc: Pierre Peiffer Acked-by: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 95b68dec0d52c7b8fea3698b3938cf3ab936436b Author: Chandru Date: Fri Jul 25 01:47:55 2008 -0700 calgary iommu: use the first kernels TCE tables in kdump kdump kernel fails to boot with calgary iommu and aacraid driver on a x366 box. The ongoing dma's of aacraid from the first kernel continue to exist until the driver is loaded in the kdump kernel. Calgary is initialized prior to aacraid and creation of new tce tables causes wrong dma's to occur. Here we try to get the tce tables of the first kernel in kdump kernel and use them. While in the kdump kernel we do not allocate new tce tables but instead read the base address register contents of calgary iommu and use the tables that the registers point to. With these changes the kdump kernel and hence aacraid now boots normally. Signed-off-by: Chandru Siddalingappa Acked-by: Muli Ben-Yehuda Cc: Ingo Molnar Cc: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 8448502cfc915f70e3f8923849ade27d472044cb Author: Oleg Nesterov Date: Fri Jul 25 01:47:54 2008 -0700 workqueues: do CPU_UP_CANCELED if CPU_UP_PREPARE fails The bug was pointed out by Akinobu Mita , and this patch is based on his original patch. workqueue_cpu_callback(CPU_UP_PREPARE) expects that if it returns NOTIFY_BAD, _cpu_up() will send CPU_UP_CANCELED then. However, this is not true since "cpu hotplug: cpu: deliver CPU_UP_CANCELED only to NOTIFY_OKed callbacks with CPU_UP_PREPARE" commit: a0d8cdb652d35af9319a9e0fb7134de2a276c636 The callback which has returned NOTIFY_BAD will not receive CPU_UP_CANCELED. Change the code to fulfil the CPU_UP_CANCELED logic if CPU_UP_PREPARE fails. Signed-off-by: Oleg Nesterov Reported-by: Akinobu Mita Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 8de6d308bab4f67fcf953562f9f08f9527cad72d Author: Oleg Nesterov Date: Fri Jul 25 01:47:53 2008 -0700 workqueues: schedule_on_each_cpu() can use schedule_work_on() schedule_on_each_cpu() can use schedule_work_on() to avoid the code duplication. Signed-off-by: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit ef1ca236b8d645349ed6569598ae3f6c1b9511c0 Author: Oleg Nesterov Date: Fri Jul 25 01:47:53 2008 -0700 workqueues: queue_work() can use queue_work_on() queue_work() can use queue_work_on() to avoid the code duplication. Signed-off-by: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit a67da70dc0955580665f5444f318b92e69a3c272 Author: Oleg Nesterov Date: Fri Jul 25 01:47:52 2008 -0700 workqueues: lockdep annotations for flush_work() Add lockdep annotations to flush_work() and update the comment. Signed-off-by: Oleg Nesterov Cc: Jarek Poplawski Acked-by: Johannes Berg Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 69b895fd13d73aebf62b75502eb6513d43057ba3 Author: Oleg Nesterov Date: Fri Jul 25 01:47:51 2008 -0700 S390 topology: don't use kthread() for arch_reinit_sched_domains() Now that it is safe to use get_online_cpus() we can revert [S390] cpu topology: Fix possible deadlock. commit: fd781fa25c9e9c6fd1599df060b05e7c4ad724e5 and call arch_reinit_sched_domains() directly from topology_work_fn(). Signed-off-by: Oleg Nesterov Cc: Gautham R Shenoy Tested-by: Heiko Carstens Cc: Max Krasnyansky Cc: Paul Jackson Cc: Paul Menage Cc: Peter Zijlstra Cc: Vegard Nossum Cc: Martin Schwidefsky Cc: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 3da1c84c00c7e5fa8348336bd8c342f9128b0f14 Author: Oleg Nesterov Date: Fri Jul 25 01:47:50 2008 -0700 workqueues: make get_online_cpus() useable for work->func() workqueue_cpu_callback(CPU_DEAD) flushes cwq->thread under cpu_maps_update_begin(). This means that the multithreaded workqueues can't use get_online_cpus() due to the possible deadlock, very bad and very old problem. Introduce the new state, CPU_POST_DEAD, which is called after cpu_hotplug_done() but before cpu_maps_update_done(). Change workqueue_cpu_callback() to use CPU_POST_DEAD instead of CPU_DEAD. This means that create/destroy functions can't rely on get_online_cpus() any longer and should take cpu_add_remove_lock instead. [akpm@linux-foundation.org: fix CONFIG_SMP=n] Signed-off-by: Oleg Nesterov Acked-by: Gautham R Shenoy Cc: Heiko Carstens Cc: Max Krasnyansky Cc: Paul Jackson Cc: Paul Menage Cc: Peter Zijlstra Cc: Vegard Nossum Cc: Martin Schwidefsky Cc: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 8616a89ab761239c963eea3a63be383f127cc7e8 Author: Oleg Nesterov Date: Fri Jul 25 01:47:49 2008 -0700 workqueues: schedule_on_each_cpu: use flush_work() Change schedule_on_each_cpu() to use flush_work() instead of flush_workqueue(), this way we don't wait for other work_struct's which can be queued meanwhile. Signed-off-by: Oleg Nesterov Cc: Jarek Poplawski Cc: Max Krasnyansky Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit db700897224b5ebdf852f2d38920ce428940d059 Author: Oleg Nesterov Date: Fri Jul 25 01:47:49 2008 -0700 workqueues: implement flush_work() Most of users of flush_workqueue() can be changed to use cancel_work_sync(), but sometimes we really need to wait for the completion and cancelling is not an option. schedule_on_each_cpu() is good example. Add the new helper, flush_work(work), which waits for the completion of the specific work_struct. More precisely, it "flushes" the result of of the last queue_work() which is visible to the caller. For example, this code queue_work(wq, work); /* WINDOW */ queue_work(wq, work); flush_work(work); doesn't necessary work "as expected". What can happen in the WINDOW above is - wq starts the execution of work->func() - the caller migrates to another CPU now, after the 2nd queue_work() this work is active on the previous CPU, and at the same time it is queued on another. In this case flush_work(work) may return before the first work->func() completes. It is trivial to add another helper int flush_work_sync(struct work_struct *work) { return flush_work(work) || wait_on_work(work); } which works "more correctly", but it has to iterate over all CPUs and thus it much slower than flush_work(). Signed-off-by: Oleg Nesterov Acked-by: Max Krasnyansky Acked-by: Jarek Poplawski Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 1a4d9b0aa0d3c50314e57525a5e5ec2cfc48b4c8 Author: Oleg Nesterov Date: Fri Jul 25 01:47:47 2008 -0700 workqueues: insert_work: use "list_head *" instead of "int tail" insert_work() inserts the new work_struct before or after cwq->worklist, depending on the "int tail" parameter. Change it to accept "list_head *" instead, this shrinks .text a bit and allows us to insert the barrier after specific work_struct. Signed-off-by: Oleg Nesterov Cc: Jarek Poplawski Cc: Max Krasnyansky Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 565b9b14e7f48131bca58840aa404bbef058fa89 Author: Oleg Nesterov Date: Fri Jul 25 01:47:47 2008 -0700 coredump: format_corename: fix the "core_uses_pid" logic I don't understand why the multi-thread coredump implies the core_uses_pid behaviour, but we shouldn't use mm->mm_users for that. This counter can be incremented by get_task_mm(). Use the valued returned by coredump_wait() instead. Also, remove the "const char *pattern" argument, format_corename() can use core_pattern directly. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Oleg Nesterov Cc: Roland McGrath Cc: Alan Cox Cc: Andi Kleen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit a94e2d408eaedbd85aae259621d46fafc10479a2 Author: Oleg Nesterov Date: Fri Jul 25 01:47:46 2008 -0700 coredump: kill mm->core_done Now that we have core_state->dumper list we can use it to wake up the sub-threads waiting for the coredump completion. This uglifies the code and .text grows by 47 bytes, but otoh mm_struct lessens by sizeof(struct completion). Also, with this change we can decouple exit_mm() from the coredumping code. Signed-off-by: Oleg Nesterov Cc: Roland McGrath Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 182c515fd2a942623aed4e4e0e0b37fe96571b05 Author: Oleg Nesterov Date: Fri Jul 25 01:47:45 2008 -0700 coredump: elf_fdpic_core_dump: use core_state->dumper list Kill the nasty rcu_read_lock() + do_each_thread() loop, use the list encoded in mm->core_state instead, s/GFP_ATOMIC/GFP_KERNEL/. This patch allows futher cleanups in binfmt_elf_fdpic.c. Signed-off-by: Oleg Nesterov Acked-by: Roland McGrath Cc: David Howells Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 83914441f94c6f2cd468ca97365f6c34f418706e Author: Oleg Nesterov Date: Fri Jul 25 01:47:45 2008 -0700 coredump: elf_core_dump: use core_state->dumper list Kill the nasty rcu_read_lock() + do_each_thread() loop, use the list encoded in mm->core_state instead, s/GFP_ATOMIC/GFP_KERNEL/. This patch allows futher cleanups in binfmt_elf.c, in particular we can kill the parallel info->threads list. Signed-off-by: Oleg Nesterov Acked-by: Roland McGrath Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit b564daf806d492dd4f7afe9b6c83b8d35d137669 Author: Oleg Nesterov Date: Fri Jul 25 01:47:44 2008 -0700 coredump: construct the list of coredumping threads at startup time binfmt->core_dump() has to iterate over the all threads in system in order to find the coredumping threads and construct the list using the GFP_ATOMIC allocations. With this patch each thread allocates the list node on exit_mm()'s stack and adds itself to the list. This allows us to do further changes: - simplify ->core_dump() - change exit_mm() to clear ->mm first, then wait for ->core_done. this makes the coredumping process visible to oom_kill - kill mm->core_done Signed-off-by: Oleg Nesterov Acked-by: Roland McGrath Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 9d5b327bf198d2720666de958dcc2ae219d86952 Author: Oleg Nesterov Date: Fri Jul 25 01:47:43 2008 -0700 coredump: make mm->core_state visible to ->core_dump() Move the "struct core_state core_state" from coredump_wait() to do_coredump(), this makes mm->core_state visible to binfmt->core_dump(). Signed-off-by: Oleg Nesterov Acked-by: Roland McGrath Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit c5f1cc8c1828486a61ab3e575da6e2c62b34d399 Author: Oleg Nesterov Date: Fri Jul 25 01:47:42 2008 -0700 coredump: turn core_state->nr_threads into atomic_t Turn core_state->nr_threads into atomic_t and kill now unneeded down_write(&mm->mmap_sem) in exit_mm(). Signed-off-by: Oleg Nesterov Cc: Roland McGrath Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 8cd9c249128a59e8e833d454a784b0cbd338d468 Author: Oleg Nesterov Date: Fri Jul 25 01:47:42 2008 -0700 coredump: simplify core_state->nr_threads calculation Change zap_process() to return int instead of incrementing mm->core_state->nr_threads directly. Change zap_threads() to set mm->core_state only on success. This patch restores the original size of .text, and more importantly now ->nr_threads is used in two places only. Signed-off-by: Oleg Nesterov Cc: Roland McGrath Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 999d9fc1670bc082928b93b11d1f2e0e417d973c Author: Oleg Nesterov Date: Fri Jul 25 01:47:41 2008 -0700 coredump: move mm->core_waiters into struct core_state Move mm->core_waiters into "struct core_state" allocated on stack. This shrinks mm_struct a little bit and allows further changes. This patch mostly does s/core_waiters/core_state. The only essential change is that coredump_wait() must clear mm->core_state before return. The coredump_wait()'s path is uglified and .text grows by 30 bytes, this is fixed by the next patch. Signed-off-by: Oleg Nesterov Cc: Roland McGrath Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 32ecb1f26dd50eeaac4e3f4dea4541c97848e459 Author: Oleg Nesterov Date: Fri Jul 25 01:47:41 2008 -0700 coredump: turn mm->core_startup_done into the pointer to struct core_state mm->core_startup_done points to "struct completion startup_done" allocated on the coredump_wait()'s stack. Introduce the new structure, core_state, which holds this "struct completion". This way we can add more info visible to the threads participating in coredump without enlarging mm_struct. No changes in affected .o files. Signed-off-by: Oleg Nesterov Cc: Roland McGrath Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 24d5288f06ed8b3a368eba967d587cdb012dfdf7 Author: Oleg Nesterov Date: Fri Jul 25 01:47:40 2008 -0700 coredump: elf_core_dump: skip kernel threads linux_binfmt->core_dump() runs before the process does exit_aio(), this means that we can hit the kernel thread which shares the same ->mm. Afaics, nothing really bad can happen, but perhaps it makes sense to fix this minor bug. It is sad we have to iterate over all threads in system and use GFP_ATOMIC. Hopefully we can kill theses ugly do_each_thread()s, but this needs some nontrivial changes in mm_struct and do_coredump. Signed-off-by: Oleg Nesterov Cc: Roland McGrath Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 15b9f360c0316c06d37c09b02d85565edbaf9dd3 Author: Oleg Nesterov Date: Fri Jul 25 01:47:39 2008 -0700 coredump: zap_threads() must skip kernel threads The main loop in zap_threads() must skip kthreads which may use the same mm. Otherwise we "kill" this thread erroneously (for example, it can not fork or exec after that), and the coredumping task stucks in the TASK_UNINTERRUPTIBLE state forever because of the wrong ->core_waiters count. Signed-off-by: Oleg Nesterov Cc: Roland McGrath Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 246bb0b1deb29726990620d8b5e55ca29f331362 Author: Oleg Nesterov Date: Fri Jul 25 01:47:38 2008 -0700 kill PF_BORROWED_MM in favour of PF_KTHREAD Kill PF_BORROWED_MM. Change use_mm/unuse_mm to not play with ->flags, and do s/PF_BORROWED_MM/PF_KTHREAD/ for a couple of other users. No functional changes yet. But this allows us to do further fixes/cleanups. oom_kill/ptrace/etc often check "p->mm != NULL" to filter out the kthreads, this is wrong because of use_mm(). The problem with PF_BORROWED_MM is that we need task_lock() to avoid races. With this patch we can check PF_KTHREAD directly, or use a simple lockless helper: /* The result must not be dereferenced !!! */ struct mm_struct *__get_task_mm(struct task_struct *tsk) { if (tsk->flags & PF_KTHREAD) return NULL; return tsk->mm; } Note also ecard_task(). It runs with ->mm != NULL, but it's the kernel thread without PF_BORROWED_MM. Signed-off-by: Oleg Nesterov Cc: Roland McGrath Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 7b34e4283c685f5cc6ba6d30e939906eee0d4bcf Author: Oleg Nesterov Date: Fri Jul 25 01:47:37 2008 -0700 introduce PF_KTHREAD flag Introduce the new PF_KTHREAD flag to mark the kernel threads. It is set by INIT_TASK() and copied to the forked childs (we could set it in kthreadd() along with PF_NOFREEZE instead). daemonize() was changed as well. In that case testing of PF_KTHREAD is racy, but daemonize() is hopeless anyway. This flag is cleared in do_execve(), before search_binary_handler(). Probably not the best place, we can do this in exec_mmap() or in start_thread(), or clear it along with PF_FORKNOEXEC. But I think this doesn't matter in practice, and if do_execve() fails kthread should die soon. Signed-off-by: Oleg Nesterov Cc: Roland McGrath Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 3d749b9e676b26584a47e75c235aa6f69d0697ae Author: Oleg Nesterov Date: Fri Jul 25 01:47:37 2008 -0700 ptrace: simplify ptrace_stop()->sigkill_pending() path 1. SIGKILL can't be blocked, remove this check from sigkill_pending(). 2. When ptrace_stop() sees sigkill_pending() == T, it can just return. Kill "int killed" and simplify the code. This also is more correct, the tracer shouldn't see us in TASK_TRACED if we are not going to stop. I strongly believe this code needs further changes. We should do the "was this task killed" check unconditionally, currently it depends on arch_ptrace_stop_needed(). On the other hand, sigkill_pending() isn't very clever. If the task was killed tkill(SIGKILL), the signal can be already dequeued if the caller is do_exit(). Signed-off-by: Oleg Nesterov Cc: Roland McGrath Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 364d3c13c17f45da6d638011078d4c4d3070d719 Author: Oleg Nesterov Date: Fri Jul 25 01:47:36 2008 -0700 ptrace: give more respect to SIGKILL ptrace_stop() has some complicated checks to prevent the scheduling in the TASK_TRACED state with the pending SIGKILL, but these checks are racy, and they depend on arch_ptrace_stop_needed(). This patch assumes that the traced task should die asap if it was killed by SIGKILL, in that case schedule()->signal_pending_state() has no reason to ignore the TASK_WAKEKILL part of TASK_TRACED, and we can kill this nasty special case. Note: do_exit()->ptrace_notify() is special, the killed task can already dequeue SIGKILL at this point. Another indication that fatal_signal_pending() is not exactly right. Signed-off-by: Oleg Nesterov Cc: Ingo Molnar Cc: Matthew Wilcox Cc: Roland McGrath Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit f22ab814a24e654b1de24db0c5f8b57b5ab2026a Author: Adrian Bunk Date: Fri Jul 25 01:47:34 2008 -0700 include/asm/ptrace.h userspace headers cleanup This patch contains the following cleanups for the asm/ptrace.h userspace headers: - include/asm-generic/Kbuild.asm already lists ptrace.h, remove the superfluous listings in the Kbuild files of the following architectures: - cris - frv - powerpc - x86 - don't expose function prototypes and macros to userspace: - arm - blackfin - cris - mn10300 - parisc - remove #ifdef CONFIG_'s around #define's: - blackfin - m68knommu - sh: AFAIK __SH5__ should work in both kernel and userspace, no need to leak CONFIG_SUPERH64 to userspace - xtensa: cosmetical change to remove empty #ifndef __ASSEMBLY__ #else #endif from the userspace headers Not changed by this patch is the fact that the following architectures have a different struct pt_regs depending on CONFIG_ variables: - h8300 - m68knommu - mips This does not work in userspace. Signed-off-by: Adrian Bunk Cc: Cc: Roland McGrath Cc: Oleg Nesterov Acked-by: Greg Ungerer Acked-by: Paul Mundt Acked-by: Grant Grundler Acked-by: Jesper Nilsson Acked-by: Chris Zankel Acked-by: David Howells Acked-by: Paul Mackerras Acked-by: Russell King Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit bc64efd220dcd4449aef8dd2564d73127b583b09 Author: Gustavo Fernando Padovan Date: Fri Jul 25 01:47:33 2008 -0700 kernel/signal.c: change vars pid and tgid types to pid_t Change the type of pid and tgid variables from int to the POSIX type pid_t. Signed-off-by: Gustavo F. Padovan Cc: Oleg Nesterov Cc: Roland McGrath Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit d8878ba3f05ae5bbfad5a6e72e5121c0ea35f989 Author: Michael Kerrisk Date: Fri Jul 25 01:47:32 2008 -0700 signals: make siginfo_t si_utime + si_sstime report times in USER_HZ, not HZ In the switch to configurable HZ in 2.6, the treatment of the si_utime and si_stime fields that are exposed to userland via the siginfo structure looks to have been botched. As things stand, these fields report times in units of HZ, so that userland gets information that varies depending on the HZ that the kernel was configured with. This patch changes the reported values to use USER_HZ units. Signed-off-by: Michael Kerrisk Acked-by: Oleg Nesterov Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Roland McGrath Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit e4901f92a8dbe843e76651a50f7a2a6dd3d53474 Author: Oleg Nesterov Date: Fri Jul 25 01:47:31 2008 -0700 coredump: zap_threads: comments && use while_each_thread() No changes in fs/exec.o The for_each_process() loop in zap_threads() is very subtle, it is not clear why we don't race with fork/exit/exec. Add the fat comment. Also, change the code to use while_each_thread(). Signed-off-by: Oleg Nesterov Acked-by: Roland McGrath Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 2b201a9eddf509e8e935b45e573648e36f4b623f Author: Oleg Nesterov Date: Fri Jul 25 01:47:31 2008 -0700 signals: do_signal_stop: kill the SIGNAL_UNKILLABLE check fae5fa44f1fd079ffbed8e0add929dd7bbd1347f changed do_signal_stop() to check SIGNAL_UNKILLABLE, this wasn't needed. If signal_group_exit() == F, the signal sent to SIGNAL_UNKILLABLE task must be already filtered out by the caller, get_signal_to_deliver(). And if signal_group_exit() == T we are not going to stop. Signed-off-by: Oleg Nesterov Acked-by: Roland McGrath Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 92413d771e7123304fb4b9efd2a00cccc946e383 Author: Oleg Nesterov Date: Fri Jul 25 01:47:30 2008 -0700 signals: dequeue_signal: don't check SIGNAL_GROUP_EXIT when setting SIGNAL_STOP_DEQUEUED dequeue_signal() checks SIGNAL_GROUP_EXIT before setting SIGNAL_STOP_DEQUEUED. This was added by 788e05a67c343fa22f2ae1d3ca264e7f15c25eaf a long ago to avoid the coredump/SIGSTOP race. Since then the related code was changed, and now this subtle check is both incomplete and unneeded at the same time. It is incomplete because nowadays exec() doesn't set SIGNAL_GROUP_EXIT, so in fact we should check signal_group_exit() to avoid a similar race. Fortunately, we doesn't need the check at all. The only function which relies on SIGNAL_STOP_DEQUEUED is do_signal_stop(), and it ignores this flag if signal_group_exit() == T, this covers the SIGNAL_GROUP_EXIT case. Signed-off-by: Oleg Nesterov Acked-by: Roland McGrath Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 3854a771821c970065e3203a0b40ddc4101538cc Author: Oleg Nesterov Date: Fri Jul 25 01:47:29 2008 -0700 __exit_signal: don't take rcu lock There is no reason for rcu_read_lock() in __exit_signal(). tsk->sighand can only be changed if tsk does exec, obviously this is not possible. Signed-off-by: Oleg Nesterov Cc: Roland McGrath Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 100360f03077663b7bef3af44805b6cf700c3bee Author: Oleg Nesterov Date: Fri Jul 25 01:47:29 2008 -0700 signals: change collect_signal() to return void With the recent changes collect_signal() always returns true. Change it to return void and update the single caller. Signed-off-by: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit d4434207616980885205c605697868c0f07e4378 Author: Oleg Nesterov Date: Fri Jul 25 01:47:28 2008 -0700 signals: collect_signal: simplify the "still_pending" logic Factor out sigdelset() calls and remove the "still_pending" variable. Signed-off-by: Oleg Nesterov Acked-by: Roland McGrath Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 6715ca451cfff1c9ce4b33ad9918a1dacf43997c Author: Oleg Nesterov Date: Fri Jul 25 01:47:27 2008 -0700 signals: collect_signal: remove the unneeded sigismember() check collect_signal() checks sigismember(&list->signal, sig), this is not needed. This "sig" was just found by next_signal(), so it must be valid. We have a (completely broken) call to ->notifier in between, but it must not play with sigpending->signal bits or unlock ->siglock. Signed-off-by: Oleg Nesterov Acked-by: Roland McGrath Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 96347e7759e2e433c427defa0fa1adfc8cce6226 Author: Oleg Nesterov Date: Fri Jul 25 01:47:27 2008 -0700 posix timers: release_posix_timer: kill the bogus put_task_struct(->it_process); release_posix_timer() can't be called with ->it_process != NULL. Once sys_timer_create() sets ->it_process it must not call release_posix_timer(), otherwise we can race with another thread doing sys_timer_delete(), this timer is visible to idr_find() and unlocked. The same is true for two other callers (actually, for any possible caller), sys_timer_delete() and itimer_delete(). They must clear ->it_process before unlock_timer() + release_posix_timer(). Signed-off-by: Oleg Nesterov Acked-by: Roland McGrath Cc: john stultz Cc: Thomas Gleixner Cc: Roland McGrath Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 4b7a1304267bff68260ae861784b27130e805be3 Author: Oleg Nesterov Date: Fri Jul 25 01:47:26 2008 -0700 posix timers: timer_delete: remove the bogus "->it_process != NULL" check sys_timer_delete() and itimer_delete() check "timer->it_process != NULL", this looks completely bogus. ->it_process == NULL means that this timer is already under destruction or it is not fully initialized, this must not happen. sys_timer_delete: the timer is locked, and lock_timer() can't succeed if ->it_process == NULL. itimer_delete: it is called by exit_itimers() when there are no other threads which can play with signal_struct->posix_timers. Signed-off-by: Oleg Nesterov Acked-by: Roland McGrath Cc: john stultz Cc: Thomas Gleixner Cc: Roland McGrath Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit da5ef6bb96158b0fc0d808704237a453af449124 Author: Lai Jiangshan Date: Fri Jul 25 01:47:25 2008 -0700 cpuset: two minor code-cleanups In cpuset_update_task_memory_state() local variable struct task_struct *tsk = current; And local variable tsk is used 14 times and statement task_cs(tsk) is used twice in this function. So using task_cs(tsk) instead of task_cs(current) is better for readability. And "(struct cgroup_scanner *)&scan" is not good for readability also. (and "container_of" is used in cpuset_do_move_task(), not "(cpuset_hotplug_scanner *)scan") Signed-off-by: Lai Jiangshan Acked-by: Paul Menage Cc: Paul Jackson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 02412483777651a26b19a75e49c2a451a174ca9c Author: Lai Jiangshan Date: Fri Jul 25 01:47:24 2008 -0700 cpuset: code-cleanup for started_after cgroup(cgroup_scan_tasks) will initialize heap->gt for us. This patch removes started_after() and its helper-function. Signed-off-by: Lai Jiangshan Acked-by: Paul Menage Cc: Paul Jackson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 489a5393a20dcbf91104052120eb2eff8791b61b Author: Lai Jiangshan Date: Fri Jul 25 01:47:23 2008 -0700 cpuset: don't pass empty cpumasks to partition_sched_domains() I create lots of empty cpusets(empty cpumasks) and turn off the "sched_load_balance" in top cpuset. I found that all these empty cpumasks are passed to partition_sched_domains() in rebuild_sched_domains(), it's very time-consuming for partition_sched_domains() and it's not need. It also reduce memory consumed and some works in rebuild_sched_domains() too. Signed-off-by: Lai Jiangshan Acked-by: Paul Menage Cc: Paul Jackson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit c372e817afc629fea9ff6321313325ed0b4a855b Author: Li Zefan Date: Fri Jul 25 01:47:23 2008 -0700 cpuset: avoid unnecessary sched domains rebuilding When changing 'sched_relax_domain_level', don't rebuild sched domains if 'cpus' is empty or 'sched_load_balance' is not set. Also make the comments of rebuild_sched_domains() more readable. Signed-off-by: Li Zefan Cc: Hidetoshi Seto Cc: Paul Jackson Cc: Paul Menage Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit f9b4fb8dabf38fb456c97f01aace07cb6e7c1723 Author: Miao Xie Date: Fri Jul 25 01:47:22 2008 -0700 cpusets: update task's cpus_allowed and mems_allowed after CPU/NODE offline/online The bug is that a task may run on the cpu/node which is not in its cpuset.cpus/ cpuset.mems. It can be reproduced by the following commands: ----------------------------------- # mkdir /dev/cpuset # mount -t cpuset xxx /dev/cpuset # mkdir /dev/cpuset/0 # echo 0-1 > /dev/cpuset/0/cpus # echo 0 > /dev/cpuset/0/mems # echo $$ > /dev/cpuset/0/tasks # echo 0 > /sys/devices/system/cpu/cpu1/online # echo 1 > /sys/devices/system/cpu/cpu1/online ----------------------------------- There is only CPU0 in cpuset.cpus, but the task in this cpuset runs on both CPU0 and CPU1. It is because the task's cpu_allowed didn't get updated after we did CPU offline/online manipulation. Similar for mem_allowed. This patch fixes this bug expect for root cpuset. Because there is a problem about root cpuset, in that whether it is necessary to update all the tasks in root cpuset or not after cpu/node offline/online. If updating, some kernel threads which is bound into a specified cpu will be unbound. If not updating, there is a bug in root cpuset. This bug is also caused by offline/online manipulation. For example, there is a dual-cpu machine. we create a sub cpuset in root cpuset and assign 1 to its cpus. And then we attach some tasks into this sub cpuset. After this, we offline CPU1. Now, the tasks in this new cpuset are moved into root cpuset automatically because there is no cpu in sub cpuset. Then we online CPU1, we find all the tasks which doesn't belong to root cpuset originally just run on CPU0. Maybe we need to add a flag in the task_struct to mark which task can't be unbound? Signed-off-by: Miao Xie Acked-by: Paul Jackson Cc: Li Zefan Cc: Paul Jackson Cc: Paul Menage Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 0b2f630a28d53b5a2082a5275bc3334b10373508 Author: Miao Xie Date: Fri Jul 25 01:47:21 2008 -0700 cpusets: restructure the function update_cpumask() and update_nodemask() Extract two functions from update_cpumask() and update_nodemask().They will be used later for updating tasks' cpus_allowed and mems_allowed after CPU/NODE offline/online. [lizf@cn.fujitsu.com: build fix] Signed-off-by: Miao Xie Acked-by: Paul Jackson Cc: David Rientjes Cc: Li Zefan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 628f42355389cfb596ca3a5a5f64fb9054a2a06a Author: KAMEZAWA Hiroyuki Date: Fri Jul 25 01:47:20 2008 -0700 memcg: limit change shrink usage Shrinking memory usage at limit change. [akpm@linux-foundation.org: coding-style fixes] Acked-by: Balbir Singh Acked-by: Pavel Emelyanov Signed-off-by: KAMEZAWA Hiroyuki Cc: Paul Menage Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 12b9804419cfb1c1bdac413f6c373af3b88d154b Author: KAMEZAWA Hiroyuki Date: Fri Jul 25 01:47:19 2008 -0700 res_counter: limit change support ebusy Add an interface to set limit. This is necessary to memory resource controller because it shrinks usage at set limit. Other controllers may not need this interface to shrink usage because shrinking is not necessary or impossible. Acked-by: Balbir Singh Acked-by: Pavel Emelyanov Signed-off-by: KAMEZAWA Hiroyuki Cc: Paul Menage Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit cede86acd8bd5d2205dec28db8ac86410a3a19e8 Author: Li Zefan Date: Fri Jul 25 01:47:18 2008 -0700 memcg: clean up checking of the disabled flag Those checks are unnecessary, because when the subsystem is disabled it can't be mounted, so those functions won't get called. The check is needed in functions which will be called in other places except cgroup. [hugh@veritas.com: further checking of disabled flag] Signed-off-by: Li Zefan Acked-by: Balbir Singh Acked-by: KAMEZAWA Hiroyuki Acked-by: KOSAKI Motohiro Signed-off-by: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit accf163e6ab729f1fc5fffaa0310e498270bf4e7 Author: KAMEZAWA Hiroyuki Date: Fri Jul 25 01:47:17 2008 -0700 memcg: remove a redundant check Because of remove refcnt patch, it's very rare case to that mem_cgroup_charge_common() is called against a page which is accounted. mem_cgroup_charge_common() is called when. 1. a page is added into file cache. 2. an anon page is _newly_ mapped. A racy case is that a newly-swapped-in anonymous page is referred from prural threads in do_swap_page() at the same time. (a page is not Locked when mem_cgroup_charge() is called from do_swap_page.) Another case is shmem. It charges its page before calling add_to_page_cache(). Then, mem_cgroup_charge_cache() is called twice. This case is handled in mem_cgroup_cache_charge(). But this check may be too hacky... Signed-off-by : KAMEZAWA Hiroyuki Cc: Balbir Singh Cc: "Eric W. Biederman" Cc: Pavel Emelyanov Cc: Li Zefan Cc: Hugh Dickins Cc: YAMAMOTO Takashi Cc: Paul Menage Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit b76734e5e34e1889ab9fc5f3756570b1129f0f50 Author: KAMEZAWA Hiroyuki Date: Fri Jul 25 01:47:16 2008 -0700 memcg: add hints for branch Showing brach direction for obvious conditions. Signed-off-by: KAMEZAWA Hiroyuki Cc: Balbir Singh Cc: "Eric W. Biederman" Cc: Pavel Emelyanov Cc: Li Zefan Cc: Hugh Dickins Cc: YAMAMOTO Takashi Cc: Paul Menage Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit c9b0ed51483cc2fc42bb801b6675c4231b0e4634 Author: KAMEZAWA Hiroyuki Date: Fri Jul 25 01:47:15 2008 -0700 memcg: helper function for relcaim from shmem. A new call, mem_cgroup_shrink_usage() is added for shmem handling and relacing non-standard usage of mem_cgroup_charge/uncharge. Now, shmem calls mem_cgroup_charge() just for reclaim some pages from mem_cgroup. In general, shmem is used by some process group and not for global resource (like file caches). So, it's reasonable to reclaim pages from mem_cgroup where shmem is mainly used. [hugh@veritas.com: shmem_getpage release page sooner] [hugh@veritas.com: mem_cgroup_shrink_usage css_put] Signed-off-by: KAMEZAWA Hiroyuki Cc: Balbir Singh Cc: "Eric W. Biederman" Cc: Pavel Emelyanov Cc: Li Zefan Cc: YAMAMOTO Takashi Cc: Paul Menage Cc: David Rientjes Signed-off-by: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds commit 69029cd550284e32de13d6dd2f77b723c8a0e444 Author: KAMEZAWA Hiroyuki Date: Fri Jul 25 01:47:14 2008 -0700 memcg: remove refcnt from page_cgroup memcg: performance improvements Patch Description 1/5 ... remove refcnt fron page_cgroup patch (shmem handling is fixed) 2/5 ... swapcache handling patch 3/5 ... add helper function for shmem's memory reclaim patch 4/5 ... optimize by likely/unlikely ppatch 5/5 ... remove redundunt check patch (shmem handling is fixed.) Unix bench result. == 2.6.26-rc2-mm1 + memory resource controller Execl Throughput 2915.4 lps (29.6 secs, 3 samples) C Compiler Throughput 1019.3 lpm (60.0 secs, 3 samples) Shell Scripts (1 concurrent) 5796.0 lpm (60.0 secs, 3 samples) Shell Scripts (8 concurrent) 1097.7 lpm (60.0 secs, 3 samples) Shell Scripts (16 concurrent) 565.3 lpm (60.0 secs, 3 samples) File Read 1024 bufsize 2000 maxblocks 1022128.0 KBps (30.0 secs, 3 samples) File Write 1024 bufsize 2000 maxblocks 544057.0 KBps (30.0 secs, 3 samples) File Copy 1024 bufsize 2000 maxblocks 346481.0 KBps (30.0 secs, 3 samples) File Read 256 bufsize 500 maxblocks 319325.0 KBps (30.0 secs, 3 samples) File Write 256 bufsize 500 maxblocks 148788.0 KBps (30.0 secs, 3 samples) File Copy 256 bufsize 500 maxblocks 99051.0 KBps (30.0 secs, 3 samples) File Read 4096 bufsize 8000 maxblocks 2058917.0 KBps (30.0 secs, 3 samples) File Write 4096 bufsize 8000 maxblocks 1606109.0 KBps (30.0 secs, 3 samples) File Copy 4096 bufsize 8000 maxblocks 854789.0 KBps (30.0 secs, 3 samples) Dc: sqrt(2) to 99 decimal places 126145.2 lpm (30.0 secs, 3 samples) INDEX VALUES TEST BASELINE RESULT INDEX Execl Throughput 43.0 2915.4 678.0 File Copy 1024 bufsize 2000 maxblocks 3960.0 346481.0 875.0 File Copy 256 bufsize 500 maxblocks 1655.0 99051.0 598.5 File Copy 4096 bufsize 8000 maxblocks 5800.0 854789.0 1473.8 Shell Scripts (8 concurrent) 6.0 1097.7 1829.5 ========= FINAL SCORE 991.3 == 2.6.26-rc2-mm1 + this set == Execl Throughput 3012.9 lps (29.9 secs, 3 samples) C Compiler Throughput 981.0 lpm (60.0 secs, 3 samples) Shell Scripts (1 concurrent) 5872.0 lpm (60.0 secs, 3 samples) Shell Scripts (8 concurrent) 1120.3 lpm (60.0 secs, 3 samples) Shell Scripts (16 concurrent) 578.0 lpm (60.0 secs, 3 samples) File Read 1024 bufsize 2000 maxblocks 1003993.0 KBps (30.0 secs, 3 samples) File Write 1024 bufsize 2000 maxblocks 550452.0 KBps (30.0 secs, 3 samples) File Copy 1024 bufsize 2000 maxblocks 347159.0 KBps (30.0 secs, 3 samples) File Read 256 bufsize 500 maxblocks 314644.0 KBps (30.0 secs, 3 samples) File Write 256 bufsize 500 maxblocks 151852.0 KBps (30.0 secs, 3 samples) File Copy 256 bufsize 500 maxblocks 101000.0 KBps (30.0 secs, 3 samples) File Read 4096 bufsize 8000 maxblocks 2033256.0 KBps (30.0 secs, 3 samples) File Write 4096 bufsize 8000 maxblocks 1611814.0 KBps (30.0 secs, 3 samples) File Copy 4096 bufsize 8000 maxblocks 847979.0 KBps (30.0 secs, 3 samples) Dc: sqrt(2) to 99 decimal places 128148.7 lpm (30.0 secs, 3 samples) INDEX VALUES TEST BASELINE RESULT INDEX Execl Throughput 43.0 3012.9 700.7 File Copy 1024 bufsize 2000 maxblocks 3960.0 347159.0 876.7 File Copy 256 bufsize 500 maxblocks 1655.0 101000.0 610.3 File Copy 4096 bufsize 8000 maxblocks 5800.0 847979.0 1462.0 Shell Scripts (8 concurrent) 6.0 1120.3 1867.2 ========= FINAL SCORE 1004.6 This patch: Remove refcnt from page_cgroup(). After this, * A page is charged only when !page_mapped() && no page_cgroup is assigned. * Anon page is newly mapped. * File page is added to mapping->tree. * A page is uncharged only when * Anon page is fully unmapped. * File page is removed from LRU. There is no change in behavior from user's view. This patch also removes unnecessary calls in rmap.c which was used only for refcnt mangement. [akpm@linux-foundation.org: fix warning] [hugh@veritas.com: fix shmem_unuse_inode charging]