aboutsummaryrefslogtreecommitdiffstats
path: root/ipc/sem.c
AgeCommit message (Collapse)AuthorFilesLines
2025-09-01copy_process: pass clone_flags as u64 across calltreeSimon Schuster1-1/+1
With the introduction of clone3 in commit 7f192e3cd316 ("fork: add clone3") the effective bit width of clone_flags on all architectures was increased from 32-bit to 64-bit, with a new type of u64 for the flags. However, for most consumers of clone_flags the interface was not changed from the previous type of unsigned long. While this works fine as long as none of the new 64-bit flag bits (CLONE_CLEAR_SIGHAND and CLONE_INTO_CGROUP) are evaluated, this is still undesirable in terms of the principle of least surprise. Thus, this commit fixes all relevant interfaces of callees to sys_clone3/copy_process (excluding the architecture-specific copy_thread) to consistently pass clone_flags as u64, so that no truncation to 32-bit integers occurs on 32-bit architectures. Signed-off-by: Simon Schuster <schuster.simon@siemens-energy.com> Link: https://lore.kernel.org/20250901-nios2-implement-clone3-v2-2-53fcf5577d57@siemens-energy.com Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-18ipc/sem: use flexible array in 'struct sem_undo'Christophe JAILLET1-4/+2
Turn 'semadj' in 'struct sem_undo' into a flexible array. The advantages are: - save the size of a pointer when the new undo structure is allocated - avoid some always ugly pointer arithmetic to get the address of semadj - avoid an indirection when the array is accessed While at it, use struct_size() to compute the size of the new undo structure. Link: https://lkml.kernel.org/r/1ba993d443ad7e16ac2b1902adab1f05ebdfa454.1688918791.git.christophe.jaillet@wanadoo.fr Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Manfred Spraul <manfred@colorfullife.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-05ipc/sem: Fix dangling sem_array access in semtimedop raceJann Horn1-1/+2
When __do_semtimedop() goes to sleep because it has to wait for a semaphore value becoming zero or becoming bigger than some threshold, it links the on-stack sem_queue to the sem_array, then goes to sleep without holding a reference on the sem_array. When __do_semtimedop() comes back out of sleep, one of two things must happen: a) We prove that the on-stack sem_queue has been disconnected from the (possibly freed) sem_array, making it safe to return from the stack frame that the sem_queue exists in. b) We stabilize our reference to the sem_array, lock the sem_array, and detach the sem_queue from the sem_array ourselves. sem_array has RCU lifetime, so for case (b), the reference can be stabilized inside an RCU read-side critical section by locklessly checking whether the sem_queue is still connected to the sem_array. However, the current code does the lockless check on sem_queue before starting an RCU read-side critical section, so the result of the lockless check immediately becomes useless. Fix it by doing rcu_read_lock() before the lockless check. Now RCU ensures that if we observe the object being on our queue, the object can't be freed until rcu_read_unlock(). This bug is only hittable on kernel builds with full preemption support (either CONFIG_PREEMPT or PREEMPT_DYNAMIC with preempt=full). Fixes: 370b262c896e ("ipc/sem: avoid idr tree lookup for interrupted semop") Cc: stable@vger.kernel.org Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-05-09ipc: update semtimedop() to use hrtimerPrakash Sangappa1-12/+11
semtimedop() should be converted to use hrtimer like it has been done for most of the system calls with timeouts. This system call already takes a struct timespec as an argument and can therefore provide finer granularity timed wait. Link: https://lkml.kernel.org/r/1651187881-2858-1-git-send-email-prakash.sangappa@oracle.com Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-09ipc/sem: remove redundant assignmentsMichal Orzel1-2/+0
Get rid of redundant assignments which end up in values not being read either because they are overwritten or the function ends. Reported by clang-tidy [deadcode.DeadStores] Link: https://lkml.kernel.org/r/20220409101933.207157-1-michalorzel.eng@gmail.com Signed-off-by: Michal Orzel <michalorzel.eng@gmail.com> Reviewed-by: Tom Rix <trix@redhat.com> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-02-04ipc/sem: do not sleep with a spin lock heldMinghao Chi1-2/+2
We can't call kvfree() with a spin lock held, so defer it. Link: https://lkml.kernel.org/r/20211223031207.556189-1-chi.minghao@zte.com.cn Fixes: fc37a3b8b438 ("[PATCH] ipc sem: use kvmalloc for sem_undo allocation") Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Minghao Chi <chi.minghao@zte.com.cn> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Manfred Spraul <manfred@colorfullife.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Yang Guang <cgel.zte@gmail.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Bhaskar Chowdhury <unixbhaskar@gmail.com> Cc: Vasily Averin <vvs@virtuozzo.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-14ipc: remove memcg accounting for sops objects in do_semtimedop()Vasily Averin1-1/+1
Linus proposes to revert an accounting for sops objects in do_semtimedop() because it's really just a temporary buffer for a single semtimedop() system call. This object can consume up to 2 pages, syscall is sleeping one, size and duration can be controlled by user, and this allocation can be repeated by many thread at the same time. However Shakeel Butt pointed that there are much more popular objects with the same life time and similar memory consumption, the accounting of which was decided to be rejected for performance reasons. Considering at least 2 pages for task_struct and 2 pages for the kernel stack, a back of the envelope calculation gives a footprint amplification of <1.5 so this temporal buffer can be safely ignored. The factor would IMO be interesting if it was >> 2 (from the PoV of excessive (ab)use, fine-grained accounting seems to be currently unfeasible due to performance impact). Link: https://lore.kernel.org/lkml/90e254df-0dfe-f080-011e-b7c53ee7fd20@virtuozzo.com/ Fixes: 18319498fdd4 ("memcg: enable accounting of ipc resources") Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-09Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-armLinus Torvalds1-33/+52
Pull ARM development updates from Russell King: - Rename "mod_init" and "mod_exit" so that initcall debug output is actually useful (Randy Dunlap) - Update maintainers entries for linux-arm-kernel to indicate it is moderated for non-subscribers (Randy Dunlap) - Move install rules to arch/arm/Makefile (Masahiro Yamada) - Drop unnecessary ARCH_NR_GPIOS definition (Linus Walleij) - Don't warn about atags_to_fdt() stack size (David Heidelberg) - Speed up unaligned copy_{from,to}_kernel_nofault (Arnd Bergmann) - Get rid of set_fs() usage (Arnd Bergmann) - Remove checks for GCC prior to v4.6 (Geert Uytterhoeven) * tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm: ARM: 9118/1: div64: Remove always-true __div64_const32_is_OK() duplicate ARM: 9117/1: asm-generic: div64: Remove always-true __div64_const32_is_OK() ARM: 9116/1: unified: Remove check for gcc < 4 ARM: 9110/1: oabi-compat: fix oabi epoll sparse warning ARM: 9113/1: uaccess: remove set_fs() implementation ARM: 9112/1: uaccess: add __{get,put}_kernel_nofault ARM: 9111/1: oabi-compat: rework fcntl64() emulation ARM: 9114/1: oabi-compat: rework sys_semtimedop emulation ARM: 9108/1: oabi-compat: rework epoll_wait/epoll_pwait emulation ARM: 9107/1: syscall: always store thread_info->abi_syscall ARM: 9109/1: oabi-compat: add epoll_pwait handler ARM: 9106/1: traps: use get_kernel_nofault instead of set_fs() ARM: 9115/1: mm/maccess: fix unaligned copy_{from,to}_kernel_nofault ARM: 9105/1: atags_to_fdt: don't warn about stack size ARM: 9103/1: Drop ARCH_NR_GPIOS definition ARM: 9102/1: move theinstall rules to arch/arm/Makefile ARM: 9100/1: MAINTAINERS: mark all linux-arm-kernel@infradead list as moderated ARM: 9099/1: crypto: rename 'mod_init' & 'mod_exit' functions to be module-specific
2021-09-03memcg: enable accounting of ipc resourcesVasily Averin1-4/+5
When user creates IPC objects it forces kernel to allocate memory for these long-living objects. It makes sense to account them to restrict the host's memory consumption from inside the memcg-limited container. This patch enables accounting for IPC shared memory segments, messages semaphores and semaphore's undo lists. Link: https://lkml.kernel.org/r/d6507b06-4df6-78f8-6c54-3ae86e3b5339@virtuozzo.com Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Roman Gushchin <guro@fb.com> Cc: Serge Hallyn <serge@hallyn.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Yutian Yang <nglaive@gmail.com> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-20ARM: 9114/1: oabi-compat: rework sys_semtimedop emulationArnd Bergmann1-32/+52
sys_oabi_semtimedop() is one of the last users of set_fs() on Arm. To remove this one, expose the internal code of the actual implementation that operates on a kernel pointer and call it directly after copying. There should be no measurable impact on the normal execution of this function, and it makes the overly long function a little shorter, which may help readability. While reworking the oabi version, make it behave a little more like the native one, using kvmalloc_array() and restructure the code flow in a similar way. The naming of __do_semtimedop() is not very good, I hope someone can come up with a better name. One regression was spotted by kernel test robot <rong.a.chen@intel.com> and fixed before the first mailing list submission. Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
2021-07-01ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lockManfred Spraul1-5/+9
The patch solves three weaknesses in ipc/sem.c: 1) The initial read of use_global_lock in sem_lock() is an intentional race. KCSAN detects these accesses and prints a warning. 2) The code assumes that plain C read/writes are not mangled by the CPU or the compiler. 3) The comment it sysvipc_sem_proc_show() was hard to understand: The rest of the comments in ipc/sem.c speaks about sem_perm.lock, and suddenly this function speaks about ipc_lock_object(). To solve 1) and 2), use READ_ONCE()/WRITE_ONCE(). Plain C reads are used in code that owns sma->sem_perm.lock. The comment is updated to solve 3) [manfred@colorfullife.com: use READ_ONCE()/WRITE_ONCE() for use_global_lock] Link: https://lkml.kernel.org/r/20210627161919.3196-3-manfred@colorfullife.com Link: https://lkml.kernel.org/r/20210514175319.12195-1-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Cc: <1vier1@web.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01ipc sem: use kvmalloc for sem_undo allocationVasily Averin1-5/+6
Patch series "ipc: allocations cleanup", v2. Some ipc objects use the wrong allocation functions: small objects can use kmalloc(), and vice versa, potentially large objects can use kmalloc(). This patch (of 2): Size of sem_undo can exceed one page and with the maximum possible nsems = 32000 it can grow up to 64Kb. Let's switch its allocation to kvmalloc to avoid user-triggered disruptive actions like OOM killer in case of high-order memory shortage. User triggerable high order allocations are quite a problem on heavily fragmented systems. They can be a DoS vector. Link: https://lkml.kernel.org/r/ebc3ac79-3190-520d-81ce-22ad194986ec@virtuozzo.com Link: https://lkml.kernel.org/r/a6354fd9-2d55-2e63-dd4d-fa7dc1d11134@virtuozzo.com Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-22ipc/mqueue, msg, sem: avoid relying on a stack reference past its expiryVarad Gautam1-2/+4
do_mq_timedreceive calls wq_sleep with a stack local address. The sender (do_mq_timedsend) uses this address to later call pipelined_send. This leads to a very hard to trigger race where a do_mq_timedreceive call might return and leave do_mq_timedsend to rely on an invalid address, causing the following crash: RIP: 0010:wake_q_add_safe+0x13/0x60 Call Trace: __x64_sys_mq_timedsend+0x2a9/0x490 do_syscall_64+0x80/0x680 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f5928e40343 The race occurs as: 1. do_mq_timedreceive calls wq_sleep with the address of `struct ext_wait_queue` on function stack (aliased as `ewq_addr` here) - it holds a valid `struct ext_wait_queue *` as long as the stack has not been overwritten. 2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call __pipelined_op. 3. Sender calls __pipelined_op::smp_store_release(&this->state, STATE_READY). Here is where the race window begins. (`this` is `ewq_addr`.) 4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it will see `state == STATE_READY` and break. 5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed to be a `struct ext_wait_queue *` since it was on do_mq_timedreceive's stack. (Although the address may not get overwritten until another function happens to touch it, which means it can persist around for an indefinite time.) 6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a `struct ext_wait_queue *`, and uses it to find a task_struct to pass to the wake_q_add_safe call. In the lucky case where nothing has overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct. In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a bogus address as the receiver's task_struct causing the crash. do_mq_timedsend::__pipelined_op() should not dereference `this` after setting STATE_READY, as the receiver counterpart is now free to return. Change __pipelined_op to call wake_q_add_safe on the receiver's task_struct returned by get_task_struct, instead of dereferencing `this` which sits on the receiver's stack. As Manfred pointed out, the race potentially also exists in ipc/msg.c::expunge_all and ipc/sem.c::wake_up_sem_queue_prepare. Fix those in the same way. Link: https://lkml.kernel.org/r/20210510102950.12551-1-varad.gautam@suse.com Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers") Fixes: 8116b54e7e23ef ("ipc/sem.c: document and update memory barriers") Fixes: 0d97a82ba830d8 ("ipc/msg.c: update and document memory barriers") Signed-off-by: Varad Gautam <varad.gautam@suse.com> Reported-by: Matthias von Faber <matthias.vonfaber@aox-tech.de> Acked-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Manfred Spraul <manfred@colorfullife.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-07ipc/sem.c: spelling fixBhaskar Chowdhury1-1/+1
s/purpuse/purpose/ Link: https://lkml.kernel.org/r/20210319221432.26631-1-unixbhaskar@gmail.com Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-07ipc/sem.c: mundane typo fixesBhaskar Chowdhury1-5/+5
s/runtine/runtime/ s/AQUIRE/ACQUIRE/ s/seperately/separately/ s/wont/won\'t/ s/succesfull/successful/ Link: https://lkml.kernel.org/r/20210326022240.26375-1-unixbhaskar@gmail.com Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-23treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva1-2/+2
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-12ipc: uninline functionsAlexey Dobriyan1-2/+1
Two functions are only called via function pointers, don't bother inlining them. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Link: http://lkml.kernel.org/r/20200710200312.GA960353@localhost.localdomain Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-21Revert "ipc,sem: remove uneeded sem_undo_list lock usage in exit_sem()"Ioanna Alifieraki1-4/+2
This reverts commit a97955844807e327df11aa33869009d14d6b7de0. Commit a97955844807 ("ipc,sem: remove uneeded sem_undo_list lock usage in exit_sem()") removes a lock that is needed. This leads to a process looping infinitely in exit_sem() and can also lead to a crash. There is a reproducer available in [1] and with the commit reverted the issue does not reproduce anymore. Using the reproducer found in [1] is fairly easy to reach a point where one of the child processes is looping infinitely in exit_sem between for(;;) and if (semid == -1) block, while it's trying to free its last sem_undo structure which has already been freed by freeary(). Each sem_undo struct is on two lists: one per semaphore set (list_id) and one per process (list_proc). The list_id list tracks undos by semaphore set, and the list_proc by process. Undo structures are removed either by freeary() or by exit_sem(). The freeary function is invoked when the user invokes a syscall to remove a semaphore set. During this operation freeary() traverses the list_id associated with the semaphore set and removes the undo structures from both the list_id and list_proc lists. For this case, exit_sem() is called at process exit. Each process contains a struct sem_undo_list (referred to as "ulp") which contains the head for the list_proc list. When the process exits, exit_sem() traverses this list to remove each sem_undo struct. As in freeary(), whenever a sem_undo struct is removed from list_proc, it is also removed from the list_id list. Removing elements from list_id is safe for both exit_sem() and freeary() due to sem_lock(). Removing elements from list_proc is not safe; freeary() locks &un->ulp->lock when it performs list_del_rcu(&un->list_proc) but exit_sem() does not (locking was removed by commit a97955844807 ("ipc,sem: remove uneeded sem_undo_list lock usage in exit_sem()"). This can result in the following situation while executing the reproducer [1] : Consider a child process in exit_sem() and the parent in freeary() (because of semctl(sid[i], NSEM, IPC_RMID)). - The list_proc for the child contains the last two undo structs A and B (the rest have been removed either by exit_sem() or freeary()). - The semid for A is 1 and semid for B is 2. - exit_sem() removes A and at the same time freeary() removes B. - Since A and B have different semid sem_lock() will acquire different locks for each process and both can proceed. The bug is that they remove A and B from the same list_proc at the same time because only freeary() acquires the ulp lock. When exit_sem() removes A it makes ulp->list_proc.next to point at B and at the same time freeary() removes B setting B->semid=-1. At the next iteration of for(;;) loop exit_sem() will try to remove B. The only way to break from for(;;) is for (&un->list_proc == &ulp->list_proc) to be true which is not. Then exit_sem() will check if B->semid=-1 which is and will continue looping in for(;;) until the memory for B is reallocated and the value at B->semid is changed. At that point, exit_sem() will crash attempting to unlink B from the lists (this can be easily triggered by running the reproducer [1] a second time). To prove this scenario instrumentation was added to keep information about each sem_undo (un) struct that is removed per process and per semaphore set (sma). CPU0 CPU1 [caller holds sem_lock(sma for A)] ... freeary() exit_sem() ... ... ... sem_lock(sma for B) spin_lock(A->ulp->lock) ... list_del_rcu(un_A->list_proc) list_del_rcu(un_B->list_proc) Undo structures A and B have different semid and sem_lock() operations proceed. However they belong to the same list_proc list and they are removed at the same time. This results into ulp->list_proc.next pointing to the address of B which is already removed. After reverting commit a97955844807 ("ipc,sem: remove uneeded sem_undo_list lock usage in exit_sem()") the issue was no longer reproducible. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1694779 Link: http://lkml.kernel.org/r/20191211191318.11860-1-ioanna-maria.alifieraki@canonical.com Fixes: a97955844807 ("ipc,sem: remove uneeded sem_undo_list lock usage in exit_sem()") Signed-off-by: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Acked-by: Herton R. Krzesinski <herton@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: <malat@debian.org> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jay Vosburgh <jay.vosburgh@canonical.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04ipc/sem.c: document and update memory barriersManfred Spraul1-25/+41
Document and update the memory barriers in ipc/sem.c: - Add smp_store_release() to wake_up_sem_queue_prepare() and document why it is needed. - Read q->status using READ_ONCE+smp_acquire__after_ctrl_dep(). as the pair for the barrier inside wake_up_sem_queue_prepare(). - Add comments to all barriers, and mention the rules in the block regarding locking. - Switch to using wake_q_add_safe(). Link: http://lkml.kernel.org/r/20191020123305.14715-6-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Waiman Long <longman@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: <1vier1@web.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25ipc/sem.c: convert to use built-in RCU list checkingJoel Fernandes (Google)1-1/+2
CONFIG_PROVE_RCU_LIST requires list_for_each_entry_rcu() to pass a lockdep expression if using srcu or locking for protection. It can only check regular RCU protection, all other protection needs to be passed as lockdep expression. Link: http://lkml.kernel.org/r/20190830231817.76862-2-joel@joelfernandes.org Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: "Gustavo A. R. Silva" <gustavo@embeddedor.com> Cc: Jonathan Derrick <jonathan.derrick@intel.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-07ipc/sem.c: replace kvmalloc/memset with kvzalloc and use struct_sizeGustavo A. R. Silva1-5/+1
Use kvzalloc() instead of kvmalloc() and memset(). Also, make use of the struct_size() helper instead of the open-coded version in order to avoid any potential type mistakes. This code was detected with the help of Coccinelle. Link: http://lkml.kernel.org/r/20190131214221.GA28930@embeddedor Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-07ipc: annotate implicit fall throughMathieu Malaterre1-0/+1
There is a plan to build the kernel with -Wimplicit-fallthrough and this place in the code produced a warning (W=1). This commit remove the following warning: ipc/sem.c:1683:6: warning: this statement may fall through [-Wimplicit-fallthrough=] Link: http://lkml.kernel.org/r/20190114203608.18218-1-malat@debian.org Signed-off-by: Mathieu Malaterre <malat@debian.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-07y2038: syscalls: rename y2038 compat syscallsArnd Bergmann1-1/+1
A lot of system calls that pass a time_t somewhere have an implementation using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have been reworked so that this implementation can now be used on 32-bit architectures as well. The missing step is to redefine them using the regular SYSCALL_DEFINEx() to get them out of the compat namespace and make it possible to build them on 32-bit architectures. Any system call that ends in 'time' gets a '32' suffix on its name for that version, while the others get a '_time32' suffix, to distinguish them from the normal version, which takes a 64-bit time argument in the future. In this step, only 64-bit architectures are changed, doing this rename first lets us avoid touching the 32-bit architectures twice. Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-01-25ipc: rename old-style shmctl/semctl/msgctl syscallsArnd Bergmann1-7/+32
The behavior of these system calls is slightly different between architectures, as determined by the CONFIG_ARCH_WANT_IPC_PARSE_VERSION symbol. Most architectures that implement the split IPC syscalls don't set that symbol and only get the modern version, but alpha, arm, microblaze, mips-n32, mips-n64 and xtensa expect the caller to pass the IPC_64 flag. For the architectures that so far only implement sys_ipc(), i.e. m68k, mips-o32, powerpc, s390, sh, sparc, and x86-32, we want the new behavior when adding the split syscalls, so we need to distinguish between the two groups of architectures. The method I picked for this distinction is to have a separate system call entry point: sys_old_*ctl() now uses ipc_parse_version, while sys_*ctl() does not. The system call tables of the five architectures are changed accordingly. As an additional benefit, we no longer need the configuration specific definition for ipc_parse_version(), it always does the same thing now, but simply won't get called on architectures with the modern interface. A small downside is that on architectures that do set ARCH_WANT_IPC_PARSE_VERSION, we now have an extra set of entry points that are never called. They only add a few bytes of bloat, so it seems better to keep them compared to adding yet another Kconfig symbol. I considered adding new syscall numbers for the IPC_64 variants for consistency, but decided against that for now. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-08-27y2038: globally rename compat_time to old_time32Arnd Bergmann1-5/+5
Christoph Hellwig suggested a slightly different path for handling backwards compatibility with the 32-bit time_t based system calls: Rather than simply reusing the compat_sys_* entry points on 32-bit architectures unchanged, we get rid of those entry points and the compat_time types by renaming them to something that makes more sense on 32-bit architectures (which don't have a compat mode otherwise), and then share the entry points under the new name with the 64-bit architectures that use them for implementing the compatibility. The following types and interfaces are renamed here, and moved from linux/compat_time.h to linux/time32.h: old new --- --- compat_time_t old_time32_t struct compat_timeval struct old_timeval32 struct compat_timespec struct old_timespec32 struct compat_itimerspec struct old_itimerspec32 ns_to_compat_timeval() ns_to_old_timeval32() get_compat_itimerspec64() get_old_itimerspec32() put_compat_itimerspec64() put_old_itimerspec32() compat_get_timespec64() get_old_timespec32() compat_put_timespec64() put_old_timespec32() As we already have aliases in place, this patch addresses only the instances that are relevant to the system call interface in particular, not those that occur in device drivers and other modules. Those will get handled separately, while providing the 64-bit version of the respective interfaces. I'm not renaming the timex, rusage and itimerval structures, as we are still debating what the new interface will look like, and whether we will need a replacement at all. This also doesn't change the names of the syscall entry points, which can be done more easily when we actually switch over the 32-bit architectures to use them, at that point we need to change COMPAT_SYSCALL_DEFINEx to SYSCALL_DEFINEx with a new name, e.g. with a _time32 suffix. Suggested-by: Christoph Hellwig <hch@infradead.org> Link: https://lore.kernel.org/lkml/20180705222110.GA5698@infradead.org/ Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-08-22ipc/util.c: further variable name cleanupsManfred Spraul1-3/+3
The varable names got a mess, thus standardize them again: id: user space id. Called semid, shmid, msgid if the type is known. Most functions use "id" already. idx: "index" for the idr lookup Right now, some functions use lid, ipc_addid() already uses idx as the variable name. seq: sequence number, to avoid quick collisions of the user space id key: user space key, used for the rhash tree Link: http://lkml.kernel.org/r/20180712185241.4017-12-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Kees Cook <keescook@chromium.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22ipc: simplify ipc initializationDavidlohr Bueso1-6/+4
Now that we know that rhashtable_init() will not fail, we can get rid of a lot of the unnecessary cleanup paths when the call errored out. [manfred@colorfullife.com: variable name added to util.h to resolve checkpatch warning] Link: http://lkml.kernel.org/r/20180712185241.4017-11-manfred@colorfullife.com Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Kees Cook <keescook@chromium.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22ipc: rename ipcctl_pre_down_nolock()Manfred Spraul1-1/+1
Both the comment and the name of ipcctl_pre_down_nolock() are misleading: The function must be called while holdling the rw semaphore. Therefore the patch renames the function to ipcctl_obtain_check(): This name matches the other names used in util.c: - "obtain" function look up a pointer in the idr, without acquiring the object lock. - The caller is responsible for locking. - _check means that the sequence number is checked. Link: http://lkml.kernel.org/r/20180712185241.4017-5-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Kees Cook <keescook@chromium.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid()Manfred Spraul1-1/+1
ipc_addid() is impossible to use: - for certain failures, the caller must not use ipc_rcu_putref(), because the reference counter is not yet initialized. - for other failures, the caller must use ipc_rcu_putref(), because parallel operations could be ongoing already. The patch cleans that up, by initializing the refcount early, and by modifying all callers. The issues is related to the finding of syzbot+2827ef6b3385deb07eaf@syzkaller.appspotmail.com: syzbot found an issue with reading kern_ipc_perm.seq, here both read and write to already released memory could happen. Link: http://lkml.kernel.org/r/20180712185241.4017-4-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22ipc: compute kern_ipc_perm.id under the ipc lockManfred Spraul1-5/+13
ipc_addid() initializes kern_ipc_perm.id after having called ipc_idr_alloc(). Thus a parallel semctl() or msgctl() that uses e.g. MSG_STAT may use this unitialized value as the return code. The patch moves all accesses to kern_ipc_perm.id under the spin_lock(). The issues is related to the finding of syzbot+2827ef6b3385deb07eaf@syzkaller.appspotmail.com: syzbot found an issue with kern_ipc_perm.seq Link: http://lkml.kernel.org/r/20180712185241.4017-2-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-02Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+1
The BTF conflicts were simple overlapping changes. The virtio_net conflict was an overlap of a fix of statistics counter, happening alongisde a move over to a bonafide statistics structure rather than counting value on the stack. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-26ipc/sem.c: prevent queue.status tearing in semopDavidlohr Bueso1-1/+1
In order for load/store tearing prevention to work, _all_ accesses to the variable in question need to be done around READ and WRITE_ONCE() macros. Ensure everyone does so for q->status variable for semtimedop(). Link: http://lkml.kernel.org/r/20180717052654.676-1-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-22rhashtable: split rhashtable.hNeilBrown1-0/+1
Due to the use of rhashtables in net namespaces, rhashtable.h is included in lots of the kernel, so a small changes can required a large recompilation. This makes development painful. This patch splits out rhashtable-types.h which just includes the major type declarations, and does not include (non-trivial) inline code. rhashtable.h is no longer included by anything in the include/ directory. Common include files only include rhashtable-types.h so a large recompilation is only triggered when that changes. Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-15sysvipc/sem: mitigate semnum index against spectre v1Davidlohr Bueso1-4/+14
Both smatch and coverity are reporting potential issues with spectre variant 1 with the 'semnum' index within the sma->sems array, ie: ipc/sem.c:388 sem_lock() warn: potential spectre issue 'sma->sems' ipc/sem.c:641 perform_atomic_semop_slow() warn: potential spectre issue 'sma->sems' ipc/sem.c:721 perform_atomic_semop() warn: potential spectre issue 'sma->sems' Avoid any possible speculation by using array_index_nospec() thus ensuring the semnum value is bounded to [0, sma->sem_nsems). With the exception of sem_lock() all of these are slowpaths. Link: http://lkml.kernel.org/r/20180423171131.njs4rfm2yzyeg6do@linux-n805 Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Gustavo A. R. Silva" <gustavo@embeddedor.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-12treewide: kvmalloc() -> kvmalloc_array()Kees Cook1-1/+1
The kvmalloc() function has a 2-factor argument form, kvmalloc_array(). This patch replaces cases of: kvmalloc(a * b, gfp) with: kvmalloc_array(a * b, gfp) as well as handling cases of: kvmalloc(a * b * c, gfp) with: kvmalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kvmalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kvmalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kvmalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | kvmalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kvmalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | kvmalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | kvmalloc( - sizeof(char) * (COUNT) + COUNT , ...) | kvmalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | kvmalloc( - sizeof(u8) * COUNT + COUNT , ...) | kvmalloc( - sizeof(__u8) * COUNT + COUNT , ...) | kvmalloc( - sizeof(char) * COUNT + COUNT , ...) | kvmalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kvmalloc + kvmalloc_array ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) | - kvmalloc + kvmalloc_array ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) | - kvmalloc + kvmalloc_array ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) | - kvmalloc + kvmalloc_array ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) | - kvmalloc + kvmalloc_array ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) | - kvmalloc + kvmalloc_array ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) | - kvmalloc + kvmalloc_array ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) | - kvmalloc + kvmalloc_array ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kvmalloc + kvmalloc_array ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kvmalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kvmalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kvmalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kvmalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kvmalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kvmalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kvmalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kvmalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kvmalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kvmalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kvmalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kvmalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kvmalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | kvmalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kvmalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kvmalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kvmalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kvmalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kvmalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kvmalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kvmalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kvmalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kvmalloc(C1 * C2 * C3, ...) | kvmalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) | kvmalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) | kvmalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) | kvmalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kvmalloc(sizeof(THING) * C2, ...) | kvmalloc(sizeof(TYPE) * C2, ...) | kvmalloc(C1 * C2 * C3, ...) | kvmalloc(C1 * C2, ...) | - kvmalloc + kvmalloc_array ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) | - kvmalloc + kvmalloc_array ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) | - kvmalloc + kvmalloc_array ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) | - kvmalloc + kvmalloc_array ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) | - kvmalloc + kvmalloc_array ( - (E1) * E2 + E1, E2 , ...) | - kvmalloc + kvmalloc_array ( - (E1) * (E2) + E1, E2 , ...) | - kvmalloc + kvmalloc_array ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>
2018-04-20y2038: ipc: Enable COMPAT_32BIT_TIMEArnd Bergmann1-1/+2
Three ipc syscalls (mq_timedsend, mq_timedreceive and and semtimedop) take a timespec argument. After we move 32-bit architectures over to useing 64-bit time_t based syscalls, we need seperate entry points for the old 32-bit based interfaces. This changes the #ifdef guards for the existing 32-bit compat syscalls to check for CONFIG_COMPAT_32BIT_TIME instead, which will then be enabled on all existing 32-bit architectures. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-04-20y2038: ipc: Use __kernel_timespecArnd Bergmann1-2/+2
This is a preparatation for changing over __kernel_timespec to 64-bit times, which involves assigning new system call numbers for mq_timedsend(), mq_timedreceive() and semtimedop() for compatibility with future y2038 proof user space. The existing ABIs will remain available through compat code. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-04-20y2038: ipc: Report long times to user spaceArnd Bergmann1-3/+11
The shmid64_ds/semid64_ds/msqid64_ds data structures have been extended to contain extra fields for storing the upper bits of the time stamps, this patch does the other half of the job and and fills the new fields on 32-bit architectures as well as 32-bit tasks running on a 64-bit kernel in compat mode. There should be no change for native 64-bit tasks. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-04-20y2038: ipc: Use ktime_get_real_seconds consistentlyArnd Bergmann1-3/+3
In some places, we still used get_seconds() instead of ktime_get_real_seconds(), and I'm changing the remaining ones now to all use ktime_get_real_seconds() so we use the full available range for timestamps instead of overflowing the 'unsigned long' return value in year 2106 on 32-bit kernels. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-04-11ipc/sem: introduce semctl(SEM_STAT_ANY)Davidlohr Bueso1-5/+12
There is a permission discrepancy when consulting shm ipc object metadata between /proc/sysvipc/sem (0444) and the SEM_STAT semctl command. The later does permission checks for the object vs S_IRUGO. As such there can be cases where EACCESS is returned via syscall but the info is displayed anyways in the procfs files. While this might have security implications via info leaking (albeit no writing to the sma metadata), this behavior goes way back and showing all the objects regardless of the permissions was most likely an overlook - so we are stuck with it. Furthermore, modifying either the syscall or the procfs file can cause userspace programs to break (ie ipcs). Some applications require getting the procfs info (without root privileges) and can be rather slow in comparison with a syscall -- up to 500x in some reported cases for shm. This patch introduces a new SEM_STAT_ANY command such that the sem ipc object permissions are ignored, and only audited instead. In addition, I've left the lsm security hook checks in place, as if some policy can block the call, then the user has no other choice than just parsing the procfs file. Link: http://lkml.kernel.org/r/20180215162458.10059-3-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reported-by: Robert Kettler <robert.kettler@outlook.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-03Merge branch 'userns-linus' of ↵Linus Torvalds1-28/+53
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace updates from Eric Biederman: "There was a lot of work this cycle fixing bugs that were discovered after the merge window and getting everything ready where we can reasonably support fully unprivileged fuse. The bug fixes you already have and much of the unprivileged fuse work is coming in via other trees. Still left for fully unprivileged fuse is figuring out how to cleanly handle .set_acl and .get_acl in the legacy case, and properly handling of evm xattrs on unprivileged mounts. Included in the tree is a cleanup from Alexely that replaced a linked list with a statically allocated fix sized array for the pid caches, which simplifies and speeds things up. Then there is are some cleanups and fixes for the ipc namespace. The motivation was that in reviewing other code it was discovered that access ipc objects from different pid namespaces recorded pids in such a way that when asked the wrong pids were returned. In the worst case there has been a measured 30% performance impact for sysvipc semaphores. Other test cases showed no measurable performance impact. Manfred Spraul and Davidlohr Bueso who tend to work on sysvipc performance both gave the nod that this is good enough. Casey Schaufler and James Morris have given their approval to the LSM side of the changes. I simplified the types and the code dealing with sysvipc to pass just kern_ipc_perm for all three types of ipc. Which reduced the header dependencies throughout the kernel and simplified the lsm code. Which let me work on the pid fixes without having to worry about trivial changes causing complete kernel recompiles" * 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: ipc/shm: Fix pid freeing. ipc/shm: fix up for struct file no longer being available in shm.h ipc/smack: Tidy up from the change in type of the ipc security hooks ipc: Directly call the security hook in ipc_ops.associate ipc/sem: Fix semctl(..., GETPID, ...) between pid namespaces ipc/msg: Fix msgctl(..., IPC_STAT, ...) between pid namespaces ipc/shm: Fix shmctl(..., IPC_STAT, ...) between pid namespaces. ipc/util: Helpers for making the sysvipc operations pid namespace aware ipc: Move IPCMNI from include/ipc.h into ipc/util.h msg: Move struct msg_queue into ipc/msg.c shm: Move struct shmid_kernel into ipc/shm.c sem: Move struct sem and struct sem_array into ipc/sem.c msg/security: Pass kern_ipc_perm not msg_queue into the msg_queue security hooks shm/security: Pass kern_ipc_perm not shmid_kernel into the shm security hooks sem/security: Pass kern_ipc_perm not sem_array into the sem security hooks pidns: simpler allocation of pid_* caches
2018-04-02ipc: add semctl syscall/compat_syscall wrappersDominik Brodowski1-2/+12
Provide ksys_semctl() and compat_ksys_semctl() wrappers to avoid in-kernel calls to these syscalls. The ksys_ prefix denotes that these functions are meant as a drop-in replacement for the syscalls. In particular, they use the same calling convention as sys_semctl() and compat_sys_semctl(). This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02ipc: add semget syscall wrapperDominik Brodowski1-1/+6
Provide ksys_semget() wrapper to avoid in-kernel calls to this syscall. The ksys_ prefix denotes that this function is meant as a drop-in replacement for the syscall. In particular, it uses the same calling convention as sys_semget(). This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02ipc: add semtimedop syscall/compat_syscall wrappersDominik Brodowski1-5/+18
Provide ksys_semtimedop() and compat_ksys_semtimedop() wrappers to avoid in-kernel calls to these syscalls. The ksys_ prefix denotes that these functions are meant as a drop-in replacement for the syscalls. In particular, they use the same calling convention as sys_semtimedop() and compat_sys_semtimedop(). This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-03-27ipc: Directly call the security hook in ipc_ops.associateEric W. Biederman1-9/+1
After the last round of cleanups the shm, sem, and msg associate operations just became trivial wrappers around the appropriate security method. Simplify things further by just calling the security method directly. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-03-27ipc/sem: Fix semctl(..., GETPID, ...) between pid namespacesEric W. Biederman1-10/+12
Today the last process to update a semaphore is remembered and reported in the pid namespace of that process. If there are processes in any other pid namespace querying that process id with GETPID the result will be unusable nonsense as it does not make any sense in your own pid namespace. Due to ipc_update_pid I don't think you will be able to get System V ipc semaphores into a troublesome cache line ping-pong. Using struct pids from separate process are not a problem because they do not share a cache line. Using struct pid from different threads of the same process are unlikely to be a problem as the reference count update can be avoided. Further linux futexes are a much better tool for the job of mutual exclusion between processes than System V semaphores. So I expect programs that are performance limited by their interprocess mutual exclusion primitive will be using futexes. So while it is possible that enhancing the storage of the last rocess of a System V semaphore from an integer to a struct pid will cause a performance regression because of the effect of frequently updating the pid reference count. I don't expect that to happen in practice. This change updates semctl(..., GETPID, ...) to return the process id of the last process to update a semphore inthe pid namespace of the calling process. Fixes: b488893a390e ("pid namespaces: changes to show virtual ids to user") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-03-22sem: Move struct sem and struct sem_array into ipc/sem.cEric W. Biederman1-0/+34
All of the users are now in ipc/sem.c so make the definitions local to that file to make code maintenance easier. AKA to prevent rebuilding the entire kernel when one of these files is changed. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-03-22sem/security: Pass kern_ipc_perm not sem_array into the sem security hooksEric W. Biederman1-11/+8
All of the implementations of security hooks that take sem_array only access sem_perm the struct kern_ipc_perm member. This means the dependencies of the sem security hooks can be simplified by passing the kern_ipc_perm member of sem_array. Making this change will allow struct sem and struct sem_array to become private to ipc/sem.c. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-02-06ipc: fix ipc data structures inconsistencyPhilippe Mikoyan1-0/+10
As described in the title, this patch fixes <ipc>id_ds inconsistency when <ipc>ctl_stat executes concurrently with some ds-changing function, e.g. shmat, msgsnd or whatever. For instance, if shmctl(IPC_STAT) is running concurrently with shmat, following data structure can be returned: {... shm_lpid = 0, shm_nattch = 1, ...} Link: http://lkml.kernel.org/r/20171202153456.6514-1-philippe.mikoyan@skat.systems Signed-off-by: Philippe Mikoyan <philippe.mikoyan@skat.systems> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-0/+1
Merge more updates from Andrew Morton: - a bit more MM - procfs updates - dynamic-debug fixes - lib/ updates - checkpatch - epoll - nilfs2 - signals - rapidio - PID management cleanup and optimization - kcov updates - sysvipc updates - quite a few misc things all over the place * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits) EXPERT Kconfig menu: fix broken EXPERT menu include/asm-generic/topology.h: remove unused parent_node() macro arch/tile/include/asm/topology.h: remove unused parent_node() macro arch/sparc/include/asm/topology_64.h: remove unused parent_node() macro arch/sh/include/asm/topology.h: remove unused parent_node() macro arch/ia64/include/asm/topology.h: remove unused parent_node() macro drivers/pcmcia/sa1111_badge4.c: avoid unused function warning mm: add infrastructure for get_user_pages_fast() benchmarking sysvipc: make get_maxid O(1) again sysvipc: properly name ipc_addid() limit parameter sysvipc: duplicate lock comments wrt ipc_addid() sysvipc: unteach ids->next_id for !CHECKPOINT_RESTORE initramfs: use time64_t timestamps drivers/watchdog: make use of devm_register_reboot_notifier() kernel/reboot.c: add devm_register_reboot_notifier() kcov: update documentation Makefile: support flag -fsanitizer-coverage=trace-cmp kcov: support comparison operands collection kcov: remove pointless current != NULL check kernel/panic.c: add TAINT_AUX ...
2017-11-17sysvipc: duplicate lock comments wrt ipc_addid()Davidlohr Bueso1-0/+1
The comment in msgqueues when using ipc_addid() is quite useful imo. Duplicate it for shm and semaphores. Link: http://lkml.kernel.org/r/20170831172049.14576-3-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17Merge branch 'work.misc' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "Assorted stuff, really no common topic here" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: vfs: grab the lock instead of blocking in __fd_install during resizing vfs: stop clearing close on exec when closing a fd include/linux/fs.h: fix comment about struct address_space fs: make fiemap work from compat_ioctl coda: fix 'kernel memory exposure attempt' in fsync pstore: remove unneeded unlikely() vfs: remove unneeded unlikely() stubs for mount_bdev() and kill_block_super() in !CONFIG_BLOCK case make vfs_ustat() static do_handle_open() should be static elf_fdpic: fix unused variable warning fold destroy_super() into __put_super() new helper: destroy_unused_super() fix address space warnings in ipc/ acct.h: get rid of detritus
2017-11-02License cleanup: add SPDX GPL-2.0 license identifier to files with no licenseGreg Kroah-Hartman1-0/+1
Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-11fix address space warnings in ipc/Linus Torvalds1-2/+2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-14Merge branch 'work.ipc' of ↵Linus Torvalds1-116/+228
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull ipc compat cleanup and 64-bit time_t from Al Viro: "IPC copyin/copyout sanitizing, including 64bit time_t work from Deepa Dinamani" * 'work.ipc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: utimes: Make utimes y2038 safe ipc: shm: Make shmid_kernel timestamps y2038 safe ipc: sem: Make sem_array timestamps y2038 safe ipc: msg: Make msg_queue timestamps y2038 safe ipc: mqueue: Replace timespec with timespec64 ipc: Make sys_semtimedop() y2038 safe get rid of SYSVIPC_COMPAT on ia64 semtimedop(): move compat to native shmat(2): move compat to native msgrcv(2), msgsnd(2): move compat to native ipc(2): move compat to native ipc: make use of compat ipc_perm helpers semctl(): move compat to native semctl(): separate all layout-dependent copyin/copyout msgctl(): move compat to native msgctl(): split the actual work from copyin/copyout ipc: move compat shmctl to native shmctl: split the work from copyin/copyout
2017-09-08ipc: optimize semget/shmget/msgget for lots of keysGuillaume Knispel1-4/+7
ipc_findkey() used to scan all objects to look for the wanted key. This is slow when using a high number of keys. This change adds an rhashtable of kern_ipc_perm objects in ipc_ids, so that one lookup cease to be O(n). This change gives a 865% improvement of benchmark reaim.jobs_per_min on a 56 threads Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz with 256G memory [1] Other (more micro) benchmark results, by the author: On an i5 laptop, the following loop executed right after a reboot took, without and with this change: for (int i = 0, k=0x424242; i < KEYS; ++i) semget(k++, 1, IPC_CREAT | 0600); total total max single max single KEYS without with call without call with 1 3.5 4.9 µs 3.5 4.9 10 7.6 8.6 µs 3.7 4.7 32 16.2 15.9 µs 4.3 5.3 100 72.9 41.8 µs 3.7 4.7 1000 5,630.0 502.0 µs * * 10000 1,340,000.0 7,240.0 µs * * 31900 17,600,000.0 22,200.0 µs * * *: unreliable measure: high variance The duration for a lookup-only usage was obtained by the same loop once the keys are present: total total max single max single KEYS without with call without call with 1 2.1 2.5 µs 2.1 2.5 10 4.5 4.8 µs 2.2 2.3 32 13.0 10.8 µs 2.3 2.8 100 82.9 25.1 µs * 2.3 1000 5,780.0 217.0 µs * * 10000 1,470,000.0 2,520.0 µs * * 31900 17,400,000.0 7,810.0 µs * * Finally, executing each semget() in a new process gave, when still summing only the durations of these syscalls: creation: total total KEYS without with 1 3.7 5.0 µs 10 32.9 36.7 µs 32 125.0 109.0 µs 100 523.0 353.0 µs 1000 20,300.0 3,280.0 µs 10000 2,470,000.0 46,700.0 µs 31900 27,800,000.0 219,000.0 µs lookup-only: total total KEYS without with 1 2.5 2.7 µs 10 25.4 24.4 µs 32 106.0 72.6 µs 100 591.0 352.0 µs 1000 22,400.0 2,250.0 µs 10000 2,510,000.0 25,700.0 µs 31900 28,200,000.0 115,000.0 µs [1] http://lkml.kernel.org/r/20170814060507.GE23258@yexl-desktop Link: http://lkml.kernel.org/r/20170815194954.ck32ta2z35yuzpwp@debix Signed-off-by: Guillaume Knispel <guillaume.knispel@supersonicimagine.com> Reviewed-by: Marc Pardo <marc.pardo@supersonicimagine.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Kees Cook <keescook@chromium.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Serge Hallyn <serge@hallyn.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: Guillaume Knispel <guillaume.knispel@supersonicimagine.com> Cc: Marc Pardo <marc.pardo@supersonicimagine.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08ipc/sem: play nicer with large nsops allocationsDavidlohr Bueso1-2/+2
Replacing semop()'s kmalloc for kvmalloc was originally proposed by Manfred on the premise that it can be called for large (than order-1) sizes. For example, while Oracle recommends setting SEMOPM to a _minimum_ of 100, some distros[1] encourage the setting to be a factor of the amount of db tasks (PROCESSES), which can get fishy for large systems (easily going beyond 1000). [1] An Example of Semaphore Settings https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Tuning_and_Optimizing_Red_Hat_Enterprise_Linux_for_Oracle_9i_and_10g_Databases/sect-Oracle_9i_and_10g_Tuning_Guide-Setting_Semaphores-An_Example_of_Semaphore_Settings.html So let's just convert this to kvmalloc, just like the rest of the allocations we do in ipc. While the fallback vmalloc obviously involves more overhead, this by far the uncommon path, and it's better for the user than just erroring out with kmalloc. Link: http://lkml.kernel.org/r/20170803184136.13855-2-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08ipc/sem: drop sem_checkid helperDavidlohr Bueso1-2/+0
... 'tis not used. Link: http://lkml.kernel.org/r/20170803184136.13855-1-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08ipc: convert sem_undo_list.refcnt from atomic_t to refcount_tElena Reshetova1-4/+4
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Link: http://lkml.kernel.org/r/1499417992-3238-3-git-send-email-elena.reshetova@intel.com Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Serge Hallyn <serge@hallyn.com> Cc: <arozansk@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-03ipc: sem: Make sem_array timestamps y2038 safeDeepa Dinamani1-9/+9
time_t is not y2038 safe. Replace all uses of time_t by y2038 safe time64_t. Similarly, replace the calls to get_seconds() with y2038 safe ktime_get_real_seconds(). Note that this preserves fast access on 64 bit systems, but 32 bit systems need sequence counters. The syscall interface themselves are not changed as part of the patch. They will be part of a different series. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-03ipc: Make sys_semtimedop() y2038 safeDeepa Dinamani1-6/+6
struct timespec is not y2038 safe on 32 bit machines. Replace timespec with y2038 safe struct timespec64. Note that the patch only changes the internals without modifying the syscall interface. This will be part of a separate series. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-08-21Merge branch 'for-mingo' of ↵Ingo Molnar1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU updates from Paul E. McKenney: - Removal of spin_unlock_wait() - SRCU updates - Torture-test updates - Documentation updates - Miscellaneous fixes - CPU-hotplug fixes - Miscellaneous non-RCU fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-17ipc: Replace spin_unlock_wait() with lock/unlock pairPaul E. McKenney1-1/+2
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore replaces the spin_unlock_wait() call in exit_sem() with spin_lock() followed immediately by spin_unlock(). This should be safe from a performance perspective because exit_sem() is rarely invoked in production. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Will Deacon <will.deacon@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Andrea Parri <parri.andrea@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Manfred Spraul <manfred@colorfullife.com>
2017-08-02ipc: add missing container_of()s for randstructKees Cook1-1/+2
When building with the randstruct gcc plugin, the layout of the IPC structs will be randomized, which requires any sub-structure accesses to use container_of(). The proc display handlers were missing the needed container_of()s since the iterator is passing in the top-level struct kern_ipc_perm. This would lead to crashes when running the "lsipc" program after the system had IPC registered (e.g. after starting up Gnome): general protection fault: 0000 [#1] PREEMPT SMP ... RIP: 0010:shm_add_rss_swap.isra.1+0x13/0xa0 ... Call Trace: sysvipc_shm_proc_show+0x5e/0x150 sysvipc_proc_show+0x1a/0x30 seq_read+0x2e9/0x3f0 ... Link: http://lkml.kernel.org/r/20170730205950.GA55841@beast Fixes: 3859a271a003 ("randstruct: Mark various structs for randomization") Signed-off-by: Kees Cook <keescook@chromium.org> Reported-by: Dominik Brodowski <linux@dominikbrodowski.net> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-15semtimedop(): move compat to nativeAl Viro1-11/+33
... and finally kill the sodding compat_convert_timespec() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-15semctl(): move compat to nativeAl Viro1-0/+94
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-15semctl(): separate all layout-dependent copyin/copyoutAl Viro1-98/+94
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-12ipc/sem: drop __sem_free()Kees Cook1-7/+2
The remaining users of __sem_free() can simply call kvfree() instead for better readability. [manfred@colorfullife.com: Rediff to keep rcu protection for security_sem_alloc()] Link: http://lkml.kernel.org/r/20170525185107.12869-20-manfred@colorfullife.com Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ipc: move atomic_set() to where it is neededKees Cook1-1/+0
Only after ipc_addid() has succeeded will refcounting be used, so move initialization into ipc_addid() and remove from open-coded *_alloc() routines. Link: http://lkml.kernel.org/r/20170525185107.12869-17-manfred@colorfullife.com Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ipc/sem.c: avoid ipc_rcu_putref for failed ipc_addid()Manfred Spraul1-5/+4
Loosely based on a patch from Kees Cook <keescook@chromium.org>: - id and retval can be merged - if ipc_addid() fails, then use call_rcu() directly. The difference is that call_rcu is used for failed ipc_addid() calls, to continue to guaranteed an rcu delay for security_sem_free(). Link: http://lkml.kernel.org/r/20170525185107.12869-14-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Kees Cook <keescook@chromium.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ipc/sem: avoid ipc_rcu_alloc()Kees Cook1-5/+20
Instead of using ipc_rcu_alloc() which only performs the refcount bump, open code it to perform better sem-specific checks. This also allows for sem_array structure layout to be randomized in the future. [manfred@colorfullife.com: Rediff, because the memset was temporarily inside ipc_rcu_alloc()] Link: http://lkml.kernel.org/r/20170525185107.12869-10-manfred@colorfullife.com Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ipc/sem: do not use ipc_rcu_free()Kees Cook1-2/+7
Avoid using ipc_rcu_free, since it just re-finds the original structure pointer. For the pre-list-init failure path, there is no RCU needed, since it was just allocated. It can be directly freed. Link: http://lkml.kernel.org/r/20170525185107.12869-6-manfred@colorfullife.com Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ipc: drop non-RCU allocationKees Cook1-3/+5
The only users of ipc_alloc() were ipc_rcu_alloc() and the on-heap sem_io fall-back memory. Better to just open-code these to make things easier to read. [manfred@colorfullife.com: Rediff due to inclusion of memset() into ipc_rcu_alloc()] Link: http://lkml.kernel.org/r/20170525185107.12869-5-manfred@colorfullife.com Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ipc: merge ipc_rcu and kern_ipc_permManfred Spraul1-17/+17
ipc has two management structures that exist for every id: - struct kern_ipc_perm, it contains e.g. the permissions. - struct ipc_rcu, it contains the rcu head for rcu handling and the refcount. The patch merges both structures. As a bonus, we may save one cacheline, because both structures are cacheline aligned. In addition, it reduces the number of casts, instead most codepaths can use container_of. To simplify code, the ipc_rcu_alloc initializes the allocation to 0. [manfred@colorfullife.com: really include the memset() into ipc_alloc_rcu()] Link: http://lkml.kernel.org/r/564f8612-0601-b267-514f-a9f650ec9b32@colorfullife.com Link: http://lkml.kernel.org/r/20170525185107.12869-3-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ipc/sem.c: remove sem_base, embed struct semManfred Spraul1-54/+34
sma->sem_base is initialized with sma->sem_base = (struct sem *) &sma[1]; The current code has four problems: - There is an unnecessary pointer dereference - sem_base is not needed. - Alignment for struct sem only works by chance. - The current code causes false positive for static code analysis. - This is a cast between different non-void types, which the future randstruct GCC plugin warns on. And, as bonus, the code size gets smaller: Before: 0 .text 00003770 After: 0 .text 0000374e [manfred@colorfullife.com: s/[0]/[]/, per hch] Link: http://lkml.kernel.org/r/20170525185107.12869-2-manfred@colorfullife.com Link: http://lkml.kernel.org/r/20170515171912.6298-2-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Kees Cook <keescook@chromium.org> Cc: <1vier1@web.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Fabian Frederick <fabf@skynet.be> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar1-0/+1
<linux/sched/wake_q.h> We are going to split <linux/sched/wake_q.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/wake_q.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-27ipc/sem: add hysteresisManfred Spraul1-25/+61
sysv sem has two lock modes: One with per-semaphore locks, one lock mode with a single global lock for the whole array. When switching from the per-semaphore locks to the global lock, all per-semaphore locks must be scanned for ongoing operations. The patch adds a hysteresis for switching from the global lock to the per semaphore locks. This reduces how often the per-semaphore locks must be scanned. Compared to the initial patch, this is a simplified solution: Setting USE_GLOBAL_LOCK_HYSTERESIS to 1 restores the current behavior. In theory, a workload with exactly 10 simple sops and then one complex op now scales a bit worse, but this is pure theory: If there is concurrency, the it won't be exactly 10:1:10:1:10:1:... If there is no concurrency, then there is no need for scalability. Link: http://lkml.kernel.org/r/1476851896-3590-3-git-send-email-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@zytor.com> Cc: <1vier1@web.de> Cc: kernel test robot <xiaolong.ye@intel.com> Cc: <felixh@informatik.uni-bremen.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27ipc/sem.c: avoid using spin_unlock_wait()Manfred Spraul1-22/+3
a) The ACQUIRE in spin_lock() applies to the read, not to the store, at least for powerpc. This forces to add a smp_mb() into the fast path. b) The memory barrier provided by spin_unlock_wait() is right now arch dependent. Therefore: Use spin_lock()/spin_unlock() instead of spin_unlock_wait(). Advantage: faster single op semop calls(), observed +8.9% on x86. (the other solution would be arch dependencies in ipc/sem). Disadvantage: slower complex op semop calls, if (and only if) there are no sleeping operations. The next patch adds hysteresis, this further reduces the probability that the slow path is used. Link: http://lkml.kernel.org/r/1476851896-3590-2-git-send-email-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@zytor.com> Cc: <1vier1@web.de> Cc: kernel test robot <xiaolong.ye@intel.com> Cc: <felixh@informatik.uni-bremen.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-10ipc/sem.c: fix incorrect sem_lock pairingManfred Spraul1-1/+1
Based on the syzcaller test case from dvyukov: https://gist.githubusercontent.com/dvyukov/d0e5efefe4d7d6daed829f5c3ca26a40/raw/08d0a261fe3c987bed04fbf267e08ba04bd533ea/gistfile1.txt The slow (i.e.: failure to acquire) syscall exit from semtimedop() incorrectly assumed that the the same lock is acquired as it was at the initial syscall entry. This is wrong: - thread A: single semop semop(), sleeps - thread B: multi semop semop(), sleeps - thread A: woken up by signal/timeout With this sequence, the initial sem_lock() call locks the per-semaphore spinlock, and it is unlocked with sem_unlock(). The call at the syscall return locks the global spinlock. Because locknum is not updated, the following sem_unlock() call unlocks the per-semaphore spinlock, which is actually not locked. The fix is trivial: Use the return value from sem_lock. Fixes: 370b262c896e ("ipc/sem: avoid idr tree lookup for interrupted semop") Link: http://lkml.kernel.org/r/1482215645-22328-1-git-send-email-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Reported-by: Johanna Abrahamsson <johanna@mjao.org> Tested-by: Johanna Abrahamsson <johanna@mjao.org> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14ipc/sem: avoid idr tree lookup for interrupted semopDavidlohr Bueso1-32/+5
We can avoid the idr tree lookup (albeit possibly avoiding idr_find_fast()) when being awoken in EINTR, as the semid will not change in this context while blocked. Use the sma pointer directly and take the sem_lock, then re-check for RMID races. We continue to re-check the queue.status with the lock held such that we can detect situations where we where are dealing with a spurious wakeup but another task that holds the sem_lock updated the queue.status while we were spinning for it. Once we take the lock it obviously won't change again. Being the only caller, get rid of sem_obtain_lock() altogether. Link: http://lkml.kernel.org/r/1478708774-28826-3-git-send-email-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14ipc/sem: simplify wait-wake loopDavidlohr Bueso1-56/+52
Instead of using the reverse goto, we can simplify the flow and make it more language natural by just doing do-while instead. One would hope this is the standard way (or obviously just with a while bucle) that we do wait/wakeup handling in the kernel. The exact same logic is kept, just more indented. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/1478708774-28826-2-git-send-email-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14ipc/sem: use proper list api for pending_list wakeupsDavidlohr Bueso1-25/+13
... saves some LoC and looks cleaner than re-implementing the calls. Link: http://lkml.kernel.org/r/1474225896-10066-6-git-send-email-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14ipc/sem: explicitly inline check_restartDavidlohr Bueso1-1/+1
The compiler already does this, but make it explicit. This helper is really small and also used in update_queue's main loop, which is O(N^2) scanning. Inline and avoid the function overhead. Link: http://lkml.kernel.org/r/1474225896-10066-5-git-send-email-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14ipc/sem: optimize perform_atomic_semop()Davidlohr Bueso1-10/+102
This is the main workhorse that deals with semop user calls such that the waitforzero or semval update operations, on the set, can complete on not as the sma currently stands. Currently, the set is iterated twice (setting semval, then backwards for the sempid value). Slowpaths, and particularly SEM_UNDO calls, must undo any altered sem when it is detected that the caller must block or has errored-out. With larger sets, there can occur situations where this involves a lot of cycles and can obviously be a suboptimal use of cached resources in shared memory. Ie, discarding CPU caches that are also calling semop and have the sembuf cached (and can complete), while the current lock holder doing the semop will block, error, or does a waitforzero operation. This patch proposes still iterating the set twice, but the first scan is read-only, and we perform the actual updates afterward, once we know that the call will succeed. In order to not suffer from the overhead of dealing with sops that act on the same sem_num, such (rare) cases use perform_atomic_semop_slow(), which is exactly what we have now. Duplicates are detected before grabbing sem_lock, and uses simple a 32/64-bit hash array variable to based on the sem_num we are working on. In addition add some comments to when we expect to the caller to block. [akpm@linux-foundation.org: coding-style fixes] [colin.king@canonical.com: ensure we left shift a ULL rather than a 32 bit integer] Link: http://lkml.kernel.org/r/20161028181129.7311-1-colin.king@canonical.com Link: http://lkml.kernel.org/r/20160921194603.GB21438@linux-80c1.suse Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14ipc/sem: rework task wakeupsDavidlohr Bueso1-180/+86
Our sysv sems have been using the notion of lockless wakeups for a while, ever since commit 0a2b9d4c7967 ("ipc/sem.c: move wake_up_process out of the spinlock section"), in order to reduce the sem_lock hold times. This in-house pending queue can be replaced by wake_q (just like all the rest of ipc now), in that it provides the following advantages: o Simplifies and gets rid of unnecessary code. o We get rid of the IN_WAKEUP complexities. Given that wake_q_add() grabs reference to the task, if awoken due to an unrelated event, between the wake_q_add() and wake_up_q() window, we cannot race with sys_exit and the imminent call to wake_up_process(). o By not spinning IN_WAKEUP, we no longer need to disable preemption. In consequence, the wakeup paths (after schedule(), that is) must acknowledge an external signal/event, as well spurious wakeup occurring during the pending wakeup window. Obviously no changes in semantics that could be visible to the user. The fastpath is _only_ for when we know for sure that we were awoken due to a the waker's successful semop call (queue.status is not -EINTR). On a 48-core Haswell, running the ipcscale 'waitforzero' test, the following is seen with increasing thread counts: v4.8-rc5 v4.8-rc5 semopv2 Hmean sembench-sem-2 574733.00 ( 0.00%) 578322.00 ( 0.62%) Hmean sembench-sem-8 811708.00 ( 0.00%) 824689.00 ( 1.59%) Hmean sembench-sem-12 842448.00 ( 0.00%) 845409.00 ( 0.35%) Hmean sembench-sem-21 933003.00 ( 0.00%) 977748.00 ( 4.80%) Hmean sembench-sem-48 935910.00 ( 0.00%) 1004759.00 ( 7.36%) Hmean sembench-sem-79 937186.00 ( 0.00%) 983976.00 ( 4.99%) Hmean sembench-sem-234 974256.00 ( 0.00%) 1060294.00 ( 8.83%) Hmean sembench-sem-265 975468.00 ( 0.00%) 1016243.00 ( 4.18%) Hmean sembench-sem-296 991280.00 ( 0.00%) 1042659.00 ( 5.18%) Hmean sembench-sem-327 975415.00 ( 0.00%) 1029977.00 ( 5.59%) Hmean sembench-sem-358 1014286.00 ( 0.00%) 1049624.00 ( 3.48%) Hmean sembench-sem-389 972939.00 ( 0.00%) 1043127.00 ( 7.21%) Hmean sembench-sem-420 981909.00 ( 0.00%) 1056747.00 ( 7.62%) Hmean sembench-sem-451 990139.00 ( 0.00%) 1051609.00 ( 6.21%) Hmean sembench-sem-482 965735.00 ( 0.00%) 1040313.00 ( 7.72%) [akpm@linux-foundation.org: coding-style fixes] [sfr@canb.auug.org.au: merge fix for WAKE_Q to DEFINE_WAKE_Q rename] Link: http://lkml.kernel.org/r/20161122210410.5eca9fc2@canb.auug.org.au Link: http://lkml.kernel.org/r/1474225896-10066-3-git-send-email-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14ipc/sem: do not call wake_sem_queue_do() prematurely ... as this call should ↵Davidlohr Bueso1-7/+12
obviously be paired with its _prepare() counterpart. At least whenever possible, as there is no harm in calling it bogusly as we do now in a few places. Immediate error semop(2) paths that are far from ever having the task block can be simplified and avoid a few unnecessary loads on their way out of the call as it is not deeply nested. Link: http://lkml.kernel.org/r/1474225896-10066-2-git-send-email-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11ipc/sem.c: add cond_resched in exit_smeNikolay Borisov1-0/+2
In CONFIG_PREEMPT=n kernel a softlockup was observed while the for loop in exit_sem. Apparently it's possible for the loop to take quite a long time and it doesn't have a scheduling point in it. Since the codes is executing under an rcu read section this may also cause rcu stalls, which in turn block synchronize_rcu operations, which more or less de-stabilises the whole system. Fix this by introducing a cond_resched() at the beginning of the loop. So this patch fixes the following: NMI watchdog: BUG: soft lockup - CPU#10 stuck for 23s! [httpd:18119] CPU: 10 PID: 18119 Comm: httpd Tainted: G O 4.4.20-clouder2 #6 Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015 task: ffff88348d695280 ti: ffff881c95550000 task.ti: ffff881c95550000 RIP: 0010:[<ffffffff81614bc7>] [<ffffffff81614bc7>] _raw_spin_lock+0x17/0x30 RSP: 0018:ffff881c95553e40 EFLAGS: 00000246 RAX: 0000000000000000 RBX: ffff883161b1eea8 RCX: 000000000000000d RDX: 0000000000000001 RSI: 000000000000000e RDI: ffff883161b1eea4 RBP: ffff881c95553ea0 R08: ffff881c95553e68 R09: ffff883fef376f88 R10: ffff881fffb58c20 R11: ffffea0072556600 R12: ffff883161b1eea0 R13: ffff88348d695280 R14: ffff883dec427000 R15: ffff8831621672a0 FS: 0000000000000000(0000) GS:ffff881fffb40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f3b3723e020 CR3: 0000000001c0a000 CR4: 00000000001406e0 Call Trace: ? exit_sem+0x7c/0x280 do_exit+0x338/0xb40 do_group_exit+0x43/0xd0 SyS_exit_group+0x14/0x20 entry_SYSCALL_64_fastpath+0x16/0x6e Link: http://lkml.kernel.org/r/1475154992-6363-1-git-send-email-kernel@kyup.com Signed-off-by: Nikolay Borisov <kernel@kyup.com> Cc: Herton R. Krzesinski <herton@redhat.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11ipc/sem.c: fix complex_count vs. simple op raceManfred Spraul1-55/+83
Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a race: sem_lock has a fast path that allows parallel simple operations. There are two reasons why a simple operation cannot run in parallel: - a non-simple operations is ongoing (sma->sem_perm.lock held) - a complex operation is sleeping (sma->complex_count != 0) As both facts are stored independently, a thread can bypass the current checks by sleeping in the right positions. See below for more details (or kernel bugzilla 105651). The patch fixes that by creating one variable (complex_mode) that tracks both reasons why parallel operations are not possible. The patch also updates stale documentation regarding the locking. With regards to stable kernels: The patch is required for all kernels that include the commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") (3.10?) The alternative is to revert the patch that introduced the race. The patch is safe for backporting, i.e. it makes no assumptions about memory barriers in spin_unlock_wait(). Background: Here is the race of the current implementation: Thread A: (simple op) - does the first "sma->complex_count == 0" test Thread B: (complex op) - does sem_lock(): This includes an array scan. But the scan can't find Thread A, because Thread A does not own sem->lock yet. - the thread does the operation, increases complex_count, drops sem_lock, sleeps Thread A: - spin_lock(&sem->lock), spin_is_locked(sma->sem_perm.lock) - sleeps before the complex_count test Thread C: (complex op) - does sem_lock (no array scan, complex_count==1) - wakes up Thread B. - decrements complex_count Thread A: - does the complex_count test Bug: Now both thread A and thread C operate on the same array, without any synchronization. Fixes: 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") Link: http://lkml.kernel.org/r/1469123695-5661-1-git-send-email-manfred@colorfullife.com Reported-by: <felixh@informatik.uni-bremen.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: <1vier1@web.de> Cc: <stable@vger.kernel.org> [3.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02sysv, ipc: fix security-layer leakingFabian Frederick1-6/+6
Commit 53dad6d3a8e5 ("ipc: fix race with LSMs") updated ipc_rcu_putref() to receive rcu freeing function but used generic ipc_rcu_free() instead of msg_rcu_free() which does security cleaning. Running LTP msgsnd06 with kmemleak gives the following: cat /sys/kernel/debug/kmemleak unreferenced object 0xffff88003c0a11f8 (size 8): comm "msgsnd06", pid 1645, jiffies 4294672526 (age 6.549s) hex dump (first 8 bytes): 1b 00 00 00 01 00 00 00 ........ backtrace: kmemleak_alloc+0x23/0x40 kmem_cache_alloc_trace+0xe1/0x180 selinux_msg_queue_alloc_security+0x3f/0xd0 security_msg_queue_alloc+0x2e/0x40 newque+0x4e/0x150 ipcget+0x159/0x1b0 SyS_msgget+0x39/0x40 entry_SYSCALL_64_fastpath+0x13/0x8f Manfred Spraul suggested to fix sem.c as well and Davidlohr Bueso to only use ipc_rcu_free in case of security allocation failure in newary() Fixes: 53dad6d3a8e ("ipc: fix race with LSMs") Link: http://lkml.kernel.org/r/1470083552-22966-1-git-send-email-fabf@skynet.be Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: <stable@vger.kernel.org> [3.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-14locking/spinlock: Update spin_unlock_wait() usersPeter Zijlstra1-1/+0
With the modified semantics of spin_unlock_wait() a number of explicit barriers can be removed. Also update the comment for the do_exit() usecase, as that was somewhat stale/obscure. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-14locking/barriers: Introduce smp_acquire__after_ctrl_dep()Peter Zijlstra1-12/+2
Introduce smp_acquire__after_ctrl_dep(), this construct is not uncommon, but the lack of this barrier is. Use it to better express smp_rmb() uses in WRITE_ONCE(), the IPC semaphore code and the qspinlock code. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-22ipc/sem: make semctl setting sempid consistentDavidlohr Bueso1-2/+11
As indicated by bug#112271, Linux sets the sempid value upon semctl, and not only for semop calls. However, within semctl we only do this for SETVAL, leaving SETALL without updating the field, and therefore rather inconsistent behavior when compared to other Unices. There is really no documentation regarding this and therefore users should not make assumptions. With this patch, along with updating semctl.2 manpages, this scenario should become less ambiguous As such, set sempid on SETALL cmd. Also update some in-code documentation, specifying where the sempid is set. Passes ltp and custom testcase where a child (fork) does SETALL to the set. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reported-by: Philip Semanchuk <linux_kernel.20.ick@spamgourmet.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: PrasannaKumar Muralidharan <prasannatsmkumar@gmail.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Herton R. Krzesinski <herton@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-22tree wide: use kvfree() than conditional kfree()/vfree()Tetsuo Handa1-1/+1
There are many locations that do if (memory_was_allocated_by_vmalloc) vfree(ptr); else kfree(ptr); but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory using is_vmalloc_addr(). Unless callers have special reasons, we can replace this branch with kvfree(). Please check and reply if you found problems. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Jan Kara <jack@suse.com> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net> Acked-by: David Rientjes <rientjes@google.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: Boris Petkov <bp@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-14ipc/sem.c: update/correct memory barriersManfred Spraul1-4/+14
sem_lock() did not properly pair memory barriers: !spin_is_locked() and spin_unlock_wait() are both only control barriers. The code needs an acquire barrier, otherwise the cpu might perform read operations before the lock test. As no primitive exists inside <include/spinlock.h> and since it seems noone wants another primitive, the code creates a local primitive within ipc/sem.c. With regards to -stable: The change of sem_wait_array() is a bugfix, the change to sem_lock() is a nop (just a preprocessor redefinition to improve the readability). The bugfix is necessary for all kernels that use sem_wait_array() (i.e.: starting from 3.10). Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Kirill Tkhai <ktkhai@parallels.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: <stable@vger.kernel.org> [3.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-14ipc,sem: remove uneeded sem_undo_list lock usage in exit_sem()Herton R. Krzesinski1-2/+4
After we acquire the sma->sem_perm lock in exit_sem(), we are protected against a racing IPC_RMID operation. Also at that point, we are the last user of sem_undo_list. Therefore it isn't required that we acquire or use ulp->lock. Signed-off-by: Herton R. Krzesinski <herton@redhat.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Rafael Aquini <aquini@redhat.com> CC: Aristeu Rozanski <aris@redhat.com> Cc: David Jeffery <djeffery@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-14ipc,sem: fix use after free on IPC_RMID after a task using same semaphore ↵Herton R. Krzesinski1-6/+17
set exits The current semaphore code allows a potential use after free: in exit_sem we may free the task's sem_undo_list while there is still another task looping through the same semaphore set and cleaning the sem_undo list at freeary function (the task called IPC_RMID for the same semaphore set). For example, with a test program [1] running which keeps forking a lot of processes (which then do a semop call with SEM_UNDO flag), and with the parent right after removing the semaphore set with IPC_RMID, and a kernel built with CONFIG_SLAB, CONFIG_SLAB_DEBUG and CONFIG_DEBUG_SPINLOCK, you can easily see something like the following in the kernel log: Slab corruption (Not tainted): kmalloc-64 start=ffff88003b45c1c0, len=64 000: 6b 6b 6b 6b 6b 6b 6b 6b 00 6b 6b 6b 6b 6b 6b 6b kkkkkkkk.kkkkkkk 010: ff ff ff ff 6b 6b 6b 6b ff ff ff ff ff ff ff ff ....kkkk........ Prev obj: start=ffff88003b45c180, len=64 000: 00 00 00 00 ad 4e ad de ff ff ff ff 5a 5a 5a 5a .....N......ZZZZ 010: ff ff ff ff ff ff ff ff c0 fb 01 37 00 88 ff ff ...........7.... Next obj: start=ffff88003b45c200, len=64 000: 00 00 00 00 ad 4e ad de ff ff ff ff 5a 5a 5a 5a .....N......ZZZZ 010: ff ff ff ff ff ff ff ff 68 29 a7 3c 00 88 ff ff ........h).<.... BUG: spinlock wrong CPU on CPU#2, test/18028 general protection fault: 0000 [#1] SMP Modules linked in: 8021q mrp garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc ppdev input_leds joydev parport_pc parport floppy serio_raw virtio_balloon virtio_rng virtio_console virtio_net iosf_mbi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr qxl ttm drm_kms_helper drm snd_hda_codec_generic i2c_piix4 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore crc32c_intel virtio_pci virtio_ring virtio pata_acpi ata_generic [last unloaded: speedstep_lib] CPU: 2 PID: 18028 Comm: test Not tainted 4.2.0-rc5+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014 RIP: spin_dump+0x53/0xc0 Call Trace: spin_bug+0x30/0x40 do_raw_spin_unlock+0x71/0xa0 _raw_spin_unlock+0xe/0x10 freeary+0x82/0x2a0 ? _raw_spin_lock+0xe/0x10 semctl_down.clone.0+0xce/0x160 ? __do_page_fault+0x19a/0x430 ? __audit_syscall_entry+0xa8/0x100 SyS_semctl+0x236/0x2c0 ? syscall_trace_leave+0xde/0x130 entry_SYSCALL_64_fastpath+0x12/0x71 Code: 8b 80 88 03 00 00 48 8d 88 60 05 00 00 48 c7 c7 a0 2c a4 81 31 c0 65 8b 15 eb 40 f3 7e e8 08 31 68 00 4d 85 e4 44 8b 4b 08 74 5e <45> 8b 84 24 88 03 00 00 49 8d 8c 24 60 05 00 00 8b 53 04 48 89 RIP [<ffffffff810d6053>] spin_dump+0x53/0xc0 RSP <ffff88003750fd68> ---[ end trace 783ebb76612867a0 ]--- NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [test:18053] Modules linked in: 8021q mrp garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc ppdev input_leds joydev parport_pc parport floppy serio_raw virtio_balloon virtio_rng virtio_console virtio_net iosf_mbi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr qxl ttm drm_kms_helper drm snd_hda_codec_generic i2c_piix4 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore crc32c_intel virtio_pci virtio_ring virtio pata_acpi ata_generic [last unloaded: speedstep_lib] CPU: 3 PID: 18053 Comm: test Tainted: G D 4.2.0-rc5+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014 RIP: native_read_tsc+0x0/0x20 Call Trace: ? delay_tsc+0x40/0x70 __delay+0xf/0x20 do_raw_spin_lock+0x96/0x140 _raw_spin_lock+0xe/0x10 sem_lock_and_putref+0x11/0x70 SYSC_semtimedop+0x7bf/0x960 ? handle_mm_fault+0xbf6/0x1880 ? dequeue_task_fair+0x79/0x4a0 ? __do_page_fault+0x19a/0x430 ? kfree_debugcheck+0x16/0x40 ? __do_page_fault+0x19a/0x430 ? __audit_syscall_entry+0xa8/0x100 ? do_audit_syscall_entry+0x66/0x70 ? syscall_trace_enter_phase1+0x139/0x160 SyS_semtimedop+0xe/0x10 SyS_semop+0x10/0x20 entry_SYSCALL_64_fastpath+0x12/0x71 Code: 47 10 83 e8 01 85 c0 89 47 10 75 08 65 48 89 3d 1f 74 ff 7e c9 c3 0f 1f 44 00 00 55 48 89 e5 e8 87 17 04 00 66 90 c9 c3 0f 1f 00 <55> 48 89 e5 0f 31 89 c1 48 89 d0 48 c1 e0 20 89 c9 48 09 c8 c9 Kernel panic - not syncing: softlockup: hung tasks I wasn't able to trigger any badness on a recent kernel without the proper config debugs enabled, however I have softlockup reports on some kernel versions, in the semaphore code, which are similar as above (the scenario is seen on some servers running IBM DB2 which uses semaphore syscalls). The patch here fixes the race against freeary, by acquiring or waiting on the sem_undo_list lock as necessary (exit_sem can race with freeary, while freeary sets un->semid to -1 and removes the same sem_undo from list_proc or when it removes the last sem_undo). After the patch I'm unable to reproduce the problem using the test case [1]. [1] Test case used below: #include <stdio.h> #include <sys/types.h> #include <sys/ipc.h> #include <sys/sem.h> #include <sys/wait.h> #include <stdlib.h> #include <time.h> #include <unistd.h> #include <errno.h> #define NSEM 1 #define NSET 5 int sid[NSET]; void thread() { struct sembuf op; int s; uid_t pid = getuid(); s = rand() % NSET; op.sem_num = pid % NSEM; op.sem_op = 1; op.sem_flg = SEM_UNDO; semop(sid[s], &op, 1); exit(EXIT_SUCCESS); } void create_set() { int i, j; pid_t p; union { int val; struct semid_ds *buf; unsigned short int *array; struct seminfo *__buf; } un; /* Create and initialize semaphore set */ for (i = 0; i < NSET; i++) { sid[i] = semget(IPC_PRIVATE , NSEM, 0644 | IPC_CREAT); if (sid[i] < 0) { perror("semget"); exit(EXIT_FAILURE); } } un.val = 0; for (i = 0; i < NSET; i++) { for (j = 0; j < NSEM; j++) { if (semctl(sid[i], j, SETVAL, un) < 0) perror("semctl"); } } /* Launch threads that operate on semaphore set */ for (i = 0; i < NSEM * NSET * NSET; i++) { p = fork(); if (p < 0) perror("fork"); if (p == 0) thread(); } /* Free semaphore set */ for (i = 0; i < NSET; i++) { if (semctl(sid[i], NSEM, IPC_RMID)) perror("IPC_RMID"); } /* Wait for forked processes to exit */ while (wait(NULL)) { if (errno == ECHILD) break; }; } int main(int argc, char **argv) { pid_t p; srand(time(NULL)); while (1) { p = fork(); if (p < 0) { perror("fork"); exit(EXIT_FAILURE); } if (p == 0) { create_set(); goto end; } /* Wait for forked processes to exit */ while (wait(NULL)) { if (errno == ECHILD) break; }; } end: return 0; } [akpm@linux-foundation.org: use normal comment layout] Signed-off-by: Herton R. Krzesinski <herton@redhat.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Rafael Aquini <aquini@redhat.com> CC: Aristeu Rozanski <aris@redhat.com> Cc: David Jeffery <djeffery@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-30ipc: rename ipc_obtain_objectDavidlohr Bueso1-2/+2
... to ipc_obtain_object_idr, which is more meaningful and makes the code slightly easier to follow. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15ipc: remove use of seq_printf return valueJoe Perches1-12/+14
The seq_printf return value, because it's frequently misused, will eventually be converted to void. See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to seq_has_overflowed() and make public") Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-17ipc,sem: use current->state helpersDavidlohr Bueso1-1/+1
Call __set_current_state() instead of assigning the new state directly. These interfaces also aid CONFIG_DEBUG_ATOMIC_SLEEP environments, keeping track of who changed the state. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13ipc/sem.c: change memory barrier in sem_lock() to smp_rmb()Manfred Spraul1-3/+10
When I fixed bugs in the sem_lock() logic, I was more conservative than necessary. Therefore it is safe to replace the smp_mb() with smp_rmb(). And: With smp_rmb(), semop() syscalls are up to 10% faster. The race we must protect against is: sem->lock is free sma->complex_count = 0 sma->sem_perm.lock held by thread B thread A: A: spin_lock(&sem->lock) B: sma->complex_count++; (now 1) B: spin_unlock(&sma->sem_perm.lock); A: spin_is_locked(&sma->sem_perm.lock); A: XXXXX memory barrier A: if (sma->complex_count == 0) Thread A must read the increased complex_count value, i.e. the read must not be reordered with the read of sem_perm.lock done by spin_is_locked(). Since it's about ordering of reads, smp_rmb() is sufficient. [akpm@linux-foundation.org: update sem_lock() comment, from Davidlohr] Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-03ipc/sem.c: fully initialize sem_array before making it visibleManfred Spraul1-7/+8
ipc_addid() makes a new ipc identifier visible to everyone. New objects start as locked, so that the caller can complete the initialization after the call. Within struct sem_array, at least sma->sem_base and sma->sem_nsems are accessed without any locks, therefore this approach doesn't work. Thus: Move the ipc_addid() to the end of the initialization. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Reported-by: Rik van Riel <riel@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc/sem.c: add a printk_once for semctl(GETNCNT/GETZCNT)Manfred Spraul1-0/+11
The actual Linux implementation for semctl(GETNCNT) and semctl(GETZCNT) always (since 0.99.10) reported a thread as sleeping on all semaphores that are listed in the semop() call. The documented behavior (both in the Linux man page and in the Single Unix Specification) is that a task should be reported on exactly one semaphore: The semaphore that caused the thread to got to sleep. This patch adds a pr_info_once() that is triggered if a thread hits the relevant case. The code triggers slightly too often, otherwise it would be necessary to replicate the old code. As there are no known users of GETNCNT or GETZCNT, this is done to prevent unnecessary bloat. The task that triggered is reported with name (tsk->comm) and pid. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Acked-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc/sem.c: make semctl(,,{GETNCNT,GETZCNT}) standard compliantManfred Spraul1-21/+13
SUSv4 clearly defines how semncnt and semzcnt must be calculated: A task waits on exactly one semaphore: The semaphore from the first operation in the sop array that cannot proceed. The Linux implementation never followed the standard, it tried to count all semaphores that might be the reason why a task sleeps. This patch fixes that. Note: a) The implementation assumes that GETNCNT and GETZCNT are rare operations, therefore the code counts them only on demand. (If they wouldn't be rare, then the non-compliance would have been found earlier) b) compared to the initial version of the patch, the BUG_ONs were removed and it was clarified that the new behavior conforms to SUS. Back-compatibility concerns: Manfred: : - there is no application in Fedora that uses GETNCNT or GETZCNT. : : - application that use only single-sop semop() are also safe, the : difference only affects complex apps. : : - portable application are also safe, the new behavior is standard : compliant. : : But that's it. The old behavior existed in Linux from 0.99.something : until now. Michael: : * These operations seem to be very little used. Grepping the public : source that is contained Fedora 20 source DVD, there appear to be no : uses. Of course, this says nothing about uses in private / : non-mainstream FOSS code, but it seems likely that the same pattern : is followed there. : : * The existing behavior is hard enough to understand that I suspect : that no one understood it well enough to rely on it anyway : (especially as that behavior contradicted both man page and POSIX). : : So, there's a chance of breakage, but I estimate that it's minute. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc/sem.c: store which operation blocks in perform_atomic_semop()Manfred Spraul1-0/+3
Preparation for the next patch: In the slow-path of perform_atomic_semop(), store a pointer to the operation that caused the operation to block. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc/sem.c: change perform_atomic_semop parametersManfred Spraul1-19/+19
Right now, perform_atomic_semop gets the content of sem_queue as individual fields. Changes that, instead pass a pointer to sem_queue. This is a preparation for the next patch: it uses sem_queue to store the reason why a task must sleep. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc/sem.c: remove code duplicationManfred Spraul1-56/+51
count_semzcnt and count_semncnt are more of less identical. The patch creates a single function that either counts the number of tasks waiting for zero or waiting due to a decrease operation. Compared to the initial version, the BUG_ONs were removed. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc/sem.c: bugfix for semctl(,,GETZCNT)Manfred Spraul1-0/+10
GETZCNT is supposed to return the number of threads that wait until a semaphore value becomes 0. The current implementation overlooks complex operations that contain both wait-for-zero operation and operations that alter at least one semaphore. The patch fixes that. It's intentionally copy&paste, this will be cleaned up in the next patch. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc, kernel: clear whitespacePaul McQuade1-5/+5
trailing whitespace Signed-off-by: Paul McQuade <paulmcquad@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc, kernel: use Linux headersPaul McQuade1-1/+1
Use #include <linux/uaccess.h> instead of <asm/uaccess.h> Use #include <linux/types.h> instead of <asm/types.h> Signed-off-by: Paul McQuade <paulmcquad@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc: constify ipc_opsMathias Krause1-5/+5
There is no need to recreate the very same ipc_ops structure on every kernel entry for msgget/semget/shmget. Just declare it static and be done with it. While at it, constify it as we don't modify the structure at runtime. Found in the PaX patch, written by the PaX Team. Signed-off-by: Mathias Krause <minipli@googlemail.com> Cc: PaX Team <pageexec@freemail.hu> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-27ipc: remove braces for single statementsDavidlohr Bueso1-5/+3
Deal with checkpatch messages: WARNING: braces {} are not necessary for single statement blocks Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-27ipc: standardize code commentsDavidlohr Bueso1-13/+11
IPC commenting style is all over the place, *specially* in util.c. This patch orders things a bit. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-27ipc: whitespace cleanupManfred Spraul1-48/+48
The ipc code does not adhere the typical linux coding style. This patch fixes lots of simple whitespace errors. - mostly autogenerated by scripts/checkpatch.pl -f --fix \ --types=pointer_location,spacing,space_before_tab - one manual fixup (keep structure members tab-aligned) - removal of additional space_before_tab that were not found by --fix Tested with some of my msg and sem test apps. Andrew: Could you include it in -mm and move it towards Linus' tree? Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Suggested-by: Li Bin <huawei.libin@huawei.com> Cc: Joe Perches <joe@perches.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-27ipc: change kern_ipc_perm.deleted type to boolRafael Aquini1-1/+1
struct kern_ipc_perm.deleted is meant to be used as a boolean toggle, and the changes introduced by this patch are just to make the case explicit. Signed-off-by: Rafael Aquini <aquini@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Greg Thelen <gthelen@google.com> Acked-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-27ipc: introduce ipc_valid_object() helper to sort out IPC_RMID racesRafael Aquini1-8/+16
After the locking semantics for the SysV IPC API got improved, a couple of IPC_RMID race windows were opened because we ended up dropping the 'kern_ipc_perm.deleted' check performed way down in ipc_lock(). The spotted races got sorted out by re-introducing the old test within the racy critical sections. This patch introduces ipc_valid_object() to consolidate the way we cope with IPC_RMID races by using the same abstraction across the API implementation. Signed-off-by: Rafael Aquini <aquini@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Greg Thelen <gthelen@google.com> Reviewed-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-27ipc/sem.c: avoid overflow of semop undo (semadj) valuePetr Mladek1-11/+13
When trying to understand semop code, I found a small mistake in the check for semadj (undo) value overflow. The new undo value is not stored immediately and next potential checks are done against the old value. The failing scenario is not much practical. One semop call has to do more operations on the same semaphore. Also semval and semadj must have different values, so there has to be some operations without SEM_UNDO flag. For example: struct sembuf depositor_op[1]; struct sembuf collector_op[2]; depositor_op[0].sem_num = 0; depositor_op[0].sem_op = 20000; depositor_op[0].sem_flg = 0; collector_op[0].sem_num = 0; collector_op[0].sem_op = -10000; collector_op[0].sem_flg = SEM_UNDO; collector_op[1].sem_num = 0; collector_op[1].sem_op = -10000; collector_op[1].sem_flg = SEM_UNDO; if (semop(semid, depositor_op, 1) == -1) { perror("Failed to do 1st deposit"); return 1; } if (semop(semid, collector_op, 2) == -1) { perror("Failed to do 1st collect"); return 1; } if (semop(semid, depositor_op, 1) == -1) { perror("Failed to do 2nd deposit"); return 1; } if (semop(semid, collector_op, 2) == -1) { perror("Failed to do 2nd collect"); return 1; } return 0; It passes without error now but the semadj value has overflown in the 2nd collector operation. [akpm@linux-foundation.org: restore lessened scope of local `undo'] [davidlohr@hp.com: correct header comment for perform_atomic_semop] Signed-off-by: Petr Mladek <pmladek@suse.cz> Acked-by: Davidlohr Bueso <davidlohr@hp.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Cc: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-16ipc/sem.c: synchronize semop and semctl with IPC_RMIDManfred Spraul1-13/+29
After acquiring the semlock spinlock, operations must test that the array is still valid. - semctl() and exit_sem() would walk stale linked lists (ugly, but should be ok: all lists are empty) - semtimedop() would sleep forever - and if woken up due to a signal - access memory after free. The patch also: - standardizes the tests for .deleted, so that all tests in one function leave the function with the same approach. - unconditionally tests for .deleted immediately after every call to sem_lock - even it it means that for semctl(GETALL), .deleted will be tested twice. Both changes make the review simpler: After every sem_lock, there must be a test of .deleted, followed by a goto to the cleanup code (if the function uses "goto cleanup"). The only exception is semctl_down(): If sem_ids().rwsem is locked, then the presence in ids->ipcs_idr is equivalent to !.deleted, thus no additional test is required. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Mike Galbraith <efault@gmx.de> Acked-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-30ipc/sem.c: update sem_otime for all operationsManfred Spraul1-13/+29
In commit 0a2b9d4c7967 ("ipc/sem.c: move wake_up_process out of the spinlock section"), the update of semaphore's sem_otime(last semop time) was moved to one central position (do_smart_update). But since do_smart_update() is only called for operations that modify the array, this means that wait-for-zero semops do not update sem_otime anymore. The fix is simple: Non-alter operations must update sem_otime. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Reported-by: Jia He <jiakernel@gmail.com> Tested-by: Jia He <jiakernel@gmail.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-30ipc/sem.c: synchronize the proc interfaceManfred Spraul1-0/+8
The proc interface is not aware of sem_lock(), it instead calls ipc_lock_object() directly. This means that simple semop() operations can run in parallel with the proc interface. Right now, this is uncritical, because the implementation doesn't do anything that requires a proper synchronization. But it is dangerous and therefore should be fixed. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-30ipc/sem.c: optimize sem_lock()Manfred Spraul1-0/+8
Operations that need access to the whole array must guarantee that there are no simple operations ongoing. Right now this is achieved by spin_unlock_wait(sem->lock) on all semaphores. If complex_count is nonzero, then this spin_unlock_wait() is not necessary, because it was already performed in the past by the thread that increased complex_count and even though sem_perm.lock was dropped inbetween, no simple operation could have started, because simple operations cannot start when complex_count is non-zero. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Mike Galbraith <bitbucket@online.de> Cc: Rik van Riel <riel@redhat.com> Reviewed-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-30ipc/sem.c: fix race in sem_lock()Manfred Spraul1-44/+78
The exclusion of complex operations in sem_lock() is insufficient: after acquiring the per-semaphore lock, a simple op must first check that sem_perm.lock is not locked and only after that test check complex_count. The current code does it the other way around - and that creates a race. Details are below. The patch is a complete rewrite of sem_lock(), based in part on the code from Mike Galbraith. It removes all gotos and all loops and thus the risk of livelocks. I have tested the patch (together with the next one) on my i3 laptop and it didn't cause any problems. The bug is probably also present in 3.10 and 3.11, but for these kernels it might be simpler just to move the test of sma->complex_count after the spin_is_locked() test. Details of the bug: Assume: - sma->complex_count = 0. - Thread 1: semtimedop(complex op that must sleep) - Thread 2: semtimedop(simple op). Pseudo-Trace: Thread 1: sem_lock(): acquire sem_perm.lock Thread 1: sem_lock(): check for ongoing simple ops Nothing ongoing, thread 2 is still before sem_lock(). Thread 1: try_atomic_semop() <<< preempted. Thread 2: sem_lock(): static inline int sem_lock(struct sem_array *sma, struct sembuf *sops, int nsops) { int locknum; again: if (nsops == 1 && !sma->complex_count) { struct sem *sem = sma->sem_base + sops->sem_num; /* Lock just the semaphore we are interested in. */ spin_lock(&sem->lock); /* * If sma->complex_count was set while we were spinning, * we may need to look at things we did not lock here. */ if (unlikely(sma->complex_count)) { spin_unlock(&sem->lock); goto lock_array; } <<<<<<<<< <<< complex_count is still 0. <<< <<< Here it is preempted <<<<<<<<< Thread 1: try_atomic_semop() returns, notices that it must sleep. Thread 1: increases sma->complex_count. Thread 1: drops sem_perm.lock Thread 2: /* * Another process is holding the global lock on the * sem_array; we cannot enter our critical section, * but have to wait for the global lock to be released. */ if (unlikely(spin_is_locked(&sma->sem_perm.lock))) { spin_unlock(&sem->lock); spin_unlock_wait(&sma->sem_perm.lock); goto again; } <<< sem_perm.lock already dropped, thus no "goto again;" locknum = sops->sem_num; Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Mike Galbraith <bitbucket@online.de> Cc: Rik van Riel <riel@redhat.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: <stable@vger.kernel.org> [3.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24ipc: fix race with LSMsDavidlohr Bueso1-16/+18
Currently, IPC mechanisms do security and auditing related checks under RCU. However, since security modules can free the security structure, for example, through selinux_[sem,msg_queue,shm]_free_security(), we can race if the structure is freed before other tasks are done with it, creating a use-after-free condition. Manfred illustrates this nicely, for instance with shared mem and selinux: -> do_shmat calls rcu_read_lock() -> do_shmat calls shm_object_check(). Checks that the object is still valid - but doesn't acquire any locks. Then it returns. -> do_shmat calls security_shm_shmat (e.g. selinux_shm_shmat) -> selinux_shm_shmat calls ipc_has_perm() -> ipc_has_perm accesses ipc_perms->security shm_close() -> shm_close acquires rw_mutex & shm_lock -> shm_close calls shm_destroy -> shm_destroy calls security_shm_free (e.g. selinux_shm_free_security) -> selinux_shm_free_security calls ipc_free_security(&shp->shm_perm) -> ipc_free_security calls kfree(ipc_perms->security) This patch delays the freeing of the security structures after all RCU readers are done. Furthermore it aligns the security life cycle with that of the rest of IPC - freeing them based on the reference counter. For situations where we need not free security, the current behavior is kept. Linus states: "... the old behavior was suspect for another reason too: having the security blob go away from under a user sounds like it could cause various other problems anyway, so I think the old code was at least _prone_ to bugs even if it didn't have catastrophic behavior." I have tested this patch with IPC testcases from LTP on both my quad-core laptop and on a 64 core NUMA server. In both cases selinux is enabled, and tests pass for both voluntary and forced preemption models. While the mentioned races are theoretical (at least no one as reported them), I wanted to make sure that this new logic doesn't break anything we weren't aware of. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11ipc: rename ids->rw_mutexDavidlohr Bueso1-12/+12
Since in some situations the lock can be shared for readers, we shouldn't be calling it a mutex, rename it to rwsem. Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09ipc/sem.c: rename try_atomic_semop() to perform_atomic_semop(), docu updateManfred Spraul1-11/+21
Cleanup: Some minor points that I noticed while writing the previous patches 1) The name try_atomic_semop() is misleading: The function performs the operation (if it is possible). 2) Some documentation updates. No real code change, a rename and documentation changes. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Rik van Riel <riel@redhat.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09ipc/sem.c: replace shared sem_otime with per-semaphore valueManfred Spraul1-6/+31
sem_otime contains the time of the last semaphore operation that completed successfully. Every operation updates this value, thus access from multiple cpus can cause thrashing. Therefore the patch replaces the variable with a per-semaphore variable. The per-array sem_otime is only calculated when required. No performance improvement on a single-socket i3 - only important for larger systems. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Rik van Riel <riel@redhat.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09ipc/sem.c: always use only one queue for alter operationsManfred Spraul1-40/+88
There are two places that can contain alter operations: - the global queue: sma->pending_alter - the per-semaphore queues: sma->sem_base[].pending_alter. Since one of the queues must be processed first, this causes an odd priorization of the wakeups: complex operations have priority over simple ops. The patch restores the behavior of linux <=3.0.9: The longest waiting operation has the highest priority. This is done by using only one queue: - if there are complex ops, then sma->pending_alter is used. - otherwise, the per-semaphore queues are used. As a side effect, do_smart_update_queue() becomes much simpler: no more goto logic. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Rik van Riel <riel@redhat.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09ipc/sem: separate wait-for-zero and alter tasks into seperate queuesManfred Spraul1-60/+151
Introduce separate queues for operations that do not modify the semaphore values. Advantages: - Simpler logic in check_restart(). - Faster update_queue(): Right now, all wait-for-zero operations are always tested, even if the semaphore value is not 0. - wait-for-zero gets again priority, as in linux <=3.0.9 Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Rik van Riel <riel@redhat.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09ipc/sem.c: cacheline align the semaphore structuresManfred Spraul1-1/+1
As now each semaphore has its own spinlock and parallel operations are possible, give each semaphore its own cacheline. On a i3 laptop, this gives up to 28% better performance: #semscale 10 | grep "interleave 2" - before: Cpus 1, interleave 2 delay 0: 36109234 in 10 secs Cpus 2, interleave 2 delay 0: 55276317 in 10 secs Cpus 3, interleave 2 delay 0: 62411025 in 10 secs Cpus 4, interleave 2 delay 0: 81963928 in 10 secs -after: Cpus 1, interleave 2 delay 0: 35527306 in 10 secs Cpus 2, interleave 2 delay 0: 70922909 in 10 secs <<< + 28% Cpus 3, interleave 2 delay 0: 80518538 in 10 secs Cpus 4, interleave 2 delay 0: 89115148 in 10 secs <<< + 8.7% i3, with 2 cores and with hyperthreading enabled. Interleave 2 in order use first the full cores. HT partially hides the delay from cacheline trashing, thus the improvement is "only" 8.7% if 4 threads are running. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Rik van Riel <riel@redhat.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09ipc: remove unused functionsDavidlohr Bueso1-1/+0
We can now drop the msg_lock and msg_lock_check functions along with a bogus comment introduced previously in semctl_down. Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09ipc: move locking out of ipcctl_pre_down_nolockDavidlohr Bueso1-11/+16
This function currently acquires both the rw_mutex and the rcu lock on successful lookups, leaving the callers to explicitly unlock them, creating another two level locking situation. Make the callers (including those that still use ipcctl_pre_down()) explicitly lock and unlock the rwsem and rcu lock. Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09ipc: close open coded spin lock callsDavidlohr Bueso1-7/+7
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-26ipc/sem.c: Fix missing wakeups in do_smart_update_queue()Manfred Spraul1-5/+22
do_smart_update_queue() is called when an operation (semop, semctl(SETVAL), semctl(SETALL), ...) modified the array. It must check which of the sleeping tasks can proceed. do_smart_update_queue() missed a few wakeups: - if a sleeping complex op was completed, then all per-semaphore queues must be scanned - not only those that were modified by *sops - if a sleeping simple op proceeded, then the global queue must be scanned again And: - the test for "|sops == NULL) before scanning the global queue is not required: If the global queue is empty, then it doesn't need to be scanned - regardless of the reason for calling do_smart_update_queue() The patch is not optimized, i.e. even completing a wait-for-zero operation causes a rescan. This is done to keep the patch as simple as possible. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-09ipc,sem: fix semctl(..., GETNCNT)Rik van Riel1-0/+7
The semctl GETNCNT returns the number of semops waiting for the specified semaphore to become nonzero. After commit 9f1bc2c9022c ("ipc,sem: have only one list in struct sem_queue"), the semops waiting on just one semaphore are waiting on that semaphore's list. In order to return the correct count, we have to walk that list too, in addition to the sem_array's list for complex operations. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-09ipc,sem: fix semctl(..., GETZCNT)Rik van Riel1-0/+7
The semctl GETZCNT returns the number of semops waiting for the specified semaphore to become zero. After commit 9f1bc2c9022c ("ipc,sem: have only one list in struct sem_queue"), the semops waiting on just one semaphore are waiting on that semaphore's list. In order to return the correct count, we have to walk that list too, in addition to the sem_array's list for complex operations. This bug broke dbench; it works again with this patch applied. Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Kent Overstreet <koverstreet@google.com> Tested-by: Kent Overstreet <koverstreet@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-04ipc: simplify rcu_read_lock() in semctl_nolock()Linus Torvalds1-2/+1
This trivially combines two rcu_read_lock() calls in both sides of a if-statement into one single one in front of the if-statement. Split out as an independent cleanup from the previous commit. Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-04ipc: simplify semtimedop/semctl_main() common error path handlingLinus Torvalds1-27/+14
With various straight RCU lock/unlock movements, one common exit path pattern had become rcu_read_unlock(); goto out_wakeup; and in fact there were no cases where we wanted to exit to out_wakeup _without_ releasing the RCU read lock. So replace that pattern with "goto out_rcu_wakeup", and remove the old out_wakeup. Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-04ipc: move sem_obtain_lock() rcu locking into the only callerLinus Torvalds1-9/+7
sem_obtain_lock() was another of those functions that returned with the RCU lock held for reading in the success case. Move the RCU locking to the caller (semtimedop()), making it more obvious. We already did RCU locking elsewhere in that function. Side note: why does semtimedop() re-do the semphore lookup after the sleep, rather than just getting a reference to the semaphore it already looked up originally? Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-04ipc: fix double sem unlock in semctl error pathLinus Torvalds1-1/+1
Fix another ipc locking buglet introduced by the scalability patches: when semctl_down() was changed to delay the semaphore locking, one error path for security_sem_semctl() went through the semaphore unlock logic even though the semaphore had never been locked. Introduced by commit 16df3674efe3 ("ipc,sem: do not hold ipc lock more than necessary") Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-04ipc: move the rcu_read_lock() from sem_lock_and_putref() into callersLinus Torvalds1-2/+3
This is another ipc semaphore locking cleanup, trying to make the locking more straightforward. We move the rcu read locking into the callers of sem_lock_and_putref(), which in general means that we now mostly do the rcu_read_lock() and rcu_read_unlock() in the same function. Mostly. We still have the ipc_addid/newary/freeary mess, and things like ipcctl_pre_down_nolock(). Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-04ipc: sem_putref() does not need the semaphore lock any moreLinus Torvalds1-3/+1
ipc_rcu_putref() uses atomics for the refcount, and the games to lock and unlock the semaphore just to try to keep the reference counting working are no longer useful. Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-04ipc: move rcu_read_unlock() out of sem_unlock() and into callersLinus Torvalds1-2/+17
The IPC locking is a mess, and sem_unlock() unlocks not only the semaphore spinlock, it also drops the rcu read lock. Unlike sem_lock(), which just gets the spin-lock, and expects the caller to get the rcu read lock. This all makes things very hard to follow, and it's very confusing when you take the rcu read lock in one function, and then release it in another. And it has caused actual bugs: the sem_obtain_lock() function ended up dropping the RCU read lock twice in one error path, because it first did the sem_unlock(), and then did a rcu_read_unlock() to match the rcu_read_lock() it had done. This is just a totally mindless "remove rcu_read_unlock() from sem_unlock() and add it immediately after each caller" (except for the aforementioned bug where we did too many rcu_read_unlock(), and in find_alloc_undo() where we just got the rcu_read_lock() to correct for the fact that sem_unlock would immediately drop it again). We can (and should) clean things up further, but this fixes the bug with the minimal amount of subtlety. Reviewed-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-02ipc: fix GETALL/IPC_RM race for sysv semaphoresAl Viro1-21/+8
We can step on WARN_ON_ONCE() in sem_getref() if a semaphore is removed just as we are about to call sem_getref() from semctl_main(); results are not pretty. We should fail with -EIDRM, same as if IPC_RM happened while we'd been doing allocation there. This also expands sem_getref() at its only callsite (and fixed there), while sem_getref_and_unlock() is simply killed off - it has no callers at all. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-01ipc,sem: fine grained locking for semtimedopRik van Riel1-100/+171
Introduce finer grained locking for semtimedop, to handle the common case of a program wanting to manipulate one semaphore from an array with multiple semaphores. If the call is a semop manipulating just one semaphore in an array with multiple semaphores, only take the lock for that semaphore itself. If the call needs to manipulate multiple semaphores, or another caller is in a transaction that manipulates multiple semaphores, the sem_array lock is taken, as well as all the locks for the individual semaphores. On a 24 CPU system, performance numbers with the semop-multi test with N threads and N semaphores, look like this: vanilla Davidlohr's Davidlohr's + Davidlohr's + threads patches rwlock patches v3 patches 10 610652 726325 1783589 2142206 20 341570 365699 1520453 1977878 30 288102 307037 1498167 2037995 40 290714 305955 1612665 2256484 50 288620 312890 1733453 2650292 60 289987 306043 1649360 2388008 70 291298 306347 1723167 2717486 80 290948 305662 1729545 2763582 90 290996 306680 1736021 2757524 100 292243 306700 1773700 3059159 [davidlohr.bueso@hp.com: do not call sem_lock when bogus sma] [davidlohr.bueso@hp.com: make refcounter atomic] Signed-off-by: Rik van Riel <riel@redhat.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Chegu Vinod <chegu_vinod@hp.com> Cc: Jason Low <jason.low2@hp.com> Reviewed-by: Michel Lespinasse <walken@google.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Stanislav Kinsbursky <skinsbursky@parallels.com> Tested-by: Emmanuel Benisty <benisty.e@gmail.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-01ipc,sem: have only one list in struct sem_queueRik van Riel1-31/+34
Having only one list in struct sem_queue, and only queueing simple semaphore operations on the list for the semaphore involved, allows us to introduce finer grained locking for semtimedop. Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Chegu Vinod <chegu_vinod@hp.com> Cc: Emmanuel Benisty <benisty.e@gmail.com> Cc: Jason Low <jason.low2@hp.com> Cc: Michel Lespinasse <walken@google.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Stanislav Kinsbursky <skinsbursky@parallels.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-01ipc,sem: open code and rename sem_lockRik van Riel1-6/+23
Rename sem_lock() to sem_obtain_lock(), so we can introduce a sem_lock() later that only locks the sem_array and does nothing else. Open code the locking from ipc_lock() in sem_obtain_lock() so we can introduce finer grained locking for the sem_array in the next patch. [akpm@linux-foundation.org: propagate the ipc_obtain_object() errno out of sem_obtain_lock()] Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Chegu Vinod <chegu_vinod@hp.com> Cc: Emmanuel Benisty <benisty.e@gmail.com> Cc: Jason Low <jason.low2@hp.com> Cc: Michel Lespinasse <walken@google.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Stanislav Kinsbursky <skinsbursky@parallels.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-01ipc,sem: do not hold ipc lock more than necessaryDavidlohr Bueso1-48/+113
Instead of holding the ipc lock for permissions and security checks, among others, only acquire it when necessary. Some numbers.... 1) With Rik's semop-multi.c microbenchmark we can see the following results: Baseline (3.9-rc1): cpus 4, threads: 256, semaphores: 128, test duration: 30 secs total operations: 151452270, ops/sec 5048409 + 59.40% a.out [kernel.kallsyms] [k] _raw_spin_lock + 6.14% a.out [kernel.kallsyms] [k] sys_semtimedop + 3.84% a.out [kernel.kallsyms] [k] avc_has_perm_flags + 3.64% a.out [kernel.kallsyms] [k] __audit_syscall_exit + 2.06% a.out [kernel.kallsyms] [k] copy_user_enhanced_fast_string + 1.86% a.out [kernel.kallsyms] [k] ipc_lock With this patchset: cpus 4, threads: 256, semaphores: 128, test duration: 30 secs total operations: 273156400, ops/sec 9105213 + 18.54% a.out [kernel.kallsyms] [k] _raw_spin_lock + 11.72% a.out [kernel.kallsyms] [k] sys_semtimedop + 7.70% a.out [kernel.kallsyms] [k] ipc_has_perm.isra.21 + 6.58% a.out [kernel.kallsyms] [k] avc_has_perm_flags + 6.54% a.out [kernel.kallsyms] [k] __audit_syscall_exit + 4.71% a.out [kernel.kallsyms] [k] ipc_obtain_object_check 2) While on an Oracle swingbench DSS (data mining) workload the improvements are not as exciting as with Rik's benchmark, we can see some positive numbers. For an 8 socket machine the following are the percentages of %sys time incurred in the ipc lock: Baseline (3.9-rc1): 100 swingbench users: 8,74% 400 swingbench users: 21,86% 800 swingbench users: 84,35% With this patchset: 100 swingbench users: 8,11% 400 swingbench users: 19,93% 800 swingbench users: 77,69% [riel@redhat.com: fix two locking bugs] [sasha.levin@oracle.com: prevent releasing RCU read lock twice in semctl_main] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Reviewed-by: Chegu Vinod <chegu_vinod@hp.com> Acked-by: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jason Low <jason.low2@hp.com> Cc: Emmanuel Benisty <benisty.e@gmail.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Stanislav Kinsbursky <skinsbursky@parallels.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-05get rid of union semop in sys_semctl(2) argumentsAl Viro1-45/+76
just have the bugger take unsigned long and deal with SETVAL case (when we use an int member in the union) explicitly. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-03-03make HAVE_SYSCALL_WRAPPERS unconditionalAl Viro1-2/+0
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-06userns: Convert ipc to use kuid and kgid where appropriateEric W. Biederman1-5/+8
- Store the ipc owner and creator with a kuid - Store the ipc group and the crators group with a kgid. - Add error handling to ipc_update_perms, allowing it to fail if the uids and gids can not be converted to kuids or kgids. - Modify the proc files to display the ipc creator and owner in the user namespace of the opener of the proc file. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2011-11-02ipc/sem.c: remove private structures from public header fileManfred Spraul1-0/+46
include/linux/sem.h contains several structures that are only used within ipc/sem.c. The patch moves them into ipc/sem.c - there is no need to expose the structures to the whole kernel. No functional changes, only whitespace cleanups and 80-char per line fixes. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02ipc/sem.c: handle spurious wakeupsManfred Spraul1-0/+9
semtimedop() does not handle spurious wakeups, it returns -EINTR to user space. Most other schedule() users would just loop and not return to user space. The patch adds such a loop to semtimedop() Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Reported-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02ipc/sem.c: fix return code race with semop vs. semop +semctl(IPC_RMID)Manfred Spraul1-1/+0
sys_semtimedop() may return -EIDRM although the semaphore operation completed successfully: thread 1: thread 2: semtimedop(), sleeps semop(): * acquires sem_lock() semtimedop() woken up due to timeout sem_lock() loops * notices that thread 2 could be completed. * performs the operations that thread 2 is sleeping on. * marks the semaphore operation as IN_WAKEUP * drops sem_lock(), does wakeup, sets return code to 0 * thread delayed due to interrupt, whatever * returns to user space * thread still delayed semctl(IPC_RMID) * acquires sem_lock() * ipc_rmid(), ipcp->deleted=1 * drops sem_lock() * thread finally continues - but seem_lock() now fails due to ipcp->deleted == 1 * returns -EIDRM instead of 0 The fix is trivial: Always use the return code in queue.status. In real world, the race probably doesn't matter: If the semaphore array is destroyed, the app is probably not interested if the last operation succeeded or was already cancelled. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-25ipc/sem.c: fix race with concurrent semtimedop() timeouts and IPC_RMIDManfred Spraul1-2/+11
If a semaphore array is removed and in parallel a sleeping task is woken up (signal or timeout, does not matter), then the woken up task does not wait until wake_up_sem_queue_do() is completed. This will cause crashes, because wake_up_sem_queue_do() will read from a stale pointer. The fix is simple: Regardless of anything, always call get_queue_result(). This function waits until wake_up_sem_queue_do() has finished it's task. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=27142 Reported-by: Yuriy Yevtukhov <yuriy@ucoz.com> Reported-by: Harald Laabs <kernel@dasr.de> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: <stable@kernel.org> [2.6.35+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-20ipc,rcu: Convert call_rcu(free_un) to kfree_rcu()Lai Jiangshan1-8/+2
The rcu callback free_un() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(free_un). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Manfred Spraul <manfred@colorfullife.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-03-31Fix common misspellingsLucas De Marchi1-1/+1
Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-23userns: user namespaces: convert several capable() callsSerge E. Hallyn1-4/+6
CAP_IPC_OWNER and CAP_IPC_LOCK can be checked against current_user_ns(), because the resource comes from current's own ipc namespace. setuid/setgid are to uids in own namespace, so again checks can be against current_user_ns(). Changelog: Jan 11: Use task_ns_capable() in place of sched_capable(). Jan 11: Use nsown_capable() as suggested by Bastian Blank. Jan 11: Clarify (hopefully) some logic in futex and sched.c Feb 15: use ns_capable for ipc, not nsown_capable Feb 23: let copy_ipcs handle setting ipc_ns->user_ns Feb 23: pass ns down rather than taking it from current [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-01sys_semctl: fix kernel stack leakageDan Rosenberg1-0/+2
The semctl syscall has several code paths that lead to the leakage of uninitialized kernel stack memory (namely the IPC_INFO, SEM_INFO, IPC_STAT, and SEM_STAT commands) during the use of the older, obsolete version of the semid_ds struct. The copy_semid_to_user() function declares a semid_ds struct on the stack and copies it back to the user without initializing or zeroing the "sem_base", "sem_pending", "sem_pending_last", and "undo" pointers, allowing the leakage of 16 bytes of kernel stack memory. The code is still reachable on 32-bit systems - when calling semctl() newer glibc's automatically OR the IPC command with the IPC_64 flag, but invoking the syscall directly allows users to use the older versions of the struct. Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-07-20ipc/sem.c: bugfix for semop() not reporting successful operationManfred Spraul1-7/+39
The last change to improve the scalability moved the actual wake-up out of the section that is protected by spin_lock(sma->sem_perm.lock). This means that IN_WAKEUP can be in queue.status even when the spinlock is acquired by the current task. Thus the same loop that is performed when queue.status is read without the spinlock acquired must be performed when the spinlock is acquired. Thanks to kamezawa.hiroyu@jp.fujitsu.com for noticing lack of the memory barrier. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=16255 [akpm@linux-foundation.org: clean up kerneldoc, checkpatch warning and whitespace] Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Reported-by: Luca Tettamanti <kronos.it@gmail.com> Tested-by: Luca Tettamanti <kronos.it@gmail.com> Reported-by: Christoph Lameter <cl@linux-foundation.org> Cc: Maciej Rutecki <maciej.rutecki@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27ipc/sem.c: use ERR_CASTJulia Lawall1-1/+1
Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more clear what is the purpose of the operation, which otherwise looks like a no-op. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ type T; T x; identifier f; @@ T f (...) { <+... - ERR_PTR(PTR_ERR(x)) + x ...+> } @@ expression x; @@ - ERR_PTR(PTR_ERR(x)) + ERR_CAST(x) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27ipc/sem.c: update description of the implementationManfred Spraul1-50/+53
ipc/sem.c begins with a 15 year old description about bugs in the initial implementation in Linux-1.0. The patch replaces that with a top level description of the current code. A TODO could be derived from this text: The opengroup man page for semop() does not mandate FIFO. Thus there is no need for a semaphore array list of pending operations. If - this list is removed - the per-semaphore array spinlock is removed (possible if there is no list to protect) - sem_otime is moved into the semaphores and calculated on demand during semctl() then the array would be read-mostly - which would significantly improve scaling for applications that use semaphore arrays with lots of entries. The price would be expensive semctl() calls: for(i=0;i<sma->sem_nsems;i++) spin_lock(sma->sem_lock); <do stuff> for(i=0;i<sma->sem_nsems;i++) spin_unlock(sma->sem_lock); I'm not sure if the complexity is worth the effort, thus here is the documentation of the current behavior first. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27ipc/sem.c: move wake_up_process out of the spinlock sectionManfred Spraul1-32/+91
The wake-up part of semtimedop() consists out of two steps: - the right tasks must be identified. - they must be woken up. Right now, both steps run while the array spinlock is held. This patch reorders the code and moves the actual wake_up_process() behind the point where the spinlock is dropped. The code also moves setting sem->sem_otime to one place: It does not make sense to set the last modify time multiple times. [akpm@linux-foundation.org: repair kerneldoc] [akpm@linux-foundation.org: fix uninitialised retval] Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27ipc/sem.c: optimize update_queue() for bulk wakeup callsManfred Spraul1-13/+97
The following series of patches tries to fix the spinlock contention reported by Chris Mason - his benchmark exposes problems of the current code: - In the worst case, the algorithm used by update_queue() is O(N^2). Bulk wake-up calls can enter this worst case. The patch series fix that. Note that the benchmark app doesn't expose the problem, it just should be fixed: Real world apps might do the wake-ups in another order than perfect FIFO. - The part of the code that runs within the semaphore array spinlock is significantly larger than necessary. The patch series fixes that. This change is responsible for the main improvement. - The cacheline with the spinlock is also used for a variable that is read in the hot path (sem_base) and for a variable that is unnecessarily written to multiple times (sem_otime). The last step of the series cacheline-aligns the spinlock. This patch: The SysV semaphore code allows to perform multiple operations on all semaphores in the array as atomic operations. After a modification, update_queue() checks which of the waiting tasks can complete. The algorithm that is used to identify the tasks is O(N^2) in the worst case. For some cases, it is simple to avoid the O(N^2). The patch adds a detection logic for some cases, especially for the case of an array where all sleeping tasks are single sembuf operations and a multi-sembuf operation is used to wake up multiple tasks. A big database application uses that approach. The patch fixes wakeup due to semctl(,,SETALL,) - the initial version of the patch breaks that. [akpm@linux-foundation.org: make do_smart_update() static] Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16ipc: remove unreachable code in sem.cAmerigo Wang1-2/+1
This line is unreachable, remove it. [akpm@linux-foundation.org: remove unneeded initialisation of `err'] Signed-off-by: WANG Cong <amwang@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16ipc/sem.c: optimize single sops when semval is zeroManfred Spraul1-0/+11
If multiple simple decrements on the same semaphore are pending, then the current code scans all decrement operations, even if the semaphore value is already 0. The patch optimizes that: if the semaphore value is 0, then there is no need to scan the q->alter entries. Note that this is a common case: It happens if 100 decrements by one are pending and now an increment by one increases the semaphore value from 0 to 1. Without this patch, all 100 entries are scanned. With the patch, only one entry is scanned, then woken up. Then the new rule triggers and the scanning is aborted, without looking at the remaining 99 tasks. With this patch, single sop increment/decrement by 1 are now O(1). (same as with Nick's patch) Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Pierre Peiffer <peifferp@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16ipc/sem.c: optimize single semop operationsManfred Spraul1-11/+40
sysv sem has the concept of semaphore arrays that consist out of multiple semaphores. Atomic operations that affect multiple semaphores are supported. The patch optimizes single semaphore operation calls that affect only one semaphore: It's not necessary to scan all pending operations, it is sufficient to scan the per-semaphore list. The idea is from Nick Piggin version of an ipc sem improvement, the implementation is different: The code tries to keep as much common code as possible. As the result, the patch is simpler, but optimizes fewer cases. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Pierre Peiffer <peifferp@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16ipc/sem.c: add a per-semaphore pending listManfred Spraul1-5/+32
Based on Nick's findings: sysv sem has the concept of semaphore arrays that consist out of multiple semaphores. Atomic operations that affect multiple semaphores are supported. The patch is the first step for optimizing simple, single semaphore operations: In addition to the global list of all pending operations, a 2nd, per-semaphore list with the simple operations is added. Note: this patch does not make sense by itself, the new list is used nowhere. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Pierre Peiffer <peifferp@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16ipc/sem.c: optimize if semops failManfred Spraul1-2/+2
Reduce the amount of scanning of the list of pending semaphore operations: If try_atomic_semop failed, then no changes were applied. Thus no need to restart. Additionally, this patch correct an incorrect comment: It's possible to wait for arbitrary semaphore values (do a dec by <x>, wait-for-zero, inc by <x> in one atomic operation) Both changes are from Nick Piggin, the patch is the result of a different split of the individual changes. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Pierre Peiffer <peifferp@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16ipc/sem.c: sem preempt improveNick Piggin1-15/+23
The strange sysv semaphore wakeup scheme has a kind of busy-wait lock involved, which could deadlock if preemption is enabled during the "lock". It is an implementation detail (due to a spinlock being held) that this is actually the case. However if "spinlocks" are made preemptible, or if the sem lock is changed to a sleeping lock for example, then the wakeup would become buggy. So this might be a bugfix for -rt kernels. Imagine waker being preempted by wakee and never clearing IN_WAKEUP -- if wakee has higher RT priority then there is a priority inversion deadlock. Even if there is not a priority inversion to cause a deadlock, then there is still time wasted spinning. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Pierre Peiffer <peifferp@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16ipc/sem.c: sem use list operationsNick Piggin1-44/+31
Replace the handcoded list operations in update_queue() with the standard list_for_each_entry macros. list_for_each_entry_safe() must be used, because list entries can disappear immediately uppon the wakeup event. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Pierre Peiffer <peifferp@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16ipc/sem.c: sem optimise undo list searchNick Piggin1-6/+20
Around a month ago, there was some discussion about an improvement of the sysv sem algorithm: Most (at least: some important) users only use simple semaphore operations, therefore it's worthwile to optimize this use case. This patch: Move last looked up sem_undo struct to the head of the task's undo list. Attempt to move common entries to the front of the list so search time is reduced. This reduces lookup_undo on oprofile of problematic SAP workload by 30% (see patch 4 for a description of SAP workload). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Pierre Peiffer <peifferp@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16ipc ns: fix memory leak (idr)Serge E. Hallyn1-0/+1
We have apparently had a memory leak since 7ca7e564e049d8b350ec9d958ff25eaa24226352 "ipc: store ipcs into IDRs" in 2007. The idr of which 3 exist for each ipc namespace is never freed. This patch simply frees them when the ipcns is freed. I don't believe any idr_remove() are done from rcu (and could therefore be delayed until after this idr_destroy()), so the patch should be safe. Some quick testing showed no harm, and the memory leak fixed. Caught by kmemleak. Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-15rculist: use list_entry_rcu in places where it's appropriateJiri Pirko1-2/+2
Use previously introduced list_entry_rcu instead of an open-coded list_entry + rcu_dereference combination. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: dipankar@in.ibm.com LKML-Reference: <20090414181715.GA3634@psychotron.englab.brq.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-14[CVE-2009-0029] System call wrappers part 25Heiko Carstens1-4/+5
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14[CVE-2009-0029] System call wrapper special casesHeiko Carstens1-1/+8
System calls with an unsigned long long argument can't be converted with the standard wrappers since that would include a cast to long, which in turn means that we would lose the upper 32 bit on 32 bit architectures. Also semctl can't use the standard wrapper since it has a 'union' parameter. So we handle them as special case and add some extra wrappers instead. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-06ipc: do not goto to the next lineDenis V. Lunev1-1/+0
Signed-off-by: Denis V. Lunev <den@openvz.org> Reviewed-by: WANG Cong <wangcong@zeuux.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-05mm: update my addressAlan Cox1-1/+1
Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16ipc/sem.c: make free_un() staticAdrian Bunk1-1/+1
Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25ipc/sem.c: rewrite undo list lockingManfred Spraul1-54/+93
The attached patch: - reverses the locking order of ulp->lock and sem_lock: Previously, it was first ulp->lock, then inside sem_lock. Now it's the other way around. - converts the undo structure to rcu. Benefits: - With the old locking order, IPC_RMID could not kfree the undo structures. The stale entries remained in the linked lists and were released later. - The patch fixes a a race in semtimedop(): if both IPC_RMID and a semget() that recreates exactly the same id happen between find_alloc_undo() and sem_lock, then semtimedop() would access already kfree'd memory. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Reviewed-by: Nadia Derbey <Nadia.Derbey@bull.net> Cc: Pierre Peiffer <peifferp@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25ipc/sem.c: convert sem_array.sem_pending to struct list_headManfred Spraul1-57/+35
sem_array.sem_pending is a double linked list, the attached patch converts it to struct list_head. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Reviewed-by: Nadia Derbey <Nadia.Derbey@bull.net> Cc: Pierre Peiffer <peifferp@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25ipc/sem.c: remove unused entries from struct sem_queueManfred Spraul1-2/+0
sem_queue.sma and sem_queue.id were never used, the attached patch removes them. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Reviewed-by: Nadia Derbey <Nadia.Derbey@bull.net> Cc: Pierre Peiffer <peifferp@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25ipc/sem.c: convert undo structures to struct list_headManfred Spraul1-74/+89
The undo structures contain two linked lists, the attached patch replaces them with generic struct list_head lists. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Nadia Derbey <Nadia.Derbey@bull.net> Cc: Pierre Peiffer <peifferp@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29ipc: sysvsem: implement sys_unshare(CLONE_SYSVSEM)Manfred Spraul1-0/+1
sys_unshare(CLONE_NEWIPC) doesn't handle the undo lists properly, this can cause a kernel memory corruption. CLONE_NEWIPC must detach from the existing undo lists. Fix, part 1: add support for sys_unshare(CLONE_SYSVSEM) The original reason to not support it was the potential (inevitable?) confusion due to the fact that sys_unshare(CLONE_SYSVSEM) has the inverse meaning of clone(CLONE_SYSVSEM). Our two most reasonable options then appear to be (1) fully support CLONE_SYSVSEM, or (2) continue to refuse explicit CLONE_SYSVSEM, but always do it anyway on unshare(CLONE_SYSVSEM). This patch does (1). Changelog: Apr 16: SEH: switch to Manfred's alternative patch which removes the unshare_semundo() function which always refused CLONE_SYSVSEM. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Pierre Peiffer <peifferp@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29IPC: consolidate all xxxctl_down() functionsPierre Peiffer1-38/+4
semctl_down(), msgctl_down() and shmctl_down() are used to handle the same set of commands for each kind of IPC. They all start to do the same job (they retrieve the ipc and do some permission checks) before handling the commands on their own. This patch proposes to consolidate this by moving these same pieces of code into one common function called ipcctl_pre_down(). It simplifies a little these xxxctl_down() functions and increases a little the maintainability. Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Nadia Derbey <Nadia.Derbey@bull.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29IPC: introduce ipc_update_perm()Pierre Peiffer1-4/+1
The IPC_SET command performs the same permission setting for all IPCs. This patch introduces a common ipc_update_perm() function to update these permissions and makes use of it for all IPCs. Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Nadia Derbey <Nadia.Derbey@bull.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29IPC: get rid of the use *_setbuf structure.Pierre Peiffer1-26/+14
All IPCs make use of an intermetiate *_setbuf structure to handle the IPC_SET command. This is not really needed and, moreover, it complicates a little bit the code. This patch gets rid of the use of it and uses directly the semid64_ds/ msgid64_ds/shmid64_ds structure. In addition of removing one struture declaration, it also simplifies and improves a little bit the common 64-bits path. Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Nadia Derbey <Nadia.Derbey@bull.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29IPC/semaphores: remove one unused parameter from semctl_down()Pierre Peiffer1-3/+3
semctl_down() takes one unused parameter: semnum. This patch proposes to get rid of it. Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Nadia Derbey <Nadia.Derbey@bull.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29IPC/semaphores: move the rwmutex handling inside semctl_downPierre Peiffer1-11/+13
semctl_down is called with the rwmutex (the one which protects the list of ipcs) taken in write mode. This patch moves this rwmutex taken in write-mode inside semctl_down. This has the advantages of reducing a little bit the window during which this rwmutex is taken, clarifying sys_semctl, and finally of having a coherent behaviour with [shm|msg]ctl_down Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Nadia Derbey <Nadia.Derbey@bull.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29IPC/semaphores: code factorisationPierre Peiffer1-30/+31
Trivial patch which adds some small locking functions and makes use of them to factorize some part of the code and to make it cleaner. Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Nadia Derbey <Nadia.Derbey@bull.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29IPC: use ipc_buildid() directly from ipc_addid()Pierre Peiffer1-2/+0
By continuing to consolidate a little the IPC code, each id can be built directly in ipc_addid() instead of having it built from each callers of ipc_addid() And I also remove shm_addid() in order to have, as much as possible, the same code for shm/sem/msg. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net> Cc: Nadia Derbey <Nadia.Derbey@bull.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08IPC: consolidate sem_exit_ns(), msg_exit_ns() and shm_exit_ns()Pierre Peiffer1-22/+5
sem_exit_ns(), msg_exit_ns() and shm_exit_ns() are all called when an ipc_namespace is released to free all ipcs of each type. But in fact, they do the same thing: they loop around all ipcs to free them individually by calling a specific routine. This patch proposes to consolidate this by introducing a common function, free_ipcs(), that do the job. The specific routine to call on each individual ipcs is passed as parameter. For this, these ipc-specific 'free' routines are reworked to take a generic 'struct ipc_perm' as parameter. Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Nadia Derbey <Nadia.Derbey@bull.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08IPC: make struct ipc_ids static in ipc_namespacePierre Peiffer1-22/+4
Each ipc_namespace contains a table of 3 pointers to struct ipc_ids (3 for msg, sem and shm, structure used to store all ipcs) These 'struct ipc_ids' are dynamically allocated for each icp_namespace as the ipc_namespace itself (for the init namespace, they are initialized with pointers to static variables instead) It is so for historical reason: in fact, before the use of idr to store the ipcs, the ipcs were stored in tables of variable length, depending of the maximum number of ipc allowed. Now, these 'struct ipc_ids' have a fixed size. As they are allocated in any cases for each new ipc_namespace, there is no gain of memory in having them allocated separately of the struct ipc_namespace. This patch proposes to make this table static in the struct ipc_namespace. Thus, we can allocate all in once and get rid of all the code needed to allocate and free these ipc_ids separately. Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net> Acked-by: Cedric Le Goater <clg@fr.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Nadia Derbey <Nadia.Derbey@bull.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08IPC/semaphores: consolidate SEM_STAT and IPC_STAT commandsPierre Peiffer1-22/+16
These commands (SEM_STAT and IPC_STAT) are rather doing the same things (only the meaning of the id given as input and the return value differ). However, for the semaphores, they are handled in two different places (two different functions). This patch consolidates this for clarification by handling these both commands in the same place in semctl_nolock(). It also removes one unused parameter for this function. Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net> Cc: Nadia Derbey <Nadia.Derbey@bull.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08namespaces: move the IPC namespace under IPC_NS optionPavel Emelyanov1-0/+3
Currently the IPC namespace management code is spread over the ipc/*.c files. I moved this code into ipc/namespace.c file which is compiled out when needed. The linux/ipc_namespace.h file is used to store the prototypes of the functions in namespace.c and the stubs for NAMESPACES=n case. This is done so, because the stub for copy_ipc_namespace requires the knowledge of the CLONE_NEWIPC flag, which is in sched.h. But the linux/ipc.h file itself in included into many many .c files via the sys.h->sem.h sequence so adding the sched.h into it will make all these .c depend on sched.h which is not that good. On the other hand the knowledge about the namespaces stuff is required in 4 .c files only. Besides, this patch compiles out some auxiliary functions from ipc/sem.c, msg.c and shm.c files. It turned out that moving these functions into namespaces.c is not that easy because they use many other calls and macros from the original file. Moving them would make this patch complicated. On the other hand all these functions can be consolidated, so I will send a separate patch doing this a bit later. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06IPC: fix error check in all new xxx_lock() and xxx_exit_ns() functionsPierre Peiffer1-3/+14
In the new implementation of the [sem|shm|msg]_lock[_check]() routines, we use the return value of ipc_lock() in container_of() without any check. But ipc_lock may return a errcode. The use of this errcode in container_of() may alter this errcode, and we don't want this. And in xxx_exit_ns, the pointer return by idr_find is of type 'struct kern_ipc_per'... Today, the code will work as is because the member used in these container_of() is the first member of its container (offset == 0), the errcode isn't changed then. But in the general case, we can't count on this assumption and this may lead later to a real bug if we don't correct this. Again, the proposed solution is simple and correct. But, as pointed by Nadia, with this solution, the same check will be done several times (in all sub-callers...), what is not very funny/optimal... Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net> Cc: Nadia Derbey <Nadia.Derbey@bull.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19IPC: fix error case when idr-cache is empty in ipcget()Pierre Peiffer1-2/+2
With the use of idr to store the ipc, the case where the idr cache is empty, when idr_get_new is called (this may happen even if we call idr_pre_get() before), is not well handled: it lets semget()/shmget()/msgget() return ENOSPC when this cache is empty, what 1. does not reflect the facts and 2. does not conform to the man(s). This patch fixes this by retrying the whole process of allocation in this case. Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net> Cc: Nadia Derbey <Nadia.Derbey@bull.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19IPC: cleanup some code and wrong comments about semundo list managmentPierre Peiffer1-40/+6
Some comments about sem_undo_list seem wrong. About the comment above unlock_semundo: "... If task2 now exits before task1 releases the lock (by calling unlock_semundo()), then task1 will never call spin_unlock(). ..." This is just wrong, I see no reason for which task1 will not call spin_unlock... The rest of this comment is also wrong... Unless I miss something (of course). Finally, (un)lock_semundo functions are useless, so remove them for simplification. (this avoids an useless if statement) Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net> Cc: Nadia Derbey <Nadia.Derbey@bull.net> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19ipc: remove unneeded parametersNadia Derbey1-6/+4
Remvoe the unneeded parameters from ipc_checkid() and ipc_buildid() interfaces. Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19fix idr_find() lockingNadia Derbey1-14/+30
This is a patch that fixes the way idr_find() used to be called in ipc_lock(): in all the paths that don't imply an update of the ipcs idr, it was called without the idr tree being locked. The changes are: . in ipc_ids, the mutex has been changed into a reader/writer semaphore. . ipc_lock() now takes the mutex as a reader during the idr_find(). . a new routine ipc_lock_down() has been defined: it doesn't take the mutex, assuming that it is being held by the caller. This is the routine that is now called in all the update paths. Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net> Acked-by: Jarek Poplawski <jarkao2@o2.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19ipc: fix wrong commentsNadia Derbey1-0/+14
This patch fixes the wrong / obsolete comments in the ipc code. Also adds a missing lock around ipc_get_maxid() in shm_get_stat(). Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Storing ipcs into IDRsNadia Derbey1-6/+17
This patch converts casts of struct kern_ipc_perm to . struct msg_queue . struct sem_array . struct shmid_kernel into the equivalent container_of() macro. It improves code maintenance because the code need not change if kern_ipc_perm is no longer at the beginning of the containing struct. Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>