aboutsummaryrefslogtreecommitdiffstats
path: root/lib/rhashtable.c
AgeCommit message (Collapse)AuthorFilesLines
2025-09-13mm/slub: allow to set node and align in k[v]reallocVitaly Wool1-2/+2
Reimplement k[v]realloc_node() to be able to set node and alignment should a user need to do so. In order to do that while retaining the maximal backward compatibility, add k[v]realloc_node_align() functions and redefine the rest of API using these new ones. While doing that, we also keep the number of _noprof variants to a minimum, which implies some changes to the existing users of older _noprof functions, that basically being bcachefs. With that change we also provide the ability for the Rust part of the kernel to set node and alignment in its K[v]xxx [re]allocations. Link: https://lkml.kernel.org/r/20250806124147.1724658-1-vitaly.wool@konsulko.se Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.se> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Jann Horn <jannh@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-26Merge tag 'mm-nonmm-stable-2025-01-24-23-16' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "Mainly individually changelogged singleton patches. The patch series in this pull are: - "lib min_heap: Improve min_heap safety, testing, and documentation" from Kuan-Wei Chiu provides various tightenings to the min_heap library code - "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms some cleanup and Rust preparation in the xarray library code - "Update reference to include/asm-<arch>" from Geert Uytterhoeven fixes pathnames in some code comments - "Converge on using secs_to_jiffies()" from Easwar Hariharan uses the new secs_to_jiffies() in various places where that is appropriate - "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen switches two filesystems to the new mount API - "Convert ocfs2 to use folios" from Matthew Wilcox does that - "Remove get_task_comm() and print task comm directly" from Yafang Shao removes now-unneeded calls to get_task_comm() in various places - "squashfs: reduce memory usage and update docs" from Phillip Lougher implements some memory savings in squashfs and performs some maintainability work - "lib: clarify comparison function requirements" from Kuan-Wei Chiu tightens the sort code's behaviour and adds some maintenance work - "nilfs2: protect busy buffer heads from being force-cleared" from Ryusuke Konishi fixes an issues in nlifs when the fs is presented with a corrupted image - "nilfs2: fix kernel-doc comments for function return values" from Ryusuke Konishi fixes some nilfs kerneldoc - "nilfs2: fix issues with rename operations" from Ryusuke Konishi addresses some nilfs BUG_ONs which syzbot was able to trigger - "minmax.h: Cleanups and minor optimisations" from David Laight does some maintenance work on the min/max library code - "Fixes and cleanups to xarray" from Kemeng Shi does maintenance work on the xarray library code" * tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (131 commits) ocfs2: use str_yes_no() and str_no_yes() helper functions include/linux/lz4.h: add some missing macros Xarray: use xa_mark_t in xas_squash_marks() to keep code consistent Xarray: remove repeat check in xas_squash_marks() Xarray: distinguish large entries correctly in xas_split_alloc() Xarray: move forward index correctly in xas_pause() Xarray: do not return sibling entries from xas_find_marked() ipc/util.c: complete the kernel-doc function descriptions gcov: clang: use correct function param names latencytop: use correct kernel-doc format for func params minmax.h: remove some #defines that are only expanded once minmax.h: simplify the variants of clamp() minmax.h: move all the clamp() definitions after the min/max() ones minmax.h: use BUILD_BUG_ON_MSG() for the lo < hi test in clamp() minmax.h: reduce the #define expansion of min(), max() and clamp() minmax.h: update some comments minmax.h: add whitespace around operators and after commas nilfs2: do not update mtime of renamed directory that is not moved nilfs2: handle errors that nilfs_prepare_chunk() may return CREDITS: fix spelling mistake ...
2025-01-19rhashtable: Fix rhashtable_try_insert testHerbert Xu1-5/+7
The test on whether rhashtable_insert_one did an insertion relies on the value returned by rhashtable_lookup_one. Unfortunately that value is overwritten after rhashtable_insert_one returns. Fix this by moving the test before data gets overwritten. Simplify the test as only data == NULL matters. Finally move atomic_inc back within the lock as otherwise it may be reordered with the atomic_dec on the removal side, potentially leading to an underflow. Reported-by: Michael Kelley <mhklinux@outlook.com> Fixes: e1d3422c95f0 ("rhashtable: Fix potential deadlock by moving schedule_work outside lock") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Tested-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Breno Leitao <leitao@debian.org> Tested-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-01-12lib/rhashtable: fix the typo for preemptiblePratyush Mittal1-1/+1
Fix the spelling of the mis-spelled word Link: https://lkml.kernel.org/r/20241123102929.11660-1-pratyushmittal@gmail.com Signed-off-by: Pratyush Mittal <pratyushmittal@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-21rhashtable: Fix potential deadlock by moving schedule_work outside lockBreno Leitao1-4/+6
Move the hash table growth check and work scheduling outside the rht lock to prevent a possible circular locking dependency. The original implementation could trigger a lockdep warning due to a potential deadlock scenario involving nested locks between rhashtable bucket, rq lock, and dsq lock. By relocating the growth check and work scheduling after releasing the rth lock, we break this potential deadlock chain. This change expands the flexibility of rhashtable by removing restrictive locking that previously limited its use in scheduler and workqueue contexts. Import to say that this calls rht_grow_above_75(), which reads from struct rhashtable without holding the lock, if this is a problem, we can move the check to the lock, and schedule the workqueue after the lock. Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class") Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Breno Leitao <leitao@debian.org> Modified so that atomic_inc is also moved outside of the bucket lock along with the growth above 75% check. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-09-01lib/rhashtable: cleanup fallback check in bucket_table_alloc()Davidlohr Bueso1-1/+1
Upon allocation failure, the current check with the nofail bits is unnecessary, and further stands in the way of discouraging direct use of __GFP_NOFAIL. Remove this and replace with the proper way of determining if doing a non-blocking allocation for the nested table case. Link: https://lkml.kernel.org/r/20240806153927.184515-1-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Suggested-by: Michal Hocko <mhocko@suse.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25rhashtable: plumb through alloc tagKent Overstreet1-8/+14
This gives better memory allocation profiling results; rhashtable allocations will be accounted to the code that initialized the rhashtable. [surenb@google.com: undo _noprof additions in the documentation] Link: https://lkml.kernel.org/r/20240326231453.1206227-1-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-32-surenb@google.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-09rhashtable: Allow rhashtable to be used from irq-safe contextsTejun Heo1-6/+10
rhashtable currently only does bh-safe synchronization making it impossible to use from irq-safe contexts. Switch it to use irq-safe synchronization to remove the restriction. v2: Update the lock functions to return the ulong flags value and unlock functions to take the value directly instead of passing around the pointer. Suggested by Linus. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-08lib: fix spelling mistakesZhen Lei1-1/+1
Fix some spelling mistakes in comments: permanentely ==> permanently wont ==> won't remaning ==> remaining succed ==> succeed shouldnt ==> shouldn't alpha-numeric ==> alphanumeric storeing ==> storing funtion ==> function documenation ==> documentation Determin ==> Determine intepreted ==> interpreted ammount ==> amount obious ==> obvious interupts ==> interrupts occured ==> occurred asssociated ==> associated taking into acount ==> taking into account squence ==> sequence stil ==> still contiguos ==> contiguous matchs ==> matches Link: https://lkml.kernel.org/r/20210607072555.12416-1-thunder.leizhen@huawei.com Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-28rhashtable: Restore RCU marking on rhash_lock_headHerbert Xu1-19/+16
This patch restores the RCU marking on bucket_table->buckets as it really does need RCU protection. Its removal had led to a fatal bug. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-06rhashtable: Drop raw RCU deref in nested_table_freeHerbert Xu1-4/+13
This patch replaces some unnecessary uses of rcu_dereference_raw in the rhashtable code with rcu_dereference_protected. The top-level nested table entry is only marked as RCU because it shares the same type as the tree entries underneath it. So it doesn't need any RCU protection. We also don't need RCU protection when we're freeing a nested RCU table because by this stage we've long passed a memory barrier when anyone could change the nested table. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500Thomas Gleixner1-4/+1
Based on 2 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation # extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 4122 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Enrico Weigelt <info@metux.net> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-16rhashtable: Fix cmpxchg RCU warningsHerbert Xu1-2/+3
As cmpxchg is a non-RCU mechanism it will cause sparse warnings when we use it for RCU. This patch adds explicit casts to silence those warnings. This should probably be moved to RCU itself in future. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-16rhashtable: Remove RCU marking from rhash_lock_headHerbert Xu1-14/+14
The opaque type rhash_lock_head should not be marked with __rcu because it can never be dereferenced. We should apply the RCU marking when we turn it into a pointer which can be dereferenced. This patch does exactly that. This fixes a number of sparse warnings as well as getting rid of some unnecessary RCU checking. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12rhashtable: use BIT(0) for locking.NeilBrown1-1/+1
As reported by Guenter Roeck, the new bit-locking using BIT(1) doesn't work on the m68k architecture. m68k only requires 2-byte alignment for words and longwords, so there is only one unused bit in pointers to structs - We current use two, one for the NULLS marker at the end of the linked list, and one for the bit-lock in the head of the list. The two uses don't need to conflict as we never need the head of the list to be a NULLS marker - the marker is only needed to check if an object has moved to a different table, and the bucket head cannot move. The NULLS marker is only needed in a ->next pointer. As we already have different types for the bucket head pointer (struct rhash_lock_head) and the ->next pointers (struct rhash_head), it is fairly easy to treat the lsb differently in each. So: Initialize buckets heads to NULL, and use the lsb for locking. When loading the pointer from the bucket head, if it is NULL (ignoring the lock big), report as being the expected NULLS marker. When storing a value into a bucket head, if it is a NULLS marker, store NULL instead. And convert all places that used bit 1 for locking, to use bit 0. Fixes: 8f0db018006a ("rhashtable: use bit_spin_locks to protect hash bucket.") Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12rhashtable: replace rht_ptr_locked() with rht_assign_locked()NeilBrown1-3/+3
The only times rht_ptr_locked() is used, it is to store a new value in a bucket-head. This is the only time it makes sense to use it too. So replace it by a function which does the whole task: Sets the lock bit and assigns to a bucket head. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12rhashtable: move dereference inside rht_ptr()NeilBrown1-6/+6
Rather than dereferencing a pointer to a bucket and then passing the result to rht_ptr(), we now pass in the pointer and do the dereference in rht_ptr(). This requires that we pass in the tbl and hash as well to support RCU checks, and means that the various rht_for_each functions can expect a pointer that can be dereferenced without further care. There are two places where we dereference a bucket pointer where there is no testable protection - in each case we know that we much have exclusive access without having taken a lock. The previous code used rht_dereference() to pretend that holding the mutex provided protects, but holding the mutex never provides protection for accessing buckets. So instead introduce rht_ptr_exclusive() that can be used when there is known to be exclusive access without holding any locks. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12rhashtable: fix some __rcu annotation errorsNeilBrown1-2/+2
With these annotations, the rhashtable now gets no warnings when compiled with "C=1" for sparse checking. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12rhashtable: use struct_size() in kvzalloc()Gustavo A. R. Silva1-2/+1
One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; struct boo entry[]; }; size = sizeof(struct foo) + count * sizeof(struct boo); instance = kvzalloc(size, GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kvzalloc(struct_size(instance, entry, count), GFP_KERNEL); This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-07rhashtable: add lockdep tracking to bucket bit-spin-locks.NeilBrown1-6/+9
Native bit_spin_locks are not tracked by lockdep. The bit_spin_locks used for rhashtable buckets are local to the rhashtable implementation, so there is little opportunity for the sort of misuse that lockdep might detect. However locks are held while a hash function or compare function is called, and if one of these took a lock, a misbehaviour is possible. As it is quite easy to add lockdep support this unlikely possibility seems to be enough justification. So create a lockdep class for bucket bit_spin_lock and attach through a lockdep_map in each bucket_table. Without the 'nested' annotation in rhashtable_rehash_one(), lockdep correctly reports a possible problem as this lock is taken while another bucket lock (in another table) is held. This confirms that the added support works. With the correct nested annotation in place, lockdep reports no problems. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-07rhashtable: use bit_spin_locks to protect hash bucket.NeilBrown1-71/+70
This patch changes rhashtables to use a bit_spin_lock on BIT(1) of the bucket pointer to lock the hash chain for that bucket. The benefits of a bit spin_lock are: - no need to allocate a separate array of locks. - no need to have a configuration option to guide the choice of the size of this array - locking cost is often a single test-and-set in a cache line that will have to be loaded anyway. When inserting at, or removing from, the head of the chain, the unlock is free - writing the new address in the bucket head implicitly clears the lock bit. For __rhashtable_insert_fast() we ensure this always happens when adding a new key. - even when lockings costs 2 updates (lock and unlock), they are in a cacheline that needs to be read anyway. The cost of using a bit spin_lock is a little bit of code complexity, which I think is quite manageable. Bit spin_locks are sometimes inappropriate because they are not fair - if multiple CPUs repeatedly contend of the same lock, one CPU can easily be starved. This is not a credible situation with rhashtable. Multiple CPUs may want to repeatedly add or remove objects, but they will typically do so at different buckets, so they will attempt to acquire different locks. As we have more bit-locks than we previously had spinlocks (by at least a factor of two) we can expect slightly less contention to go with the slightly better cache behavior and reduced memory consumption. To enhance type checking, a new struct is introduced to represent the pointer plus lock-bit that is stored in the bucket-table. This is "struct rhash_lock_head" and is empty. A pointer to this needs to be cast to either an unsigned lock, or a "struct rhash_head *" to be useful. Variables of this type are most often called "bkt". Previously "pprev" would sometimes point to a bucket, and sometimes a ->next pointer in an rhash_head. As these are now different types, pprev is NULL when it would have pointed to the bucket. In that case, 'blk' is used, together with correct locking protocol. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-07rhashtable: allow rht_bucket_var to return NULL.NeilBrown1-9/+20
Rather than returning a pointer to a static nulls, rht_bucket_var() now returns NULL if the bucket doesn't exist. This will make the next patch, which stores a bitlock in the bucket pointer, somewhat cleaner. This change involves introducing __rht_bucket_nested() which is like rht_bucket_nested(), but doesn't provide the static nulls, and changing rht_bucket_nested() to call this and possible provide a static nulls - as is still needed for the non-var case. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-07rhashtable: use cmpxchg() in nested_table_alloc()NeilBrown1-3/+5
nested_table_alloc() relies on the fact that there is at most one spinlock allocated for every slot in the top level nested table, so it is not possible for two threads to try to allocate the same table at the same time. This assumption is a little fragile (it is not explicit) and is unnecessary as cmpxchg() can be used instead. A future patch will replace the spinlocks by per-bucket bitlocks, and then we won't be able to protect the slot pointer with a spinlock. So replace rcu_assign_pointer() with cmpxchg() - which has equivalent barrier properties. If it the cmp fails, free the table that was just allocated. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-2/+6
2019-03-21rhashtable: rename rht_for_each*continue as *from.NeilBrown1-1/+1
The pattern set by list.h is that for_each..continue() iterators start at the next entry after the given one, while for_each..from() iterators start at the given entry. The rht_for_each*continue() iterators are documented as though the start at the 'next' entry, but actually start at the given entry, and they are used expecting that behaviour. So fix the documentation and change the names to *from for consistency with list.h Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21rhashtable: don't hold lock on first table throughout insertion.NeilBrown1-36/+16
rhashtable_try_insert() currently holds a lock on the bucket in the first table, while also locking buckets in subsequent tables. This is unnecessary and looks like a hold-over from some earlier version of the implementation. As insert and remove always lock a bucket in each table in turn, and as insert only inserts in the final table, there cannot be any races that are not covered by simply locking a bucket in each table in turn. When an insert call reaches that last table it can be sure that there is no matchinf entry in any other table as it has searched them all, and insertion never happens anywhere but in the last table. The fact that code tests for the existence of future_tbl while holding a lock on the relevant bucket ensures that two threads inserting the same key will make compatible decisions about which is the "last" table. This simplifies the code and allows the ->rehash field to be discarded. We still need a way to ensure that a dead bucket_table is never re-linked by rhashtable_walk_stop(). This can be achieved by calling call_rcu() inside the locked region, and checking with rcu_head_after_call_rcu() in rhashtable_walk_stop() to see if the bucket table is empty and dead. Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Reviewed-by: Paul E. McKenney <paulmck@linux.ibm.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21rhashtable: Still do rehash when we get EEXISTHerbert Xu1-2/+6
As it stands if a shrink is delayed because of an outstanding rehash, we will go into a rescheduling loop without ever doing the rehash. This patch fixes this by still carrying out the rehash and then rescheduling so that we can shrink after the completion of the rehash should it still be necessary. The return value of EEXIST captures this case and other cases (e.g., another thread expanded/rehashed the table at the same time) where we should still proceed with the rehash. Fixes: da20420f83ea ("rhashtable: Add nested tables") Reported-by: Josh Elsasser <jelsasser@appneta.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Tested-by: Josh Elsasser <jelsasser@appneta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-22rhashtable: Remove obsolete rhashtable_walk_init functionHerbert Xu1-1/+1
The rhashtable_walk_init function has been obsolete for more than two years. This patch finally converts its last users over to rhashtable_walk_enter and removes it. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-12-03rhashtable: detect when object movement between tables might have ↵NeilBrown1-3/+5
invalidated a lookup Some users of rhashtables might need to move an object from one table to another - this appears to be the reason for the incomplete usage of NULLS markers. To support these, we store a unique NULLS_MARKER at the end of each chain, and when a search fails to find a match, we check if the NULLS marker found was the expected one. If not, the search may not have examined all objects in the target bucket, so it is repeated. The unique NULLS_MARKER is derived from the address of the head of the chain. As this cannot be derived at load-time the static rhnull in rht_bucket_nested() needs to be initialised at run time. Any caller of a lookup function must still be prepared for the possibility that the object returned is in a different table - it might have been there for some time. Note that this does NOT provide support for other uses of NULLS_MARKERs such as allocating with SLAB_TYPESAFE_BY_RCU or changing the key of an object and re-inserting it in the same table. These could only be done safely if new objects were inserted at the *start* of a hash chain, and that is not currently the case. Signed-off-by: NeilBrown <neilb@suse.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds1-1/+0
Pull networking fixes from David Miller: 1) ICE, E1000, IGB, IXGBE, and I40E bug fixes from the Intel folks. 2) Better fix for AB-BA deadlock in packet scheduler code, from Cong Wang. 3) bpf sockmap fixes (zero sized key handling, etc.) from Daniel Borkmann. 4) Send zero IPID in TCP resets and SYN-RECV state ACKs, to prevent attackers using it as a side-channel. From Eric Dumazet. 5) Memory leak in mediatek bluetooth driver, from Gustavo A. R. Silva. 6) Hook up rt->dst.input of ipv6 anycast routes properly, from Hangbin Liu. 7) hns and hns3 bug fixes from Huazhong Tan. 8) Fix RIF leak in mlxsw driver, from Ido Schimmel. 9) iova range check fix in vhost, from Jason Wang. 10) Fix hang in do_tcp_sendpages() with tls, from John Fastabend. 11) More r8152 chips need to disable RX aggregation, from Kai-Heng Feng. 12) Memory exposure in TCA_U32_SEL handling, from Kees Cook. 13) TCP BBR congestion control fixes from Kevin Yang. 14) hv_netvsc, ignore non-PCI devices, from Stephen Hemminger. 15) qed driver fixes from Tomer Tayar. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (77 commits) net: sched: Fix memory exposure from short TCA_U32_SEL qed: fix spelling mistake "comparsion" -> "comparison" vhost: correctly check the iova range when waking virtqueue qlge: Fix netdev features configuration. net: macb: do not disable MDIO bus at open/close time Revert "net: stmmac: fix build failure due to missing COMMON_CLK dependency" net: macb: Fix regression breaking non-MDIO fixed-link PHYs mlxsw: spectrum_switchdev: Do not leak RIFs when removing bridge i40e: fix condition of WARN_ONCE for stat strings i40e: Fix for Tx timeouts when interface is brought up if DCB is enabled ixgbe: fix driver behaviour after issuing VFLR ixgbe: Prevent unsupported configurations with XDP ixgbe: Replace GFP_ATOMIC with GFP_KERNEL igb: Replace mdelay() with msleep() in igb_integrated_phy_loopback() igb: Replace GFP_ATOMIC with GFP_KERNEL in igb_sw_init() igb: Use an advanced ctx descriptor for launchtime e1000: ensure to free old tx/rx rings in set_ringparam() e1000: check on netif_running() before calling e1000_up() ixgb: use dma_zalloc_coherent instead of allocator/memset ice: Trivial formatting fixes ...
2018-08-22lib/rhashtable: guarantee initial hashtable allocationDavidlohr Bueso1-3/+11
rhashtable_init() may fail due to -ENOMEM, thus making the entire api unusable. This patch removes this scenario, however unlikely. In order to guarantee memory allocation, this patch always ends up doing GFP_KERNEL|__GFP_NOFAIL for both the tbl as well as alloc_bucket_spinlocks(). Upon the first table allocation failure, we shrink the size to the smallest value that makes sense and retry with __GFP_NOFAIL semantics. With the defaults, this means that from 64 buckets, we retry with only 4. Any later issues regarding performance due to collisions or larger table resizing (when more memory becomes available) is the least of our problems. Link: http://lkml.kernel.org/r/20180712185241.4017-9-manfred@colorfullife.com Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22lib/rhashtable: simplify bucket_table_alloc()Davidlohr Bueso1-5/+2
As of ce91f6ee5b3b ("mm: kvmalloc does not fallback to vmalloc for incompatible gfp flags") we can simplify the caller and trust kvzalloc() to just do the right thing. For the case of the GFP_ATOMIC context, we can drop the __GFP_NORETRY flag for obvious reasons, and for the __GFP_NOWARN case, however, it is changed such that the caller passes the flag instead of making bucket_table_alloc() handle it. This slightly changes the gfp flags passed on to nested_table_alloc() as it will now also use GFP_ATOMIC | __GFP_NOWARN. However, I consider this a positive consequence as for the same reasons we want nowarn semantics in bucket_table_alloc(). [manfred@colorfullife.com: commit id extended to 12 digits, line wraps updated] Link: http://lkml.kernel.org/r/20180712185241.4017-8-manfred@colorfullife.com Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Kees Cook <keescook@chromium.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-20rhashtable: remove duplicated include from rhashtable.cYue Haibing1-1/+0
Remove duplicated include. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20Merge ra.kernel.org:/pub/scm/linux/kernel/git/torvalds/linuxDavid S. Miller1-8/+19
All conflicts were trivial overlapping changes, so reasonably easy to resolve. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-18lib/rhashtable: consider param->min_size when setting initial table sizeDavidlohr Bueso1-6/+11
rhashtable_init() currently does not take into account the user-passed min_size parameter unless param->nelem_hint is set as well. As such, the default size (number of buckets) will always be HASH_DEFAULT_SIZE even if the smallest allowed size is larger than that. Remediate this by unconditionally calling into rounded_hashtable_size() and handling things accordingly. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-09rhashtable: add restart routine in rhashtable_free_and_destroy()Taehee Yoo1-1/+7
rhashtable_free_and_destroy() cancels re-hash deferred work then walks and destroys elements. at this moment, some elements can be still in future_tbl. that elements are not destroyed. test case: nft_rhash_destroy() calls rhashtable_free_and_destroy() to destroy all elements of sets before destroying sets and chains. But rhashtable_free_and_destroy() doesn't destroy elements of future_tbl. so that splat occurred. test script: %cat test.nft table ip aa { map map1 { type ipv4_addr : verdict; elements = { 0 : jump a0, 1 : jump a0, 2 : jump a0, 3 : jump a0, 4 : jump a0, 5 : jump a0, 6 : jump a0, 7 : jump a0, 8 : jump a0, 9 : jump a0, } } chain a0 { } } flush ruleset table ip aa { map map1 { type ipv4_addr : verdict; elements = { 0 : jump a0, 1 : jump a0, 2 : jump a0, 3 : jump a0, 4 : jump a0, 5 : jump a0, 6 : jump a0, 7 : jump a0, 8 : jump a0, 9 : jump a0, } } chain a0 { } } flush ruleset %while :; do nft -f test.nft; done Splat looks like: [ 200.795603] kernel BUG at net/netfilter/nf_tables_api.c:1363! [ 200.806944] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI [ 200.812253] CPU: 1 PID: 1582 Comm: nft Not tainted 4.17.0+ #24 [ 200.820297] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015 [ 200.830309] RIP: 0010:nf_tables_chain_destroy.isra.34+0x62/0x240 [nf_tables] [ 200.838317] Code: 43 50 85 c0 74 26 48 8b 45 00 48 8b 4d 08 ba 54 05 00 00 48 c7 c6 60 6d 29 c0 48 c7 c7 c0 65 29 c0 4c 8b 40 08 e8 58 e5 fd f8 <0f> 0b 48 89 da 48 b8 00 00 00 00 00 fc ff [ 200.860366] RSP: 0000:ffff880118dbf4d0 EFLAGS: 00010282 [ 200.866354] RAX: 0000000000000061 RBX: ffff88010cdeaf08 RCX: 0000000000000000 [ 200.874355] RDX: 0000000000000061 RSI: 0000000000000008 RDI: ffffed00231b7e90 [ 200.882361] RBP: ffff880118dbf4e8 R08: ffffed002373bcfb R09: ffffed002373bcfa [ 200.890354] R10: 0000000000000000 R11: ffffed002373bcfb R12: dead000000000200 [ 200.898356] R13: dead000000000100 R14: ffffffffbb62af38 R15: dffffc0000000000 [ 200.906354] FS: 00007fefc31fd700(0000) GS:ffff88011b800000(0000) knlGS:0000000000000000 [ 200.915533] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 200.922355] CR2: 0000557f1c8e9128 CR3: 0000000106880000 CR4: 00000000001006e0 [ 200.930353] Call Trace: [ 200.932351] ? nf_tables_commit+0x26f6/0x2c60 [nf_tables] [ 200.939525] ? nf_tables_setelem_notify.constprop.49+0x1a0/0x1a0 [nf_tables] [ 200.947525] ? nf_tables_delchain+0x6e0/0x6e0 [nf_tables] [ 200.952383] ? nft_add_set_elem+0x1700/0x1700 [nf_tables] [ 200.959532] ? nla_parse+0xab/0x230 [ 200.963529] ? nfnetlink_rcv_batch+0xd06/0x10d0 [nfnetlink] [ 200.968384] ? nfnetlink_net_init+0x130/0x130 [nfnetlink] [ 200.975525] ? debug_show_all_locks+0x290/0x290 [ 200.980363] ? debug_show_all_locks+0x290/0x290 [ 200.986356] ? sched_clock_cpu+0x132/0x170 [ 200.990352] ? find_held_lock+0x39/0x1b0 [ 200.994355] ? sched_clock_local+0x10d/0x130 [ 200.999531] ? memset+0x1f/0x40 V2: - free all tables requested by Herbert Xu Signed-off-by: Taehee Yoo <ap420073@gmail.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-03lib: rhashtable: Correct self-assignment in rhashtable.cRishabh Bhatnagar1-1/+1
In file lib/rhashtable.c line 777, skip variable is assigned to itself. The following error was observed: lib/rhashtable.c:777:41: warning: explicitly assigning value of variable of type 'int' to itself [-Wself-assign] error, forbidden warning: rhashtable.c:777 This error was found when compiling with Clang 6.0. Change it to iter->skip. Signed-off-by: Rishabh Bhatnagar <rishabhb@codeaurora.org> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-22rhashtable: clean up dereference of ->future_tbl.NeilBrown1-5/+4
Using rht_dereference_bucket() to dereference ->future_tbl looks like a type error, and could be confusing. Using rht_dereference_rcu() to test a pointer for NULL adds an unnecessary barrier - rcu_access_pointer() is preferred for NULL tests when no lock is held. This uses 3 different ways to access ->future_tbl. - if we know the mutex is held, use rht_dereference() - if we don't hold the mutex, and are only testing for NULL, use rcu_access_pointer() - otherwise (using RCU protection for true dereference), use rht_dereference_rcu(). Note that this includes a simplification of the call to rhashtable_last_table() - we don't do an extra dereference before the call any more. Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-22rhashtable: use cmpxchg() to protect ->future_tbl.NeilBrown1-11/+4
Rather than borrowing one of the bucket locks to protect ->future_tbl updates, use cmpxchg(). This gives more freedom to change how bucket locking is implemented. Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-22rhashtable: simplify nested_table_alloc() and rht_bucket_nested_insert()NeilBrown1-12/+6
Now that we don't use the hash value or shift in nested_table_alloc() there is room for simplification. We only need to pass a "is this a leaf" flag to nested_table_alloc(), and don't need to track as much information in rht_bucket_nested_insert(). Note there is another minor cleanup in nested_table_alloc() here. The number of elements in a page of "union nested_tables" is most naturally PAGE_SIZE / sizeof(ntbl[0]) The previous code had PAGE_SIZE / sizeof(ntbl[0].bucket) which happens to be the correct value only because the bucket uses all the space in the union. Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-22rhashtable: simplify INIT_RHT_NULLS_HEAD()NeilBrown1-9/+6
The 'ht' and 'hash' arguments to INIT_RHT_NULLS_HEAD() are no longer used - so drop them. This allows us to also remove the nhash argument from nested_table_alloc(). Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-22rhashtable: remove nulls_base and related code.NeilBrown1-8/+0
This "feature" is unused, undocumented, and untested and so doesn't really belong. A patch is under development to properly implement support for detecting when a search gets diverted down a different chain, which the common purpose of nulls markers. This patch actually fixes a bug too. The table resizing allows a table to grow to 2^31 buckets, but the hash is truncated to 27 bits - any growth beyond 2^27 is wasteful an ineffective. This patch results in NULLS_MARKER(0) being used for all chains, and leaves the use of rht_is_a_null() to test for it. Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-22rhashtable: split rhashtable.hNeilBrown1-0/+1
Due to the use of rhashtables in net namespaces, rhashtable.h is included in lots of the kernel, so a small changes can required a large recompilation. This makes development painful. This patch splits out rhashtable-types.h which just includes the major type declarations, and does not include (non-trivial) inline code. rhashtable.h is no longer included by anything in the include/ directory. Common include files only include rhashtable-types.h so a large recompilation is only triggered when that changes. Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24rhashtable: improve rhashtable_walk stability when stop/start used.NeilBrown1-3/+41
When a walk of an rhashtable is interrupted with rhastable_walk_stop() and then rhashtable_walk_start(), the location to restart from is based on a 'skip' count in the current hash chain, and this can be incorrect if insertions or deletions have happened. This does not happen when the walk is not stopped and started as iter->p is a placeholder which is safe to use while holding the RCU read lock. In rhashtable_walk_start() we can revalidate that 'p' is still in the same hash chain. If it isn't then the current method is still used. With this patch, if a rhashtable walker ensures that the current object remains in the table over a stop/start period (possibly by elevating the reference count if that is sufficient), it can be sure that a walk will not miss objects that were in the hashtable for the whole time of the walk. rhashtable_walk_start() may not find the object even though it is still in the hashtable if a rehash has moved it to a new table. In this case it will (eventually) get -EAGAIN and will need to proceed through the whole table again to be sure to see everything at least once. Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24rhashtable: reset iter when rhashtable_walk_start sees new tableNeilBrown1-0/+2
The documentation claims that when rhashtable_walk_start_check() detects a resize event, it will rewind back to the beginning of the table. This is not true. We need to set ->slot and ->skip to be zero for it to be true. Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24rhashtable: Revise incorrect comment on r{hl, hash}table_walk_enter()NeilBrown1-2/+3
Neither rhashtable_walk_enter() or rhltable_walk_enter() sleep, though they do take a spinlock without irq protection. So revise the comments to accurately state the contexts in which these functions can be called. Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31rhashtable: add schedule pointsEric Dumazet1-0/+2
Rehashing and destroying large hash table takes a lot of time, and happens in process context. It is safe to add cond_resched() in rhashtable_rehash_table() and rhashtable_free_and_destroy() Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-07rhashtable: Fix rhlist duplicates insertionPaul Blakey1-1/+3
When inserting duplicate objects (those with the same key), current rhlist implementation messes up the chain pointers by updating the bucket pointer instead of prev next pointer to the newly inserted node. This causes missing elements on removal and travesal. Fix that by properly updating pprev pointer to point to the correct rhash_head next pointer. Issue: 1241076 Change-Id: I86b2c140bcb4aeb10b70a72a267ff590bb2b17e7 Fixes: ca26893f05e8 ('rhashtable: Add rhlist interface') Signed-off-by: Paul Blakey <paulb@mellanox.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-11rhashtable: Call library function alloc_bucket_locksTom Herbert1-39/+8
To allocate the array of bucket locks for the hash table we now call library function alloc_bucket_spinlocks. This function is based on the old alloc_bucket_locks in rhashtable and should produce the same effect. Signed-off-by: Tom Herbert <tom@quantonium.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-11rhashtable: Add rhastable_walk_peekTom Herbert1-16/+87
This function is like rhashtable_walk_next except that it only returns the current element in the inter and does not advance the iter. This patch also creates __rhashtable_walk_find_next. It finds the next element in the table when the entry cached in iter is NULL or at the end of a slot. __rhashtable_walk_find_next is called from rhashtable_walk_next and rhastable_walk_peek. end_of_table is an added field to the iter structure. This indicates that the end of table was reached (walker.tbl being NULL is not a sufficient condition for end of table). Signed-off-by: Tom Herbert <tom@quantonium.net> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-11rhashtable: Change rhashtable_walk_start to return voidTom Herbert1-3/+7
Most callers of rhashtable_walk_start don't care about a resize event which is indicated by a return value of -EAGAIN. So calls to rhashtable_walk_start are wrapped wih code to ignore -EAGAIN. Something like this is common: ret = rhashtable_walk_start(rhiter); if (ret && ret != -EAGAIN) goto out; Since zero and -EAGAIN are the only possible return values from the function this check is pointless. The condition never evaluates to true. This patch changes rhashtable_walk_start to return void. This simplifies code for the callers that ignore -EAGAIN. For the few cases where the caller cares about the resize event, particularly where the table can be walked in mulitple parts for netlink or seq file dump, the function rhashtable_walk_start_check has been added that returns -EAGAIN on a resize event. Signed-off-by: Tom Herbert <tom@quantonium.net> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-19rhashtable: Documentation tweakAndreas Gruenbacher1-4/+5
Clarify that rhashtable_walk_{stop,start} will not reset the iterator to the beginning of the hash table. Confusion between rhashtable_walk_enter and rhashtable_walk_start has already lead to a bug. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-15Merge tag 'random_for_linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random Pull random updates from Ted Ts'o: "Add wait_for_random_bytes() and get_random_*_wait() functions so that callers can more safely get random bytes if they can block until the CRNG is initialized. Also print a warning if get_random_*() is called before the CRNG is initialized. By default, only one single-line warning will be printed per boot. If CONFIG_WARN_ALL_UNSEEDED_RANDOM is defined, then a warning will be printed for each function which tries to get random bytes before the CRNG is initialized. This can get spammy for certain architecture types, so it is not enabled by default" * tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random: random: reorder READ_ONCE() in get_random_uXX random: suppress spammy warnings about unseeded randomness random: warn when kernel uses unseeded randomness net/route: use get_random_int for random counter net/neighbor: use get_random_u32 for 32-bit hash random rhashtable: use get_random_u32 for hash_rnd ceph: ensure RNG is seeded before using iscsi: ensure RNG is seeded before use cifs: use get_random_u32 for 32-bit lock random random: add get_random_{bytes,u32,u64,int,long,once}_wait family random: add wait_for_random_bytes() API
2017-07-10lib/rhashtable.c: use kvzalloc() in bucket_table_alloc() when possibleMichal Hocko1-4/+3
bucket_table_alloc() can be currently called with GFP_KERNEL or GFP_ATOMIC. For the former we basically have an open coded kvzalloc() while the later only uses kzalloc(). Let's simplify the code a bit by the dropping the open coded path and replace it with kvzalloc(). Link: http://lkml.kernel.org/r/20170531155145.17111-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Thomas Graf <tgraf@suug.ch> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-19rhashtable: use get_random_u32 for hash_rndJason A. Donenfeld1-1/+1
This is much faster and just as secure. It also has the added benefit of probably returning better randomness at early-boot on systems with architectural RNGs. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Thomas Graf <tgraf@suug.ch> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-08lib/rhashtable.c: simplify a strange allocation patternMichal Hocko1-10/+3
alloc_bucket_locks allocation pattern is quite unusual. We are preferring vmalloc when CONFIG_NUMA is enabled. The rationale is that vmalloc will respect the memory policy of the current process and so the backing memory will get distributed over multiple nodes if the requester is configured properly. At least that is the intention, in reality rhastable is shrunk and expanded from a kernel worker so no mempolicy can be assumed. Let's just simplify the code and use kvmalloc helper, which is a transparent way to use kmalloc with vmalloc fallback, if the caller is allowed to block and use the flag otherwise. Link: http://lkml.kernel.org/r/20170306103032.2540-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Tom Herbert <tom@herbertland.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-01rhashtable: compact struct rhashtable_paramsFlorian Westphal1-1/+1
By using smaller datatypes this (rather large) struct shrinks considerably (80 -> 48 bytes on x86_64). As this is embedded in other structs, this also rerduces size of several others, e.g. cls_fl_head or nft_hash. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-28rhashtable: Do not lower max_elems when max_size is zeroHerbert Xu1-5/+6
The commit 6d684e54690c ("rhashtable: Cap total number of entries to 2^31") breaks rhashtable users that do not set max_size. This is because when max_size is zero max_elems is also incorrectly set to zero instead of 2^31. This patch fixes it by only lowering max_elems when max_size is not zero. Fixes: 6d684e54690c ("rhashtable: Cap total number of entries to 2^31") Reported-by: Florian Fainelli <f.fainelli@gmail.com> Reported-by: kernel test robot <fengguang.wu@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-27rhashtable: Cap total number of entries to 2^31Herbert Xu1-0/+5
When max_size is not set or if it set to a sufficiently large value, the nelems counter can overflow. This would cause havoc with the automatic shrinking as it would then attempt to fit a huge number of entries into a tiny hash table. This patch fixes this by adding max_elems to struct rhashtable to cap the number of elements. This is set to 2^31 as nelems is not a precise count. This is sufficiently smaller than UINT_MAX that it should be safe. When max_size is set max_elems will be lowered to at most twice max_size as is the status quo. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26rhashtable: remove insecure_max_entries paramFlorian Westphal1-6/+0
no users in the tree, insecure_max_entries is always set to ht->p.max_size * 2 in rhtashtable_init(). Replace only spot that uses it with a ht->p.max_size check. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-18rhashtable: remove insecure_elasticityFlorian Westphal1-16/+1
commit 83e7e4ce9e93c3 ("mac80211: Use rhltable instead of rhashtable") removed the last user that made use of 'insecure_elasticity' parameter, i.e. the default of 16 is used everywhere. Replace it with a constant. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-02sched/headers: Prepare to use <linux/rcuupdate.h> instead of ↵Ingo Molnar1-0/+1
<linux/rculist.h> in <linux/sched.h> We don't actually need the full rculist.h header in sched.h anymore, we will be able to include the smaller rcupdate.h header instead. But first update code that relied on the implicit header inclusion. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-26rhashtable: Fix RCU dereference annotation in rht_bucket_nestedHerbert Xu1-2/+3
The current annotation is wrong as it says that we're only called under spinlock. In fact it should be marked as under either spinlock or RCU read lock. Fixes: da20420f83ea ("rhashtable: Add nested tables") Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-26rhashtable: Fix use before NULL check in bucket_table_freeHerbert Xu1-3/+1
Dan Carpenter reported a use before NULL check bug in the function bucket_table_free. In fact we don't need the NULL check at all as no caller can provide a NULL argument. So this patch fixes this by simply removing it. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17rhashtable: Add nested tablesHerbert Xu1-50/+220
This patch adds code that handles GFP_ATOMIC kmalloc failure on insertion. As we cannot use vmalloc, we solve it by making our hash table nested. That is, we allocate single pages at each level and reach our desired table size by nesting them. When a nested table is created, only a single page is allocated at the top-level. Lower levels are allocated on demand during insertion. Therefore for each insertion to succeed, only two (non-consecutive) pages are needed. After a nested table is created, a rehash will be scheduled in order to switch to a vmalloced table as soon as possible. Also, the rehash code will never rehash into a nested table. If we detect a nested table during a rehash, the rehash will be aborted and a new rehash will be scheduled. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-20rhashtable: Add rhlist interfaceHerbert Xu1-52/+206
The insecure_elasticity setting is an ugly wart brought out by users who need to insert duplicate objects (that is, distinct objects with identical keys) into the same table. In fact, those users have a much bigger problem. Once those duplicate objects are inserted, they don't have an interface to find them (unless you count the walker interface which walks over the entire table). Some users have resorted to doing a manual walk over the hash table which is of course broken because they don't handle the potential existence of multiple hash tables. The result is that they will break sporadically when they encounter a hash table resize/rehash. This patch provides a way out for those users, at the expense of an extra pointer per object. Essentially each object is now a list of objects carrying the same key. The hash table will only see the lists so nothing changes as far as rhashtable is concerned. To use this new interface, you need to insert a struct rhlist_head into your objects instead of struct rhash_head. While the hash table is unchanged, for type-safety you'll need to use struct rhltable instead of struct rhashtable. All the existing interfaces have been duplicated for rhlist, including the hash table walker. One missing feature is nulls marking because AFAIK the only potential user of it does not need duplicate objects. Should anyone need this it shouldn't be too hard to add. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller1-3/+7
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree. Most relevant updates are the removal of per-conntrack timers to use a workqueue/garbage collection approach instead from Florian Westphal, the hash and numgen expression for nf_tables from Laura Garcia, updates on nf_tables hash set to honor the NLM_F_EXCL flag, removal of ip_conntrack sysctl and many other incremental updates on our Netfilter codebase. More specifically, they are: 1) Retrieve only 4 bytes to fetch ports in case of non-linear skb transport area in dccp, sctp, tcp, udp and udplite protocol conntrackers, from Gao Feng. 2) Missing whitespace on error message in physdev match, from Hangbin Liu. 3) Skip redundant IPv4 checksum calculation in nf_dup_ipv4, from Liping Zhang. 4) Add nf_ct_expires() helper function and use it, from Florian Westphal. 5) Replace opencoded nf_ct_kill() call in IPVS conntrack support, also from Florian. 6) Rename nf_tables set implementation to nft_set_{name}.c 7) Introduce the hash expression to allow arbitrary hashing of selector concatenations, from Laura Garcia Liebana. 8) Remove ip_conntrack sysctl backward compatibility code, this code has been around for long time already, and we have two interfaces to do this already: nf_conntrack sysctl and ctnetlink. 9) Use nf_conntrack_get_ht() helper function whenever possible, instead of opencoding fetch of hashtable pointer and size, patch from Liping Zhang. 10) Add quota expression for nf_tables. 11) Add number generator expression for nf_tables, this supports incremental and random generators that can be combined with maps, very useful for load balancing purpose, again from Laura Garcia Liebana. 12) Fix a typo in a debug message in FTP conntrack helper, from Colin Ian King. 13) Introduce a nft_chain_parse_hook() helper function to parse chain hook configuration, this is used by a follow up patch to perform better chain update validation. 14) Add rhashtable_lookup_get_insert_key() to rhashtable and use it from the nft_set_hash implementation to honor the NLM_F_EXCL flag. 15) Missing nulls check in nf_conntrack from nf_conntrack_tuple_taken(), patch from Florian Westphal. 16) Don't use the DYING bit to know if the conntrack event has been already delivered, instead a state variable to track event re-delivery states, also from Florian. 17) Remove the per-conntrack timer, use the workqueue approach that was discussed during the NFWS, from Florian Westphal. 18) Use the netlink conntrack table dump path to kill stale entries, again from Florian. 19) Add a garbage collector to get rid of stale conntracks, from Florian. 20) Reschedule garbage collector if eviction rate is high. 21) Get rid of the __nf_ct_kill_acct() helper. 22) Use ARPHRD_ETHER instead of hardcoded 1 from ARP logger. 23) Make nf_log_set() interface assertive on unsupported families. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-3/+4
All three conflicts were cases of simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-26rhashtable: fix a memory leak in alloc_bucket_locks()Eric Dumazet1-3/+4
If vmalloc() was successful, do not attempt a kmalloc_array() Fixes: 4cf0b354d92e ("rhashtable: avoid large lock-array allocations") Reported-by: CAI Qian <caiqian@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Tested-by: CAI Qian <caiqian@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-26rhashtable: add rhashtable_lookup_get_insert_key()Pablo Neira Ayuso1-3/+7
This patch modifies __rhashtable_insert_fast() so it returns the existing object that clashes with the one that you want to insert. In case the object is successfully inserted, NULL is returned. Otherwise, you get an error via ERR_PTR(). This patch adapts the existing callers of __rhashtable_insert_fast() so they handle this new logic, and it adds a new rhashtable_lookup_get_insert_key() interface to fetch this existing object. nf_tables needs this change to improve handling of EEXIST cases via honoring the NLM_F_EXCL flag and by checking if the data part of the mapping matches what we have. Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Thomas Graf <tgraf@suug.ch> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-08-19rhashtable: Remove GFP flag from rhashtable_walk_initHerbert Xu1-28/+18
The commit 8f6fd83c6c5ec66a4a70c728535ddcdfef4f3697 ("rhashtable: accept GFP flags in rhashtable_walk_init") added a GFP flag argument to rhashtable_walk_init because some users wish to use the walker in an unsleepable context. In fact we don't need to allocate memory in rhashtable_walk_init at all. The walker is always paired with an iterator so we could just stash ourselves there. This patch does that by introducing a new enter function to replace the existing init function. This way we don't have to churn all the existing users again. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-15rhashtable: fix shift by 64 when shrinkingVegard Nossum1-2/+4
I got this: ================================================================================ UBSAN: Undefined behaviour in ./include/linux/log2.h:63:13 shift exponent 64 is too large for 64-bit type 'long unsigned int' CPU: 1 PID: 721 Comm: kworker/1:1 Not tainted 4.8.0-rc1+ #87 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014 Workqueue: events rht_deferred_worker 0000000000000000 ffff88011661f8d8 ffffffff82344f50 0000000041b58ab3 ffffffff84f98000 ffffffff82344ea4 ffff88011661f900 ffff88011661f8b0 0000000000000001 ffff88011661f6b8 dffffc0000000000 ffffffff867f7640 Call Trace: [<ffffffff82344f50>] dump_stack+0xac/0xfc [<ffffffff82344ea4>] ? _atomic_dec_and_lock+0xc4/0xc4 [<ffffffff8242f5b8>] ubsan_epilogue+0xd/0x8a [<ffffffff82430c41>] __ubsan_handle_shift_out_of_bounds+0x255/0x29a [<ffffffff824309ec>] ? __ubsan_handle_out_of_bounds+0x180/0x180 [<ffffffff84003436>] ? nl80211_req_set_reg+0x256/0x2f0 [<ffffffff812112ba>] ? print_context_stack+0x8a/0x160 [<ffffffff81200031>] ? amd_pmu_reset+0x341/0x380 [<ffffffff823af808>] rht_deferred_worker+0x1618/0x1790 [<ffffffff823af808>] ? rht_deferred_worker+0x1618/0x1790 [<ffffffff823ae1f0>] ? rhashtable_jhash2+0x370/0x370 [<ffffffff8134c12d>] ? process_one_work+0x6fd/0x1970 [<ffffffff8134c1cf>] process_one_work+0x79f/0x1970 [<ffffffff8134c12d>] ? process_one_work+0x6fd/0x1970 [<ffffffff8134ba30>] ? try_to_grab_pending+0x4c0/0x4c0 [<ffffffff8134d564>] ? worker_thread+0x1c4/0x1340 [<ffffffff8134d8ff>] worker_thread+0x55f/0x1340 [<ffffffff845e904f>] ? __schedule+0x4df/0x1d40 [<ffffffff8134d3a0>] ? process_one_work+0x1970/0x1970 [<ffffffff8134d3a0>] ? process_one_work+0x1970/0x1970 [<ffffffff813642f7>] kthread+0x237/0x390 [<ffffffff813640c0>] ? __kthread_parkme+0x280/0x280 [<ffffffff845f8c93>] ? _raw_spin_unlock_irq+0x33/0x50 [<ffffffff845f95df>] ret_from_fork+0x1f/0x40 [<ffffffff813640c0>] ? __kthread_parkme+0x280/0x280 ================================================================================ roundup_pow_of_two() is undefined when called with an argument of 0, so let's avoid the call and just fall back to ht->p.min_size (which should never be smaller than HASH_MIN_SIZE). Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-14rhashtable: avoid large lock-array allocationsFlorian Westphal1-2/+5
Sander reports following splat after netfilter nat bysrc table got converted to rhashtable: swapper/0: page allocation failure: order:3, mode:0x2084020(GFP_ATOMIC|__GFP_COMP) CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.8.0-rc1 [..] [<ffffffff811633ed>] warn_alloc_failed+0xdd/0x140 [<ffffffff811638b1>] __alloc_pages_nodemask+0x3e1/0xcf0 [<ffffffff811a72ed>] alloc_pages_current+0x8d/0x110 [<ffffffff8117cb7f>] kmalloc_order+0x1f/0x70 [<ffffffff811aec19>] __kmalloc+0x129/0x140 [<ffffffff8146d561>] bucket_table_alloc+0xc1/0x1d0 [<ffffffff8146da1d>] rhashtable_insert_rehash+0x5d/0xe0 [<ffffffff819fcfff>] nf_nat_setup_info+0x2ef/0x400 The failure happens when allocating the spinlock array. Even with GFP_KERNEL its unlikely for such a large allocation to succeed. Thomas Graf pointed me at inet_ehash_locks_alloc(), so in addition to adding NOWARN for atomic allocations this also makes the bucket-array sizing more conservative. In commit 095dc8e0c3686 ("tcp: fix/cleanup inet_ehash_locks_alloc()"), Eric Dumazet says: "Budget 2 cache lines per cpu worth of 'spinlocks'". IOW, consider size needed by a single spinlock when determining number of locks per cpu. So with 64 byte per cacheline and 4 byte per spinlock this gives 32 locks per cpu. Resulting size of the lock-array (sizeof(spinlock) == 4): cpus: 1 2 4 8 16 32 64 old: 1k 1k 4k 8k 16k 16k 16k new: 128 256 512 1k 2k 4k 8k 8k allocation should have decent chance of success even with GFP_ATOMIC, and should not fail with GFP_KERNEL. With 72-byte spinlock (LOCKDEP): cpus : 1 2 old: 9k 18k new: ~2k ~4k Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Suggested-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-05rhashtable: accept GFP flags in rhashtable_walk_initBob Copeland1-2/+4
In certain cases, the 802.11 mesh pathtable code wants to iterate over all of the entries in the forwarding table from the receive path, which is inside an RCU read-side critical section. Enable walks inside atomic sections by allowing GFP_ATOMIC allocations for the walker state. Change all existing callsites to pass in GFP_KERNEL. Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Bob Copeland <me@bobcopeland.com> [also adjust gfs2/glock.c and rhashtable tests] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-12-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+2
2015-12-18rhashtable: Kill harmless RCU warning in rhashtable_walk_initHerbert Xu1-1/+2
The commit c6ff5268293ef98e48a99597e765ffc417e39fa5 ("rhashtable: Fix walker list corruption") causes a suspicious RCU usage warning because we no longer hold ht->mutex when we dereference ht->tbl. However, this is a false positive because we now hold ht->lock which also guarantees that ht->tbl won't disppear from under us. This patch kills the warning by using rcu_dereference_protected. Reported-by: kernel test robot <ying.huang@linux.intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-27/+40
Conflicts: drivers/net/geneve.c Here we had an overlapping change, where in 'net' the extraneous stats bump was being removed whilst in 'net-next' the final argument to udp_tunnel6_xmit_skb() was being changed. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-16rhashtable: Fix walker list corruptionHerbert Xu1-9/+7
The commit ba7c95ea3870fe7b847466d39a049ab6f156aa2c ("rhashtable: Fix sleeping inside RCU critical section in walk_stop") introduced a new spinlock for the walker list. However, it did not convert all existing users of the list over to the new spin lock. Some continued to use the old mutext for this purpose. This obviously led to corruption of the list. The fix is to use the spin lock everywhere where we touch the list. This also allows us to do rcu_rad_lock before we take the lock in rhashtable_walk_start. With the old mutex this would've deadlocked but it's safe with the new spin lock. Fixes: ba7c95ea3870 ("rhashtable: Fix sleeping inside RCU...") Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-16rhashtable: Enforce minimum size on initial hash tableHerbert Xu1-3/+3
William Hua <william.hua@canonical.com> wrote: > > I wasn't aware there was an enforced minimum size. I simply set the > nelem_hint in the rhastable_params struct to 1, expecting it to grow as > needed. This caused a segfault afterwards when trying to insert an > element. OK we're doing the size computation before we enforce the limit on min_size. ---8<--- We need to do the initial hash table size computation after we have obtained the correct min_size/max_size parameters. Otherwise we may end up with a hash table whose size is outside the allowed envelope. Fixes: a998f712f77e ("rhashtable: Round up/down min/max_size to...") Reported-by: William Hua <william.hua@canonical.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-08rhashtable: Remove unnecessary wmb for future_tblHerbert Xu1-3/+0
The patch 9497df88ab5567daa001829051c5f87161a81ff0 ("rhashtable: Fix reader/rehash race") added a pair of barriers. In fact the wmb is superfluous because every subsequent write to the old or new hash table uses rcu_assign_pointer, which itself carriers a full barrier prior to the assignment. Therefore we may remove the explicit wmb. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-05Revert "rhashtable: Use __vmalloc with GFP_ATOMIC for table allocation"David S. Miller1-3/+2
This reverts commit d3716f18a7d841565c930efde30737a3557eee69. vmalloc cannot be used in BH disabled contexts, even with GFP_ATOMIC. And we certainly want to support rhashtable users inserting entries with software interrupts disabled. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-04rhashtable: Use __vmalloc with GFP_ATOMIC for table allocationHerbert Xu1-2/+3
When an rhashtable user pounds rhashtable hard with back-to-back insertions we may end up growing the table in GFP_ATOMIC context. Unfortunately when the table reaches a certain size this often fails because we don't have enough physically contiguous pages to hold the new table. Eric Dumazet suggested (and in fact wrote this patch) using __vmalloc instead which can be used in GFP_ATOMIC context. Reported-by: Phil Sutter <phil@nwl.cc> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-04rhashtable: Prevent spurious EBUSY errors on insertionHerbert Xu1-15/+30
Thomas and Phil observed that under stress rhashtable insertion sometimes failed with EBUSY, even though this error should only ever been seen when we're under attack and our hash chain length has grown to an unacceptable level, even after a rehash. It turns out that the logic for detecting whether there is an existing rehash is faulty. In particular, when two threads both try to grow the same table at the same time, one of them may see the newly grown table and thus erroneously conclude that it had been rehashed. This is what leads to the EBUSY error. This patch fixes this by remembering the current last table we used during insertion so that rhashtable_insert_rehash can detect when another thread has also done a resize/rehash. When this is detected we will give up our resize/rehash and simply retry the insertion with the new table. Reported-by: Thomas Graf <tgraf@suug.ch> Reported-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Tested-by: Phil Sutter <phil@nwl.cc> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-22lib: fix data race in rhashtable_rehash_oneDmitriy Vyukov1-4/+1
rhashtable_rehash_one() uses complex logic to update entry->next field, after INIT_RHT_NULLS_HEAD and NULLS_MARKER expansion: entry->next = 1 | ((base + off) << 1) This can be compiled along the lines of: entry->next = base + off entry->next <<= 1 entry->next |= 1 Which will break concurrent readers. NULLS value recomputation is not needed here, so just remove the complex logic. The data race was found with KernelThreadSanitizer (KTSAN). Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-08rhashtable: fix for resize events during table walkPhil Sutter1-2/+2
If rhashtable_walk_next detects a resize operation in progress, it jumps to the new table and continues walking that one. But it misses to drop the reference to it's current item, leading it to continue traversing the new table's bucket in which the current item is sorted into, and after reaching that bucket's end continues traversing the new table's second bucket instead of the first one, thereby potentially missing items. This fixes the rhashtable runtime test for me. Bug probably introduced by Herbert Xu's patch eddee5ba ("rhashtable: Fix walker behaviour during rehash") although not explicitly tested. Fixes: eddee5ba ("rhashtable: Fix walker behaviour during rehash") Signed-off-by: Phil Sutter <phil@nwl.cc> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-0/+1
2015-06-07rhashtable: add missing import <linux/export.h>Hauke Mehrtens1-0/+1
rhashtable uses EXPORT_SYMBOL_GPL() without importing linux/export.h directly it is only imported indirectly through some other includes. Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-0/+11
Conflicts: drivers/net/ethernet/cadence/macb.c drivers/net/phy/phy.c include/linux/skbuff.h net/ipv4/tcp.c net/switchdev/switchdev.c Switchdev was a case of RTNH_H_{EXTERNAL --> OFFLOAD} renaming overlapping with net-next changes of various sorts. phy.c was a case of two changes, one adding a local variable to a function whilst the second was removing one. tcp.c overlapped a deadlock fix with the addition of new tcp_info statistic values. macb.c involved the addition of two zyncq device entries. skbuff.h involved adding back ipv4_daddr to nf_bridge_info whilst net-next changes put two other existing members of that struct into a union. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-16rhashtable: Add cap on number of elements in hash tableHerbert Xu1-0/+11
We currently have no limit on the number of elements in a hash table. This is a problem because some users (tipc) set a ceiling on the maximum table size and when that is reached the hash table may degenerate. Others may encounter OOM when growing and if we allow insertions when that happens the hash table perofrmance may also suffer. This patch adds a new paramater insecure_max_entries which becomes the cap on the table. If unset it defaults to max_size * 2. If it is also zero it means that there is no cap on the number of elements in the table. However, the table will grow whenever the utilisation hits 100% and if that growth fails, you will get ENOMEM on insertion. As allowing oversubscription is potentially dangerous, the name contains the word insecure. Note that the cap is not a hard limit. This is done for performance reasons as enforcing a hard limit will result in use of atomic ops that are heavier than the ones we currently use. The reasoning is that we're only guarding against a gross over- subscription of the table, rather than a small breach of the limit. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-05rhashtable: Simplify iterator codeThomas Graf1-6/+2
Remove useless obj variable and goto logic. Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-22rhashtable: Do not schedule more than one rehash if we can't grow furtherThomas Graf1-2/+2
The current code currently only stops inserting rehashes into the chain when no resizes are currently scheduled. As long as resizes are scheduled and while inserting above the utilization watermark, more and more rehashes will be scheduled. This lead to a perfect DoS storm with thousands of rehashes scheduled which lead to thousands of spinlocks to be taken sequentially. Instead, only allow either a series of resizes or a single rehash. Drop any further rehashes and return -EBUSY. Fixes: ccd57b1bd324 ("rhashtable: Add immediate rehash during insertion") Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-22rhashtable: Schedule async resize when sync realloc failsThomas Graf1-1/+6
When rhashtable_insert_rehash() fails with ENOMEM, this indicates that we can't allocate the necessary memory in the current context but the limits as set by the user would still allow to grow. Thus attempt an async resize in the background where we can allocate using GFP_KERNEL which is more likely to succeed. The insertion itself will still fail to indicate pressure. This fixes a bug where the table would never continue growing once the utilization is above 100%. Fixes: ccd57b1bd324 ("rhashtable: Add immediate rehash during insertion") Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-25rhashtable: provide len to obj_hashfnPatrick McHardy1-1/+1
nftables sets will be converted to use so called setextensions, moving the key to a non-fixed position. To hash it, the obj_hashfn must be used, however it so far doesn't receive the length parameter. Pass the key length to obj_hashfn() and convert existing users. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-03-24rhashtable: Add rhashtable_free_and_destroy()Thomas Graf1-10/+39
rhashtable_destroy() variant which stops rehashes, iterates over the table and calls a callback to release resources. Avoids need for nft_hash to embed rhashtable internals and allows to get rid of the being_destroyed flag. It also saves a 2nd mutex lock upon destruction. Also fixes an RCU lockdep splash on nft set destruction due to calling rht_for_each_entry_safe() without holding bucket locks. Open code this loop as we need know that no mutations may occur in parallel. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-24rhashtable: Disable automatic shrinking by defaultThomas Graf1-1/+1
Introduce a new bool automatic_shrinking to require the user to explicitly opt-in to automatic shrinking of tables. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-24rhashtable: Use 'unsigned int' consistentlyThomas Graf1-8/+10
Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-24rhashtable: Add comment on choice of elasticity valueHerbert Xu1-0/+12
This patch adds a comment on the choice of the value 16 as the maximum chain length before we force a rehash. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23rhashtable: Fix sleeping inside RCU critical section in walk_stopHerbert Xu1-2/+5
The commit 963ecbd41a1026d99ec7537c050867428c397b89 ("rhashtable: Fix use-after-free in rhashtable_walk_stop") fixed a real bug but created another one because we may end up sleeping inside an RCU critical section. This patch fixes it properly by replacing the mutex with a spin lock that specifically protects the walker lists. Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23rhashtable: Add immediate rehash during insertionHerbert Xu1-1/+59
This patch reintroduces immediate rehash during insertion. If we find during insertion that the table is full or the chain length exceeds a set limit (currently 16 but may be disabled with insecure_elasticity) then we will force an immediate rehash. The rehash will contain an expansion if the table utilisation exceeds 75%. If this rehash fails then the insertion will fail. Otherwise the insertion will be reattempted in the new hash table. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23rhashtable: Allow GFP_ATOMIC bucket table allocationHerbert Xu1-11/+15
This patch adds the ability to allocate bucket table with GFP_ATOMIC instead of GFP_KERNEL. This is needed when we perform an immediate rehash during insertion. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23rhashtable: Add multiple rehash supportHerbert Xu1-15/+72
This patch adds the missing bits to allow multiple rehashes. The read-side as well as remove already handle this correctly. So it's only the rehasher and insertion that need modification to handle this. Note that this patch doesn't actually enable it so for now rehashing is still only performed by the worker thread. This patch also disables the explicit expand/shrink interface because the table is meant to expand and shrink automatically, and continuing to export these interfaces unnecessarily complicates the life of the rehasher since the rehash process is now composed of two parts. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23rhashtable: Shrink to fitHerbert Xu1-3/+10
This patch changes rhashtable_shrink to shrink to the smallest size possible rather than halving the table. This is needed because with multiple rehashing we will defer shrinking until all other rehashing is done, meaning that when we do shrink we may be able to shrink a lot. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23rhashtable: Allow hashfn to be unsetHerbert Xu1-1/+16
Since every current rhashtable user uses jhash as their hash function, the fact that jhash is an inline function causes each user to generate a copy of its code. This function provides a solution to this problem by allowing hashfn to be unset. In which case rhashtable will automatically set it to jhash. Furthermore, if the key length is a multiple of 4, we will switch over to jhash2. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23rhashtable: Add barrier to ensure we see new tables in walkerHerbert Xu1-0/+3
The walker is a lockless reader so it too needs an smp_rmb before reading the future_tbl field in order to see any new tables that may contain elements that we should have walked over. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20rhashtable: Rip out obsolete out-of-line interfaceHerbert Xu1-284/+0
Now that all rhashtable users have been converted over to the inline interface, this patch removes the unused out-of-line interface. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20rhashtable: Allow hash/comparison functions to be inlinedHerbert Xu1-113/+50
This patch deals with the complaint that we make indirect function calls on the fast paths unnecessarily in rhashtable. We resolve it by moving the fast paths into inline functions that take struct rhashtable_param (which obviously must be the same set of parameters supplied to rhashtable_init) as an argument. The only remaining indirect call is to obj_hashfn (or key_hashfn it obj_hashfn is unset) on the rehash as well as the insert-during- rehash slow path. This patch also extends the support of vairable-length keys to include those where the key is fixed but scattered in the object. For example, in netlink we want to key off the namespace and the portid but they're not next to each other. This patch does this by directly using the object hash function as the indicator of whether the key is accessible or not. It also adds a new function obj_cmpfn to compare a key against an object. This means that the caller no longer needs to supply explicit compare functions. All this is done in a backwards compatible manner so no existing users are affected until they convert to the new interface. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20rhashtable: Make rhashtable_init params argument constHerbert Xu1-3/+4
This patch marks the rhashtable_init params argument const as there is no reason to modify it since we will always make a copy of it in the rhashtable. This patch also fixes a bug where we don't actually round up the value of min_size unless it is less than HASH_MIN_SIZE. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-19rhashtable: Round up/down min/max_size to ensure we respect limitThomas Graf1-2/+8
Round up min_size respectively round down max_size to the next power of two to make sure we always respect the limit specified by the user. This is required because we compare the table size against the limit before we expand or shrink. Also fixes a minor bug where we modified min_size in the params provided instead of the copy stored in struct rhashtable. Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18rhashtable: Remove max_shift and min_shiftHerbert Xu1-6/+1
Now that nobody uses max_shift and min_shift, we can safely remove them. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18rhashtable: Introduce max_size/min_sizeHerbert Xu1-4/+8
This patch adds the parameters max_size and min_size which are meant to replace max_shift and min_shift. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18rhashtable: Remove shift from bucket_tableHerbert Xu1-3/+2
Keeping both size and shift is silly. We only need one. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-16rhashtable: Avoid calculating hash again to unlockThomas Graf1-6/+5
Caching the lock pointer avoids having to hash on the object again to unlock the bucket locks. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-16rhashtable: Annotate RCU locking of walkersThomas Graf1-0/+2
Fixes the following sparse warnings: lib/rhashtable.c:767:5: warning: context imbalance in 'rhashtable_walk_start' - wrong count at exit lib/rhashtable.c:849:6: warning: context imbalance in 'rhashtable_walk_stop' - unexpected unlock Fixes: f2dba9c6ff0d ("rhashtable: Introduce rhashtable_walk_*") Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-15rhashtable: Fix rhashtable_remove failuresHerbert Xu1-10/+7
The commit 9d901bc05153bbf33b5da2cd6266865e531f0545 ("rhashtable: Free bucket tables asynchronously after rehash") causes gratuitous failures in rhashtable_remove. The reason is that it inadvertently introduced multiple rehashing from the perspective of readers. IOW it is now possible to see more than two tables during a single RCU critical section. Fortunately the other reader rhashtable_lookup already deals with this correctly thanks to c4db8848af6af92f90462258603be844baeab44d ("rhashtable: rhashtable: Move future_tbl into struct bucket_table") so only rhashtable_remove is broken by this change. This patch fixes this by looping over every table from the first one to the last or until we find the element that we were trying to delete. Incidentally the simple test for detecting rehashing to prevent starting another shrinking no longer works. Since it isn't needed anyway (the work queue and the mutex serves as a natural barrier to unnecessary rehashes) I've simply killed the test. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-15rhashtable: Fix use-after-free in rhashtable_walk_stopHerbert Xu1-3/+4
The commit c4db8848af6af92f90462258603be844baeab44d ("rhashtable: Move future_tbl into struct bucket_table") introduced a use-after- free bug in rhashtable_walk_stop because it dereferences tbl after droping the RCU read lock. This patch fixes it by moving the RCU read unlock down to the bottom of rhashtable_walk_stop. In fact this was how I had it originally but it got dropped while rearranging patches because this one depended on the async freeing of bucket_table. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-15rhashtable: Move future_tbl into struct bucket_tableHerbert Xu1-16/+11
This patch moves future_tbl to open up the possibility of having multiple rehashes on the same table. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-15rhashtable: Add rehash counter to bucket_tableHerbert Xu1-0/+1
This patch adds a rehash counter to bucket_table to indicate the last bucket that has been rehashed. This serves two purposes: 1. Any bucket that has been rehashed can never gain a new object. 2. If the rehash counter reaches the size of the table, the table will forever remain empty. This patch also downsizes bucket_table->size to an unsigned int since we do not support sizes greater than 32 bits yet. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-15rhashtable: Free bucket tables asynchronously after rehashHerbert Xu1-3/+6
There is in fact no need to wait for an RCU grace period in the rehash function, since all insertions are guaranteed to go into the new table through spin locks. This patch uses call_rcu to free the old/rehashed table at our leisure. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-15rhashtable: Move seed init into bucket_table_allocHerbert Xu1-10/+6
It seems that I have already made every rehash redo the random seed even though my commit message indicated otherwise :) Since we have already taken that step, this patch goes one step further and moves the seed initialisation into bucket_table_alloc. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-15rhashtable: Use SINGLE_DEPTH_NESTINGHerbert Xu1-7/+2
We only nest one level deep there is no need to roll our own subclasses. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-15rhashtable: Fix walker behaviour during rehashHerbert Xu1-23/+46
Previously whenever the walker encountered a resize it simply snaps back to the beginning and starts again. However, this only works if the rehash started and completed while the walker was idle. If the walker attempts to restart while the rehash is still ongoing, we may miss objects that we shouldn't have. This patch fixes this by making the walker walk the old table followed by the new table just like all other readers. If a rehash is detected we will still signal our caller of the fact so they can prepare for duplicates but we will simply continue the walk onto the new table after the old one is finished either by us or by the rehasher. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-12rhashtable: Fix read-side crash during rehashHerbert Xu1-1/+1
This patch fixes a typo rhashtable_lookup_compare where we fail to recompute the hash when looking up the new table. This causes elements to be missed and potentially a crash during a resize. Reported-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-12rhashtable: kill ht->shift atomic operationsDaniel Borkmann1-30/+25
Commit c0c09bfdc415 ("rhashtable: avoid unnecessary wakeup for worker queue") changed ht->shift to be atomic, which is actually unnecessary. Instead of leaving the current shift in the core rhashtable structure, it can be cached inside the individual bucket tables. There, it will only be initialized once during a new table allocation in the shrink/expansion slow path, and from then onward it stays immutable for the rest of the bucket table liftime. That allows shift to be non-atomic. The patch also moves hash_rnd management into the table setup. The rhashtable structure now consumes 3 instead of 4 cachelines. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Ying Xue <ying.xue@windriver.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-12rhashtable: Fix reader/rehash raceHerbert Xu1-0/+6
There is a potential race condition between readers and the rehasher. In particular, the rehasher could have started a rehash while the reader finishes a scan of the old table but fails to see the new table pointer. This patch closes this window by adding smp_wmb/smp_rmb. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-12rhashtable: Remove obj_raw_hashfnHerbert Xu1-18/+7
Now that the only caller of obj_raw_hashfn is head_hashfn, we can simply kill it and fold it into the latter. This patch also moves the common shift from head_hashfn/key_hashfn into rht_bucket_index. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-12rhashtable: Remove key length argument to key_hashfnHerbert Xu1-3/+4
key_hashfn has only one caller and it doesn't really need to supply the key length as an extra parameter. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-12rhashtable: Use head_hashfn instead of obj_raw_hashfnHerbert Xu1-7/+5
Now that we don't have cross-table hashes, we no longer need to keep the entire hash value so all users of obj_raw_hashfn can use head_hashfn instead. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-12rhashtable: Move masking back into key_hashfnHerbert Xu1-2/+3
This patch reverts commit c88455ce50ae4224d84960ce2baa53e61580df27 ("rhashtable: key_hashfn() must return full hash value") because the only user of it always masks the hash value. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-11rhashtable: Add annotation to nested lockHerbert Xu1-2/+2
Commit aa34a6cb0478842452bac58edb50d3ef9e178c92 ("rhashtable: Add arbitrary rehash function") killed the annotation on the nested lock which leads to bitching from lockdep. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-11rhashtable: Add arbitrary rehash functionHerbert Xu1-332/+174
This patch adds a rehash function that supports the use of any hash function for the new table. This is needed to support changing the random seed value during the lifetime of the hash table. However for now the random seed value is still constant and the rehash function is simply used to replace the existing expand/shrink functions. [ ASSERT_BUCKET_LOCK() and thus debug_dump_table() + debug_dump_buckets() are not longer used, so delete them entirely. -DaveM ] Signed-off-by: Herbert Xu <herbert.xu@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-11rhashtable: Move hash_rnd into bucket_tableHerbert Xu1-9/+15
Currently hash_rnd is a parameter that users can set. However, no existing users set this parameter. It is also something that people are unlikely to want to set directly since it's just a random number. In preparation for allowing the reseeding/rehashing of rhashtable, this patch moves hash_rnd into bucket_table so that it's now an internal state rather than a parameter. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-27rhashtable: use cond_resched()Eric Dumazet1-0/+4
If a hash table has 128 slots and 16384 elems, expand to 256 slots takes more than one second. For larger sets, a soft lockup is detected. Holding cpu for that long, even in a work queue is a show stopper for non preemptable kernels. cond_resched() at strategic points to allow process scheduler to reschedule us. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-27rhashtable: remove indirection for grow/shrink decision functionsDaniel Borkmann1-39/+17
Currently, all real users of rhashtable default their grow and shrink decision functions to rht_grow_above_75() and rht_shrink_below_30(), so that there's currently no need to have this explicitly selectable. It can/should be generic and private inside rhashtable until a real use case pops up. Since we can make this private, we'll save us this additional indirection layer and can improve insertion/deletion time as well. Reference: http://patchwork.ozlabs.org/patch/443040/ Suggested-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-27rhashtable: unconditionally grow when max_shift is not specifiedDaniel Borkmann1-1/+1
While commit c0c09bfdc415 ("rhashtable: avoid unnecessary wakeup for worker queue") rightfully moved part of the decision making of whether we should expand or shrink from the expand/shrink functions themselves into insert/delete functions in order to avoid unnecessary worker wake-ups, it however introduced a regression by doing so. Before that change, if no max_shift was specified (= 0) on rhashtable initialization, rhashtable_expand() would just grow unconditionally and lets the available memory be the limiting factor. After that change, if no max_shift was specified, there would be _no_ expansion step at all. Given that netlink and tipc have a max_shift specified, it was not visible there, but Josh Hunt reported that if nft that starts out with a default element hint of 3 if not otherwise provided, would slow i.e. inserts down trememdously as it cannot grow larger to relax table occupancy. Given that the test case verifies shrinks/expands manually, we also must remove pointer to the helper functions to explicitly avoid parallel resizing on insertions/deletions. test_bucket_stats() and test_rht_lookup() could also be wrapped around rhashtable mutex to explicitly synchronize a walk from resizing, but I think that defeats the actual test case which intended to have explicit test steps, i.e. 1) inserts, 2) expands, 3) shrinks, 4) deletions, with object verification after each stage. Reported-by: Josh Hunt <johunt@akamai.com> Fixes: c0c09bfdc415 ("rhashtable: avoid unnecessary wakeup for worker queue") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Ying Xue <ying.xue@windriver.com> Cc: Josh Hunt <johunt@akamai.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-23rhashtable: initialize all rhashtable walker membersSasha Levin1-0/+3
Commit f2dba9c6ff ("rhashtable: Introduce rhashtable_walk_*") forgot to initialize the members of struct rhashtable_walker after allocating it, which caused an undefined value for 'resize' which is used later on. Fixes: f2dba9c6ff ("rhashtable: Introduce rhashtable_walk_*") Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-20rhashtable: better high order allocation attemptsDaniel Borkmann1-3/+3
When trying to allocate future tables via bucket_table_alloc(), it seems overkill on large table shifts that we probe for kzalloc() unconditionally first, as it's likely to fail. Only probe with kzalloc() for more reasonable table sizes and use vzalloc() either as a fallback on failure or directly in case of large table sizes. Fixes: 7e1e77636e36 ("lib: Resizable, Scalable, Concurrent Hash Table") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-20rhashtable: don't test for shrink on insert, expansion on deleteDaniel Borkmann1-9/+18
Restore pre 54c5b7d311c8 behaviour and only probe for expansions on inserts and shrinks on deletes. Currently, it will happen that on initial inserts into a sparse hash table, we may i.e. shrink it first simply because it's not fully populated yet, only to later realize that we need to grow again. This however is counter intuitive, e.g. an initial default size of 64 elements is already small enough, and in case an elements size hint is given to the hash table by a user, we should avoid unnecessary expansion steps, so a shrink is clearly unintended here. Fixes: 54c5b7d311c8 ("rhashtable: introduce rhashtable_wakeup_worker helper function") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Ying Xue <ying.xue@windriver.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08rhashtable: using ERR_PTR requires linux/err.hStephen Rothwell1-0/+1
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-06rhashtable: Fix remove logic to avoid cross references between bucketsThomas Graf1-11/+17
The remove logic properly searched the remaining chain for a matching entry with an identical hash but it did this while searching from both the old and new table. Instead in order to not leave stale references behind we need to: 1. When growing and searching from the new table: Search remaining chain for entry with same hash to avoid having the new table directly point to a entry with a different hash. 2. When shrinking and searching from the old table: Check if the element after the removed would create a cross reference and avoid it if so. These bugs were present from the beginning in nft_hash. Also, both insert functions calculated the hash based on the mask of the new table. This worked while growing. Wwhile shrinking, the mask of the inew table is smaller than the mask of the old table. This lead to a bit not being taken into account when selecting the bucket lock and thus caused the wrong bucket to be locked eventually. Fixes: 7e1e77636e36 ("lib: Resizable, Scalable, Concurrent Hash Table") Fixes: 97defe1ecf86 ("rhashtable: Per bucket locks & deferred expansion/shrinking") Reported-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-06rhashtable: Avoid bucket cross reference after removalThomas Graf1-9/+17
During a resize, when two buckets in the larger table map to a single bucket in the smaller table and the new table has already been (partially) linked to the old table. Removal of an element may result the bucket in the larger table to point to entries which all hash to a different value than the bucket index. Thus causing two buckets to point to the same sub chain after unzipping. This is not illegal *during* the resize phase but after it has completed. Keep the old table around until all of the unzipping is done to allow the removal code to only search for matching hashed entries during this special period. Reported-by: Ying Xue <ying.xue@windriver.com> Fixes: 97defe1ecf86 ("rhashtable: Per bucket locks & deferred expansion/shrinking") Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-06rhashtable: Add more lock verificationThomas Graf1-2/+8
Catch hash miscalculations which result in hard to track down race conditions. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-06rhashtable: Dump bucket tables on locking violation under PROVE_LOCKINGThomas Graf1-24/+75
This simplifies debugging of locking violations if compiled with CONFIG_PROVE_LOCKING. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-06rhashtable: Wait for RCU readers after final unzip workThomas Graf1-0/+2
We need to wait for all RCU readers to complete after the last bit of unzipping has been completed. Otherwise the old table is freed up prematurely. Fixes: 7e1e77636e36 ("lib: Resizable, Scalable, Concurrent Hash Table") Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-06rhashtable: Use a single bucket lock for sibling bucketsThomas Graf1-101/+69
rhashtable currently allows to use a bucket lock per bucket. This requires multiple levels of complicated nested locking because when resizing, a single bucket of the smaller table will map to two buckets in the larger table. So far rhashtable has explicitly locked both buckets in the larger table. By excluding the highest bit of the hash from the bucket lock map and thus only allowing locks to buckets in a ratio of 1:2, the locking can be simplified a lot without losing the benefits of multiple locks. Larger tables which benefit from multiple locks will not have a single lock per bucket anyway. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-06rhashtable: key_hashfn() must return full hash valueThomas Graf1-7/+1
The value computed by key_hashfn() is used by rhashtable_lookup_compare() to traverse both tables during a resize. key_hashfn() must therefore return the hash value without the buckets mask applied so it can be masked to the size of each individual table. Fixes: 97defe1ecf86 ("rhashtable: Per bucket locks & deferred expansion/shrinking") Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04rhashtable: Introduce rhashtable_walk_*Herbert Xu1-0/+163
Some existing rhashtable users get too intimate with it by walking the buckets directly. This prevents us from easily changing the internals of rhashtable. This patch adds the helpers rhashtable_walk_init/exit/start/next/stop which will replace these custom walkers. They are meant to be usable for both procfs seq_file walks as well as walking by a netlink dump. The iterator structure should fit inside a netlink dump cb structure, with at least one element to spare. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04rhashtable: Fix potential crash on destroy in rhashtable_shrinkHerbert Xu1-0/+4
The current being_destroyed check in rhashtable_expand is not enough since if we start a shrinking process after freeing all elements in the table that's also going to crash. This patch adds a being_destroyed check to the deferred worker thread so that we bail out as soon as we take the lock. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-30rhashtable: Make selftest modularGeert Uytterhoeven1-205/+0
Allow the selftest on the resizable hash table to be built modular, just like all other tests that do not depend on DEBUG_KERNEL. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-26rhashtable: rhashtable_remove() must unlink in both tbl and future_tblThomas Graf1-9/+15
As removals can occur during resizes, entries may be referred to from both tbl and future_tbl when the removal is requested. Therefore rhashtable_remove() must unlink the entry in both tables if this is the case. The existing code did search both tables but stopped when it hit the first match. Failing to unlink in both tables resulted in use after free. Fixes: 97defe1ecf86 ("rhashtable: Per bucket locks & deferred expansion/shrinking") Reported-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-16rhashtable: Fix race in rhashtable_destroy() and use regular work_structYing Xue1-6/+6
When we put our declared work task in the global workqueue with schedule_delayed_work(), its delay parameter is always zero. Therefore, we should define a regular work in rhashtable structure instead of a delayed work. By the way, we add a condition to check whether resizing functions are NULL before cancelling the work, avoiding to cancel an uninitialized work. Lastly, while we wait for all work items we submitted before to run to completion with cancel_delayed_work(), ht->mutex has been taken in rhashtable_destroy(). Moreover, cancel_delayed_work() doesn't return until all work items are accomplished, and when work items are scheduled, the work's function - rht_deferred_worker() will be called. However, as rht_deferred_worker() also needs to acquire the lock, deadlock might happen at the moment as the lock is already held before. So if the cancel work function is moved out of the lock covered scope, this will avoid the deadlock. Fixes: 97defe1 ("rhashtable: Per bucket locks & deferred expansion/shrinking") Signed-off-by: Ying Xue <ying.xue@windriver.com> Cc: Thomas Graf <tgraf@suug.ch> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-14rhashtable: Lower/upper bucket may map to same lock while shrinkingThomas Graf1-3/+12
Each per bucket lock covers a configurable number of buckets. While shrinking, two buckets in the old table contain entries for a single bucket in the new table. We need to lock down both while linking. Check if they are protected by different locks to avoid a recursive lock. Fixes: 97defe1e ("rhashtable: Per bucket locks & deferred expansion/shrinking") Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-13rhashtable: involve rhashtable_lookup_compare_insert routineYing Xue1-2/+40
Introduce a new function called rhashtable_lookup_compare_insert() which is very similar to rhashtable_lookup_insert(). But the former makes use of users' given compare function to look for an object, and then inserts it into hash table if found. As the entire process of search and insertion is under protection of per bucket lock, this can help users to avoid the involvement of extra lock. Signed-off-by: Ying Xue <ying.xue@windriver.com> Cc: Thomas Graf <tgraf@suug.ch> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-08rhashtable: initialize atomic nelems variableYing Xue1-0/+1
Signed-off-by: Ying Xue <ying.xue@windriver.com> Cc: Thomas Graf <tgraf@suug.ch> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-08rhashtable: avoid unnecessary wakeup for worker queueYing Xue1-11/+7
Move condition statements of verifying whether hash table size exceeds its maximum threshold or reaches its minimum threshold from resizing functions to resizing decision functions, avoiding unnecessary wakeup for worker queue thread. Signed-off-by: Ying Xue <ying.xue@windriver.com> Cc: Thomas Graf <tgraf@suug.ch> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-08rhashtable: future table needs to be traversed when remove an objectYing Xue1-2/+2
When remove an object from hash table, we currently only traverse old bucket table to check whether the object exists. If the object is not found in it, we will try again. But in the second search loop, we still search the object from the old table instead of future table. As a result, the object may be not removed from hash table especially when resizing is currently in progress and the object is just saved in the future table. Signed-off-by: Ying Xue <ying.xue@windriver.com> Cc: Thomas Graf <tgraf@suug.ch> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-08rhashtable: involve rhashtable_lookup_insert routineYing Xue1-15/+82
Involve a new function called rhashtable_lookup_insert() which makes lookup and insertion atomic under bucket lock protection, helping us avoid to introduce an extra lock when we search and insert an object into hash table. Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-08rhashtable: introduce rhashtable_wakeup_worker helper functionYing Xue1-8/+15
Introduce rhashtable_wakeup_worker() helper function to reduce duplicated code where to wake up worker. By the way, as long as the both "future_tbl" and "tbl" bucket table pointers point to the same bucket array, we should try to wake up the resizing worker thread, otherwise, it indicates the work of resizing hash table is not finished yet. However, currently we will wake up the worker thread only when the two pointers point to different bucket array. Obviously this is wrong. So, the issue is also fixed as well in the patch. Signed-off-by: Ying Xue <ying.xue@windriver.com> Cc: Thomas Graf <tgraf@suug.ch> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-08rhashtable: optimize rhashtable_lookup routineYing Xue1-23/+18
Define an internal compare function and relevant compare argument, and then make use of rhashtable_lookup_compare() to lookup key in hash table, reducing duplicated code between rhashtable_lookup() and rhashtable_lookup_compare(). Signed-off-by: Ying Xue <ying.xue@windriver.com> Cc: Thomas Graf <tgraf@suug.ch> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-03rhashtable: Supports for nulls markerThomas Graf1-7/+30
In order to allow for wider usage of rhashtable, use a special nulls marker to terminate each chain. The reason for not using the existing nulls_list is that the prev pointer usage would not be valid as entries can be linked in two different buckets at the same time. The 4 nulls base bits can be set through the rhashtable_params structure like this: struct rhashtable_params params = { [...] .nulls_base = (1U << RHT_BASE_SHIFT), }; This reduces the hash length from 32 bits to 27 bits. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-03rhashtable: Per bucket locks & deferred expansion/shrinkingThomas Graf1-114/+344
Introduces an array of spinlocks to protect bucket mutations. The number of spinlocks per CPU is configurable and selected based on the hash of the bucket. This allows for parallel insertions and removals of entries which do not share a lock. The patch also defers expansion and shrinking to a worker queue which allows insertion and removal from atomic context. Insertions and deletions may occur in parallel to it and are only held up briefly while the particular bucket is linked or unzipped. Mutations of the bucket table pointer is protected by a new mutex, read access is RCU protected. In the event of an expansion or shrinking, the new bucket table allocated is exposed as a so called future table as soon as the resize process starts. Lookups, deletions, and insertions will briefly use both tables. The future table becomes the main table after an RCU grace period and initial linking of the old to the new table was performed. Optimization of the chains to make use of the new number of buckets follows only the new table is in use. The side effect of this is that during that RCU grace period, a bucket traversal using any rht_for_each() variant on the main table will not see any insertions performed during the RCU grace period which would at that point land in the future table. The lookup will see them as it searches both tables if needed. Having multiple insertions and removals occur in parallel requires nelems to become an atomic counter. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-03nft_hash: Remove rhashtable_remove_pprev()Thomas Graf1-27/+7
The removal function of nft_hash currently stores a reference to the previous element during lookup which is used to optimize removal later on. This was possible because a lock is held throughout calling rhashtable_lookup() and rhashtable_remove(). With the introdution of deferred table resizing in parallel to lookups and insertions, the nftables lock will no longer synchronize all table mutations and the stored pprev may become invalid. Removing this optimization makes removal slightly more expensive on average but allows taking the resize cost out of the insert and remove path. Signed-off-by: Thomas Graf <tgraf@suug.ch> Cc: netfilter-devel@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-03rhashtable: Factor out bucket_tail() functionThomas Graf1-9/+14
Subsequent patches will require access to the bucket tail. Access to the tail is relatively cheap as the automatic resizing of the table should keep the number of entries per bucket to no more than 0.75 on average. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-03rhashtable: Convert bucket iterators to take table and indexThomas Graf1-11/+19
This patch is in preparation to introduce per bucket spinlocks. It extends all iterator macros to take the bucket table and bucket index. It also introduces a new rht_dereference_bucket() to handle protected accesses to buckets. It introduces a barrier() to the RCU iterators to the prevent the compiler from caching the first element. The lockdep verifier is introduced as stub which always succeeds and properly implement in the next patch when the locks are introduced. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-03rhashtable: Use rht_obj() instead of manual offset calculationThomas Graf1-2/+2
Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-03rhashtable: Do hashing inside of rhashtable_lookup_compare()Thomas Graf1-61/+30
Hash the key inside of rhashtable_lookup_compare() like rhashtable_lookup() does. This allows to simplify the hashing functions and keep them private. Signed-off-by: Thomas Graf <tgraf@suug.ch> Cc: netfilter-devel@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-10net: replace remaining users of arch_fast_hash with jhashDaniel Borkmann1-4/+4
This patch effectively reverts commit 500f80872645 ("net: ovs: use CRC32 accelerated flow hash if available"), and other remaining arch_fast_hash() users such as from nfsd via commit 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.") where it has been used as a hash function for bloom filtering. While we think that these users are actually not much of concern, it has been requested to remove the arch_fast_hash() library bits that arose from [1] entirely as per recent discussion [2]. The main argument is that using it as a hash may introduce bias due to its linearity (see avalanche criterion) and thus makes it less clear (though we tried to document that) when this security/performance trade-off is actually acceptable for a general purpose library function. Lets therefore avoid any further confusion on this matter and remove it to prevent any future accidental misuse of it. For the time being, this is going to make hashing of flow keys a bit more expensive in the ovs case, but future work could reevaluate a different hashing discipline. [1] https://patchwork.ozlabs.org/patch/299369/ [2] https://patchwork.ozlabs.org/patch/418756/ Cc: Neil Brown <neilb@suse.de> Cc: Francesco Fusco <fusco@ntop.org> Cc: Jesse Gross <jesse@nicira.com> Cc: Thomas Graf <tgraf@suug.ch> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-24rhashtable: Check for count mismatch while iterating in selftestThomas Graf1-7/+20
Verify whether both the lock and RCU protected iterators see all test entries before and after expanding and shrinking has been performed. Also verify whether the number of entries in the hashtable remains stable during expansion and shrinking. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-5/+5
Conflicts: drivers/net/ethernet/chelsio/cxgb4vf/sge.c drivers/net/ethernet/intel/ixgbe/ixgbe_phy.c sge.c was overlapping two changes, one to use the new __dev_alloc_page() in net-next, and one to use s->fl_pg_order in net. ixgbe_phy.c was a set of overlapping whitespace changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-13rhashtable: Drop gfp_flags arg in insert/remove functionsThomas Graf1-24/+17
Reallocation is only required for shrinking and expanding and both rely on a mutex for synchronization and callers of rhashtable_init() are in non atomic context. Therefore, no reason to continue passing allocation hints through the API. Instead, use GFP_KERNEL and add __GFP_NOWARN | __GFP_NORETRY to allow for silent fall back to vzalloc() without the OOM killer jumping in as pointed out by Eric Dumazet and Eric W. Biederman. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-13rhashtable: Add parent argument to mutex_is_heldHerbert Xu1-2/+2
Currently mutex_is_held can only test locks in the that are global since it takes no arguments. This prevents rhashtable from being used in places where locks are lock, e.g., per-namespace locks. This patch adds a parent field to mutex_is_held and rhashtable_params so that local locks can be used (and tested). Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-13rhashtable: Move mutex_is_held under PROVE_LOCKINGHerbert Xu1-0/+8
The rhashtable function mutex_is_held is only used when PROVE_LOCKING is enabled. This patch makes the mutex_is_held field in rhashtable optional depending on PROVE_LOCKING. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-13lib: rhashtable - Remove weird non-ASCII characters from commentsHerbert Xu1-5/+5
My editor spewed garbage that looked like memory corruption on my screen. It turns out that a number of occurences of "fi" got turned into a ligature. This patch replaces these ligatures with the ASCII letters "fi". Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Cheers, Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds1-4/+8
Pull networking updates from David Miller: "Most notable changes in here: 1) By far the biggest accomplishment, thanks to a large range of contributors, is the addition of multi-send for transmit. This is the result of discussions back in Chicago, and the hard work of several individuals. Now, when the ->ndo_start_xmit() method of a driver sees skb->xmit_more as true, it can choose to defer the doorbell telling the driver to start processing the new TX queue entires. skb->xmit_more means that the generic networking is guaranteed to call the driver immediately with another SKB to send. There is logic added to the qdisc layer to dequeue multiple packets at a time, and the handling mis-predicted offloads in software is now done with no locks held. Finally, pktgen is extended to have a "burst" parameter that can be used to test a multi-send implementation. Several drivers have xmit_more support: i40e, igb, ixgbe, mlx4, virtio_net Adding support is almost trivial, so export more drivers to support this optimization soon. I want to thank, in no particular or implied order, Jesper Dangaard Brouer, Eric Dumazet, Alexander Duyck, Tom Herbert, Jamal Hadi Salim, John Fastabend, Florian Westphal, Daniel Borkmann, David Tat, Hannes Frederic Sowa, and Rusty Russell. 2) PTP and timestamping support in bnx2x, from Michal Kalderon. 3) Allow adjusting the rx_copybreak threshold for a driver via ethtool, and add rx_copybreak support to enic driver. From Govindarajulu Varadarajan. 4) Significant enhancements to the generic PHY layer and the bcm7xxx driver in particular (EEE support, auto power down, etc.) from Florian Fainelli. 5) Allow raw buffers to be used for flow dissection, allowing drivers to determine the optimal "linear pull" size for devices that DMA into pools of pages. The objective is to get exactly the necessary amount of headers into the linear SKB area pre-pulled, but no more. The new interface drivers use is eth_get_headlen(). From WANG Cong, with driver conversions (several had their own by-hand duplicated implementations) by Alexander Duyck and Eric Dumazet. 6) Support checksumming more smoothly and efficiently for encapsulations, and add "foo over UDP" facility. From Tom Herbert. 7) Add Broadcom SF2 switch driver to DSA layer, from Florian Fainelli. 8) eBPF now can load programs via a system call and has an extensive testsuite. Alexei Starovoitov and Daniel Borkmann. 9) Major overhaul of the packet scheduler to use RCU in several major areas such as the classifiers and rate estimators. From John Fastabend. 10) Add driver for Intel FM10000 Ethernet Switch, from Alexander Duyck. 11) Rearrange TCP_SKB_CB() to reduce cache line misses, from Eric Dumazet. 12) Add Datacenter TCP congestion control algorithm support, From Florian Westphal. 13) Reorganize sk_buff so that __copy_skb_header() is significantly faster. From Eric Dumazet" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1558 commits) netlabel: directly return netlbl_unlabel_genl_init() net: add netdev_txq_bql_{enqueue, complete}_prefetchw() helpers net: description of dma_cookie cause make xmldocs warning cxgb4: clean up a type issue cxgb4: potential shift wrapping bug i40e: skb->xmit_more support net: fs_enet: Add NAPI TX net: fs_enet: Remove non NAPI RX r8169:add support for RTL8168EP net_sched: copy exts->type in tcf_exts_change() wimax: convert printk to pr_foo() af_unix: remove 0 assignment on static ipv6: Do not warn for informational ICMP messages, regardless of type. Update Intel Ethernet Driver maintainers list bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING tipc: fix bug in multicast congestion handling net: better IFF_XMIT_DST_RELEASE support net/mlx4_en: remove NETDEV_TX_BUSY 3c59x: fix bad split of cpu_to_le32(pci_map_single()) net: bcmgenet: fix Tx ring priority programming ...
2014-10-07Merge branch 'for-linus' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull "trivial tree" updates from Jiri Kosina: "Usual pile from trivial tree everyone is so eagerly waiting for" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits) Remove MN10300_PROC_MN2WS0038 mei: fix comments treewide: Fix typos in Kconfig kprobes: update jprobe_example.c for do_fork() change Documentation: change "&" to "and" in Documentation/applying-patches.txt Documentation: remove obsolete pcmcia-cs from Changes Documentation: update links in Changes Documentation: Docbook: Fix generated DocBook/kernel-api.xml score: Remove GENERIC_HAS_IOMAP gpio: fix 'CONFIG_GPIO_IRQCHIP' comments tty: doc: Fix grammar in serial/tty dma-debug: modify check_for_stack output treewide: fix errors in printk genirq: fix reference in devm_request_threaded_irq comment treewide: fix synchronize_rcu() in comments checkstack.pl: port to AArch64 doc: queue-sysfs: minor fixes init/do_mounts: better syntax description MIPS: fix comment spelling powerpc/simpleboot: fix comment ...
2014-10-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-4/+4
Conflicts: drivers/net/usb/r8152.c net/netfilter/nfnetlink.c Both r8152 and nfnetlink conflicts were simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller1-4/+4
Pablo Neira Ayuso says: ==================== nf pull request for net This series contains netfilter fixes for net, they are: 1) Fix lockdep splat in nft_hash when releasing sets from the rcu_callback context. We don't the mutex there anymore. 2) Remove unnecessary spinlock_bh in the destroy path of the nf_tables rbtree set type from rcu_callback context. 3) Fix another lockdep splat in rhashtable. None of the callers hold a mutex when calling rhashtable_destroy. 4) Fix duplicated error reporting from nfnetlink when aborting and replaying a batch. 5) Fix a Kconfig issue reported by kbuild robot. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+0
Conflicts: arch/mips/net/bpf_jit.c drivers/net/can/flexcan.c Both the flexcan and MIPS bpf_jit conflicts were cases of simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-19lib: rhashtable: remove second linux/log2.h inclusionFabian Frederick1-1/+0
linux/log2.h was included twice. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-03lib/rhashtable: allow user to set the minimum shifts of shrinkingYing Xue1-4/+8
Although rhashtable library allows user to specify a quiet big size for user's created hash table, the table may be shrunk to a very small size - HASH_MIN_SIZE(4) after object is removed from the table at the first time. Subsequently, even if the total amount of objects saved in the table is quite lower than user's initial setting in a long time, the hash table size is still dynamically adjusted by rhashtable_shrink() or rhashtable_expand() each time object is inserted or removed from the table. However, as synchronize_rcu() has to be called when table is shrunk or expanded by the two functions, we should permit user to set the minimum table size through configuring the minimum number of shifts according to user specific requirement, avoiding these expensive actions of shrinking or expanding because of calling synchronize_rcu(). Signed-off-by: Ying Xue <ying.xue@windriver.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-03rhashtable: fix lockdep splat in rhashtable_destroy()Pablo Neira Ayuso1-4/+4
No need for rht_dereference() from rhashtable_destroy() since the existing callers don't hold the mutex when invoking this function from: 1) Netlink, this is called in case of memory allocation errors in the initialization path, no nl_sk_hash_lock is held. 2) Netfilter, this is called from the rcu callback, no nfnl_lock is held either. I think it's reasonable to assume that the caller has to make sure that no hash resizing may happen before releasing the bucket array. Therefore, the caller should be responsible for releasing this in a safe way, document this to make people aware of it. This resolves a rcu lockdep splat in nft_hash: =============================== [ INFO: suspicious RCU usage. ] 3.16.0+ #178 Not tainted ------------------------------- lib/rhashtable.c:596 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 1 1 lock held by ksoftirqd/2/18: #0: (rcu_callback){......}, at: [<ffffffff810918fd>] rcu_process_callbacks+0x27e/0x4c7 stack backtrace: CPU: 2 PID: 18 Comm: ksoftirqd/2 Not tainted 3.16.0+ #178 Hardware name: LENOVO 23259H1/23259H1, BIOS G2ET32WW (1.12 ) 05/30/2012 0000000000000001 ffff88011706bb68 ffffffff8143debc 0000000000000000 ffff880117062610 ffff88011706bb98 ffffffff81077515 ffff8800ca041a50 0000000000000004 ffff8800ca386480 ffff8800ca041a00 ffff88011706bbb8 Call Trace: [<ffffffff8143debc>] dump_stack+0x4e/0x68 [<ffffffff81077515>] lockdep_rcu_suspicious+0xfa/0x103 [<ffffffff81228b1b>] rhashtable_destroy+0x46/0x52 [<ffffffffa06f21a7>] nft_hash_destroy+0x73/0x82 [nft_hash] Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Thomas Graf <tgraf@suug.ch>
2014-08-26lib: rhashtable: Spelling s/compuate/compute/Geert Uytterhoeven1-2/+2
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Thomas Graf <tgraf@suug.ch> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-08-14rhashtable: unexport and make rht_obj() staticThomas Graf1-7/+1
No need to export rht_obj(), all inner to outer object translations occur internally. It was intended to be used with rht_for_each() which now primarily serves as the iterator for rhashtable_remove_pprev() to effectively flush and free the full table. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-14rhashtable: RCU annotations for next pointersThomas Graf1-1/+1
Properly annotate next pointers as access is RCU protected in the lookup path. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02lib: Resizable, Scalable, Concurrent Hash TableThomas Graf1-0/+797
Generic implementation of a resizable, scalable, concurrent hash table based on [0]. The implementation supports both, fixed size keys specified via an offset and length, or arbitrary keys via own hash and compare functions. Lookups are lockless and protected as RCU read side critical sections. Automatic growing/shrinking based on user configurable watermarks is available while allowing concurrent lookups to take place. Objects to be hashed must include a struct rhash_head. The reason for not using the existing struct hlist_head is that the expansion and shrinking will have two buckets point to a single entry which would lead in obscure reverse chaining behaviour. Code includes a boot selftest if CONFIG_TEST_RHASHTABLE is defined. [0] https://www.usenix.org/legacy/event/atc11/tech/final_files/Triplett.pdf Signed-off-by: Thomas Graf <tgraf@suug.ch> Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>