According to java String source code, the hash implementation is:
public int hashCode()
{
if (cachedHashCode != 0)
return cachedHashCode;
// Compute the hash code using a local variable to be reentrant.
int hashCode = 0;
int limit = count + offset;
for (int i = offset; i < limit; i++)
hashCode = hashCode * 31 + value[i];
return cachedHashCode = hashCode;
}
You can transfer this to Python (w/o caching):
class JavaHashStr(str):
def __hash__(self):
hashCode = 0
for char in self:
hashCode = hashCode * 31 + ord(char)
return hashCode
>>> j = JavaHashStr("abcd")
>>> hash(j)
2987074 # same as java
>>> j = JavaHashStr("abcdef")
>>> hash(j)
2870581347 # java: -1424385949
Note, Python ints do not overflow like java, so this is wrong for many cases. You would have to add a simulation for the overflow (Update: thx to @PresidentJamesK.Polk for the improved version, SO thread on the topic):
class JavaHashStr(str):
def __hash__(self):
hashCode = 0
for char in self:
hashCode = (hashCode * 31 + ord(char)) & (2**32 - 1) # unsigned
if hashCode & 2**31:
hashCode -= 2**32 # make it signed
return hashCode
Now, even overflowing hashes behave the same:
>>> j = JavaHashStr("abc")
>>> hash(j)
96354
>>> j = JavaHashStr("abcdef")
>>> hash(j)
-1424385949 # Java hash for "abcdef"
This might still be off for characters from the latter unicode panes like emojis or the like. But for the most common punctuation and latin-based characters, this should work.
String.hashCodein Python?hash(obj)for any object in Python. Mutable types (dict,list, etc) are however unhashable. Custom classes can define the behavior by implementing a__hash__method. Default for custom classes is based on object identity (memory location) withhash(obj) == hash(id(obj)//16)