68

I need to Convert Strings stored in my Database to a Numeric value. Result can be Integer (preferred) or Bigint. This conversion is to be done at Database side in a PL/pgSQL function.

Can someone please point me to some algorithm or any API's that can be used to achieve this?

I have been searching for this on Google for hours now, could not find anything useful so far :(

2
  • Dou you mean to convert "42" to 42 (char to int)? Commented Mar 21, 2012 at 17:37
  • 1
    No. I mean strings with Characters and Symbols. What I am looking for a good hashing algorithm that has very low Collision and can generate a hash in Int or Bigint data type. Commented Mar 21, 2012 at 17:49

5 Answers 5

82

Just keep the first 32 bits or 64 bits of the MD5 hash. Of course, it voids the main property of md5 (=the probability of collision being infinitesimal) but you'll still get a wide dispersion of values which presumably is good enough for your problem.

SQL functions derived from the other answers:

For bigint:

create function h_bigint(text) returns bigint as $$
 select ('x'||substr(md5($1),1,16))::bit(64)::bigint;
$$ language sql;

For int:

create function h_int(text) returns int as $$
 select ('x'||substr(md5($1),1,8))::bit(32)::int;
$$ language sql;
Sign up to request clarification or add additional context in comments.

3 Comments

What does the ('x'|| conditional do?
@MichaelYoo: the || operator is for string concatenation (only MySQL and programming languages inspired by C interpret || as a logical OR, it is not the standard interpretation in SQL).
I see. I think it's just to "unhex" the hex representation in to binary, which is then cast to bits, is that right? (because Postgres doesn't have unhex?) The examples I see always put the 'x' outside of quotes like x'1b1c2f', how does prepending a string literal 'x' work in this case? i.e. would CONCAT() also work?
37

PostgreSQL has hash functions for many column types. You can use hashtext if you want integer hash values, or hashtextextended if you prefer bigint hash values.

Note that hashXXXextended functions take one extra parameter for seed and 0 means that no seed will be used.

In the commit message the author Robert Haas says:

Just in case somebody wants a 64-bit hash value that is compatible with the existing 32-bit hash values, make the low 32-bits of the 64-bit hash value match the 32-bit hash value when the seed is 0.

Example

postgres=# SELECT hashtextextended('test string of type text', 0);
   hashtextextended
----------------------
 -6578719834206879717
(1 row)

postgres=# SELECT hashtext('test string of type text');
  hashtext
-------------
 -1790427109
(1 row)

What about other data types?

You can check all available hash functions on your backend by checking the output for \df hash*. Below you can see the functions available in PG 14.0.

hanefi=# \df hash*
                                     List of functions
   Schema   |           Name           | Result data type |   Argument data types    | Type
------------+--------------------------+------------------+--------------------------+------
 pg_catalog | hash_aclitem             | integer          | aclitem                  | func
 pg_catalog | hash_aclitem_extended    | bigint           | aclitem, bigint          | func
 pg_catalog | hash_array               | integer          | anyarray                 | func
 pg_catalog | hash_array_extended      | bigint           | anyarray, bigint         | func
 pg_catalog | hash_multirange          | integer          | anymultirange            | func
 pg_catalog | hash_multirange_extended | bigint           | anymultirange, bigint    | func
 pg_catalog | hash_numeric             | integer          | numeric                  | func
 pg_catalog | hash_numeric_extended    | bigint           | numeric, bigint          | func
 pg_catalog | hash_range               | integer          | anyrange                 | func
 pg_catalog | hash_range_extended      | bigint           | anyrange, bigint         | func
 pg_catalog | hash_record              | integer          | record                   | func
 pg_catalog | hash_record_extended     | bigint           | record, bigint           | func
 pg_catalog | hashbpchar               | integer          | character                | func
 pg_catalog | hashbpcharextended       | bigint           | character, bigint        | func
 pg_catalog | hashchar                 | integer          | "char"                   | func
 pg_catalog | hashcharextended         | bigint           | "char", bigint           | func
 pg_catalog | hashenum                 | integer          | anyenum                  | func
 pg_catalog | hashenumextended         | bigint           | anyenum, bigint          | func
 pg_catalog | hashfloat4               | integer          | real                     | func
 pg_catalog | hashfloat4extended       | bigint           | real, bigint             | func
 pg_catalog | hashfloat8               | integer          | double precision         | func
 pg_catalog | hashfloat8extended       | bigint           | double precision, bigint | func
 pg_catalog | hashhandler              | index_am_handler | internal                 | func
 pg_catalog | hashinet                 | integer          | inet                     | func
 pg_catalog | hashinetextended         | bigint           | inet, bigint             | func
 pg_catalog | hashint2                 | integer          | smallint                 | func
 pg_catalog | hashint2extended         | bigint           | smallint, bigint         | func
 pg_catalog | hashint4                 | integer          | integer                  | func
 pg_catalog | hashint4extended         | bigint           | integer, bigint          | func
 pg_catalog | hashint8                 | integer          | bigint                   | func
 pg_catalog | hashint8extended         | bigint           | bigint, bigint           | func
 pg_catalog | hashmacaddr              | integer          | macaddr                  | func
 pg_catalog | hashmacaddr8             | integer          | macaddr8                 | func
 pg_catalog | hashmacaddr8extended     | bigint           | macaddr8, bigint         | func
 pg_catalog | hashmacaddrextended      | bigint           | macaddr, bigint          | func
 pg_catalog | hashname                 | integer          | name                     | func
 pg_catalog | hashnameextended         | bigint           | name, bigint             | func
 pg_catalog | hashoid                  | integer          | oid                      | func
 pg_catalog | hashoidextended          | bigint           | oid, bigint              | func
 pg_catalog | hashoidvector            | integer          | oidvector                | func
 pg_catalog | hashoidvectorextended    | bigint           | oidvector, bigint        | func
 pg_catalog | hashtext                 | integer          | text                     | func
 pg_catalog | hashtextextended         | bigint           | text, bigint             | func
 pg_catalog | hashtid                  | integer          | tid                      | func
 pg_catalog | hashtidextended          | bigint           | tid, bigint              | func
 pg_catalog | hashvarlena              | integer          | internal                 | func
 pg_catalog | hashvarlenaextended      | bigint           | internal, bigint         | func
(47 rows)

Caveats

If you want to have consistent hashes across different systems, make sure you have the same collation behaviour.

The built-in collatable data types are text, varchar, and char. If you have different collation options, you can see different hash values. For example the sort order of the strings ‘a-a’ and ‘a+a’ flipped in glibc 2.28 (Debian 10, RHEL 8) compared to earlier releases.

If you wish to use the hashes on the same machine, you need not worry, as long as you do not update glibc or use a different collation.

See more details at: https://www.citusdata.com/blog/2020/12/12/dont-let-collation-versions-corrupt-your-postgresql-indexes/

2 Comments

If you want to combine several values into a single hash, you can chain them by making one hash the seed for the next: hashfloat8extended(1.0, hashfloat8(2.0))
according to this discussion postgresql.org/message-id/… hash**** family of functions are for internal use and might change in future versions producing different hashes for the same input. imo its risky to rely on them in a production system.
20

You can create a md5 hash value without problems:

select md5('hello, world');

This returns a string with a hex number.

Unfortunately there is no built-in function to convert hex to integer but as you are doing that in PL/pgSQL anyway, this might help:

https://stackoverflow.com/a/8316731/330315

1 Comment

Output of md5 will have substringed to avoid Integer Out of range error.
5

Must it be an integer? The pg_crypto module provides a number of standard hash functions (md5, sha1, etc). They all return bytea. I suppose you could throw away some bits and convert bytea to integer.

bigint is too small to store a cryptographic hash. The largest non-bytea binary type Pg supports is uuid. You could cast a digest to uuid like this:

select ('{'||encode( substring(digest('foobar','sha256') from 1 for 16), 'hex')||'}')::uuid;
                 uuid                 
--------------------------------------
 c3ab8ff1-3720-e8ad-9047-dd39466b3c89

Comments

2

This is an implementation of Java's String.hashCode():

CREATE OR REPLACE FUNCTION hashCode(_string text) RETURNS INTEGER AS $$
DECLARE
  val_ CHAR[];
  h_ INTEGER := 0;
  ascii_ INTEGER;
  c_ char;
BEGIN
  val_ = regexp_split_to_array(_string, '');

  FOR i in 1 .. array_length(val_, 1)
  LOOP
    c_ := (val_)[i];
    ascii_ := ascii(c_);
    h_ = 31 * h_ + ascii_;
    raise info '%: % = %', i, c_, h_;
  END LOOP;
RETURN h_;
END;
$$ LANGUAGE plpgsql;

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.