Storing large strings in POSTGRES and comparing

Question

In our analytics application, we parse URL's and store in the database.

We parse the URLS using URLparse module and store each and every content in a separate table.

from urlparse import urlsplit
parsed = urlsplit('http://user:pass@NetLoc:80/path;parameters/path2;parameters2?query=argument#fragment')
print parsed
print 'scheme  :', parsed.scheme
print 'netloc  :', parsed.netloc
print 'path    :', parsed.path
print 'query   :', parsed.query
print 'fragment:', parsed.fragment
print 'username:', parsed.username
print 'password:', parsed.password
print 'hostname:', parsed.hostname, '(netloc in lower case)'
print 'port    :', parsed.port

Before inserting it we check if the contents are already on the table, if its there we dont insert it.

It works fine except the PATH table. The PATH content for our URLS are too big(2000-3000) bytes and it takes lot of time to index/compare and insert the row.

Is there a way better way to store a 2000-3000 byte field that needs to be compared?

Craig Ringer · Accepted Answer · 2015-11-24 05:21:00Z

1

Personally I would store a hash of the path component and/or the whole URL. Then for searches I'd check the hash.

answered Nov 24, 2015 at 5:21

Craig Ringer

329k84 gold badges742 silver badges820 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Abs Over a year ago

Good idea but the computation overhead for hashing before comparing is to be considered along with the extra hashed column which offers little business value. However this will require least code change perhaps

Craig Ringer Over a year ago

The CPU time for hashing will be tiny. Don't use a cryptographic hash (like md5); use a cheap, fast hash. Take a look at github.com/markokr/pghashlib .

user1050619 Over a year ago

@Craig: Have a question..Will the hashing of string ever lead to hash collision and 2 string having the same hashes?

Craig Ringer Over a year ago

@user1050619 Yes, of course. You're reducing a potentially-unlimited-length string into a fixed number of bits. That's why you check the hash and, if the hash matches, then compare the string its self. This greatly reduces the number of comparisons of the full length string you have to do. That's how hash maps work.

Abs · Accepted Answer · 2015-11-24 04:20:48Z

0

You can use jsonb with gin or GiST indexing depending on your dataset

http://www.postgresql.org/docs/9.4/static/datatype-json.html

Basically I would store each parsed part separately and this way everything you want can be indexed searchable and your comparison can be quite efficient too.

scheme , host , port etc...

answered Nov 24, 2015 at 4:20

Abs

3,9621 gold badge33 silver badges30 bronze badges

Collectives™ on Stack Overflow

Storing large strings in POSTGRES and comparing

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related