0

In our analytics application, we parse URL's and store in the database.

We parse the URLS using URLparse module and store each and every content in a separate table.

from urlparse import urlsplit
parsed = urlsplit('http://user:pass@NetLoc:80/path;parameters/path2;parameters2?query=argument#fragment')
print parsed
print 'scheme  :', parsed.scheme
print 'netloc  :', parsed.netloc
print 'path    :', parsed.path
print 'query   :', parsed.query
print 'fragment:', parsed.fragment
print 'username:', parsed.username
print 'password:', parsed.password
print 'hostname:', parsed.hostname, '(netloc in lower case)'
print 'port    :', parsed.port

Before inserting it we check if the contents are already on the table, if its there we dont insert it.

It works fine except the PATH table. The PATH content for our URLS are too big(2000-3000) bytes and it takes lot of time to index/compare and insert the row.

Is there a way better way to store a 2000-3000 byte field that needs to be compared?

2 Answers 2

1

Personally I would store a hash of the path component and/or the whole URL. Then for searches I'd check the hash.

Sign up to request clarification or add additional context in comments.

4 Comments

Good idea but the computation overhead for hashing before comparing is to be considered along with the extra hashed column which offers little business value. However this will require least code change perhaps
The CPU time for hashing will be tiny. Don't use a cryptographic hash (like md5); use a cheap, fast hash. Take a look at github.com/markokr/pghashlib .
@Craig: Have a question..Will the hashing of string ever lead to hash collision and 2 string having the same hashes?
@user1050619 Yes, of course. You're reducing a potentially-unlimited-length string into a fixed number of bits. That's why you check the hash and, if the hash matches, then compare the string its self. This greatly reduces the number of comparisons of the full length string you have to do. That's how hash maps work.
0

You can use jsonb with gin or GiST indexing depending on your dataset

http://www.postgresql.org/docs/9.4/static/datatype-json.html

Basically I would store each parsed part separately and this way everything you want can be indexed searchable and your comparison can be quite efficient too.

scheme , host , port etc...

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.