In our analytics application, we parse URL's and store in the database.
We parse the URLS using URLparse module and store each and every content in a separate table.
from urlparse import urlsplit
parsed = urlsplit('http://user:pass@NetLoc:80/path;parameters/path2;parameters2?query=argument#fragment')
print parsed
print 'scheme :', parsed.scheme
print 'netloc :', parsed.netloc
print 'path :', parsed.path
print 'query :', parsed.query
print 'fragment:', parsed.fragment
print 'username:', parsed.username
print 'password:', parsed.password
print 'hostname:', parsed.hostname, '(netloc in lower case)'
print 'port :', parsed.port
Before inserting it we check if the contents are already on the table, if its there we dont insert it.
It works fine except the PATH table. The PATH content for our URLS are too big(2000-3000) bytes and it takes lot of time to index/compare and insert the row.
Is there a way better way to store a 2000-3000 byte field that needs to be compared?