2

what would be the best way to split this in python. (address, city, state, zip)

<div class="adtxt">7616 W Belmont Ave<br />Chicago, IL 60634-3225</div>

in some case zip code is as

 <div class="adtxt">7616 W Belmont Ave<br />Chicago, IL 60634</div>
1
  • 1
    It's easy to give an approach for this specific example, but you'll get a better answer if you explain what kinds of addresses you need to handle, and what assumptions you're making. For example: Can there be more than one address line? Do you have to handle international addresses? Is it possible for state/zip to be on different lines? Will <br /> be the only way to separate lines? Etc. etc. Commented Jul 29, 2010 at 5:17

3 Answers 3

3

Depending on how tight or lax you want to be on various aspects that can't be deduced from a single example, something like the following should work...:

import re

s = re.compile(r'^<div.*?>([^<]+)<br.*?>([^,]+), (\w\w) (\d{5}-\d{4})</div>$')
mo = s.match(thestring)
if mo is None:
  raise ValueError('No match for %r' % thestring)
address, city, state, zip = mo.groups()
Sign up to request clarification or add additional context in comments.

5 Comments

the works fine, but if the zip is only 5 digits, it throws error
replace (\d{5}-\d{4}) at the end of regexp with (\d{5}-\d{4}|\d{5})
@bobsr, I do explain it depends on how strict or lax you want to be on conditions that are impossible to deduce from a single example -- if the only variation (as per your edit) is in the zipcode, @cji's solution fixes that; if there's more variation, you need more and more tweaks.
And BTW, I agree that the parsing of the tag themselves could best be done by an HTML parser, but if you do that you still need further processing for the city / state / zip separation.
0

Just a hint: there are much better ways to parse HTML than regular expressions, for example Beautiful Soup.

Here's why you shouldn't do that with regular expressions.

EDIT: Oh well, @teepark linked it first. :)

Comments

0

Combining beautifulsoup and the regular expressions should give you something like:

import BeautifulSoup
import re
thestring = r'<div class="adtxt">7616 W Belmont Ave<br />Chicago, IL 60634-3225</div>'
re0 = re.compile(r'(?P<address>[^<]+)')
re1 = re.compile(r'(?P<city>[^,]+), (?P<state>\w\w) (?P<zip>\d{5}-\d{4})')
soup = BeautifulSoup.BeautifulSoup(thestring)
(address,) = re0.search(soup.div.contents[0]).groups()
city, state, zip = re1.search(soup.div.contents[2]).groups()

1 Comment

this works fine, but if the zip is only 5 digits (60634), it throws error

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.