Python string split

Question

what would be the best way to split this in python. (address, city, state, zip)

<div class="adtxt">7616 W Belmont Ave<br />Chicago, IL 60634-3225</div>

in some case zip code is as

 <div class="adtxt">7616 W Belmont Ave<br />Chicago, IL 60634</div>

It's easy to give an approach for this specific example, but you'll get a better answer if you explain what kinds of addresses you need to handle, and what assumptions you're making. For example: Can there be more than one address line? Do you have to handle international addresses? Is it possible for state/zip to be on different lines? Will <br /> be the only way to separate lines? Etc. etc. — Owen S.
– Owen S., Commented Jul 29, 2010 at 5:17

Alex Martelli · Accepted Answer · 2010-07-29 05:14:11Z

3

Depending on how tight or lax you want to be on various aspects that can't be deduced from a single example, something like the following should work...:

import re

s = re.compile(r'^<div.*?>([^<]+)<br.*?>([^,]+), (\w\w) (\d{5}-\d{4})</div>$')
mo = s.match(thestring)
if mo is None:
  raise ValueError('No match for %r' % thestring)
address, city, state, zip = mo.groups()

answered Jul 29, 2010 at 5:14

Alex Martelli

888k175 gold badges1.3k silver badges1.4k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

teepark Over a year ago

obligatory: stackoverflow.com/questions/1732348/…

bobsr Over a year ago

the works fine, but if the zip is only 5 digits, it throws error

cji Over a year ago

replace (\d{5}-\d{4}) at the end of regexp with (\d{5}-\d{4}|\d{5})

Alex Martelli Over a year ago

@bobsr, I do explain it depends on how strict or lax you want to be on conditions that are impossible to deduce from a single example -- if the only variation (as per your edit) is in the zipcode, @cji's solution fixes that; if there's more variation, you need more and more tweaks.

Alex Martelli Over a year ago

And BTW, I agree that the parsing of the tag themselves could best be done by an HTML parser, but if you do that you still need further processing for the city / state / zip separation.

Community · Accepted Answer · 2017-05-23 10:32:43Z

0

Just a hint: there are much better ways to parse HTML than regular expressions, for example Beautiful Soup.

Here's why you shouldn't do that with regular expressions.

EDIT: Oh well, @teepark linked it first. :)

edited May 23, 2017 at 10:32

CommunityBot

11 silver badge

answered Jul 29, 2010 at 5:22

thevilledev

2,4071 gold badge16 silver badges19 bronze badges

Comments

SiggyF · Accepted Answer · 2010-07-29 05:34:35Z

0

Combining beautifulsoup and the regular expressions should give you something like:

import BeautifulSoup
import re
thestring = r'<div class="adtxt">7616 W Belmont Ave<br />Chicago, IL 60634-3225</div>'
re0 = re.compile(r'(?P<address>[^<]+)')
re1 = re.compile(r'(?P<city>[^,]+), (?P<state>\w\w) (?P<zip>\d{5}-\d{4})')
soup = BeautifulSoup.BeautifulSoup(thestring)
(address,) = re0.search(soup.div.contents[0]).groups()
city, state, zip = re1.search(soup.div.contents[2]).groups()

answered Jul 29, 2010 at 5:34

SiggyF

23.3k8 gold badges46 silver badges57 bronze badges

1 Comment

bobsr Over a year ago

this works fine, but if the zip is only 5 digits (60634), it throws error

Collectives™ on Stack Overflow

Python string split

3 Answers 3

5 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related