Regular Expression in Python

Question

I'm trying to build a list of domain names from an Enom API call. I get back a lot of information and need to locate the domain name related lines, and then join them together.

The string that comes back from Enom looks somewhat like this:

SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
TLDOverride=1

I'd like to build a list from that which looks like this:

[domain1.com, domain2.org, domain3.co.uk, domain4.net]

To find the different domain name components I've tried the following (where "enom" is the string above) but have only been able to get the SLD and TLD matches.

re.findall("^.*(SLD|TLD).*$", enom, re.M)

zenpoy · Accepted Answer · 2013-05-26 13:35:54Z

6

Edit: Every time I see a question asking for regular expression solution I have this bizarre urge to try and solve it without regular expressions. Most of the times it's more efficient than the use of regex, I encourage the OP to test which of the solutions is most efficient.

Here is the naive approach:

a = """SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
TLDOverride=1"""

b = a.split("\n")
c = [x.split("=")[1] for x in b if x != 'TLDOverride=1']
for x in range(0,len(c),2):
    print ".".join(c[x:x+2])

>> domain1.com
>> domain2.org
>> domain3.co.uk
>> domain4.net

edited May 26, 2013 at 13:35

answered May 26, 2013 at 12:23

zenpoy

20.3k10 gold badges65 silver badges89 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

SethMMorton Over a year ago

I agree! Pythons string manipulation functions are so powerful, I feel like banging one's head against the walls with regex is often unneeded. I rarely use import re anymore.

mata · Accepted Answer · 2013-05-26 16:15:15Z

4

You have a capturing group in your expression. re.findall documentation says:

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

That's why only the conent of the capturing group is returned.

try:

re.findall("^.*((?:SLD|TLD)\d*)=(.*)$", enom, re.M)

This would return a list of tuples:

[('SLD1', 'domain1'), ('TLD1', 'com'), ('SLD2', 'domain2'), ('TLD2', 'org'), ('SLD3', 'domain3'), ('TLD4', 'co.uk'), ('SLD5', 'domain4'), ('TLD5', 'net')]

Combining SLDs and TLDs is then up to you.

edited May 26, 2013 at 16:15

answered May 26, 2013 at 12:23

mata

69.4k10 gold badges168 silver badges162 bronze badges

Comments

Qiang Jin · Accepted Answer · 2013-05-26 12:25:28Z

3

this works for you example,

>>> sld_list = re.findall("^.*SLD[0-9]*?=(.*?)$", enom, re.M)
>>> tld_list = re.findall("^.*TLD[0-9]*?=(.*?)$", enom, re.M)
>>> map(lambda x: x[0] + '.' + x[1], zip(sld_list, tld_list))
['domain1.com', 'domain2.org', 'domain3.co.uk', 'domain4.net']

answered May 26, 2013 at 12:25

Qiang Jin

4,47721 silver badges16 bronze badges

Comments

kirelagin · Accepted Answer · 2013-05-26 13:26:26Z

3

I'm not sure why are you talking about regular expressions. I mean, why don't you just run a for loop?

A famous quote seems to be appropriate here:

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

domains = []
components = []
for line in enom.split('\n'):
  k,v = line.split('=')
  if k == 'TLDOverride':
    continue
  components.append(v)
  if k.startswith('TLD'):
    domains.append('.'.join(components))
    components = []

P.S. I'm not sure what's this TLDOverride so the code just ignores it.

edited May 26, 2013 at 13:26

answered May 26, 2013 at 12:25

kirelagin

13.7k2 gold badges45 silver badges59 bronze badges

1 Comment

Andrei Kaigorodov Over a year ago

I think this is the best solution, because it's O(n) algorithm.

Jon Clements · Accepted Answer · 2013-05-26 12:30:58Z

2

Here's one way:

import re
print map('.'.join,  zip(*[iter(re.findall(r'^(?:S|T)LD\d+=(.*)$', text, re.M))]*2))
# ['domain1.com', 'domain2.org', 'domain3.co.uk', 'domain4.net']

answered May 26, 2013 at 12:30

Jon Clements

143k34 gold badges254 silver badges288 bronze badges

Comments

Andrei Kaigorodov · Accepted Answer · 2013-05-26 12:56:36Z

2

Just for fun, map -> filter -> map:

input = """
SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
"""

splited = map(lambda x: x.split("="), input.split())
slds = filter(lambda x: x[1][0].startswith('SLD'), enumerate(splited))
print map(lambda x: '.'.join([x[1][1], splited[x[0] + 1][1], ]), slds)

>>> ['domain1.com', 'domain2.org', 'domain3.co.uk', 'domain4.net']

answered May 26, 2013 at 12:56

Andrei Kaigorodov

2,16517 silver badges17 bronze badges

Comments

georg · Accepted Answer · 2013-05-26 12:34:27Z

1

This appears to do what you want:

domains = re.findall('SLD\d+=(.+)', re.sub(r'\nTLD\d+=', '.', enom))

It assumes that the lines are sorted and SLD always comes before its TLD. If that can be not the case, try this slightly more verbose code without regexes:

d = dict(x.split('=') for x in enom.strip().splitlines())

domains = [
    d[key] + '.' + d.get('T' + key[1:], '') 
    for key in d if key.startswith('SLD')
]

answered May 26, 2013 at 12:34

georg

216k57 gold badges324 silver badges401 bronze badges

Comments

Community · Accepted Answer · 2017-05-23 11:43:15Z

1

You need to use multiline regex for this. This is similar to this post.

data = """SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
TLDOverride=1"""

domain_seq = re.compile(r"SLD\d=(\w+)\nTLD\d=(\w+)", re.M)
for item in domain_seq.finditer(data):
    domain, tld = item.group(1), item.group(2)
    print "%s.%s" % (domain,tld)

edited May 23, 2017 at 11:43

CommunityBot

11 silver badge

answered May 26, 2013 at 12:36

rh0dium

7,0924 gold badges51 silver badges81 bronze badges

Comments

l4mpi · Accepted Answer · 2013-05-26 13:15:29Z

1

As some other answers already said, there's no need to use a regular expression here. A simple split and some filtering will do nicely:

lines = data.split("\n") #assuming data contains your input string
sld, tld = [[x.split("=")[1] for x in lines if x[:3] == t] for t in ("SLD", "TLD")]
result = [x+y for x, y in zip(sld, tld)]

edited May 26, 2013 at 13:15

answered May 26, 2013 at 13:10

l4mpi

5,1183 gold badges36 silver badges59 bronze badges

Collectives™ on Stack Overflow

Regular Expression in Python

9 Answers 9

1 Comment

Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

1 Comment

Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related