PHP Regex for extracting subdomains of arbitrary domains

Question

I want to extract the subdomain and domain part for domains with arbitrary top level extensions.

Thus:

sub1.domain1.com --> Extract subdomain=sub1, domain=domain1.com

sub2.domain2.co.in --> Extract subdomain=sub2, domain=domain2.co.in

sub3.domain3.co.uk --> Extract subdomain=sub3, domain=domain3.co.uk

sub4.domain4.us --> Extract subdomain=sub4, domain=domain4.us

mydomain.com --> Extract subdomain="", domain=mydomain.com

mydomain.co.in --> Extract subdomain="", domain=mydomain.co.in

I am bit confused about how to handle TLDs like co.in/co.uk etc. I could do this using brute force way by measuring if the last 5 characters have a DOT (.) in them, but thinking if there is a regex way to do this.

NOTE 1: As TToni pointed out, there can be ambiguities. However, I will put some constraints:

1) The "Domain name" part (without the extension) --> will be at-least 4 characters.

2) The TLD extension part (.com, co.in, .us, etc) will either have a single DOT or if it has two DOTS, then the penultimate part (sub TLD) will have atmost 3 characters.

I have a feeling that these constraints will make the problem unambigious and solvable using regex.

(Also, assume "www." has been stripped out already).

NOTE 2:

Example of above constraints

sub.dom.in --> domain="sub.dom.in"

sub.dom1.in --> domain="dom1.in", subdomain="sub"

This may sound buggy, but the reason is - I want this for my internal purposes, and all my domains have atleast 4 characters in them, AND, all extensions have either single DOT or the penultimate part is at-max 3 characters.

NOTE 3: I have a feeling I might make mistakes by using regex for this. Hence thinking of doing the string search way.

regards,

JP

Not quite the same, but take a look at stackoverflow.com/questions/3853338/remove-domain-extension/… — Gumbo
– Gumbo, Commented Nov 29, 2010 at 14:14
I think you cannot fully solve this with a regex because you get ambiguities. Consider "b.c.eu" for example. Which one is the domain? — TToni
– TToni, Commented Nov 29, 2010 at 14:15
I agree with TToni. I will ammend my question. For my purpose, assume that domain name will be at-least 4 characters. Will also add one more constraint after wording it formally. — JP19
– JP19, Commented Nov 29, 2010 at 14:18
So, the domain is "all non-dot characters immediately before the first dot which occurs at least three characters from the end, and all the characters which occur after them", and the subdomain is "everything that's not in the domain, without the final dot"? — Curtis
– Curtis, Commented Nov 29, 2010 at 14:38
"solvable using regex" Just because you have a hammer doesn't mean your problem is a nail — The Archetypal Paul
– The Archetypal Paul, Commented Nov 29, 2010 at 14:43

Bart Kiers · Accepted Answer · 2010-11-29 14:42:13Z

4

Not sure you need regexes. Split the domain name on '.' then apply some heuristics on the result depending on the rightmost bit - e..g if last is "com" then domain is last+second last, subdomain is the rest.

Or keep a list of "top-level" (quotes becasue it's a different meaning from the normal top level)domains, iterate over the list matching the right end of the domain name against each. If a match, remove the top level bit and return the rest as subdomain - this could be put in a regex but with a loss of clarity. The list would look something like

".edu", ".gov", ".mil", ".com", ".co.uk", ".gov.uk", ".nhs.uk", [...]

The regex would look something like

 \.(edu|gov|mil|com|co\.uk|gov\.uk|nhs\.uk|[...])$

edited Nov 29, 2010 at 14:42

Bart Kiers

171k38 gold badges307 silver badges297 bronze badges

answered Nov 29, 2010 at 14:28

The Archetypal Paul

41.9k20 gold badges107 silver badges136 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Bart Kiers Over a year ago

\.(edu)|(com)$ matches either .edu (not necessarily followed by the end-of-input) or com followed by the end-of-input (without the .!). You probably meant \.(edu|com|mil|etc)$. Also, putting [..] in a regex might be perceived as an odd (but legal) character class, whereas you meant it to be something else.

The Archetypal Paul Over a year ago

Thanks, typed too quickly. Fixed. And yes, the [...] is meant to be mean "and so on"

Bart Kiers Over a year ago

Yeah, I figured that. I took the liberty to fix the un-escaped .'s in your example regex.

alpha-mouse · Accepted Answer · 2010-11-29 14:27:35Z

0

You can use this: (\b\w+\b(?:\.\b\w+\b)*?){0,1}?\.?(\b\w+\b(?:\.\b\w{1,3}\b)?\.\b\w{1,3}\b)
It doesn't look very beautiful, but the idea behind it is simple. It will catch subdomain in the first group and domain in the second. Also it will split things like "sub1.sub2.sub3.domain2.co.in" into "sub1.sub2.sub3" and "domain2.co.in"

answered Nov 29, 2010 at 14:27

alpha-mouse

5,01328 silver badges38 bronze badges

1 Comment

morja Over a year ago

The problem is that you cannot know what the actual domain is. In the case of the sample: domain2.co.in "co" might also be the domain (e.g. co.com). So you need to use a list of all toplevel domains.

zyanlu · Accepted Answer · 2012-03-27 14:36:19Z

I got the "top-level" domain names,it might be ugly but it works.

$fix = array('com', 'edu', 'gov', 'int', 'mil', 'net', 'org', 'biz', 'info', 'pro', 'name', 'museum', 'coop', 'aero', 'x    xx', 'idv', 'al', 'dz', 'af', 'ar', 'ae', 'aw', 'om', 'az', 'eg', 'et', 'ie', 'ee', 'ad', 'ao', 'ai', 'ag', 'at', 'au',     'mo', 'bb', 'pg', 'bs', 'pk', 'py', 'ps', 'bh', 'pa', 'br', 'by', 'bm', 'bg', 'mp', 'bj', 'be', 'is', 'pr', 'ba', 'pl',     'bo', 'bz', 'bw', 'bt', 'bf', 'bi', 'bv', 'kp', 'gq', 'dk', 'de', 'tl', 'tp', 'tg', 'dm', 'do', 'ru', 'ec', 'er', 'fr',     'fo', 'pf', 'gf', 'tf', 'va', 'ph', 'fj', 'fi', 'cv', 'fk', 'gm', 'cg', 'cd', 'co', 'cr', 'gg', 'gd', 'gl', 'ge', 'cu',     'gp', 'gu', 'gy', 'kz', 'ht', 'kr', 'nl', 'an', 'hm', 'hn', 'ki', 'dj', 'kg', 'gn', 'gw', 'ca', 'gh', 'ga', 'kh', 'cz',     'zw', 'cm', 'qa', 'ky', 'km', 'ci', 'kw', 'cc', 'hr', 'ke', 'ck', 'lv', 'ls', 'la', 'lb', 'lt', 'lr', 'ly', 'li', 're',     'lu', 'rw', 'ro', 'mg', 'im', 'mv', 'mt', 'mw', 'my', 'ml', 'mk', 'mh', 'mq', 'yt', 'mu', 'mr', 'us', 'um', 'as', 'vi',     'mn', 'ms', 'bd', 'pe', 'fm', 'mm', 'md', 'ma', 'mc', 'mz', 'mx', 'nr', 'np', 'ni', 'ne', 'ng', 'nu', 'no', 'nf', 'na',     'za', 'aq', 'gs', 'eu', 'pw', 'pn', 'pt', 'jp', 'se', 'ch', 'sv', 'ws', 'yu', 'sl', 'sn', 'cy', 'sc', 'sa', 'cx', 'st',     'sh', 'kn', 'lc', 'sm', 'pm', 'vc', 'lk', 'sk', 'si', 'sj', 'sz', 'sd', 'sr', 'sb', 'so', 'tj', 'tw', 'th', 'tz', 'to',     'tc', 'tt', 'tn', 'tv', 'tr', 'tm', 'tk', 'wf', 'vu', 'gt', 've', 'bn', 'ug', 'ua', 'uy', 'uz', 'es', 'eh', 'gr', 'hk',     'sg', 'nc', 'nz', 'hu', 'sy', 'jm', 'am', 'ac', 'ye', 'iq', 'ir', 'il', 'it', 'in', 'id', 'uk', 'vg', 'io', 'jo', 'vn',     'zm', 'je', 'td', 'gi', 'cl', 'cf', 'cn', 'ac', 'ad', 'ae', 'af', 'ag', 'ai', 'al', 'am', 'an', 'ao', 'aq', 'ar', 'as',     'at', 'au', 'aw', 'az', 'ba', 'bb', 'bd', 'be', 'bf', 'bg', 'bh', 'bi', 'bj', 'bm', 'bn', 'bo', 'br', 'bs', 'bt', 'bv',     'bw', 'by', 'bz', 'ca', 'cc', 'cd', 'cf', 'cg', 'ch', 'ci', 'ck', 'cl', 'cm', 'cn', 'co', 'cr', 'cu', 'cv', 'cx', 'cy',     'cz', 'de', 'dj', 'dk', 'dm', 'do', 'dz', 'ec', 'ee', 'eg', 'eh', 'er', 'es', 'et', 'eu', 'fi', 'fj', 'fk', 'fm', 'fo',     'fr', 'ga', 'gd', 'ge', 'gf', 'gg', 'gh', 'gi', 'gl', 'gm', 'gn', 'gp', 'gq', 'gr', 'gs', 'gt', 'gu', 'gw', 'gy', 'hk',     'hm', 'hn', 'hr', 'ht', 'hu', 'id', 'ie', 'il', 'im', 'in', 'io', 'iq', 'ir', 'is', 'it', 'je', 'jm', 'jo', 'jp', 'ke',     'kg', 'kh', 'ki', 'km', 'kn', 'kp', 'kr', 'kw', 'ky', 'kz', 'la', 'lb', 'lc', 'li', 'lk', 'lr', 'ls', 'lt', 'lu', 'lv',     'ly', 'ma', 'mc', 'md', 'mg', 'mh', 'mk', 'ml', 'mm', 'mn', 'mo', 'mp', 'mq', 'mr', 'ms', 'mt', 'mu', 'mv', 'mw', 'mx',     'my', 'mz', 'na', 'nc', 'ne', 'nf', 'ng', 'ni', 'nl', 'no', 'np', 'nr', 'nu', 'nz', 'om', 'pa', 'pe', 'pf', 'pg', 'ph',     'pk', 'pl', 'pm', 'pn', 'pr', 'ps', 'pt', 'pw', 'py', 'qa', 're', 'ro', 'ru', 'rw', 'sa', 'sb', 'sc', 'sd', 'se', 'sg',     'sh', 'si', 'sj', 'sk', 'sl', 'sm', 'sn', 'so', 'sr', 'st', 'sv', 'sy', 'sz', 'tc', 'td', 'tf', 'tg', 'th', 'tj', 'tk',     'tl', 'tm', 'tn', 'to', 'tp', 'tr', 'tt', 'tv', 'tw', 'tz', 'ua', 'ug', 'uk', 'um', 'us', 'uy', 'uz', 'va', 'vc', 've',     'vg', 'vi', 'vn', 'vu', 'wf', 'ws', 'ye', 'yt', 'yu', 'yr', 'za', 'zm', 'zw');

function get_domain($url){
   global $fix;
   $host =  parse_url($url,PHP_URL_HOST);
   $list = explode('.',$host);
   $res = array();
   $i = count($list) - 1;
   while($i >= 0){ 
      if(!in_array($list[$i],$fix)){
         $res[] = $list[$i];
         break;
      }   
    $res[] = $list[$i];
    $i--;
     }   
    return implode('.',array_reverse($res));
}

Oleksandr Fediashov · Accepted Answer · 2016-06-20 10:54:51Z

0

You can use regex and any internal function, but you'll never have correct result on complex domain zones (.co.uk, .a.bg, .fuso.aichi.jp, etc.).

You need use library that uses Public Suffix List for correct extraction. I recomend TLDExtract.

Here is a sample code:

$extract = new LayerShifter\TLDExtract\Extract();

$result = $extract->parse('mydomain.co.in');
$result->getSubdomain(); // will return null
$result->getHostname(); // will return 'mydomain'
$result->getSuffix(); // will return 'co.in'
$result->getFullHost(); // will return 'mydomain.co.in'
$result->getRegistrableDomain(); // will return 'mydomain.co.in'

answered Jun 20, 2016 at 10:54

Oleksandr Fediashov

4,3351 gold badge26 silver badges43 bronze badges

Collectives™ on Stack Overflow

PHP Regex for extracting subdomains of arbitrary domains

4 Answers 4

3 Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related