1

I want to extract the subdomain and domain part for domains with arbitrary top level extensions.

Thus:

sub1.domain1.com --> Extract subdomain=sub1, domain=domain1.com

sub2.domain2.co.in --> Extract subdomain=sub2, domain=domain2.co.in

sub3.domain3.co.uk --> Extract subdomain=sub3, domain=domain3.co.uk

sub4.domain4.us --> Extract subdomain=sub4, domain=domain4.us

mydomain.com --> Extract subdomain="", domain=mydomain.com

mydomain.co.in --> Extract subdomain="", domain=mydomain.co.in

I am bit confused about how to handle TLDs like co.in/co.uk etc. I could do this using brute force way by measuring if the last 5 characters have a DOT (.) in them, but thinking if there is a regex way to do this.


NOTE 1: As TToni pointed out, there can be ambiguities. However, I will put some constraints:

1) The "Domain name" part (without the extension) --> will be at-least 4 characters.

2) The TLD extension part (.com, co.in, .us, etc) will either have a single DOT or if it has two DOTS, then the penultimate part (sub TLD) will have atmost 3 characters.

I have a feeling that these constraints will make the problem unambigious and solvable using regex.

(Also, assume "www." has been stripped out already).


NOTE 2:

Example of above constraints

sub.dom.in --> domain="sub.dom.in"

sub.dom1.in --> domain="dom1.in", subdomain="sub"

This may sound buggy, but the reason is - I want this for my internal purposes, and all my domains have atleast 4 characters in them, AND, all extensions have either single DOT or the penultimate part is at-max 3 characters.


NOTE 3: I have a feeling I might make mistakes by using regex for this. Hence thinking of doing the string search way.

regards,

JP

7
  • Not quite the same, but take a look at stackoverflow.com/questions/3853338/remove-domain-extension/… Commented Nov 29, 2010 at 14:14
  • 1
    I think you cannot fully solve this with a regex because you get ambiguities. Consider "b.c.eu" for example. Which one is the domain? Commented Nov 29, 2010 at 14:15
  • I agree with TToni. I will ammend my question. For my purpose, assume that domain name will be at-least 4 characters. Will also add one more constraint after wording it formally. Commented Nov 29, 2010 at 14:18
  • So, the domain is "all non-dot characters immediately before the first dot which occurs at least three characters from the end, and all the characters which occur after them", and the subdomain is "everything that's not in the domain, without the final dot"? Commented Nov 29, 2010 at 14:38
  • 1
    "solvable using regex" Just because you have a hammer doesn't mean your problem is a nail Commented Nov 29, 2010 at 14:43

4 Answers 4

4

Not sure you need regexes. Split the domain name on '.' then apply some heuristics on the result depending on the rightmost bit - e..g if last is "com" then domain is last+second last, subdomain is the rest.

Or keep a list of "top-level" (quotes becasue it's a different meaning from the normal top level)domains, iterate over the list matching the right end of the domain name against each. If a match, remove the top level bit and return the rest as subdomain - this could be put in a regex but with a loss of clarity. The list would look something like

".edu", ".gov", ".mil", ".com", ".co.uk", ".gov.uk", ".nhs.uk", [...]

The regex would look something like

 \.(edu|gov|mil|com|co\.uk|gov\.uk|nhs\.uk|[...])$
Sign up to request clarification or add additional context in comments.

3 Comments

\.(edu)|(com)$ matches either .edu (not necessarily followed by the end-of-input) or com followed by the end-of-input (without the .!). You probably meant \.(edu|com|mil|etc)$. Also, putting [..] in a regex might be perceived as an odd (but legal) character class, whereas you meant it to be something else.
Thanks, typed too quickly. Fixed. And yes, the [...] is meant to be mean "and so on"
Yeah, I figured that. I took the liberty to fix the un-escaped .'s in your example regex.
0

You can use this: (\b\w+\b(?:\.\b\w+\b)*?){0,1}?\.?(\b\w+\b(?:\.\b\w{1,3}\b)?\.\b\w{1,3}\b)
It doesn't look very beautiful, but the idea behind it is simple. It will catch subdomain in the first group and domain in the second. Also it will split things like "sub1.sub2.sub3.domain2.co.in" into "sub1.sub2.sub3" and "domain2.co.in"

1 Comment

The problem is that you cannot know what the actual domain is. In the case of the sample: domain2.co.in "co" might also be the domain (e.g. co.com). So you need to use a list of all toplevel domains.
0

I got the "top-level" domain names,it might be ugly but it works.

$fix = array('com', 'edu', 'gov', 'int', 'mil', 'net', 'org', 'biz', 'info', 'pro', 'name', 'museum', 'coop', 'aero', 'x    xx', 'idv', 'al', 'dz', 'af', 'ar', 'ae', 'aw', 'om', 'az', 'eg', 'et', 'ie', 'ee', 'ad', 'ao', 'ai', 'ag', 'at', 'au',     'mo', 'bb', 'pg', 'bs', 'pk', 'py', 'ps', 'bh', 'pa', 'br', 'by', 'bm', 'bg', 'mp', 'bj', 'be', 'is', 'pr', 'ba', 'pl',     'bo', 'bz', 'bw', 'bt', 'bf', 'bi', 'bv', 'kp', 'gq', 'dk', 'de', 'tl', 'tp', 'tg', 'dm', 'do', 'ru', 'ec', 'er', 'fr',     'fo', 'pf', 'gf', 'tf', 'va', 'ph', 'fj', 'fi', 'cv', 'fk', 'gm', 'cg', 'cd', 'co', 'cr', 'gg', 'gd', 'gl', 'ge', 'cu',     'gp', 'gu', 'gy', 'kz', 'ht', 'kr', 'nl', 'an', 'hm', 'hn', 'ki', 'dj', 'kg', 'gn', 'gw', 'ca', 'gh', 'ga', 'kh', 'cz',     'zw', 'cm', 'qa', 'ky', 'km', 'ci', 'kw', 'cc', 'hr', 'ke', 'ck', 'lv', 'ls', 'la', 'lb', 'lt', 'lr', 'ly', 'li', 're',     'lu', 'rw', 'ro', 'mg', 'im', 'mv', 'mt', 'mw', 'my', 'ml', 'mk', 'mh', 'mq', 'yt', 'mu', 'mr', 'us', 'um', 'as', 'vi',     'mn', 'ms', 'bd', 'pe', 'fm', 'mm', 'md', 'ma', 'mc', 'mz', 'mx', 'nr', 'np', 'ni', 'ne', 'ng', 'nu', 'no', 'nf', 'na',     'za', 'aq', 'gs', 'eu', 'pw', 'pn', 'pt', 'jp', 'se', 'ch', 'sv', 'ws', 'yu', 'sl', 'sn', 'cy', 'sc', 'sa', 'cx', 'st',     'sh', 'kn', 'lc', 'sm', 'pm', 'vc', 'lk', 'sk', 'si', 'sj', 'sz', 'sd', 'sr', 'sb', 'so', 'tj', 'tw', 'th', 'tz', 'to',     'tc', 'tt', 'tn', 'tv', 'tr', 'tm', 'tk', 'wf', 'vu', 'gt', 've', 'bn', 'ug', 'ua', 'uy', 'uz', 'es', 'eh', 'gr', 'hk',     'sg', 'nc', 'nz', 'hu', 'sy', 'jm', 'am', 'ac', 'ye', 'iq', 'ir', 'il', 'it', 'in', 'id', 'uk', 'vg', 'io', 'jo', 'vn',     'zm', 'je', 'td', 'gi', 'cl', 'cf', 'cn', 'ac', 'ad', 'ae', 'af', 'ag', 'ai', 'al', 'am', 'an', 'ao', 'aq', 'ar', 'as',     'at', 'au', 'aw', 'az', 'ba', 'bb', 'bd', 'be', 'bf', 'bg', 'bh', 'bi', 'bj', 'bm', 'bn', 'bo', 'br', 'bs', 'bt', 'bv',     'bw', 'by', 'bz', 'ca', 'cc', 'cd', 'cf', 'cg', 'ch', 'ci', 'ck', 'cl', 'cm', 'cn', 'co', 'cr', 'cu', 'cv', 'cx', 'cy',     'cz', 'de', 'dj', 'dk', 'dm', 'do', 'dz', 'ec', 'ee', 'eg', 'eh', 'er', 'es', 'et', 'eu', 'fi', 'fj', 'fk', 'fm', 'fo',     'fr', 'ga', 'gd', 'ge', 'gf', 'gg', 'gh', 'gi', 'gl', 'gm', 'gn', 'gp', 'gq', 'gr', 'gs', 'gt', 'gu', 'gw', 'gy', 'hk',     'hm', 'hn', 'hr', 'ht', 'hu', 'id', 'ie', 'il', 'im', 'in', 'io', 'iq', 'ir', 'is', 'it', 'je', 'jm', 'jo', 'jp', 'ke',     'kg', 'kh', 'ki', 'km', 'kn', 'kp', 'kr', 'kw', 'ky', 'kz', 'la', 'lb', 'lc', 'li', 'lk', 'lr', 'ls', 'lt', 'lu', 'lv',     'ly', 'ma', 'mc', 'md', 'mg', 'mh', 'mk', 'ml', 'mm', 'mn', 'mo', 'mp', 'mq', 'mr', 'ms', 'mt', 'mu', 'mv', 'mw', 'mx',     'my', 'mz', 'na', 'nc', 'ne', 'nf', 'ng', 'ni', 'nl', 'no', 'np', 'nr', 'nu', 'nz', 'om', 'pa', 'pe', 'pf', 'pg', 'ph',     'pk', 'pl', 'pm', 'pn', 'pr', 'ps', 'pt', 'pw', 'py', 'qa', 're', 'ro', 'ru', 'rw', 'sa', 'sb', 'sc', 'sd', 'se', 'sg',     'sh', 'si', 'sj', 'sk', 'sl', 'sm', 'sn', 'so', 'sr', 'st', 'sv', 'sy', 'sz', 'tc', 'td', 'tf', 'tg', 'th', 'tj', 'tk',     'tl', 'tm', 'tn', 'to', 'tp', 'tr', 'tt', 'tv', 'tw', 'tz', 'ua', 'ug', 'uk', 'um', 'us', 'uy', 'uz', 'va', 'vc', 've',     'vg', 'vi', 'vn', 'vu', 'wf', 'ws', 'ye', 'yt', 'yu', 'yr', 'za', 'zm', 'zw');

function get_domain($url){
   global $fix;
   $host =  parse_url($url,PHP_URL_HOST);
   $list = explode('.',$host);
   $res = array();
   $i = count($list) - 1;
   while($i >= 0){ 
      if(!in_array($list[$i],$fix)){
         $res[] = $list[$i];
         break;
      }   
    $res[] = $list[$i];
    $i--;
     }   
    return implode('.',array_reverse($res));
}

Comments

0

You can use regex and any internal function, but you'll never have correct result on complex domain zones (.co.uk, .a.bg, .fuso.aichi.jp, etc.).

You need use library that uses Public Suffix List for correct extraction. I recomend TLDExtract.

Here is a sample code:

$extract = new LayerShifter\TLDExtract\Extract();

$result = $extract->parse('mydomain.co.in');
$result->getSubdomain(); // will return null
$result->getHostname(); // will return 'mydomain'
$result->getSuffix(); // will return 'co.in'
$result->getFullHost(); // will return 'mydomain.co.in'
$result->getRegistrableDomain(); // will return 'mydomain.co.in'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.