Home » Php » regex – Get domain name (not subdomain) in php

regex – Get domain name (not subdomain) in php

Posted by: admin April 23, 2020 Leave a comment


I have a URL which can be any of the following formats:


Essentially, I need to be able to match any normal URL. How can I extract example.com (or .net, whatever the tld happens to be. I need this to work with any TLD.) from all of these via a single regex?

How to&Answers:

Well you can use parse_url to get the host:

$info = parse_url($url);
$host = $info['host'];

Then, you can do some fancy stuff to get only the TLD and the Host

$host_names = explode(".", $host);
$bottom_host_name = $host_names[count($host_names)-2] . "." . $host_names[count($host_names)-1];

Not very elegant, but should work.

If you want an explanation, here it goes:

First we grab everything between the scheme (http://, etc), by using parse_url‘s capabilities to… well…. parse URL’s. 🙂

Then we take the host name, and separate it into an array based on where the periods fall, so test.world.hello.myname would become:

array("test", "world", "hello", "myname");

After that, we take the number of elements in the array (4).

Then, we subtract 2 from it to get the second to last string (the hostname, or example, in your example)

Then, we subtract 1 from it to get the last string (because array keys start at 0), also known as the TLD

Then we combine those two parts with a period, and you have your base host name.


My solution in https://gist.github.com/pocesar/5366899

and the tests are here http://codepad.viper-7.com/GAh1tP

It works with any TLD, and hideous subdomain patterns (up to 3 subdomains).

There’s a test included with many domain names.

Won’t paste the function here because of the weird indentation for code in StackOverflow (could have fenced code blocks like github)


It is not possible to get the domain name without using a TLD list to compare with as their exist many cases with completely the same structure and length:

  1. www.db.de (Subdomain) versus bbc.co.uk (Domain)
  2. big.uk.com (SLD) versus www.uk.com (TLD)

Mozilla’s public suffix list should be the best option as it is used by all major browsers:

Feel free to use my function:

function tld_list($cache_dir=null) {
    // we use "/tmp" if $cache_dir is not set
    $cache_dir = isset($cache_dir) ? $cache_dir : sys_get_temp_dir();
    $lock_dir = $cache_dir . '/public_suffix_list_lock/';
    $list_dir = $cache_dir . '/public_suffix_list/';
    // refresh list all 30 days
    if (file_exists($list_dir) && @filemtime($list_dir) + 2592000 > time()) {
        return $list_dir;
    // use exclusive lock to avoid race conditions
    if (!file_exists($lock_dir) && @mkdir($lock_dir)) {
        // read from source
        $list = @fopen('https://publicsuffix.org/list/public_suffix_list.dat', 'r');
        if ($list) {
            // the list is older than 30 days so delete everything first
            if (file_exists($list_dir)) {
                foreach (glob($list_dir . '*') as $filename) {
            // now set list directory with new timestamp
            // read line-by-line to avoid high memory usage
            while ($line = fgets($list)) {
                // skip comments and empty lines
                if ($line[0] == '/' || !$line) {
                // remove wildcard
                if ($line[0] . $line[1] == '*.') {
                    $line = substr($line, 2);
                // remove exclamation mark
                if ($line[0] == '!') {
                    $line = substr($line, 1);
                // reverse TLD and remove linebreak
                $line = implode('.', array_reverse(explode('.', (trim($line)))));
                // we split the TLD list to reduce memory usage
                touch($list_dir . $line);
    // repair locks (should never happen)
    if (file_exists($lock_dir) && mt_rand(0, 100) == 0 && @filemtime($lock_dir) + 86400 < time()) {
    return $list_dir;
function get_domain($url=null) {
    // obtain location of public suffix list
    $tld_dir = tld_list();
    // no url = our own host
    $url = isset($url) ? $url : $_SERVER['SERVER_NAME'];
    // add missing scheme      ftp://            http:// ftps://   https://
    $url = !isset($url[5]) || ($url[3] != ':' && $url[4] != ':' && $url[5] != ':') ? 'http://' . $url : $url;
    // remove "/path/file.html", "/:80", etc.
    $url = parse_url($url, PHP_URL_HOST);
    // replace absolute domain name by relative (http://www.dns-sd.org/TrailingDotsInDomainNames.html)
    $url = trim($url, '.');
    // check if TLD exists
    $url = explode('.', $url);
    $parts = array_reverse($url);
    foreach ($parts as $key => $part) {
        $tld = implode('.', $parts);
        if (file_exists($tld_dir . $tld)) {
            return !$key ? '' : implode('.', array_slice($url, $key - 1));
        // remove last part
    return '';

What it makes special:

  • it accepts every input like URLs, hostnames or domains with- or without scheme
  • the list is downloaded row-by-row to avoid high memory usage
  • it creates a new file per TLD in a cache folder so get_domain() only needs to check through file_exists() if it exists so it does not need to include a huge database on every request like TLDExtract does it.
  • the list will be automatically updated every 30 days


$urls = array(
    'http://www.example.com',// example.com
    'http://subdomain.example.com',// example.com
    'http://www.example.uk.com',// example.uk.com
    'http://www.example.co.uk',// example.co.uk
    'http://www.example.com.ac',// example.com.ac
    'http://example.com.ac',// example.com.ac
    'http://www.example.accident-prevention.aero',// example.accident-prevention.aero
    'http://www.example.sub.ar',// sub.ar
    'http://www.congresodelalengua3.ar',// congresodelalengua3.ar
    'http://congresodelalengua3.ar',// congresodelalengua3.ar
    'http://www.example.pvt.k12.ma.us',// example.pvt.k12.ma.us
    'http://www.example.lib.wy.us',// example.lib.wy.us
    'com',// empty
    '.com',// empty
    'http://big.uk.com',// big.uk.com
    'uk.com',// empty
    'www.uk.com',// www.uk.com
    '.uk.com',// empty
    'stackoverflow.com',// stackoverflow.com
    '.foobarfoo',// empty
    '',// empty
    false,// empty
    ' ',// empty
    1,// empty
    'a',// empty    

Recent version with explanations (German):


$onlyHostName = implode('.', array_slice(explode('.', parse_url($link, PHP_URL_HOST)), -2));


I recommend using TLDExtract library for all operations with domain name.


I think the best way to handle this problem is:

$second_level_domains_regex = '/\.asn\.au$|\.com\.au$|\.net\.au$|\.id\.au$|\.org\.au$|\.edu\.au$|\.gov\.au$|\.csiro\.au$|\.act\.au$|\.nsw\.au$|\.nt\.au$|\.qld\.au$|\.sa\.au$|\.tas\.au$|\.vic\.au$|\.wa\.au$|\.co\.at$|\.or\.at$|\.priv\.at$|\.ac\.at$|\.avocat\.fr$|\.aeroport\.fr$|\.veterinaire\.fr$|\.co\.hu$|\.film\.hu$|\.lakas\.hu$|\.ingatlan\.hu$|\.sport\.hu$|\.hotel\.hu$|\.ac\.nz$|\.co\.nz$|\.geek\.nz$|\.gen\.nz$|\.kiwi\.nz$|\.maori\.nz$|\.net\.nz$|\.org\.nz$|\.school\.nz$|\.cri\.nz$|\.govt\.nz$|\.health\.nz$|\.iwi\.nz$|\.mil\.nz$|\.parliament\.nz$|\.ac\.za$|\.gov\.za$|\.law\.za$|\.mil\.za$|\.nom\.za$|\.school\.za$|\.net\.za$|\.co\.uk$|\.org\.uk$|\.me\.uk$|\.ltd\.uk$|\.plc\.uk$|\.net\.uk$|\.sch\.uk$|\.ac\.uk$|\.gov\.uk$|\.mod\.uk$|\.mil\.uk$|\.nhs\.uk$|\.police\.uk$/';
$domain = $_SERVER['HTTP_HOST'];
$domain = explode('.', $domain);
$domain = array_reverse($domain);
if (preg_match($second_level_domains_regex, $_SERVER['HTTP_HOST']) {
    $domain = "$domain[2].$domain[1].$domain[0]";
} else {
    $domain = "$domain[1].$domain[0]";


There are two ways to extract subdomain from a host:

  1. The first method that is more accurate is to use a database of tlds (like public_suffix_list.dat) and match domain with it. This is a little heavy in some cases. There are some PHP classes for using it like php-domain-parser and TLDExtract.

  2. The second way is not as accurate as the first one, but is very fast and it can give the correct answer in many case, I wrote this function for it:

    function get_domaininfo($url) {
        // regex can be replaced with parse_url
        preg_match("/^(https|http|ftp):\/\/(.*?)\//", "$url/" , $matches);
        $parts = explode(".", $matches[2]);
        $tld = array_pop($parts);
        $host = array_pop($parts);
        if ( strlen($tld) == 2 && strlen($host) <= 3 ) {
            $tld = "$host.$tld";
            $host = array_pop($parts);
        return array(
            'protocol' => $matches[1],
            'subdomain' => implode(".", $parts),
            'domain' => "$host.$tld",




        [protocol] => https
        [subdomain] => mysubdomain
        [domain] => domain.co.uk
        [host] => domain
        [tld] => co.uk


Here’s a function I wrote to grab the domain without subdomain(s), regardless of whether the domain is using a ccTLD or a new style long TLD, etc… There is no lookup or huge array of known TLDs, and there’s no regex. It can be a lot shorter using the ternary operator and nesting, but I expanded it for readability.

// Per Wikipedia: "All ASCII ccTLD identifiers are two letters long, 
// and all two-letter top-level domains are ccTLDs."

function topDomainFromURL($url) {
  $url_parts = parse_url($url);
  $domain_parts = explode('.', $url_parts['host']);
  if (strlen(end($domain_parts)) == 2 ) { 
    // ccTLD here, get last three parts
    $top_domain_parts = array_slice($domain_parts, -3);
  } else {
    $top_domain_parts = array_slice($domain_parts, -2);
  $top_domain = implode('.', $top_domain_parts);
  return $top_domain;


echo getDomainOnly("http://example.com/foo/bar");

function getDomainOnly($host){
    $host = strtolower(trim($host));
    $host = ltrim(str_replace("http://","",str_replace("https://","",$host)),"www.");
    $count = substr_count($host, '.');
    if($count === 2){
        if(strlen(explode('.', $host)[1]) > 3) $host = explode('.', $host, 2)[1];
    } else if($count > 2){
        $host = getDomainOnly(explode('.', $host, 2)[1]);
    $host = explode('/',$host);
    return $host[0];


I had problems with the solution provided by pocesar.
When I would use for instance subdomain.domain.nl it would not return domain.nl. Instead it would return subdomain.domain.nl
Another problem was that domain.com.br would return com.br

I am not sure but i fixed these issues with the following code (i hope it will help someone, if so I am a happy man):

function get_domain($domain, $debug = false){
    $original = $domain = strtolower($domain);
    if (filter_var($domain, FILTER_VALIDATE_IP)) {
        return $domain;
    $debug ? print('<strong style="color:green">&raquo;</strong> Parsing: '.$original) : false;
    $arr = array_slice(array_filter(explode('.', $domain, 4), function($value){
        return $value !== 'www';
    }), 0); //rebuild array indexes
    if (count($arr) > 2){
        $count = count($arr);
        $_sub = explode('.', $count === 4 ? $arr[3] : $arr[2]);
        $debug ? print(" (parts count: {$count})") : false;
        if (count($_sub) === 2){ // two level TLD
            $removed = array_shift($arr);
            if ($count === 4){ // got a subdomain acting as a domain
                $removed = array_shift($arr);
            $debug ? print("<br>\n" . '[*] Two level TLD: <strong>' . join('.', $_sub) . '</strong> ') : false;
        }elseif (count($_sub) === 1){ // one level TLD
            $removed = array_shift($arr); //remove the subdomain
            if (strlen($arr[0]) === 2 && $count === 3){ // TLD domain must be 2 letters
                array_unshift($arr, $removed);
            }elseif(strlen($arr[0]) === 3 && $count === 3){
                array_unshift($arr, $removed);
                // non country TLD according to IANA
                $tlds = array(
                if (count($arr) > 2 && in_array($_sub[0], $tlds) !== false){ //special TLD don't have a country
            $debug ? print("<br>\n" .'[*] One level TLD: <strong>'.join('.', $_sub).'</strong> ') : false;
        }else{ // more than 3 levels, something is wrong
            for ($i = count($_sub); $i > 1; $i--){
                $removed = array_shift($arr);
            $debug ? print("<br>\n" . '[*] Three level TLD: <strong>' . join('.', $_sub) . '</strong> ') : false;
    }elseif (count($arr) === 2){
        $arr0 = array_shift($arr);
        if (strpos(join('.', $arr), '.') === false && in_array($arr[0], array('localhost','test','invalid')) === false){ // not a reserved domain
            $debug ? print("<br>\n" .'Seems invalid domain: <strong>'.join('.', $arr).'</strong> re-adding: <strong>'.$arr0.'</strong> ') : false;
            // seems invalid domain, restore it
            array_unshift($arr, $arr0);
    $debug ? print("<br>\n".'<strong style="color:gray">&laquo;</strong> Done parsing: <span style="color:red">' . $original . '</span> as <span style="color:blue">'. join('.', $arr) ."</span><br>\n") : false;
    return join('.', $arr);


Here’s one that works for all domains, including those with second level domains like “co.uk”

function strip_subdomains($url){

    # credits to gavingmiller for maintaining this list
    $second_level_domains = file_get_contents("https://raw.githubusercontent.com/gavingmiller/second-level-domains/master/SLDs.csv");

    # presume sld first ...
    $possible_sld = implode('.', array_slice(explode('.', $url), -2));

    # and then verify it
    if (strpos($second_level_domains, $possible_sld)){
        return  implode('.', array_slice(explode('.', $url), -3));
    } else {
        return  implode('.', array_slice(explode('.', $url), -2));

Looks like there’s a duplicate question here: delete-subdomain-from-url-string-if-subdomain-is-found


Simply try this:

   preg_match('/(www.)?([^.]+\.[^.]+)$/', $yourHost, $matches);

   echo "domain name is: {$matches[0]}\n"; 

this working for majority of domains.


Very late, I see that you marked regex as a keyword and my function works like a charm, so far I haven’t found a url that fails:

function get_domain_regex($url){
  $pieces = parse_url($url);
  $domain = isset($pieces['host']) ? $pieces['host'] : '';
  if (preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$/i', $domain, $regs)) {
    return $regs['domain'];
    return false;

if you want one without regex I have this one, which I am sure I also took from this post

function get_domain($url){
  $parseUrl = parse_url($url);
  $host = $parseUrl['host'];
  $host_array = explode(".", $host);
  $domain = $host_array[count($host_array)-2] . "." . $host_array[count($host_array)-1];
  return $domain;

They both work amazing, BUT, this took me a while to realize if the url doesn’t start with http:// or https:// it will fail so make sure the url string starts with the protocol.


Simply try this:

  $host = $_SERVER['HTTP_HOST'];
  preg_match("/[^\.\/]+\.[^\.\/]+$/", $host, $matches);
  echo "domain name is: {$matches[0]}\n";