Home » Php » how to scrape all data from this table and store it into php array

how to scrape all data from this table and store it into php array

Posted by: admin February 25, 2020 Leave a comment

Questions:

how to scrape all data from this table and store it into php array

i have tried this code but it pulls data from other tables also and makes data unusable

<?php
error_reporting(0);
$htmlContent = file_get_contents("https://tools.tracemyip.org/search--ip/list:-v-:gTr=1&gNr=50");

    $DOM = new DOMDocument();
    $DOM->loadHTML($htmlContent);

    $Header = $DOM->getElementsByTagName('th');
    $Detail = $DOM->getElementsByTagName('span');

    //#Get header name of the table
    foreach($Header as $NodeHeader) 
    {
        $aDataTableHeaderHTML[] = trim($NodeHeader->textContent);
    }
    //print_r($aDataTableHeaderHTML); die();

    //#Get row data/detail table without header name as key
    $i = 0;
    $j = 0;
    foreach($Detail as $sNodeDetail) 
    {
        $aDataTableDetailHTML[$j][] = trim($sNodeDetail->textContent);
        $i = $i + 1;
        $j = $i % count($aDataTableHeaderHTML) == 0 ? $j + 1 : $j;
    }
    //print_r($aDataTableDetailHTML); die();

    //#Get row data/detail table with header name as key and outer array index as row number
    for($i = 0; $i < count($aDataTableDetailHTML); $i++)
    {
        for($j = 0; $j < count($aDataTableHeaderHTML); $j++)
        {
            $aTempData[$i][$aDataTableHeaderHTML[$j]] = $aDataTableDetailHTML[$i][$j];
        }
    }
    $aDataTableDetailHTML = $aTempData; unset($aTempData);
    print_r($aDataTableDetailHTML); die();
?>
How to&Answers:

This code works. I’m not sure if the ID of the table will ever change.

<?php
libxml_use_internal_errors(true);

if (!file_exists('table.html')) {
    file_put_contents('table.html', file_get_contents('https://tools.tracemyip.org/search--ip/list:-v-:gTr=1&gNr=50')); 
}

$html = file_get_contents('table.html');

$doc = new DOMDocument();
$doc->loadHTML($html);

$table = $doc->getElementById('tlzRDTIPv4');

$thead = $table->getElementsByTagName('th');
$rows = $table->getElementsByTagName('tr');

$data = [];
$cols = [];

foreach ($thead as $th) {
    $cols[] = $th->textContent;
}

foreach ($rows as $row) {
    $tds = $row->getElementsByTagName('td');
    if (count($tds) == 0) // Ignore "thead > tr"
        continue;
    $row_data = [];
    for ($i = 0; $i < count($tds); $i++) {
        $row_data[$cols[$i]] = $tds[$i]->textContent;
    }
    $data[] = $row_data;
}

print_r($data);
Array
(
    [0] => Array
        (
            [ID] => 1
            [IP Address] => 195.114.148.11
            [Organization / ISP] => Private Joint Stock Company datagroup
            [Country] => Ukraine
            [State] => Kyiv City
            [City] => - - -
            [Timezone] => Europe/Kiev
            [Browser] => Chrome 80.0.3987.99
            [Operating System] => Android, 9
            [Bot/spider] => No
        )

    [1] => Array
        (
            [ID] => 2
            [IP Address] => 102.184.99.182
            [Organization / ISP] => Vodafone Egypt
            [Country] => Egypt
            [State] => Cairo Governorate
            [City] => Cairo
            [Timezone] => Africa/Cairo
            [Browser] => Chrome 80.0.3987.99
            [Operating System] => Android, 9
            [Bot/spider] => No
        )
)

Answer:

To do the scraping I’d be tempted to use an XPath query rather than trying to use getElementsByTagName etc as it provides much greater flexibility. You notice below that this is processing the url but also listed is a copy of the report used for testing as it is far faster to process repeatedly.

The first row of the resultant array contains the column headers and subsequent rows the cell content

<?php
    $url='https://tools.tracemyip.org/search--ip/list:-v-:gTr=1&gNr=50';
    $url='c:/temp/IP Address List.html';

    $data=array();

    libxml_use_internal_errors( true );
    $dom=new DOMDocument;
    $dom->validateOnParse=false;
    $dom->recover=true;
    $dom->strictErrorChecking=false;
    $dom->loadHTMLFile( $url );
    libxml_clear_errors();

    # create the XPath object
    $xp=new DOMXPath( $dom );

    # get column headers
    $expr='//table[ @class="tbsClass1" ]/thead/tr/th[@class="header"]';
    $col=$xp->query( $expr );
    if( $col && $col->length > 0 ){
        $tmp=array();
        foreach( $col as $node ){
            $tmp[]=$node->textContent;
        }
        $data[]=$tmp;
    }


    #get row data
    $expr='//table[ @class="tbsClass1" ]/tbody/tr';
    $col=$xp->query( $expr );
    if( $col && $col->length > 0 ){
        #iterate over table rows
        foreach( $col as $node ){

            $tmp=array();

            #find table cells
            for( $i=0; $i < $node->childNodes->length; $i++ ){
                $obj=$node->childNodes[$i];
                if( $obj->nodeType==XML_ELEMENT_NODE )$tmp[]=$obj->textContent;
            }
            $data[]=$tmp;
        }
    }

    printf('<pre>%s</pre>',print_r($data,true));
?>

The above yields sample response such as:-

Array
(
    [0] => Array
        (
            [0] => ID
            [1] => IP Address
            [2] => Organization / ISP
            [3] => Country
            [4] => State
            [5] => City
            [6] => Timezone
            [7] => Browser
            [8] => Operating System
            [9] => Bot/spider
        )

    [1] => Array
        (
            [0] => 1
            [1] => 173.249.60.111
            [2] => Contabo GmbH
            [3] => Germany
            [4] => Bavaria
            [5] => Nuremberg
            [6] => Europe/Berlin
            [7] => - - -
            [8] => - - -
            [9] => No
        )

In response to your comment – as mentioned in my subsequent comment, but to clarify. When you wish to process the data ( no idea how you want to use the data ) the fact that each element in the output array does not contain the repeated column header is not a problem. The initial array entry contains all the headers so when processing you can easily access the correct header using the index.

$headers=array_shift( $data );
foreach( $data as $record ){
    foreach( $record as $i => $value )printf('%s=%s<br />',$headers[$i], $value );
}

Which yields:

ID=1
IP Address=173.249.60.111
Organization / ISP=Contabo GmbH
Country=Germany
State=Bavaria
City=Nuremberg
Timezone=Europe/Berlin
Browser=- - -
Operating System=- - -
Bot/spider=No

ID=2
IP Address=185......etc