Home » excel » excel – Web Scraping by TagName

excel – Web Scraping by TagName

Posted by: admin May 14, 2020 Leave a comment

Questions:

I am trying to pull some data from web-site, but as I’m new learner in the web scraping hence confused in the Tag Name, Class Code and in ID.. I have only basic knowledge about it.
I want to copy below Data and if the data is not exist then the cell should be left blank and code needs to be move in next value.

Class="container size" - 5*5,5*10 kind of value
Class="description" - Standard in this case also need to copy like Drive-up Access
Class="offer1" & "offer2" - Call for Availability
Class="price"

I tried to frame a code but can’t judge the exact which Tag name needs to be choose, below is my code please help me to copy this data.

Dim ie As New InternetExplorer, ws As Worksheet
Set ws = ThisWorkbook.Worksheets("Sheet1")
With ie
    .Visible = True
    .Navigate2 "" & Sheets("Home").Range("C3").Text

    While .Busy Or .readyState < 4: DoEvents: Wend

    Sheets("Unit Data").Select


    Dim listings As Object, listing As Object, headers(), results()
    Dim r As Long, list As Object, item As Object
    headers = Array("size", "features")
    Set list = .document.getElementsByClassName("units-table main")
    '.unit_size medium, .features, .promo_offers, .board_rate_wrapper p, .board_rate
    Dim rowCount As Long
    rowCount = .document.querySelectorAll(".units-table main li").Length


    ReDim results(1 To rowCount, 1 To UBound(headers) + 1)
    For Each listing In list
        For Each item In listing.getElementsByTagName("li")
            r = r + 1
            On Error Resume Next
            results(r, 1) = item.getElementsByClassName("container size")(0).innerText
            results(r, 2) = item.getElementsByClassName("description")(0).innerText

            On Error GoTo 0


        Next
    Next
    ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
    ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
    .Quit
End With
How to&Answers:

XHR:

All the info is available via XMLHTTP (XHR) request – much faster than opening a browser.

I first retrieve the row count with a css selector of .main li[class].
The "." is a class selector, the li is a type selector and the [class] is an attribute selector. The space, " ", in between is a descendant combinator. That specifies I want to retrieve all li tag/type elements, with a class attribute, having a parent whose classname is main.

This matches as follows:

As you can see, this gives me the row count; the number of parent li elements to retrieve info from for the result set.

This collection of li elements is returned as a nodeList by querySelectorAll. I cannot loop over this list applying getElementsByClassName / querySelector to individual nodes, as li elements expose no methods I can use.

Now, as I am not using a browser, I am forced to rely on the methods available to HTMLDocument object. Unlike with a browser, I do not have access to the limited pseudo class selectors that they support, when automated via VBA, which would allow me to use selector syntax such as :nth-of-type to access individual rows. This is an annoying limitation of web-scraping with VBA.

So, what can we do? Well, in this instance I can dump the innerHTML of each node into another HTMLDocument variable, html2, so that I can access the querySelector/querySelectorAll methods of that object. The HTML will then only be limited to the current li.

If we look at the HTML in question:

We can see that the li elements are general siblings. They sit next to each other at the same level. As I loop my nodeList listings, I am transferring the innerHTML from the current node into html2; my second HTMLDocument variable.

It’s worth noting I could probably have descended each listing using children for example:

listings.item(i).Children(2)......

I could then have split on newLines etc so access all info. I think my given method is faster and more robust though.

VBA:

Option Explicit  
Public Sub GetInfo()
    Dim ws As Worksheet, html As HTMLDocument, s As String
    Const URL As String = "https://www.neighborhoodselfstorage.net/self-storage-delmar-md-f1426"

    Set ws = ThisWorkbook.Worksheets("Sheet1")
    Set html = New HTMLDocument
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", URL, False
        .setRequestHeader "User-Agent", "Mozilla/5.0"
        .send
        s = .responseText
        html.body.innerHTML = s

        Dim headers(), results(), listings As Object, amenities As String

        headers = Array("Size", "Description", "Amenities", "Offer1", "Offer2", "RateType", "Price")
        Set listings = html.querySelectorAll(".main li[class]")

        Dim rowCount As Long, numColumns As Long, r As Long, c As Long
        Dim icons As Object, icon As Long, amenitiesInfo(), i As Long, item As Long

        rowCount = listings.Length
        numColumns = UBound(headers) + 1

        ReDim results(1 To rowCount, 1 To numColumns)
        Dim html2 As HTMLDocument
        Set html2 = New HTMLDocument
        For item = 0 To listings.Length - 1
            r = r + 1
            html2.body.innerHTML = listings.item(item).innerHTML
            'size,description, amenities,specials offer1 offer2, rate type, price

            results(r, 1) = Trim$(html2.querySelector(".size").innerText)
            results(r, 2) = Trim$(html2.querySelector(".description").innerText)
            Set icons = html2.querySelectorAll("i[title]")

            ReDim amenitiesInfo(0 To icons.Length - 1)

            For icon = 0 To icons.Length - 1
                amenitiesInfo(icon) = icons.item(icon).getAttribute("title")
            Next

            amenities = Join$(amenitiesInfo, ", ")

            results(r, 3) = amenities
            results(r, 4) = html2.querySelector(".offer1").innerText
            results(r, 5) = html2.querySelector(".offer2").innerText
            results(r, 6) = html2.querySelector(".rate-label").innerText
            results(r, 7) = html2.querySelector(".price").innerText
        Next

        ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
        ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
    End With
End Sub

Internet Explorer:

Assuming not re-directed from given url. Here I am using :nth-of-type pseudo class selector to target each row of the listings. Those rows are the li (list) elements holding the info for each box listing. I build up a css selector string that specifies the row and then the element within the row I am after. I pass that string to querySelector, or querySelectorAll which returns matched element/s.

Option Explicit

Public Sub UseIE()
    Dim ie As New InternetExplorerm, ws As Worksheet
    Const Url As String = "https://www.neighborhoodselfstorage.net/self-storage-delmar-md-f142"

    Set ws = ThisWorkbook.Worksheets("Sheet1")

    With ie
        .Visible = True
        .Navigate2 Url

        While .Busy Or .readyState < 4: DoEvents: Wend

        Dim headers(), results(), listings As Object, listing As Object, amenities As String

        headers = Array("Size", "Description", "Amenities", "Offer1", "Offer2", "RateType", "Price")

        Set listings = .document.querySelectorAll(".main li[class]")

        Dim rowCount As Long, numColumns As Long, r As Long, c As Long
        Dim icons As Object, icon As Long, amenitiesInfo(), i As Long

        rowCount = listings.Length
        numColumns = UBound(headers) + 1
        ReDim results(1 To rowCount, 1 To numColumns)
        For Each listing In listings
            r = r + 1
            'size,description, amenities,specials offer1 offer2, rate type, price
            With .document

                results(r, 1) = Trim$(.querySelector(".main li:nth-of-type(" & r & ") .size").innerText)
                results(r, 2) = Trim$(.querySelector(".main li:nth-of-type(" & r & ") .description").innerText)

                Set icons = .querySelectorAll("." & Join$(Split(listing.className, Chr$(32)), ".") & ":nth-of-type(" & r & ") i[title]")

                ReDim amenitiesInfo(0 To icons.Length - 1)

                For icon = 0 To icons.Length - 1
                    amenitiesInfo(icon) = icons.item(icon).getAttribute("title")
                Next

                amenities = Join$(amenitiesInfo, ",")
                results(r, 3) = amenities
                results(r, 4) = .querySelector(".main li:nth-of-type(" & r & ") .offer1").innerText
                results(r, 5) = .querySelector(".main li:nth-of-type(" & r & ") .offer2").innerText
                results(r, 6) = .querySelector(".main li:nth-of-type(" & r & ") .rate-label").innerText
                results(r, 7) = .querySelector(".main li:nth-of-type(" & r & ") .price").innerText
            End With
        Next
        .Quit
        ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
        ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
    End With
End Sub


References (VBE > Tools > References):

  1. Microsoft HTML Object Library
  2. Microsoft Internet Controls