I am trying to pull some data from web-site, but as I’m new learner in the web scraping hence confused in the Tag Name, Class Code and in ID.. I have only basic knowledge about it.
I want to copy below Data and if the data is not exist then the cell should be left blank and code needs to be move in next value.
Class="container size" - 5*5,5*10 kind of value
Class="description" - Standard in this case also need to copy like Drive-up Access
Class="offer1" & "offer2" - Call for Availability
Class="price"
I tried to frame a code but can’t judge the exact which Tag name needs to be choose, below is my code please help me to copy this data.
Dim ie As New InternetExplorer, ws As Worksheet
Set ws = ThisWorkbook.Worksheets("Sheet1")
With ie
.Visible = True
.Navigate2 "" & Sheets("Home").Range("C3").Text
While .Busy Or .readyState < 4: DoEvents: Wend
Sheets("Unit Data").Select
Dim listings As Object, listing As Object, headers(), results()
Dim r As Long, list As Object, item As Object
headers = Array("size", "features")
Set list = .document.getElementsByClassName("units-table main")
'.unit_size medium, .features, .promo_offers, .board_rate_wrapper p, .board_rate
Dim rowCount As Long
rowCount = .document.querySelectorAll(".units-table main li").Length
ReDim results(1 To rowCount, 1 To UBound(headers) + 1)
For Each listing In list
For Each item In listing.getElementsByTagName("li")
r = r + 1
On Error Resume Next
results(r, 1) = item.getElementsByClassName("container size")(0).innerText
results(r, 2) = item.getElementsByClassName("description")(0).innerText
On Error GoTo 0
Next
Next
ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
.Quit
End With
XHR:
All the info is available via XMLHTTP (XHR) request – much faster than opening a browser.
I first retrieve the row count with a css selector of .main li[class]
.
The "."
is a class selector, the li
is a type selector and the [class]
is an attribute selector. The space, " "
, in between is a descendant combinator. That specifies I want to retrieve all li
tag/type elements, with a class attribute, having a parent whose classname is main
.
This matches as follows:
As you can see, this gives me the row count; the number of parent li
elements to retrieve info from for the result set.
This collection of li elements is returned as a nodeList by querySelectorAll
. I cannot loop over this list applying getElementsByClassName
/ querySelector
to individual nodes, as li
elements expose no methods I can use.
Now, as I am not using a browser, I am forced to rely on the methods available to HTMLDocument object. Unlike with a browser, I do not have access to the limited pseudo class selectors that they support, when automated via VBA, which would allow me to use selector syntax such as :nth-of-type
to access individual rows. This is an annoying limitation of web-scraping with VBA.
So, what can we do? Well, in this instance I can dump the innerHTML
of each node into another HTMLDocument
variable, html2
, so that I can access the querySelector/querySelectorAll
methods of that object. The HTML will then only be limited to the current li
.
If we look at the HTML in question:
We can see that the li
elements are general siblings. They sit next to each other at the same level. As I loop my nodeList listings
, I am transferring the innerHTML
from the current node into html2
; my second HTMLDocument
variable.
It’s worth noting I could probably have descended each listing using children
for example:
listings.item(i).Children(2)......
I could then have split on newLines etc so access all info. I think my given method is faster and more robust though.
VBA:
Option Explicit
Public Sub GetInfo()
Dim ws As Worksheet, html As HTMLDocument, s As String
Const URL As String = "https://www.neighborhoodselfstorage.net/self-storage-delmar-md-f1426"
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", URL, False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
s = .responseText
html.body.innerHTML = s
Dim headers(), results(), listings As Object, amenities As String
headers = Array("Size", "Description", "Amenities", "Offer1", "Offer2", "RateType", "Price")
Set listings = html.querySelectorAll(".main li[class]")
Dim rowCount As Long, numColumns As Long, r As Long, c As Long
Dim icons As Object, icon As Long, amenitiesInfo(), i As Long, item As Long
rowCount = listings.Length
numColumns = UBound(headers) + 1
ReDim results(1 To rowCount, 1 To numColumns)
Dim html2 As HTMLDocument
Set html2 = New HTMLDocument
For item = 0 To listings.Length - 1
r = r + 1
html2.body.innerHTML = listings.item(item).innerHTML
'size,description, amenities,specials offer1 offer2, rate type, price
results(r, 1) = Trim$(html2.querySelector(".size").innerText)
results(r, 2) = Trim$(html2.querySelector(".description").innerText)
Set icons = html2.querySelectorAll("i[title]")
ReDim amenitiesInfo(0 To icons.Length - 1)
For icon = 0 To icons.Length - 1
amenitiesInfo(icon) = icons.item(icon).getAttribute("title")
Next
amenities = Join$(amenitiesInfo, ", ")
results(r, 3) = amenities
results(r, 4) = html2.querySelector(".offer1").innerText
results(r, 5) = html2.querySelector(".offer2").innerText
results(r, 6) = html2.querySelector(".rate-label").innerText
results(r, 7) = html2.querySelector(".price").innerText
Next
ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End With
End Sub
Internet Explorer:
Assuming not re-directed from given url. Here I am using :nth-of-type pseudo class selector to target each row of the listings. Those rows are the li
(list) elements holding the info for each box listing. I build up a css selector string that specifies the row and then the element within the row I am after. I pass that string to querySelector
, or querySelectorAll
which returns matched element/s.
Option Explicit
Public Sub UseIE()
Dim ie As New InternetExplorerm, ws As Worksheet
Const Url As String = "https://www.neighborhoodselfstorage.net/self-storage-delmar-md-f142"
Set ws = ThisWorkbook.Worksheets("Sheet1")
With ie
.Visible = True
.Navigate2 Url
While .Busy Or .readyState < 4: DoEvents: Wend
Dim headers(), results(), listings As Object, listing As Object, amenities As String
headers = Array("Size", "Description", "Amenities", "Offer1", "Offer2", "RateType", "Price")
Set listings = .document.querySelectorAll(".main li[class]")
Dim rowCount As Long, numColumns As Long, r As Long, c As Long
Dim icons As Object, icon As Long, amenitiesInfo(), i As Long
rowCount = listings.Length
numColumns = UBound(headers) + 1
ReDim results(1 To rowCount, 1 To numColumns)
For Each listing In listings
r = r + 1
'size,description, amenities,specials offer1 offer2, rate type, price
With .document
results(r, 1) = Trim$(.querySelector(".main li:nth-of-type(" & r & ") .size").innerText)
results(r, 2) = Trim$(.querySelector(".main li:nth-of-type(" & r & ") .description").innerText)
Set icons = .querySelectorAll("." & Join$(Split(listing.className, Chr$(32)), ".") & ":nth-of-type(" & r & ") i[title]")
ReDim amenitiesInfo(0 To icons.Length - 1)
For icon = 0 To icons.Length - 1
amenitiesInfo(icon) = icons.item(icon).getAttribute("title")
Next
amenities = Join$(amenitiesInfo, ",")
results(r, 3) = amenities
results(r, 4) = .querySelector(".main li:nth-of-type(" & r & ") .offer1").innerText
results(r, 5) = .querySelector(".main li:nth-of-type(" & r & ") .offer2").innerText
results(r, 6) = .querySelector(".main li:nth-of-type(" & r & ") .rate-label").innerText
results(r, 7) = .querySelector(".main li:nth-of-type(" & r & ") .price").innerText
End With
Next
.Quit
ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End With
End Sub
References (VBE > Tools > References):
- Microsoft HTML Object Library
- Microsoft Internet Controls
Tags: api, excelexcel