Home » excel » excel – VBA Retrieve data from an HTML including space

excel – VBA Retrieve data from an HTML including space

Posted by: admin April 23, 2020 Leave a comment

Questions:

Here is the relevant HTML code.

<tr style="background-color: #f0f0f0">
<td> </td><td> a</td><td>a </td><td>  </td><td>&nbsp;</td>
</tr>

Here is the VBA code.

sub gethtmlspace() 

Dim trObj As MSHTML.HTMLGenericElement
Dim tdObj As MSHTML.HTMLGenericElement
Dim aRes As Variant, bRes As Variant
Dim temp1 As Long, Temp2 As Long, temp3 As Long, Temp4 As Long
Dim oDom As Object: Set oDom = CreateObject("htmlFile")
Dim oRow As MSHTML.IHTMLElementCollection, oCell As MSHTML.IHTMLElementCollection

temp1 = 0
Temp2 = 0

    With CreateObject("MSXML2.ServerXMLHttp")
        .Open "GET", "https://docs.google.com/spreadsheets/d/1Yh6WlJTDxbOLPVaVgzn_mk2OAKYVUYgfnT5Wz-8odi4/gviz/tq?tqx=out:html&tq&gid=1", False
        .send
        oDom.body.innerHTML = .responseText
    End With

Set oRow = oDom.getElementsByTagName("TR")
    ReDim aRes(0 To oRow.Length - 1, 0 To oRow(0).getElementsByTagName("TD").Length - 1)
    For Each trObj In oRow
        Set oCell = trObj.getElementsByTagName("td")
        For Each tdObj In oCell
            aRes(temp1, Temp2) = tdObj.innerText
            Temp2 = Temp2 + 1
        Next tdObj
        Temp2 = 0
        temp1 = temp1 + 1
    Next trObj

end sub

I would like aRes array to contain the exact value in the HTMLcode, i.e.

aRes(1,0) should be equal to a space ” ” My results get empty i.e.””

aRes(1,1) should be equal to a space and character a ” a” My results get a only “a”

aRes(1,2) should be “a ” this one is correctly retrieved.

aRes(1,3) should be equal to two spaces ” ” My results get empty i.e.””

aRes(1,4) should be equal to empty My results get a space i.e.” “

I know I can use regex to get the tasks done. However, I would like to do it in a simple way using getelementsbytagname method.

I tried innerhtml, outertext, outerhtml, textcontent instead of innertext. But no luck.
I also googled for the key words, like innertext with spacing, getelementsbytagename properties. Also no luck.

Could someone help please. Thank you so much.

How to&Answers:

You can’t per se. The HTML parser decides what whitespace is useful and to retain and what to remove. I will add some references later (if I can find any) but just like in the browser engine, in the HTML parser there are rules which determine which whitespace characters are useful.

Bear in mind that:

“Whitespace” is a mass noun

covering a variety of characters which may be handled differently.

Compare what happens to your responseText after it has gone through the HTML parser:

See how whitespace determined not useful is removed. You cannot use a method of HTMLfile to get the result you want, as by the time the HTML has been parsed it is too late; and there is no setting with late bound HTMLFile, or early bound MSHTML.HTMLDocument, that changes this. You would have to look to other string manipulations first. You might, for example, do a replace$ on the .responseText of Chr$(32) with the html entity &nbsp; . Or, use regex, as you mention, to do a more efficient set of replacements.

You can generate the above image outputs with:

Option Explicit

Public Sub ExamineHtmlWhenParsed()
    Dim oDom As Object: Set oDom = CreateObject("htmlFile")

    With CreateObject("MSXML2.ServerXMLHTTP")
        .Open "GET", "https://docs.google.com/spreadsheets/d/1Yh6WlJTDxbOLPVaVgzn_mk2OAKYVUYgfnT5Wz-8odi4/gviz/tq?tqx=out:html&tq&gid=1", False
        .send
        oDom.body.innerHTML = .responseText
        WriteTxtFile .responseText, "C:\Users\User\Desktop\input.txt"
        WriteTxtFile oDom.body.innerHTML, "C:\Users\User\Desktop\parsed.txt"
    End With

End Sub

 Public Sub WriteTxtFile(ByVal aString As String, ByVal filePath As String)
    Dim fso As Object, Fileout As Object
    Set fso = CreateObject("Scripting.FileSystemObject")
    Set Fileout = fso.CreateTextFile(filePath, True, True)
    Fileout.Write aString
    Fileout.Close
End Sub

This gives a worked example of browser white space processing.

This discusses it in the content of css.

The VBA HTML parsers will be older than the current HTML5 living standard but the current standard is here. You can review the answers given to this question and the associated comments e.g.:

@JasonWoof: HTML5 spec says that browsers should only collapse 5 (ascii) whitespace characters (space, tab, cr, lf, ff).