Here is the relevant HTML code.
<tr style="background-color: #f0f0f0"> <td> </td><td> a</td><td>a </td><td> </td><td> </td> </tr>
Here is the VBA code.
sub gethtmlspace() Dim trObj As MSHTML.HTMLGenericElement Dim tdObj As MSHTML.HTMLGenericElement Dim aRes As Variant, bRes As Variant Dim temp1 As Long, Temp2 As Long, temp3 As Long, Temp4 As Long Dim oDom As Object: Set oDom = CreateObject("htmlFile") Dim oRow As MSHTML.IHTMLElementCollection, oCell As MSHTML.IHTMLElementCollection temp1 = 0 Temp2 = 0 With CreateObject("MSXML2.ServerXMLHttp") .Open "GET", "https://docs.google.com/spreadsheets/d/1Yh6WlJTDxbOLPVaVgzn_mk2OAKYVUYgfnT5Wz-8odi4/gviz/tq?tqx=out:html&tq&gid=1", False .send oDom.body.innerHTML = .responseText End With Set oRow = oDom.getElementsByTagName("TR") ReDim aRes(0 To oRow.Length - 1, 0 To oRow(0).getElementsByTagName("TD").Length - 1) For Each trObj In oRow Set oCell = trObj.getElementsByTagName("td") For Each tdObj In oCell aRes(temp1, Temp2) = tdObj.innerText Temp2 = Temp2 + 1 Next tdObj Temp2 = 0 temp1 = temp1 + 1 Next trObj end sub
I would like aRes array to contain the exact value in the HTMLcode, i.e.
aRes(1,0) should be equal to a space ” ” My results get empty i.e.””
aRes(1,1) should be equal to a space and character a ” a” My results get a only “a”
aRes(1,2) should be “a ” this one is correctly retrieved.
aRes(1,3) should be equal to two spaces ” ” My results get empty i.e.””
aRes(1,4) should be equal to empty My results get a space i.e.” “
I know I can use regex to get the tasks done. However, I would like to do it in a simple way using getelementsbytagname method.
I tried innerhtml, outertext, outerhtml, textcontent instead of innertext. But no luck.
I also googled for the key words, like innertext with spacing, getelementsbytagename properties. Also no luck.
Could someone help please. Thank you so much.
You can’t per se. The HTML parser decides what whitespace is useful and to retain and what to remove. I will add some references later (if I can find any) but just like in the browser engine, in the HTML parser there are rules which determine which whitespace characters are useful.
Bear in mind that:
covering a variety of characters which may be handled differently.
Compare what happens to your
responseText after it has gone through the HTML parser:
See how whitespace determined not useful is removed. You cannot use a method of
HTMLfile to get the result you want, as by the time the HTML has been parsed it is too late; and there is no setting with late bound
HTMLFile, or early bound
MSHTML.HTMLDocument, that changes this. You would have to look to other string manipulations first. You might, for example, do a replace$ on the
Chr$(32) with the html entity
. Or, use regex, as you mention, to do a more efficient set of replacements.
You can generate the above image outputs with:
Option Explicit Public Sub ExamineHtmlWhenParsed() Dim oDom As Object: Set oDom = CreateObject("htmlFile") With CreateObject("MSXML2.ServerXMLHTTP") .Open "GET", "https://docs.google.com/spreadsheets/d/1Yh6WlJTDxbOLPVaVgzn_mk2OAKYVUYgfnT5Wz-8odi4/gviz/tq?tqx=out:html&tq&gid=1", False .send oDom.body.innerHTML = .responseText WriteTxtFile .responseText, "C:\Users\User\Desktop\input.txt" WriteTxtFile oDom.body.innerHTML, "C:\Users\User\Desktop\parsed.txt" End With End Sub Public Sub WriteTxtFile(ByVal aString As String, ByVal filePath As String) Dim fso As Object, Fileout As Object Set fso = CreateObject("Scripting.FileSystemObject") Set Fileout = fso.CreateTextFile(filePath, True, True) Fileout.Write aString Fileout.Close End Sub
This gives a worked example of browser white space processing.
This discusses it in the content of css.
@JasonWoof: HTML5 spec says that browsers should only collapse 5 (ascii) whitespace characters (space, tab, cr, lf, ff).