I have tried the following code to scrape a table from local HTML file stored on my PC
Sub Test() Dim mtbl As Object Dim tableData As Object Dim tRow As Object Dim tcell As Object Dim trowNum As Integer Dim tcellNum As Integer Dim webpage As New HTMLDocument Dim fPath As String Dim strCnt As String Dim f As Integer fPath = Environ("USERPROFILE") & "\Desktop\LocalHTML.txt" f = FreeFile() Open fPath For Input As #f strCnt = Input(LOF(f), f) Close #f webpage.body.innerHTML = strCnt Set mtbl = webpage.getElementsByTagName("Table")(0) Set tableData = mtbl.getElementsByTagName("tr") Debug.Print tableData.Item(0).innerText On Error GoTo TryAgain: trowNum = 1 For Each tRow In tableData For Each tcell In tRow.Children tcellNum = tcellNum + 1 Sheet1.Cells(trowNum, tcellNum) = tcell.innerText Next tcell trowNum = trowNum + 1 tcellNum = 0 Next tRow Exit Sub TryAgain: Application.Wait Now + TimeValue("00:00:02") Err.Clear Resume End Sub
The code works with no errors but the results are incorrect in two points
First the characters in Arabic appears on worksheet as questions marks. I mean the unicode characters are not read correctly
Second point the data is scattered on the sheet in an unorganized structure
Here’s the link of the local HTML file
Thanks advanced for help
So, maybe this will help a little. It is not the complete answer I would like to give. Basically, the HTML is a mess (in my opinion). You don’t have data laid out in rows (
tr), with table cells (
td) within, in a manner that you can use to easily isolate individual text elements.
I am offering the following really only to demonstrate the oddities of trying to isolate individual text components and to read/write with arabic characters preserved. I borrowed an adodb stream method from @whom to ensure UTF-8.
This method, looping
table tags etc with hardcoded numbering, is ugly and really belongs in the sin bin. I use the fact that later tables have your individual components stored individually to reconstruct an overall table appearance with rows and columns.
But you may get something from it:
Option Explicit Public Sub test() Dim fStream As ADODB.Stream, html As HTMLDocument Set html = New HTMLDocument Set fStream = New ADODB.Stream With fStream .Charset = "UTF-8" .Open .LoadFromFile "C:\Users\User\Downloads\LocalHTML.html" html.body.innerHTML = .ReadText .Close End With Dim hTables As Object, startTableNumber As Long, i As Long, r As Long, c As Long Dim counter As Long, endTableNumber, numColumns As Long startTableNumber = 43 endTableNumber = 330 numColumns = 9 Set hTables = html.getElementsByTagName("table") r = 2: c = 1 For i = startTableNumber To endTableNumber Step 2 counter = counter + 1 If counter = 10 Then c = 1: r = r + 1: counter = 1 End If Cells(r, c) = hTables(i).innerText c = c + 1 Next End Sub