What am I trying to do?
I have a list of URLs that I want to scrape using Excel’s Web Query functionality. I’m trying to completely automate the process, so I’m developing an SSIS package that calls a Script Task for each URL. The Script Task creates a new Excel workbook with a worksheet, activates the worksheet, adds a QueryTable connection, refreshes the QueryTable to get the data, using
XlWebSelectionType.xlAllTables. It then saves the workbook and closes the workbook and the Excel application.
What technologies am I utilizing?
- VS 2015 (Enterprise)
- SQL Server 2016
- Microsoft Excel 16.0 Object library
- Excel local install from Office 365 ProPlus
What’s the problem?
While the script task does save all the data from the tables on the web page, it puts them all into the single worksheet, and does not save the table names. So while my data is correctly grouped in the worksheet, I have no way of knowing which “group” of data corresponds to which table.
What do I want to do about it?
Ideally I would want each QueryTable table to be saved into its own Worksheet, with the table name set as the Worksheet name. Barring that, I need a way to save the table name with the corresponding data. Adding it as a new column in the QueryTable would be the best in this scenario.
What do I have so far?
Here’s the main part of the script:
Public Sub Main() Dim URL As String = Dts.Variables("User::URL").Value.ToString() Dim FileName As String = Dts.Variables("User::FileName").Value.ToString() Dim xlNone As XlWebFormatting = XlWebFormatting.xlWebFormattingNone Dim Format As XlFileFormat = XlFileFormat.xlCSVWindows Dim ScrapeStatus As Integer = 1 Dim excel As New Microsoft.Office.Interop.Excel.ApplicationClass With excel .SheetsInNewWorkbook = 1 .DisplayAlerts = False End With Dim wb As Microsoft.Office.Interop.Excel.Workbook = excel.Workbooks.Add() With wb .Activate() .Worksheets.Select(1) End With Try Dim rnStart As Range = wb.ActiveSheet.Range("A1:Z100") Dim qtQtrResults As QueryTable = wb.ActiveSheet.QueryTables.Add(Connection:="URL;" + URL, Destination:=rnStart) With qtQtrResults .BackgroundQuery = False .WebFormatting = xlNone .WebSelectionType = XlWebSelectionType.xlAllTables .Refresh() End With excel.CalculateUntilAsyncQueriesDone() wb.SaveAs(FileName) wb.Close() excel.Quit() System.Runtime.InteropServices.Marshal.ReleaseComObject(excel) GC.Collect() GC.WaitForPendingFinalizers() Dts.TaskResult = ScriptResults.Success Catch ex As Exception Dts.Variables("User::Error").Value = ex.Message.ToString() wb.Saved = True wb.Close() excel.Quit() System.Runtime.InteropServices.Marshal.ReleaseComObject(excel) GC.Collect() GC.WaitForPendingFinalizers() Dts.TaskResult = ScriptResults.Failure End Try End Sub
What results am I getting?
For the URL
http://athletics.chabotcollege.edu/information/directory/home#directory, if I use the Web Query functionality while inside Excel, I get the following to select from:
All the table names are displayed
I should also note that while most of the web pages have a similar structure, not all are the same. So I can’t assume every page will have the same table names, or structure the tables in the same way. My solution needs to be dynamic and flexible.
I don’t think you can get the table names via web queries, if you check the web page source you can notice that the tables does not have a name attribute. The names that Excel shows in the interface are not related to the tables, they are the title of the section (which is the parent Tag of the Table) so they are not considered as Tablename.
Also, after checking the QueryTable documentation there is no option to retrieve table names or title of the table container, so it is not necessary that Excel uses the web queries to show the Tables and the headers in the Interface (as shown in the screenshots)
I think there is one way to split the data over worksheets (without table names) is:
- You must use Regular Expressions to retrieve the Table count from the web page
- You have to create a worksheet for each Table
- You have to create a QueryTable for each Table
In each QueryTable you have to set the destination worksheet Range and the following properties:
.WebSelectionType = XlWebSelectionType.xlSpecifiedTables .WebTables = i 'Where i is the index of Table
Maybe you should use an HTML parser and Regular expression to collect the Table metadata
.WebSelectionType = XlWebSelectionType.xlAllTables to
.WebSelectionType = XlWebSelectionType.xlEntirePage I’m able to capture the “names” of the tables. They are actually
aria-title values inside the parent
<section> tag of each table. It’s ugly, but it does return the strings I’m looking for.
I ended up saving both the
xlEntirePage QueryTables as text files. Then I split the
xlAllTables file into separate chunks for each table, and then search the
xlEntirePage text file for the “string” that represents the table, and I copy the preceding line, which has the title. I then save the table text as a new file with the copied title as the filename. It’s very hacky, but it did what I needed it to do.