Home » excel » perl – What is a good CPAN parser for HTML MS Excel files?

perl – What is a good CPAN parser for HTML MS Excel files?

Posted by: admin April 23, 2020 Leave a comment


I know that regular (binary) Excel files can be processed via Spreadsheet::ParseExcel.

However, I have a file that is HTML formatted:

<html xmlns:x="urn:schemas-microsoft-com:office:excel">
<meta http-equiv="Content-Type" content="text/html;charset=windows-1252">
<!--[if gte mso 9]>

Short of manually parsing it as a generic HTML file (e.g. TreeBuilder etc..), is there a CPAN module that would parse and let me access such a file as a spreadsheet, similar to Spreadsheet::ParseExcel?

Here’s where the module doesn’t work:

use strict; use warnings;
use Spreadsheet::ParseExcel;
my $parser   = Spreadsheet::ParseExcel->new();
my $file     = 'file1.xls';
my $workbook;
eval {$workbook   = $parser->Parse($file);}; 
#($Workbook returned here is ‘undef’)
How to&Answers:

I use an XPath parser to extract what I need from files like this, iterating on ./Cell/Data nodes inside of the //Row nodes, but that’s not using the same interface as Spreadsheet::ParseExcel.

I also find that you need to do some source filtering before you can use the XML parser. At a minimum you have to run

s/<xml version>/<!-- xml version -->/;

on the input.

Here’s a concise but complete solution, extracting a file like this to a 2-D array:

use XML::XPath;
open F, '<', $dirty_file_name;
open G, '>', $clean_file_name;
while(<F>) { 
    s/<xml version>/<!-- xml version -->/;
    print G
close G;
close F;

@table = map { [ map { $_->string_value } $_->find('./Cell/Data')->get_nodelist ]
  } XML::XPath->new( filename => $clean_file_name )->find('//Row')->get_nodelist;