Home » excel » Processing Excel XML files with Perl and LibXML

Processing Excel XML files with Perl and LibXML

Posted by: admin May 14, 2020 Leave a comment

Questions:

I’m trying to process data in an Excel file saved as an XML spreadsheet. After doing a fair amount of research (I’ve not done much XML processing before) I still couldn’t make it work. Here is the content of my minimal file:

<?xml version="1.0"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
 xmlns:o="urn:schemas-microsoft-com:office:office"
 xmlns:x="urn:schemas-microsoft-com:office:excel"
 xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
 xmlns:html="http://www.w3.org/TR/REC-html40"
 xmlns:fn="http://www.w3.org/2005/xpath-functions"
 xmlns:sbmextension="http://www.serena.com/SBM/XSLT_Extension">
 <Worksheet ss:Name="index">
 </Worksheet>
</Workbook>

And my script:

use XML::LibXML;
use Data::Dumper;
my $filename = $ARGV[0];
my $parser = XML::LibXML->new();
my $doc    = $parser->parse_file($filename);
my $xc = XML::LibXML::XPathContext->new( $doc->documentElement );
my $xpath = '/Workbook/Worksheet/@ss:Name';

print Dumper $xc->findvalue($xpath);

However, if I remove (the default namespace?) xmlns=”urn:schemas-microsoft-com:office:spreadsheet” then it starts working. Please can you tell me what I’m missing? I guess I could just remove it before parsing the document but I would like to understand what I’ve done wrong :). Thanks in advance.

How to&Answers:

If you want to work with XPath expressions and namespaces, you have to register the namespaces first, and then use it every time in all the XPath expressions where elements of the namespace are mentioned:

#!/usr/bin/perl
use warnings;
use strict;

use XML::LibXML;
use Data::Dumper;

my $xml = << '__XML__';
<?xml version="1.0"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook
   xmlns="urn:schemas-microsoft-com:office:spreadsheet"
 xmlns:o="urn:schemas-microsoft-com:office:office"
 xmlns:x="urn:schemas-microsoft-com:office:excel"
 xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
 xmlns:html="http://www.w3.org/TR/REC-html40"
 xmlns:fn="http://www.w3.org/2005/xpath-functions"
 xmlns:sbmextension="http://www.serena.com/SBM/XSLT_Extension">
 <Worksheet ss:Name="index">
 </Worksheet>
</Workbook>
__XML__

my $doc = XML::LibXML->load_xml( string => $xml);
my $xc  = XML::LibXML::XPathContext->new( $doc->documentElement );
$xc->registerNs('ss', 'urn:schemas-microsoft-com:office:spreadsheet');
my $xpath = '/ss:Workbook/ss:Worksheet/@ss:Name';

print Dumper $xc->findvalue($xpath);