Context
I am trying to build a Word document browser in Excel to sift trough a large amount of documents (around 1000).
The process of opening a word document proves to be rather slow (around 4 seconds per documents, so in this case it takes 2 hour to look through all the items, which is far too slow for a single query), even by disabling all things that could slow down the opening, hence I open:
- As read only
- Without the open and repair mode (which can happen on some documents)
- Disabling the display of the document
My attempt so far
These documents are tricky to look through because some keywords do appear every single time but not in the same context (not the core of the problem here since I can handle that when the text is loaded in arrays). Hence the often used Windows explorer
solution (like in this link ) cannot be used in my case.
For the moment, I managed to have a working macro that analyze the content of the word documents by opening them.
Code
Here is a sample of the code.
Note that I used the Microsoft Word 14.0 Object Library
reference
' Analyzing all the word document within the same folder '
Sub extractFile()
Dim i As Long, j As Long
Dim sAnalyzedDoc As String, sLibName As String
Dim aOut()
Dim oWordApp As Word.Application
Dim oDoc As Word.Document
Set oWordApp = CreateObject("Word.Application")
sLibName = ThisWorkbook.Path & "\"
sAnalyzedDoc = Dir(sLibName)
sKeyword = "example of a word"
With Application
.DisplayAlerts = False
.ScreenUpdating = False
End With
ReDim aOut(2, 2)
aOut(1, 1) = "Document name"
aOut(2, 1) = "Text"
While (sAnalyzedDoc <> "")
' Analyzing documents only with the .doc and .docx extension '
If Not InStr(sAnalyzedDoc, ".doc") = 0 Then
' Opening the document as mentionned above, in read only mode, without repair and invisible '
Set oDoc = Word.Documents.Open(sLibName & "\" & sAnalyzedDoc, ReadOnly:=True, OpenAndRepair:=False, Visible:=False)
With oDoc
For i = 1 To .Sentences.Count
' Searching for the keyword within the document '
If Not InStr(LCase(.Sentences.Item(i)), LCase(sKeyword)) = 0 Then
If Not IsEmpty(aOut(1, 2)) Then
ReDim Preserve aOut(2, UBound(aOut, 2) + 1)
End If
aOut(1, UBound(aOut, 2)) = sAnalyzedDoc
aOut(2, UBound(aOut, 2)) = .Sentences.Item(i)
GoTo closingDoc ' A dubious programming choice but that works for the moment '
End If
Next i
closingDoc:
' Intending to make the closing faster by not saving the document '
.Close SaveChanges:=False
End With
End If
'Moving on to the next document '
sAnalyzedDoc = Dir
Wend
exitSub:
With Output
.Range(.Cells(1, 1), .Cells(UBound(aOut, 1), UBound(aOut, 2))) = aOut
End With
With Application
.DisplayAlerts = True
.ScreenUpdating = True
End With
End Sub
My question
The idea I thought was to go via the XML content within the document to access directly to its content (which you can access when renaming any document in newer versions of Word, with a .zip
extension and going for nameOfDocument.zip\word\document.xml
).
It would be a lot faster than loading all the images, charts and tables of the word document which are of no use in a text search.
Thus, I wanted to ask if there was a way in VBA to open a word document like a zip file and access that XML document to then process it like a normal string of characters in VBA, since I already have the path and the name of the file given the above code.
Do note that this is not an easy answer to the above problem and the sole VBA code in my initial question will do perfectly the job as long as you do not have a load of documents to browse, else go for another tool (there is a Python Dynamic Link Library (DLL) that does that very well).
Ok, I’ll try to make my answer as explanatory as possible.
First of all this question lead me to the infinite journey of XML in C# and in XPath which I chose not to pursue at some point.
It reduced the time of analyzing the files from roughly 2 hours to 10 seconds.
Context
The backbone of reading XML documents, and therefore inner word XML documents, is the OpenXML library from Microsoft.
Keep in mind what I said above, that the method I was trying to implement cannot be done solely in VBA and thus must be done in another way.
This is probably due to the fact that VBA is implemented for Office and thus limited in accessing the core structure of Office documents, but I have no information relating to this limitation (any information is welcomed).
The answer I will give here is writing a C# DLL for VBA.
For writing DLL in C# and referencing to it in VBA I redirect you toward the following link which will resume in a better way this specific process: Tutorial for creating DLL in C#
Let’s start
First of all you will need to reference the WindowsBase library and the DocumentFormat.OpenXML in your project to make the solution work as explained in this MSDN article Manipulate Office Open XML Formats Documents and that one Open and add text to a word processing document (Open XML SDK)
These articles explain broadly how works the OpenXML library for manipulating word documents.
The C# code
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Xml;
using System.IO.Packaging;
namespace BrowserClass
{
public class SpecificDirectory
{
public string[,] LookUpWord(string nameKeyword, string nameStopword, string nameDirectory)
{
string sKeyWord = nameKeyword;
string sStopWord = nameStopword;
string sDirectory = nameDirectory;
sStopWord = sStopWord.ToLower();
sKeyWord = sKeyWord.ToLower();
string sDocPath = Path.GetDirectoryName(sDirectory);
// Looking for all the documents with the .docx extension
string[] sDocName = Directory.GetFiles(sDocPath, "*.docx", SearchOption.AllDirectories);
string[] sDocumentList = new string[1];
string[] sDocumentText = new string[1];
// Cycling the documents retrieved in the folder
for (int i = 0; i < sDocName.Count(); i++)
{
string docWord = sDocName[i];
// Opening the documents as read only, no need to edit them
Package officePackage = Package.Open(docWord, FileMode.Open, FileAccess.Read);
const String officeDocRelType = @"http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument";
PackagePart corePart = null;
Uri documentUri = null;
// We are extracting the part with the document content within the files
foreach (PackageRelationship relationship in officePackage.GetRelationshipsByType(officeDocRelType))
{
documentUri = PackUriHelper.ResolvePartUri(new Uri("/", UriKind.Relative), relationship.TargetUri);
corePart = officePackage.GetPart(documentUri);
break;
}
// Here enter the proper code
if (corePart != null)
{
string cpPropertiesSchema = "http://schemas.openxmlformats.org/package/2006/metadata/core-properties";
string dcPropertiesSchema = "http://purl.org/dc/elements/1.1/";
string dcTermsPropertiesSchema = "http://purl.org/dc/terms/";
// Construction of a namespace manager to handle the different parts of the xml files
NameTable nt = new NameTable();
XmlNamespaceManager nsmgr = new XmlNamespaceManager(nt);
nsmgr.AddNamespace("dc", dcPropertiesSchema);
nsmgr.AddNamespace("cp", cpPropertiesSchema);
nsmgr.AddNamespace("dcterms", dcTermsPropertiesSchema);
// Loading the xml document's text
XmlDocument doc = new XmlDocument(nt);
doc.Load(corePart.GetStream());
// I chose to directly load the inner text because I could not parse the way I wanted the document, but it works so far
string docInnerText = doc.DocumentElement.InnerText;
docInnerText = docInnerText.Replace("\* MERGEFORMAT", ".");
docInnerText = docInnerText.Replace("DOCPROPERTY ", "");
docInnerText = docInnerText.Replace("Glossary.", "");
try
{
Int32 iPosKeyword = docInnerText.ToLower().IndexOf(sKeyWord);
Int32 iPosStopWord = docInnerText.ToLower().IndexOf(sStopWord);
if (iPosStopWord == -1)
{
iPosStopWord = docInnerText.Length;
}
if (iPosKeyword != -1 && iPosKeyword <= iPosStopWord)
{
// Redimensions the array if there was already a document loaded
if (sDocumentList[0] != null)
{
Array.Resize(ref sDocumentList, sDocumentList.Length + 1);
Array.Resize(ref sDocumentText, sDocumentText.Length + 1);
}
sDocumentList[sDocumentList.Length - 1] = docWord.Substring(sDocPath.Length, docWord.Length - sDocPath.Length);
// Taking the small context around the keyword
sDocumentText[sDocumentText.Length - 1] = ("(...) " + docInnerText.Substring(iPosKeyword, sKeyWord.Length + 60) + " (...)");
}
}
catch (ArgumentOutOfRangeException)
{
Console.WriteLine("Error reading inner text.");
}
}
// Closing the package to enable opening a document right after
officePackage.Close();
}
if (sDocumentList[0] != null)
{
// Preparing the array for output
string[,] sFinalArray = new string[sDocumentList.Length, 2];
for (int i = 0; i < sDocumentList.Length; i++)
{
sFinalArray[i, 0] = sDocumentList[i].Replace("\", "");
sFinalArray[i, 1] = sDocumentText[i];
}
return sFinalArray;
}
else
{
// Preparing the array for output
string[,] sFinalArray = new string[1, 1];
sFinalArray[0, 0] = "NO MATCH";
return sFinalArray;
}
}
}
}
The VBA code associated
Option Explicit
Const sLibname As String = "C:\pathToYourDocuments\"
Sub tester()
Dim aFiles As Variant
Dim LookUpDir As BrowserClass.SpecificDirectory
Set LookUpDir = New BrowserClass.SpecificDirectory
' The array will contain all the files which contain the "searchedPhrase" '
aFiles = LookUpDir.LookUpWord("searchedPhrase", "stopWord", sLibname)
' Add here any necessary processing if needed '
End Sub
So in the end you get a tool that can scan .docx documents much faster than in a classic open-read-close approach in VBA at the cost of more code writing.
Above all you get a simple solution for your users that just want to perform simple search, especially when there is a huge number of word documents.
Note
Parsing Word .XML files can be nightmarish in VBA as pointed out by @Mikegrann .
Thankfully OpenXML has an XML parser C# , xml parsing. get data between tags that will do the work for you in C# and take the <w:t></w:t>
tags that are refering to the text of the docment. Though I found these answers so far but couldn’t make them work:
Parsing a MS Word generated XML file in C# , Reading specific XML elements from XML file
So I went for the .InnerText
solution I provided with my code above, to access the internal text, at the cost of having some formatting text input (like \\MERGEFORMAT
).