I am trying to extract the data from a PDF document into a worksheet. The PDFs show and text can be manually copied and pasted into the Excel document.
I am currently doing this through SendKeys and it is not working. I get an error when I try to paste the data from the PDF document. Why is my paste not working? If I paste after the macro has stopped running it pastes as normal.
Dim myPath As String, myExt As String Dim ws As Worksheet Dim openPDF As Object 'Dim pasteData As MSForms.DataObject Dim fCell As Range 'Set pasteData = New MSForms.DataObject Set ws = Sheets("DATA") If ws.Cells(ws.Rows.Count, "A").End(xlUp).Row > 1 Then Range("A3:A" & ws.Cells(ws.Rows.Count, "A").End(xlUp).Row).ClearContents myExt = "\*.pdf" 'When Scan Receipts Button Pressed Scan the selected folder/s for receipts For Each fCell In Range(ws.Cells(1, 1), ws.Cells(1, ws.Cells(1, ws.Columns.Count).End(xlToLeft).Column)) myPath = Dir(fCell.Value & myExt) Do While myPath <> "" myPath = fCell.Value & "\" & myPath Set openPDF = CreateObject("Shell.Application") openPDF.Open (myPath) Application.Wait Now + TimeValue("00:00:2") SendKeys "^a" Application.Wait Now + TimeValue("00:00:2") SendKeys "^c" 'Application.Wait Now + TimeValue("00:00:2") ws.Select ActiveSheet.Paste 'pasteData.GetFromClipboard 'ws.Cells(3, 1) = pasteData.GetText Exit Sub myPath = Dir Loop Next fCell
You can open the PDF file and extract its contents using the Adobe library (which I believe you can download from Adobe as part of the SDK, but it comes with certain versions of Acrobat as well)
Make sure to add the Library to your references too (On my machine it is the Adobe Acrobat 10.0 Type Library, but not sure if that is the newest version)
Even with the Adobe library it is not trivial (you’ll need to add your own error-trapping etc):
Function getTextFromPDF(ByVal strFilename As String) As String Dim objAVDoc As New AcroAVDoc Dim objPDDoc As New AcroPDDoc Dim objPage As AcroPDPage Dim objSelection As AcroPDTextSelect Dim objHighlight As AcroHiliteList Dim pageNum As Long Dim strText As String strText = "" If (objAvDoc.Open(strFilename, "") Then Set objPDDoc = objAVDoc.GetPDDoc For pageNum = 0 To objPDDoc.GetNumPages() - 1 Set objPage = objPDDoc.AcquirePage(pageNum) Set objHighlight = New AcroHiliteList objHighlight.Add 0, 10000 ' Adjust this up if it's not getting all the text on the page Set objSelection = objPage.CreatePageHilite(objHighlight) If Not objSelection Is Nothing Then For tCount = 0 To objSelection.GetNumText - 1 strText = strText & objSelection.GetText(tCount) Next tCount End If Next pageNum objAVDoc.Close 1 End If getTextFromPDF = strText End Function
What this does is essentially the same thing you are trying to do – only using Adobe’s own library. It’s going through the PDF one page at a time, highlighting all of the text on the page, then dropping it (one text element at a time) into a string.
Keep in mind what you get from this could be full of all kinds of non-printing characters (line feeds, newlines, etc) that could even end up in the middle of what look like contiguous blocks of text, so you may need additional code to clean it up before you can use it.
Hope that helps!
I know this is an old issue but I just had to do this for a project at work, and I am very surprised that nobody has thought of this solution yet:
Just open the .pdf with Microsoft word.
The code is a lot easier to work with when you are trying to extract data from a .docx because it opens in Microsoft Word. Excel and Word play well together because they are both Microsoft programs. In my case, the file of question had to be a .pdf file. Here’s the solution I came up with:
- Choose the default program to open .pdf files to be Microsoft Word
- The first time you open a .pdf file with word, a dialogue box pops up claiming word will need to convert the .pdf into a .docx file. Click the check box in the bottom left stating “do not show this message again” and then click OK.
- Create a macro that extracts data from a .docx file. I used MikeD’s Code as a resource for this.
- Tinker around with the MoveDown, MoveRight, and Find.Execute methods to fit the need of your task.
Yes you could just convert the .pdf file to a .docx file but this is a much simpler solution in my opinion.
Over time, I have found that extracting text from PDFs in a structured format is tough business. However if you are looking for an easy solution, you might want to consider XPDF tool
Pseudocode to extract the text would include:
SHELLVBA statement to extract the text from PDF to a temporary file using XPDF
- Using sequential file read statements to read the temporary file contents into a string
- Pasting the string into Excel
Simplified example below:
Sub ReadIntoExcel(PDFName As String) 'Convert PDF to text Shell "C:\Utils\pdftotext.exe -layout " & PDFName & " tempfile.txt" 'Read in the text file and write to Excel Dim TextLine as String Dim RowNumber as Integer Dim F1 as Integer RowNumber = 1 F1 = Freefile() Open "tempfile.txt" for Input as #F1 While Not EOF(#F1) Line Input #F1, TextLine ThisWorkbook.WorkSheets(1).Cells(RowNumber, 1).Value = TextLine RowNumber = RowNumber + 1 Wend Close #F1 End Sub
Copying and pasting by user interactions emulation could be not reliable (for example, popup appears and it switches the focus). You may be interested in trying the commercial ByteScout PDF Extractor SDK that is specifically designed to extract data from PDF and it works from VBA. It is also capable of extracting data from invoices and tables as CSV using VB code.
Here is the VBA code for Excel to extract text from given locations and save them into cells in the
Private Sub CommandButton1_Click() ' Create TextExtractor object ' Set extractor = CreateObject("Bytescout.PDFExtractor.TextExtractor") Dim extractor As New Bytescout_PDFExtractor.TextExtractor extractor.RegistrationName = "demo" extractor.RegistrationKey = "demo" ' Load sample PDF document extractor.LoadDocumentFromFile ("c:\sample1.pdf") ' Get page count pageCount = extractor.GetPageCount() Dim wb As Workbook Dim ws As Worksheet Dim TxtRng As Range Set wb = ActiveWorkbook Set ws = wb.Sheets("Sheet1") For i = 0 To pageCount - 1 RectLeft = 10 RectTop = 10 RectWidth = 100 RectHeight = 100 ' check the same text is extracted from returned coordinates extractor.SetExtractionArea RectLeft, RectTop, RectWidth, RectHeight ' extract text from given area extractedText = extractor.GetTextFromPage(i) ' insert rows ' Rows(1).Insert shift:=xlShiftDown ' write cell value Set TxtRng = ws.Range("A" & CStr(i + 2)) TxtRng.Value = extractedText Next Set extractor = Nothing End Sub
Disclosure: I am related to ByteScout
Using Bytescout PDF Extractor SDK is a good option. It is cheap and gives plenty of PDF related functionality. One of the answers above points to the dead page Bytescout on GitHub. I am providing a relevant working sample to extract table from PDF. You may use it to export in any format.
Set extractor = CreateObject("Bytescout.PDFExtractor.StructuredExtractor") extractor.RegistrationName = "demo" extractor.RegistrationKey = "demo" ' Load sample PDF document extractor.LoadDocumentFromFile "../../sample3.pdf" For ipage = 0 To extractor.GetPageCount() - 1 ' starting extraction from page #" extractor.PrepareStructure ipage rowCount = extractor.GetRowCount(ipage) For row = 0 To rowCount - 1 columnCount = extractor.GetColumnCount(ipage, row) For col = 0 To columnCount-1 WScript.Echo "Cell at page #" +CStr(ipage) + ", row=" & CStr(row) & ", column=" & _ CStr(col) & vbCRLF & extractor.GetCellValue(ipage, row, col) Next Next Next
Many more samples available here: https://github.com/bytescout/pdf-extractor-sdk-samples
Since I do not prefer to rely on external libraries and/or other programs, I have extended your solution so that it works.
The actual change here is using the GetFromClipboard function instead of Paste which is mainly used to paste a range of cells.
Of course, the downside is that the user must not change focus or intervene during the whole process.
Dim pathPDF As String, textPDF As String Dim openPDF As Object Dim objPDF As MsForms.DataObject pathPDF = "C:\some\path\data.pdf" Set openPDF = CreateObject("Shell.Application") openPDF.Open (pathPDF) 'TIME TO WAIT BEFORE/AFTER COPY AND PASTE SENDKEYS Application.Wait Now + TimeValue("00:00:2") SendKeys "^a" Application.Wait Now + TimeValue("00:00:2") SendKeys "^c" Application.Wait Now + TimeValue("00:00:1") AppActivate ActiveWorkbook.Windows(1).Caption objPDF.GetFromClipboard textPDF = objPDF.GetText(1) MsgBox textPDF
If you’re interested see my project in github.
To improve the solution of Slinky Sloth I had to add this beforere get from clipboard :
Set objPDF = New MSForms.DataObject
Sadly it didn’t worked for a pdf of 10 pages.
This doesn’t seem to work with the Adobe Type library. As soon as it gets to Open, I get a 429 error. Acrobat works fine though…