I am trying to extract the data from a PDF document into a worksheet. The PDFs show and text can be manually copied and pasted into the Excel document. I am currently doing this through SendKeys and it is not working. I get an error when I try to paste the data from the PDF document. Why is my paste not working? If I paste after the macro has stopped running it pastes as normal.
Dim myPath As String, myExt As String Dim ws As Worksheet Dim openPDF As Object 'Dim pasteData As MSForms.DataObject Dim fCell As Range 'Set pasteData = New MSForms.DataObject Set ws = Sheets("DATA") If ws.Cells(ws.Rows.Count, "A").End(xlUp).Row > 1 Then Range("A3:A" & ws.Cells(ws.Rows.Count, "A").End(xlUp).Row).ClearContents myExt = "\*.pdf" 'When Scan Receipts Button Pressed Scan the selected folder/s for receipts For Each fCell In Range(ws.Cells(1, 1), ws.Cells(1, ws.Cells(1, ws.Columns.Count).End(xlToLeft).Column)) myPath = Dir(fCell.Value & myExt) Do While myPath <> "" myPath = fCell.Value & "\" & myPath Set openPDF = CreateObject("Shell.Application") openPDF.Open (myPath) Application.Wait Now + TimeValue("00:00:2") SendKeys "^a" Application.Wait Now + TimeValue("00:00:2") SendKeys "^c" 'Application.Wait Now + TimeValue("00:00:2") ws.Select ActiveSheet.Paste 'pasteData.GetFromClipboard 'ws.Cells(3, 1) = pasteData.GetText Exit Sub myPath = Dir Loop Next fCell
1 1 1 silver badge
asked Mar 28, 2016 at 20:06
181 1 1 gold badge 1 1 silver badge 6 6 bronze badges
As someone who came across this and did a little more searching I found this to be a valid reference link as an alternate method.
Commented Nov 12, 2020 at 14:58You can open the PDF file and extract its contents using the Adobe library (which I believe you can download from Adobe as part of the SDK, but it comes with certain versions of Acrobat as well)
Make sure to add the Library to your references too (On my machine it is the Adobe Acrobat 10.0 Type Library, but not sure if that is the newest version)
Even with the Adobe library it is not trivial (you'll need to add your own error-trapping etc):
Function getTextFromPDF(ByVal strFilename As String) As String Dim objAVDoc As New AcroAVDoc Dim objPDDoc As New AcroPDDoc Dim objPage As AcroPDPage Dim objSelection As AcroPDTextSelect Dim objHighlight As AcroHiliteList Dim pageNum As Long Dim strText As String strText = "" If (objAvDoc.Open(strFilename, "") Then Set objPDDoc = objAVDoc.GetPDDoc For pageNum = 0 To objPDDoc.GetNumPages() - 1 Set objPage = objPDDoc.AcquirePage(pageNum) Set objHighlight = New AcroHiliteList objHighlight.Add 0, 10000 ' Adjust this up if it's not getting all the text on the page Set objSelection = objPage.CreatePageHilite(objHighlight) If Not objSelection Is Nothing Then For tCount = 0 To objSelection.GetNumText - 1 strText = strText & objSelection.GetText(tCount) Next tCount End If Next pageNum objAVDoc.Close 1 End If getTextFromPDF = strText End Function
What this does is essentially the same thing you are trying to do - only using Adobe's own library. It's going through the PDF one page at a time, highlighting all of the text on the page, then dropping it (one text element at a time) into a string.
Keep in mind what you get from this could be full of all kinds of non-printing characters (line feeds, newlines, etc) that could even end up in the middle of what look like contiguous blocks of text, so you may need additional code to clean it up before you can use it.