1

I have thousands of PDF documents that I am trying to comb through and pull out only certain data. I have successfully created a script that goes through each PDF, puts its content into a .txt, and then the final .txt is searched for the requested information. The only part I am stuck on is trying to combine all the data from each PDF into this .txt file. Currenly, each successive PDF simply overwrites the previous data and the search is only performed on the final PDF in the folder. How can I alter this set of code to allow each bit of information to be concatenated into the .txt instead of overwriting?

 $all = Get-Childitem -Path $file1 -Recurse -Filter *.pdf
    foreach ($f in $all){
        $outfile = -join ', '
        $text = convert-PDFtoText $outfile
    }

Here is my entire script for reference:

Start-Process powershell.exe -Verb RunAs {

function convert-PDFtoText {
    param(
        [Parameter(Mandatory=$true)][string]$file
    )
    Add-Type -Path "C:\ps\itextsharp.dll"
    $pdf = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList $file
    for ($page = 1; $page -le $pdf.NumberOfPages; $page++){
        $text=[iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($pdf,$page)
        Write-Output $text
    }
    $pdf.Close()
}


$content = Read-Host "What are we looking for?: "
$file1 = Read-Host "Path to search: "

$all = Get-Childitem -Path $file1 -Recurse -Filter *.pdf
foreach ($f in $all){
    $outfile = $f -join ', '
    $text = convert-PDFtoText $outfile
}





$text | Out-File "C:\ps\bulk.txt"
Select-String -Path C:\ps\bulk.txt -Pattern $content | Out-File "C:\ps\select.txt"


Start-Sleep -Seconds 60

}

Any help would be greatly appreciated!

1
  • Have you already tried the -Append switch? $text | Out-File "C:\ps\bulk.txt" -Append Commented Nov 1, 2021 at 1:06

1 Answer 1

1

To capture all output across all convert-PDFtoText in a single output file, use a single pipeline with the ForEach-Object cmdlet:

Get-ChildItem -Path $file1 -Recurse -Filter *.pdf |
  ForEach-Object { convert-PDFtoText $_.FullName } |
    Out-File "C:\ps\bulk.txt"

A tweak to your convert-PDFtoText function would allow for a more concise and efficient solution:

Make convert-PDFtoText accept Get-ChildItem input directly from the pipeline:

function convert-PDFtoText {
    param(
        [Alias('FullName')        
        [Parameter(Mandatory, ValueFromPipelineByPropertyName)] 
        [string] $file
    )

    begin {
      Add-Type -Path "C:\ps\itextsharp.dll"
    }

    process {
      $pdf = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList $file
      for ($page = 1; $page -le $pdf.NumberOfPages; $page++) {
        [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($pdf,$page)
      }
      $pdf.Close()
    }

}

This then allows you to simplify the command at the top to:

Get-ChildItem -Path $file1 -Recurse -Filter *.pdf |
  convert-PDFtoText |
    Out-File "C:\ps\bulk.txt"
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.