Concatenating Output from Folder

Question

I have thousands of PDF documents that I am trying to comb through and pull out only certain data. I have successfully created a script that goes through each PDF, puts its content into a .txt, and then the final .txt is searched for the requested information. The only part I am stuck on is trying to combine all the data from each PDF into this .txt file. Currenly, each successive PDF simply overwrites the previous data and the search is only performed on the final PDF in the folder. How can I alter this set of code to allow each bit of information to be concatenated into the .txt instead of overwriting?

 $all = Get-Childitem -Path $file1 -Recurse -Filter *.pdf
    foreach ($f in $all){
        $outfile = -join ', '
        $text = convert-PDFtoText $outfile
    }

Here is my entire script for reference:

Start-Process powershell.exe -Verb RunAs {

function convert-PDFtoText {
    param(
        [Parameter(Mandatory=$true)][string]$file
    )
    Add-Type -Path "C:\ps\itextsharp.dll"
    $pdf = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList $file
    for ($page = 1; $page -le $pdf.NumberOfPages; $page++){
        $text=[iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($pdf,$page)
        Write-Output $text
    }
    $pdf.Close()
}


$content = Read-Host "What are we looking for?: "
$file1 = Read-Host "Path to search: "

$all = Get-Childitem -Path $file1 -Recurse -Filter *.pdf
foreach ($f in $all){
    $outfile = $f -join ', '
    $text = convert-PDFtoText $outfile
}





$text | Out-File "C:\ps\bulk.txt"
Select-String -Path C:\ps\bulk.txt -Pattern $content | Out-File "C:\ps\select.txt"


Start-Sleep -Seconds 60

}

Any help would be greatly appreciated!

Have you already tried the -Append switch? $text | Out-File "C:\ps\bulk.txt" -Append — Abraham Zinala
– Abraham Zinala, Commented Nov 1, 2021 at 1:06

mklement0 · Accepted Answer · 2021-11-01 01:22:41Z

To capture all output across all convert-PDFtoText in a single output file, use a single pipeline with the ForEach-Object cmdlet:

Get-ChildItem -Path $file1 -Recurse -Filter *.pdf |
  ForEach-Object { convert-PDFtoText $_.FullName } |
    Out-File "C:\ps\bulk.txt"

A tweak to your convert-PDFtoText function would allow for a more concise and efficient solution:

Make convert-PDFtoText accept Get-ChildItem input directly from the pipeline:

function convert-PDFtoText {
    param(
        [Alias('FullName')        
        [Parameter(Mandatory, ValueFromPipelineByPropertyName)] 
        [string] $file
    )

    begin {
      Add-Type -Path "C:\ps\itextsharp.dll"
    }

    process {
      $pdf = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList $file
      for ($page = 1; $page -le $pdf.NumberOfPages; $page++) {
        [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($pdf,$page)
      }
      $pdf.Close()
    }

}

This then allows you to simplify the command at the top to:

Get-ChildItem -Path $file1 -Recurse -Filter *.pdf |
  convert-PDFtoText |
    Out-File "C:\ps\bulk.txt"

Collectives™ on Stack Overflow

Concatenating Output from Folder

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related