I have thousands of PDF documents that I am trying to comb through and pull out only certain data. I have successfully created a script that goes through each PDF, puts its content into a .txt, and then the final .txt is searched for the requested information. The only part I am stuck on is trying to combine all the data from each PDF into this .txt file. Currenly, each successive PDF simply overwrites the previous data and the search is only performed on the final PDF in the folder. How can I alter this set of code to allow each bit of information to be concatenated into the .txt instead of overwriting?
$all = Get-Childitem -Path $file1 -Recurse -Filter *.pdf
foreach ($f in $all){
$outfile = -join ', '
$text = convert-PDFtoText $outfile
}
Here is my entire script for reference:
Start-Process powershell.exe -Verb RunAs {
function convert-PDFtoText {
param(
[Parameter(Mandatory=$true)][string]$file
)
Add-Type -Path "C:\ps\itextsharp.dll"
$pdf = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList $file
for ($page = 1; $page -le $pdf.NumberOfPages; $page++){
$text=[iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($pdf,$page)
Write-Output $text
}
$pdf.Close()
}
$content = Read-Host "What are we looking for?: "
$file1 = Read-Host "Path to search: "
$all = Get-Childitem -Path $file1 -Recurse -Filter *.pdf
foreach ($f in $all){
$outfile = $f -join ', '
$text = convert-PDFtoText $outfile
}
$text | Out-File "C:\ps\bulk.txt"
Select-String -Path C:\ps\bulk.txt -Pattern $content | Out-File "C:\ps\select.txt"
Start-Sleep -Seconds 60
}
Any help would be greatly appreciated!
-Appendswitch?$text | Out-File "C:\ps\bulk.txt" -Append