2

I am using a PowerShell script and Regex to turn giant (>1GB) fixed field length text files into importable tab delimited files. The code is very fast. I need to change some of the captured fields (let's say the 4th, 6th, and 7th fields) to 0 if they are empty after trimming. Is there a super-fast way to do this, say as part of the regex capture without slowing this process down much?

DATA

ID         FIRST_NAME              LAST_NAME          COLUMN_NM_TOO_LON5THCOLUMN
10000000001MINNIE                  MOUSE              COLUMN VALUE LONGSTARTS 


PROGRAM

$proc_yyyymm = '201912'
match_regex = '^(.{10})(.{10})(.{30})(.{30})(.{30})(.{4})(.{8})(.{10})(.{1})(.{15})(.{12})'

while ($line = $stream_in.ReadLine()) {

   if ($line -match $match_data_regex) {
      $new_line = "$proc_yyyymm`t" + ($Matches[1..($Matches.Count-1)].Trim() -join "`t")
      $stream_out.WriteLine($new_line)
   }
}
12
  • I was thinking of pre-defining an array of field positions (e.g. - @(3,5,6)) that need to be modified from empty to 0 to use at run time depending on the format of the file being processed. Commented Dec 11, 2019 at 19:19
  • Yes, you would add code to test the trimmed value at those indices and substitute another value as appropriate. You could also use the [Regex]::Replace() method to do the same thing. By the way, just by posting this question it goes without saying that it's addressed to "someone who does" "maybe ... know this one". Commented Dec 11, 2019 at 19:32
  • Thanks @BACON I understand the test and assign option. I'm not sure how to use replace on specified capture groups or fields in my code. Do you have an example of that for my case? Commented Dec 11, 2019 at 19:41
  • FYI - in my actual implementation, I also stream an error file and log file. Commented Dec 11, 2019 at 21:20
  • 1
    I have a solution there that you can test if(! $value) { $value = 0 } stackoverflow.com/a/57647495/6654942 Commented Dec 12, 2019 at 15:54

5 Answers 5

3

Taking the work that @BACON and @TheMadTechnician provided, I thought that I would evaluate the different methods to parse the file and their relative performances.

Taking the example text:

ID         FIRST_NAME              LAST_NAME          COLUMN_NM_TOO_LON5THCOLUMN
10000000001MINNIE                  MOUSE              COLUMN VALUE LONGSTARTS   
10000000002PLUTO                                      COLUMN VALUE LONGSTARTS   

Copy/Paste and make a 10,000 line file out of it. With it I can on a small scale evaluate some performance of our various implementations.


Regular Expressions

@BACON's First example, a straight up Regex + empty string check:

$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")

$proc_yyyymm = '201912'
$match_regex = '^(.{11})(.{24})(.{19})(.{17})(.{9})'

$delimiter = ','
$indicesToNormalizeToZero = ,2
while ($line = $stream_in.ReadLine()) {
    if ($line -match $match_regex) {
        $trimmedMatches = $Matches[1..($Matches.Count-1)].Trim()
        foreach ($index in $indicesToNormalizeToZero)
        {
            if ($trimmedMatches[$index] -eq '')
            {
                $trimmedMatches[$index] = '0'
            }
        }

        $new_line = "$proc_yyyymm$delimiter" + ($trimmedMatches -join $delimiter)
        $stream_out.WriteLine($new_line)
    }
}
$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch

ElapsedMilliseconds: 335

Overall - pretty good performance. We'll use this as our baseline for comparison.

Next up: Let's try the Sweet Match Evaluator soultion (kudos to @BACON for thinking this one up):

$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")
$proc_yyyymm = '201912'
$match_regex = [Regex] '^(.{11})(.{24})(.{19})(.{17})(.{9})'
$match_evaluator = {
    param($match)

    # The first element of Groups contains the entire matched text; skip it
    $fields = $match.Groups `
        | Select-Object -Skip 1 `
        | ForEach-Object -Process {
            $field = $_.Value.Trim()
            if ($groupsToNormalizeToZero -contains $_.Name -and $field -eq '')
            {
                $field = '0'
            }
            return $field
        }
    return "$proc_yyyymm$delimiter" + ($fields -join $delimiter)
}

$delimiter = ','
# Replace with a HashSet/Hashtable for better lookup performance
$groupsToNormalizeToZero = ,'3'

while ($line = $stream_in.ReadLine()) {
    $new_line = $match_regex.Replace($line, $match_evaluator)

    # The original input string is returned if there was no match
    if (-not [Object]::ReferenceEquals($line, $new_line)) {
        $stream_out.WriteLine($new_line)
    }
}
$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch

ElapsedMilliseconds: 1921

Match Groups Cool! YAY.... ugh.. but for pure performance... maaaybe not quite there. Likely because we would need to optimize the match evaluator more.

For completeness, let's try @TheMadTechnician solution:

$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")

$proc_yyyymm = '201912'
$match_regex = [Regex] '^(.{11})(.{24})(.{19})(.{17})(.{9})'

while ($line = $stream_in.ReadLine()) {
    #Pad right to account for truncated lines and get captured values
    $LineMatch=$match_regex.Matches($line.PadRight(160)).groups.value
    #Replace first item in array with line header (first item is entire input line, which we don't want)
    $LineMatch[0] = $proc_yyyymm
    #Replace null columns on 4, 6, and 7 with 0
    switch($true){
        {$LineMatch[3] -notmatch '\S'} {$LineMatch[3]='0'}
    }

    $stream_out.WriteLine(($LineMatch.trim() -join "`t"))

}

$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch

ElapsedMilliseconds: 545

Not quite as good as we were expecting. I suspect it's because of the -notmatch implementing an additional regex implementation.


Pure PowerShell Way

Also, let's try a modified @js2010 method. @js2010's main advantage is that it is a "pure" PowerShell only way, using only PowerShell cmdlets.

$Stopwatch = [system.diagnostics.stopwatch]::startNew()

$proc_yyyymm = '201912'
$delimiter = ','
$cols = 0,11,35,54,71,80 # fake extra column at the end, assumes all rows are that wide, padded with spaces
$colsfile = "C:\Temp\perf\input.txt"
$csvfile = "C:\Temp\perf\out.CSV"

$firstline = get-content $colsfile | select -first 1
$headers = for ($i = 0; $i -lt $cols.count - 1; $i++) {
  $firstline.substring($cols[$i], $cols[$i+1]-$cols[$i]).trim()
}

Get-Content $colsfile | select -skip 1 | ForEach {
  $hash = [ordered]@{}
  $hash += @{ "proc_yyymm" = $proc_yyyymm }
  for ($i = 0; $i -lt $headers.length; $i++) {
    $value = $_.substring($cols[$i], $cols[$i+1]-$cols[$i]).trim()

    if(! $value) { $value = 0 }

    $hash += @{$headers[$i] = $value}
  }
  [pscustomobject]$hash
} | export-csv $csvfile -NoTypeInformation -Delimiter $delimiter

$Stopwatch.Stop()
$Stopwatch

ElapsedMilliseconds: 1043

This didn't perform badly, and actually pretty good for a "native" solution. It has the benefit of being flexible (via hash tables), and @js2010 does something that I love, by utilizing Export-CSV it will output Correct CSV delimited files (or tab). The problem with the other above solutions is that the strings aren't wrapped in quotes to escape commas, or other quotes, from the text. Export-CSV will do that.


Substring Executions

Since we know we are dealing with fixed length columns, we can use Substrings instead of Regular Expressions. Let's see if there is any performance improvement.

@BACON's next example with a basic substring execution:

$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")

$proc_yyyymm = '201912'
$delimiter = ','

while ($line = $stream_in.ReadLine()) {
    $id =                $line.Substring( 0, 11).Trim()
    $firstName =         $line.Substring(11, 24).Trim()
    $lastName =          $line.Substring(35, 19).Trim()
    $columnNameTooLong = $line.Substring(54, 17).Trim()
    $fifthColumn =       $line.Substring(71,  9).Trim()

    if ($lastName -eq '')
    {
        $lastName = '0'
    }

    $new_line = $proc_yyyymm,$id,$firstName,$lastName,$columnNameTooLong,$fifthColumn -join $delimiter
    $stream_out.WriteLine($new_line)
}

$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch

ElapsedMilliseconds: 177

Wow, it looks like Substrings are running twice as fast as our baseline Regular Expression. This definitely looks like the path to go on.

Let's try @BACON's next example with a Extract Field function and multiple write statements:

$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")

function ExtractField($chars, $startIndex, $length, $normalizeIfFirstCharWhitespace = $false)
{
    # If the first character of a field is whitespace, assume the
    # entire field is as well to avoid a String allocation and Trim()
    if ($normalizeIfFirstCharWhitespace -and [Char]::IsWhiteSpace($chars[$startIndex])) {
        return '0'
    } else {
        # Create a String from the span of Chars at known boundaries and trim it
        return (New-Object -TypeName 'String' -ArgumentList ($chars, $startIndex, $length)).Trim()
    }
}

$proc_yyyymm = '201912'
$delimiter = ','

$lineLength = 82 # Assumes the last line ends with an \r\n and not EOF
$lineChars = New-Object -TypeName 'Char[]' -ArgumentList $lineLength

while (($lastReadCount = $stream_in.ReadBlock($lineChars, 0, $lineLength)) -gt 0)
{
    $id                = ExtractField $lineChars  0 11
    $firstName         = ExtractField $lineChars 11 24
    $lastName          = ExtractField $lineChars 35 19 $true
    $columnNameTooLong = ExtractField $lineChars 54 17
    $fifthColumn       = ExtractField $lineChars 71  9

    # Are all these method calls better or worse than a single WriteLine() and object allocation(s)?
    $stream_out.Write($proc_yyyymm)
    $stream_out.Write($delimiter)
    $stream_out.Write($id)
    $stream_out.Write($delimiter)
    $stream_out.Write($firstName)
    $stream_out.Write($delimiter)
    $stream_out.Write($lastName)
    $stream_out.Write($delimiter)
    $stream_out.Write($columnNameTooLong)
    $stream_out.Write($delimiter)
    $stream_out.WriteLine($fifthColumn)
}

$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch

ElapsedMilliseconds: 5675

Wow. Running 17 times slower, is it the function call or the multiple writing to file calls that are causing the slowness? if I concatenate the Write to file:

$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")

function ExtractField($chars, $startIndex, $length, $normalizeIfFirstCharWhitespace = $false)
{
    # If the first character of a field is whitespace, assume the
    # entire field is as well to avoid a String allocation and Trim()
    if ($normalizeIfFirstCharWhitespace -and [Char]::IsWhiteSpace($chars[$startIndex])) {
        return '0'
    } else {
        # Create a String from the span of Chars at known boundaries and trim it
        return (New-Object -TypeName 'String' -ArgumentList ($chars, $startIndex, $length)).Trim()
    }
}

$proc_yyyymm = '201912'
$delimiter = ','

$lineLength = 82 # Assumes the last line ends with an \r\n and not EOF
$lineChars = New-Object -TypeName 'Char[]' -ArgumentList $lineLength

while (($lastReadCount = $stream_in.ReadBlock($lineChars, 0, $lineLength)) -gt 0)
{
    $id                = ExtractField $lineChars  0 11
    $firstName         = ExtractField $lineChars 11 24
    $lastName          = ExtractField $lineChars 35 19 $true
    $columnNameTooLong = ExtractField $lineChars 54 17
    $fifthColumn       = ExtractField $lineChars 71  9

    # Are all these method calls better or worse than a single WriteLine() and object allocation(s)?
    $new_line = $proc_yyyymm,$id,$firstName,$lastName,$columnNameTooLong,$fifthColumn -join $delimiter
    $stream_out.WriteLine($new_line)
}

$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch

ElapsedMilliseconds: 5654

The slowness is definitely caused by the additional function calls.

EDIT:

Let's check @BACON's latest extremely flexible and expandable, method with no function calls. Most of the time I would be willing to sacrifice some performance for flexibility:

$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")

$proc_yyyymm = '201912'
$delimiter = ','

$lineLength = 82 # Assumes the last line ends with an \r\n and not EOF
$lineChars = New-Object -TypeName 'Char[]' -ArgumentList $lineLength

# This could also be done with 'Offset,Length,NormalizeIfEmpty' | ConvertFrom-Csv
# The Offset property could be omitted in favor of calculating it in the loop
# based on the Length, however this way A) avoids the extra variable/addition,
# B) allows fields to be ignored if desired, and C) allows fields to be output
# in a different order than the input.
$fieldDescriptors = @(
    @{ Offset =  0; Length = 11; NormalizeIfEmpty = $false },
    @{ Offset = 11; Length = 24; NormalizeIfEmpty = $false },
    @{ Offset = 35; Length = 19; NormalizeIfEmpty = $true  },
    @{ Offset = 54; Length = 17; NormalizeIfEmpty = $false },
    @{ Offset = 71; Length =  9; NormalizeIfEmpty = $false }
) | ForEach-Object -Process { [PSCustomObject] $_ }

while (($lastReadCount = $stream_in.ReadBlock($lineChars, 0, $lineLength)) -gt 0)
{
    $stream_out.Write($proc_yyyymm)

    foreach ($fieldDescriptor in $fieldDescriptors)
    {
        # If the first character of a field is whitespace, assume the
        # entire field is as well to avoid a String allocation and Trim()
        # If space is the only possible whitespace character,
        # $lineChars[$fieldDescriptor.Offset] -eq [Char] ' ' may be faster than IsWhiteSpace()
        $fieldText = if ($fieldDescriptor.NormalizeIfEmpty `
            -and [Char]::IsWhiteSpace($lineChars[$fieldDescriptor.Offset])
        ) {
            '0'
        } else {
            # Create a String from the span of Chars at known boundaries and trim it
            (
                New-Object -TypeName 'String' -ArgumentList (
                    $lineChars, $fieldDescriptor.Offset, $fieldDescriptor.Length
                )
            ).Trim()
        }

        $stream_out.Write($delimiter)
        $stream_out.Write($fieldText)
    }

    $stream_out.WriteLine()
}

$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch

ElapsedMilliseconds: 2645

Wow. Removing the explicit Function call saved half the time. It's still slower than the baseline, but is it due to us creating multiple strings or doing multiple Write() operations?.

If, instead of multiple Write() operations, let's accumulate it in a variable. It should be faster because it's all in memory right?

$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")

$proc_yyyymm = '201912'
$delimiter = ','

$lineLength = 82 # Assumes the last line ends with an \r\n and not EOF
$lineChars = New-Object -TypeName 'Char[]' -ArgumentList $lineLength

$fieldDescriptors = @(
    @{ Offset =  0; Length = 11; NormalizeIfEmpty = $false },
    @{ Offset = 11; Length = 24; NormalizeIfEmpty = $false },
    @{ Offset = 35; Length = 19; NormalizeIfEmpty = $true  },
    @{ Offset = 54; Length = 17; NormalizeIfEmpty = $false },
    @{ Offset = 71; Length =  9; NormalizeIfEmpty = $false }
) | ForEach-Object -Process { [PSCustomObject] $_ }

while (($lastReadCount = $stream_in.ReadBlock($lineChars, 0, $lineLength)) -gt 0)
{
    $new_line = $proc_yyyymm

    foreach ($fieldDescriptor in $fieldDescriptors)
    {
        $new_line += $delimiter
        $new_line += if ($fieldDescriptor.NormalizeIfEmpty `
            -and [Char]::IsWhiteSpace($lineChars[$fieldDescriptor.Offset])
        ) {
            '0'
        } else {
            # Create a String from the span of Chars at known boundaries and trim it
            (
                New-Object -TypeName 'String' -ArgumentList (
                    $lineChars, $fieldDescriptor.Offset, $fieldDescriptor.Length
                )
            ).Trim()
        }
    }

    $stream_out.WriteLine($new_line)
}

$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch

ElapsedMilliseconds: 2605

Well that was unexpected. I think in this case we are seeing the HDD buffering the write operations and masking what goes on. Dealing with larger data sets may change this, but somewhat confirms our thoughts that the read/write operations may have less of an impact on what we do than on how we are handling the many copies of strings in memory.


So we want to 1. Avoid function calls (they seem to generate lots more overhead), and 2. use substrings. Are there any other ways to do this?

I suspect that a lot of the overhead comes from assigning/reassigning memory. Basically, every time you manipulate a sting, e.g. .Substring or .Trim or assign to a variable you make a copy of the string in memory. What if we made things ugly and mashed everything into one string?

$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")

$proc_yyyymm = '201912'
$delimiter = ','

while ($line = $stream_in.ReadLine()) {
    $lastName = $line.Substring(35, 19).Trim()
    if ($lastName -eq '')
    {
        $lastName = '0'
    }

    $new_line = $proc_yyyymm,$line.Substring( 0, 11).Trim(),$line.Substring(11, 24).Trim(),$lastName,$line.Substring(54, 17).Trim(),$line.Substring(71,  9).Trim() -join $delimiter
    $stream_out.WriteLine($new_line)
}

$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch

ElapsedMilliseconds: 146

Definitely an improvement. Let's try one more thing:


String Builder

One other class we can investigate is the String Builder class. Let's take the last example and use the String Builder:

$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")

$proc_yyyymm = '201912'
$delimiter = ','

$sb = [System.Text.StringBuilder]::New(80)

while ($sb.Append($stream_in.ReadLine()).Length) {
    $lastName = $sb.ToString(35, 19).Trim()
    if ($lastName -eq '')
    {
        $lastName = '0'
    }
    $new_line = $proc_yyyymm,$sb.ToString(0, 11).Trim(),$sb.ToString(11, 24).Trim(),$lastName,$sb.ToString(54, 17).Trim(),$sb.ToString(71, 9).Trim() -join $delimiter
    $stream_out.WriteLine($new_line)
    [void]$sb.clear()
}

$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch

ElapsedMilliseconds: 161

Overall, no real change in performance. The reason is that in this situation, we can't take advantage of some of the benefits String Builders provide. Because we need to Trim() the data, which is something that StringBulider can't do natively, we end up needing to convert it back into a string to perform the trim, which negates any performance improvement.


Super Optimize Performance

Now Let's get uglier. There is still one more variable assignment we can optimize out, Let's try with the StreamBuilder again:

$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")

$proc_yyyymm = '201912'
$delimiter = ','

$sb = [System.Text.StringBuilder]::New(80)

while ($sb.Append($stream_in.ReadLine()).Length) {
    $lastName = $sb.ToString(35, 19).Trim()
    if ($lastName -eq '')
    {
        $lastName = '0'
    }
    $stream_out.WriteLine($proc_yyyymm,$sb.ToString(0, 11).Trim(),$sb.ToString(11, 24).Trim(),$lastName,$sb.ToString(54, 17).Trim(),$sb.ToString(71, 9).Trim() -join $delimiter)
    [void]$sb.clear()
}

$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch

ElapsedMilliseconds: 72

Wow! It seems that even the final variable assignment caused a lot of additional overhead. Likely due to it allocating several more memory locations for the string.

And finally, let's do a pure String.Substring implementation, with the same optimization:

$Stopwatch = [system.diagnostics.stopwatch]::startNew()
$stream_in = [System.IO.StreamReader]::new("C:\Temp\perf\input.txt")
$stream_out = [System.IO.StreamWriter]::new("C:\Temp\perf\out.txt")

$proc_yyyymm = '201912'
$delimiter = ','

while ($line = $stream_in.ReadLine()) {
    $lastName =          $line.Substring(35, 19).Trim()

    if ($lastName -eq '')
    {
        $lastName = '0'
    }

    $stream_out.WriteLine($proc_yyyymm,$line.Substring( 0, 11).Trim(),$line.Substring(11, 24).Trim(),$lastName,$line.Substring(54, 17).Trim(),$line.Substring(71,  9).Trim() -join $delimiter)
}

$stream_out.Close()
$Stopwatch.Stop()
$Stopwatch

ElapsedMilliseconds: 57

This is the current best option: String.Substring and as little variable assignment possible, and it runs almost 6 times faster than the baseline RegEx. Now, from @Mark comments, if we can remove the Trim requirement, we could try StringBuilder again... Also, we could see improvements if we could read the whole file into memory (e.g. StringBuilder the whole file) that would be interesting...


The Story of the StringBuilder Saga!

To answer one of @Mark questions in the comments below, why does the while loop need a .Length in it? isn't that useless?:

while ($sb.Append($stream_in.ReadLine()).Length) 

Well, I thought so too.... Until I got stuck in an infinite loop.... but I didn't realize it was an infinite loop. I just thought I got something wrong and 10,000 lines of processing were spewing onto to the console. That is why you have:

`[void]$sb.clear()`

What you quickly realize when you start using it, the StringBuilder class is one noisy little bugger (especially when you are a PowerShell console). It decides to outputs it's whole life story any chance it can. That means when you Clear() it:

PS C:\> $sb.Clear()   

Capacity MaxCapacity Length
-------- ----------- ------
      80  2147483647      0

Or Append data to it:

PS C:\Temp\perf> $sb.Append($stream_in.ReadLine())                                                                      

Capacity MaxCapacity Length
-------- ----------- ------
      80  2147483647     80

Like a drunk guy at a bar, -Every- -single- -time- StringBuilder wants you to know how big it is and how much bigger it could be and who it is. And it says it to anyone that will listen. PowerShell is an enthusiastic listener, and will listen to anything anyone says, even if it is observing an Execution and sees bloody murder happening in front of them (see: $Error stream also see: when my boss wanders over and sees a blood red PowerShell console, and I say to just "ignore all of that stuff... um... it's not important"... then try to convince him to let you run your script against the Exchange server ;-).

In this case StringBuilder is full of itself, and every time it does something, it returns itself. PowerShell, being the receptive listener, will take the StringBuilder object and proclaim its presence to your loyal PowerShell Console host.

When you are processing a 10,000 long test document, and have 2 StringBuilder operations:

while ($sb.Append($stream_in.ReadLine())) {
...
    $sb.Clear()
}

You start spewing out 20,000 lines of messages to the console. For the experienced PowerShell coder, like pushing a magic HotKey, this ensues a quick Ctrl+C, open Vivaldi -> Google -> PowerShell StringBuilder Suppress Output -> I'm Feeling Lucky -> Stack Overflow -> Ctlr+C first answer: Aaand it looks like I'm going to start adding [void] to all of my StringBuilder operations to suppress output from now on:

while ([void]$sb.Append($stream_in.ReadLine())) {
...
    [void]$sb.Clear()
}

Excellent. Let's continue to run tests:

ElapsedMilliseconds: 1

Wow!!! 1 Millisecond! Woot! StringBuilder seems to be quantum level amazing! Let's check the output file:

PS C:\> Get-ChildItem .\out.txt                                                                                

Mode                LastWriteTime         Length Name
----                -------------         ------ ----
-a----       2019-12-12   3:44 PM              0 out.txt

0Kb... Well nuts. On a quantum level, we forgot to put the cat into the box.

This is where careful observers will notice the fatal flaw in my Google Stack Overflow Copy Paste skills:

while ([void] ... ) {

Which, while mathematically equivalent to a O(0) operation, it is also known as skipping the while loop or, or also known as not doing anything. Not doing anything sort of doesn't solve the problem.

Well, ok, we'll get rid of the [void] and let's fall back on our tried and true tools when dealing with undesirable output. First just redirect the output:

while ($sb.Append($stream_in.ReadLine()) >> SomeWhereElse.txt ) {

0Kb... Or um... Maybe pipe it to Out-Null?

while ($sb.Append($stream_in.ReadLine()) | Out-Null) {

0Kb... grr... I should know they all functionally equate to:

while ( $null ) {

(Breath in) ok, let's think about this. If this is being run headless, I don't have to worry about a messy console. So let's just let it spew out 10,000 lines of stuff. I know writing to the console makes the execution 10 times longer, but what I'm really curious about right now is the speed.

while ($sb.Append($stream_in.ReadLine())) {
...
    [void]$sb.Clear()
}

Aaand run.

It's running. The console is just spewing out messages about the StringBuilder and I'm chuckling about how clever I was to think about using the StringBuilder class. It's running... hee hee hee... so few people think about the differences between Immutable vs. Mutable strings... hee hee hu? ... It's still running.... even if it ran 10 times slower it should be done by now... ok Ctrl+C -> Up Arrow -> Enter. Let's run the same thing again, and expect a different outcome.

Aaaand... What do you know... the same.

This is when you look at your PowerShell console. You silently know that it has a Screen Buffer Size of 9,999, and it has completely wrapped... probably... 3 times. Far more than it should. And when there is a while loop involved, you know that you messed up on the condition statement, and you have expertly crafted an infinite loop.

Let's go back to the previous example that we based it on:

while ($line = $stream_in.ReadLine()) {
...

We are essentially saying while $stream_in.ReadLine() has data loop. Naturally, we do the same with StringBuilder, only this time we know that we have to append it:

while ($sb.Append($stream_in.ReadLine())) {
...

I'm doing the same thing... right?

After some repeated swearing, fruitless googlefu, I gave up and plugged in a length check:

while ($sb.Append($stream_in.ReadLine()).Length) {
...

And it worked as expected. But I wasn't happy that I had to do the check. It's not supposed to work that way. It was supposed to work the easy way. I had the night to think about it, and in the morning I finally understood what I had done wrong, and why it wasn't working like that.

Remember earlier when I said that StringBuilder is full of itself? Well, it's full of itself and not 0. The thousands of messages in the console were saying what was wrong:

Capacity MaxCapacity Length
-------- ----------- ------
      80  2147483647      0

Length = 0 The StringBuilder is empty. It was empty 21,000 lines ago.

The while loop condition statement wasn't evaluating on a String that either has data or 0/$null data, it was being fed a StringBuilder object. the StringBuilder object is not a 0 or a bool it's a big fat object which must be $true. The while loop condition statement would always evaluate to $true every time, even long after the StreamReader ran out of data.

Adding in a .Length made sure that the StringBuilder would return an actual number to the while loop, and ensure that it would stop when it ran out of data.

Sign up to request clarification or add additional context in comments.

15 Comments

Curiosity was getting the better of me to try this regex vs. non-regex comparison, but you beat me to it. +1 So was the intent with StringBuilder to avoid a new String object for each line? That's clever. If you used ReadBlock() like in my answer you could Append() the Char[], though trading a String for a Char[] is still a new Object, although doing a fixed-size read with no newline checking still might show gains. I was not sure if Powershell function calls would be slow, and this shows that they are. The alternative I had in mind to avoid that would be an array of (cont.)
(cont.) of objects like [PSCustomObject] @{ Offset = 0; Length = 11; NormalizeIfEmpty = $false } that would be looped over to read each field of a line; that might be tidier code but can't be faster than effectively unrolling that loop with Substring() calls like you've done. The thought I had to eliminate Trim() would be to manually walk through the String/Char[], testing the Char at each index for [Char]::IsWhiteSpace()/-eq [Char] ' ' to find the start of the field. That might be a win if this were C#, but, like I said in another comment, I have my doubts for PowerShell.
Please note, the trim requirement can not be removed. :)
In the real data, the fields are closer to 100 per line and the data is closer to 500,000 to 9,000,000 rows give or take. So, these things will also change which solution is ultimately the fastest.
No problem @BACON I had to edit my answer to more fully answer exactly why I needed .Length when using a StringBuilder in a while loop. That answer morphed into "The Story of the StringBuilder Saga!" section. Enjoy! ;-)
|
2

After making some tweaks to your code for demonstration purposes...

  • Truncating the regular expression to match the sample data
  • Changing the output delimiter (now $delimiter) to a , so the results are easy to see
  • Using a StringReader and StringWriter for input and output, respectively

...given...

$text = @'
ID         FIRST_NAME              LAST_NAME          COLUMN_NM_TOO_LON5THCOLUMN
10000000001MINNIE                  MOUSE              COLUMN VALUE LONGSTARTS   
10000000002PLUTO                                      COLUMN VALUE LONGSTARTS   
'@

...the way you proposed of adjusting the match text at specific indices would look like this...

$proc_yyyymm = '201912'
$match_regex = '^(.{11})(.{24})(.{19})(.{17})(.{9})'

$delimiter = ','
$indicesToNormalizeToZero = ,2

$stream_in = New-Object -TypeName 'System.IO.StringReader' -ArgumentList $text
$stream_out = New-Object -TypeName 'System.IO.StringWriter'

while ($line = $stream_in.ReadLine()) {
    if ($line -match $match_regex) {
        $trimmedMatches = $Matches[1..($Matches.Count-1)].Trim()
        foreach ($index in $indicesToNormalizeToZero)
        {
            if ($trimmedMatches[$index] -eq '')
            {
                $trimmedMatches[$index] = '0'
            }
        }

        $new_line = "$proc_yyyymm$delimiter" + ($trimmedMatches -join $delimiter)
        $stream_out.WriteLine($new_line)
    }
}

$stream_out.ToString()

An alternative would be to use the [Regex]::Replace() method. This is good for when you need to perform a custom transformation on a match that can't be expressed in a regex substitution. Admittedly, it might be a poor fit here because you're matching an entire line instead of individual fields, so within a match you need to know which field is which.

$proc_yyyymm = '201912'
$match_regex = [Regex] '^(.{11})(.{24})(.{19})(.{17})(.{9})'
$match_evaluator = {
    param($match)

    # The first element of Groups contains the entire matched text; skip it
    $fields = $match.Groups `
        | Select-Object -Skip 1 `
        | ForEach-Object -Process {
            $field = $_.Value.Trim()
            if ($groupsToNormalizeToZero -contains $_.Name -and $field -eq '')
            {
                $field = '0'
            }

            return $field
        }

    return "$proc_yyyymm$delimiter" + ($fields -join $delimiter)
}

$delimiter = ','
# Replace with a HashSet/Hashtable for better lookup performance
$groupsToNormalizeToZero = ,'3'

$stream_in = New-Object -TypeName 'System.IO.StringReader' -ArgumentList $text
$stream_out = New-Object -TypeName 'System.IO.StringWriter'

while ($line = $stream_in.ReadLine()) {
    $new_line = $match_regex.Replace($line, $match_evaluator)

    # The original input string is returned if there was no match
    if (-not [Object]::ReferenceEquals($line, $new_line)) {
        $stream_out.WriteLine($new_line)
    }
}

$stream_out.ToString()

$match_evaluator is a MatchEvaluator delegate that gets called for each successful match found in the input text to Replace() and returns whatever you want the replacement text to be. Inside I'm doing the same kind of index-specific transformation, comparing the group name (which will be its index as a [String]) to a known list ($groupsToNormalizeToZero); you could use named groups instead, although I found that changes the ordering of $match.Groups. There may be better applications of [Regex]::Replace() here that aren't occurring to me now.

As an alternative to using regex, since their lengths are known you could extract the fields directly from $line using the Substring() method...

$proc_yyyymm = '201912'
$delimiter = ','

$stream_in = New-Object -TypeName 'System.IO.StringReader' -ArgumentList $text
$stream_out = New-Object -TypeName 'System.IO.StringWriter'

while ($line = $stream_in.ReadLine()) {
    $id =                $line.Substring( 0, 11).Trim()
    $firstName =         $line.Substring(11, 24).Trim()
    $lastName =          $line.Substring(35, 19).Trim()
    $columnNameTooLong = $line.Substring(54, 17).Trim()
    $fifthColumn =       $line.Substring(71,  9).Trim()

    if ($lastName -eq '')
    {
        $lastName = '0'
    }

    $new_line = $proc_yyyymm,$id,$firstName,$lastName,$columnNameTooLong,$fifthColumn -join $delimiter
    $stream_out.WriteLine($new_line)
}

$stream_out.ToString()

Better still, since the length of each line is known you can avoid ReadLine()'s newline checks and subsequent String allocation by reading each line as a block of Chars and extracting the fields from there.

function ExtractField($chars, $startIndex, $length, $normalizeIfFirstCharWhitespace = $false)
{
    # If the first character of a field is whitespace, assume the
    # entire field is as well to avoid a String allocation and Trim()
    if ($normalizeIfFirstCharWhitespace -and [Char]::IsWhiteSpace($chars[$startIndex])) {
        return '0'
    } else {
        # Create a String from the span of Chars at known boundaries and trim it
        return (New-Object -TypeName 'String' -ArgumentList ($chars, $startIndex, $length)).Trim()
    }
}

$proc_yyyymm = '201912'
$delimiter = ','

$stream_in = New-Object -TypeName 'System.IO.StringReader' -ArgumentList $text
$stream_out = New-Object -TypeName 'System.IO.StringWriter'

$lineLength = 82 # Assumes the last line ends with an \r\n and not EOF
$lineChars = New-Object -TypeName 'Char[]' -ArgumentList $lineLength

while (($lastReadCount = $stream_in.ReadBlock($lineChars, 0, $lineLength)) -gt 0)
{
    $id                = ExtractField $lineChars  0 11
    $firstName         = ExtractField $lineChars 11 24
    $lastName          = ExtractField $lineChars 35 19 $true
    $columnNameTooLong = ExtractField $lineChars 54 17
    $fifthColumn       = ExtractField $lineChars 71  9

    # Are all these method calls better or worse than a single WriteLine() and object allocation(s)?
    $stream_out.Write($proc_yyyymm)
    $stream_out.Write($delimiter)
    $stream_out.Write($id)
    $stream_out.Write($delimiter)
    $stream_out.Write($firstName)
    $stream_out.Write($delimiter)
    $stream_out.Write($lastName)
    $stream_out.Write($delimiter)
    $stream_out.Write($columnNameTooLong)
    $stream_out.Write($delimiter)
    $stream_out.WriteLine($fifthColumn)
}

$stream_out.ToString()

Since @HAL9256's answer confirms that PowerShell functions are (very!) slow, a way to do the same thing without redundant code and without functions would be to define a collection of field descriptors and loop over that to extract each field from the appropriate offset...

$proc_yyyymm = '201912'
$delimiter = ','

$stream_in = New-Object -TypeName 'System.IO.StringReader' -ArgumentList $text
$stream_out = New-Object -TypeName 'System.IO.StringWriter'

$lineLength = 82 # Assumes the last line ends with an \r\n and not EOF
$lineChars = New-Object -TypeName 'Char[]' -ArgumentList $lineLength

# This could also be done with 'Offset,Length,NormalizeIfEmpty' | ConvertFrom-Csv
# The Offset property could be omitted in favor of calculating it in the loop
# based on the Length, however this way A) avoids the extra variable/addition,
# B) allows fields to be ignored if desired, and C) allows fields to be output
# in a different order than the input.
$fieldDescriptors = @(
    @{ Offset =  0; Length = 11; NormalizeIfEmpty = $false },
    @{ Offset = 11; Length = 24; NormalizeIfEmpty = $false },
    @{ Offset = 35; Length = 19; NormalizeIfEmpty = $true  },
    @{ Offset = 54; Length = 17; NormalizeIfEmpty = $false },
    @{ Offset = 71; Length =  9; NormalizeIfEmpty = $false }
) | ForEach-Object -Process { [PSCustomObject] $_ }

while (($lastReadCount = $stream_in.ReadBlock($lineChars, 0, $lineLength)) -gt 0)
{
    $stream_out.Write($proc_yyyymm)

    foreach ($fieldDescriptor in $fieldDescriptors)
    {
        # If the first character of a field is whitespace, assume the
        # entire field is as well to avoid a String allocation and Trim()
        # If space is the only possible whitespace character,
        # $lineChars[$fieldDescriptor.Offset] -eq [Char] ' ' may be faster than IsWhiteSpace()
        $fieldText = if ($fieldDescriptor.NormalizeIfEmpty `
            -and [Char]::IsWhiteSpace($lineChars[$fieldDescriptor.Offset])
        ) {
            '0'
        } else {
            # Create a String from the span of Chars at known boundaries and trim it
            (
                New-Object -TypeName 'String' -ArgumentList (
                    $lineChars, $fieldDescriptor.Offset, $fieldDescriptor.Length
                )
            ).Trim()
        }

        $stream_out.Write($delimiter)
        $stream_out.Write($fieldText)
    }

    $stream_out.WriteLine()
}

$stream_out.ToString()

I am assuming that direct string extraction would be faster than regex, but I don't know that to be $true in general let alone as it pertains to PowerShell; only testing would reveal that.

All of the above solutions yield the following output...

201912,ID,FIRST_NAME,LAST_NAME,COLUMN_NM_TOO_LON,5THCOLUMN
201912,10000000001,MINNIE,MOUSE,COLUMN VALUE LONG,STARTS
201912,10000000002,PLUTO,0,COLUMN VALUE LONG,STARTS

10 Comments

FYI - My data sample was incorrect. The fields in my actual data either completely fill the field or are left padded, so looks I need to check the last character in the field string for whitespace to decide whether to set to 0.
@Mark can you update the Question with the updated data? I have some ideas for performance that would benefit from not needing to Trim the data.
@HAL9256 Sorry all. I double-checked and the data is indeed right padded, the same as in the original sample. Let's continue to assume always fixed format always filled or padded with spaces.
@BACON What does the $stream_out.ToString() do at the end of your code?
That just displays the final output that was built in PowerShell, since I'm using StringReader/StringWriter instead of StreamReader/StreamWriter.
|
1

I'm not sure how performant Switch is, but if you convert your regex match string to a [regex] object and pass it through a switch this becomes pretty simple. The first response in a regex Matches method is always the entire input, so we can replace the first item in the response with your line header, and have a switch check for empty fields, and you'd be all set.

$proc_yyyymm = '201912'
[regex]$match_regex = '^(.{10})(.{10})(.{30})(.{30})(.{30})(.{4})(.{8})(.{10})(.{1})(.{15})(.{12})'

while ($line = $stream_in.ReadLine()) {
    #Pad right to account for truncated lines and get captured values
    $LineMatch=$match_regex.Matches($line.PadRight(160)).groups.value
    #Replace first item in array with line header (first item is entire input line, which we don't want)
    $LineMatch[0] = $proc_yyyymm
    #Replace null columns on 4, 6, and 7 with 0
    switch($true){
        {$LineMatch[4] -notmatch '\S'} {$LineMatch[4]='0'}
        {$LineMatch[6] -notmatch '\S'} {$LineMatch[6]='0'}
        {$LineMatch[7] -notmatch '\S'} {$LineMatch[7]='0'}
    }

    $stream_out.WriteLine(($LineMatch.trim() -join "`t"))

}

If you only want lines that match your regex, and want to ignore shorter lines you can get rid of the .PadRight(160) in that code.

If the switch is slowing things down you could perform a regex replacement as an alternative:

$LineMatch[4] = $LineMatch[4] -replace '^\s*$','0'

That would check if the string for column 4 is nothing but whitespace, and replace it with 0 if that is the case, or just leave it alone if it is not the case.

2 Comments

I was trying to be brief with my code sample ... but regex is already [regex], the header is a throw away. I edited the above code accordingly, I'll study your answer more thoroughly and test. Thank you!!
It's fixed length fields for every row of the file so no need to pad.
0

I have a solution there that you can test if(! $value) { $value = 0 }. There's other solutions on that page too. https://stackoverflow.com/a/57647495/6654942

2 Comments

for completeness I added in one of your methods. To see how a "pure" PowerShell solution would stack up.
@HAL9256 I'm happy to be the Pure PowerShell representative. :)
0

I tried this but I'm thinking it's going to be too slow ... still testing.

PROGRAM

$proc_yyyymm = '201912'
[regex]match_regex = '^(.{10})(.{10})(.{30})(.{30})(.{30})(.{4})(.{8})(.{12})'

# deal with header row
if ($has_header_row) {
   $line = $stream_in.ReadLine()
}

while ($line = $stream_in.ReadLine()) {

   if ($line -match $match_data_regex) {

      $Matched = $Matches[1..($Matches.Count-1)].Trim()

      Foreach ($fld in ($file_info.numeric_fields)) {

         if ($Matched[$fld] -eq '') {
            $Matched[$fld] = '0'
         }
      }

      $new_line = ("$proc_yyyymm", "$Matched") -join "`t"
      $stream_out.WriteLine($new_line)
   }
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.