0

I have the following script which runs commands on each file in a directory to match for a specific pattern. It then prints the matching output to a .csv. I have the desired formatting, however each pattern that I am matching on is getting printed twice. Like this:

Match1
Match2
Match1
Match2

Piping uniq and sort into this script is not fixing the problem so I suspect my syntax is off. I have not been able to find a solution via Google or other answers thus far. Any help is appreciated, thanks!

#!/usr/bin/env bash
FILES=/Users/User1/Desktop/Folder/"*"
for f in $FILES
do
  echo "Processing $f file..."
  # take action on each file. $f store current file name

    sed -n /"New Filters"/,/"Modified Filters"/p "$f" | grep -v -e 'Bugtraq ID:' 
  -e 'Common Vulnerabilities and Exposures:' -e 'Android' | grep -E '(^|[^0-9]) 
  [0-9]{5}($|[^0-9])'| sed 's/:/,/1' >> NewFile.csv

   echo "Complete. Check NewFile.csv"
 done;

Sample Input: Expected Result is to extract text in bold

Filters
New Filters
Modified Filters (logic changes)
Modified
Filters (metadata changes only)
Removed Filters

Filters
New Filters:
29722: HTTP: Dragonfly Backdoor.Goodor Go Implant CnC Beacon 1

Modified Filters (logic changes):
Text I don't want

Modified Filters (metadata changes only):
Text I don't want

1
  • 4
    Hello, and welcome to Stack Overflow. It would help a lot if you also posted some sample data, so we don't have to try to reverse-engineer them from your code. Without a way to quickly test what's happening, most potential answerers will not even bother trying to decipher it. Commented Jul 4, 2018 at 11:45

3 Answers 3

2

We can't tell what your problem is without sample input/output so this isn't an answer to that, but here's how to really do what you're trying to do with that script:

awk '
FNR==1 { printf "Processing %s file...\n", FILENAME | "cat>&2" }
/"New Filters"/ { inBlock=1 }
inBlock {
    if ( !/Bugtraq ID:|Common Vulnerabilities and Exposures:|Android/ &&
             /(^|[^0-9])[0-9]{5}($|[^0-9])/ ) {
        sub(/:/,",")
        print
    }
}
/"Modified Filters"/ { inBlock=0 }
' /Users/User1/Desktop/Folder/"*" > "NewFile.csv"
echo "Complete. Check NewFile.csv"

Note that there's no shell loop required. See why-is-using-a-shell-loop-to-process-text-considered-bad-practice.

Any time you find yourself using multiple commands (in particular multiple seds and/or greps) and pipes just to manipulate text, consider just using awk instead.

Sign up to request clarification or add additional context in comments.

Comments

1

Are you running the script twice? It appends with >> NewFile.csv without truncating the file at the beginning, so if run twice the CSV file would end up with repeated output. You can add > NewFile.csv at the beginning to empty out the output file.

Or, perhaps you have duplicate input files.

5 Comments

I'm running the script once on two files. I tested again by running on a single file and am still getting duplicates. Where exactly do you recommend adding > NewFile.csv to?
Why not simply show us the file so we can help you? Right now it's like you're asking a mechanic to diagnose a problem with your car but only letting him see half the car. See How to Ask if that's not clear and in particular pay attention to the part about creating a minimal reproducible example.
Put > NewFile.csv on its own line. It's a standalone command that will truncate the file.
Thank you all for the input. I've added sample input and what I am aiming to extract. While going through the file I found that "New Filters" and "Modified Filters" was mentioned more than once. I believe I need to specify with the first sed command to grab the text between the 2nd match of "New Filters and 2nd of "Modified Filters".
The way to format input, output, and code in questions and answers is by indenting it 4 spaces (the editors {} button will do that for you), not by placing a > at the start of each line. Though the results look similar in the forum, the former gives us something we can simply copy/paste for testing with while the latter would require us to edit to remove the >s which is undesirable.
0

if you need:

  • extract anything between
    • New Filter ... Modified Filters
  • but exclude
    • Bugtraq ID:
    • Common Vulnerabilities and Exposures:
    • Android
  • also match
    • 5 digits up to 1 digit at the end
  • plus
    • replace the first : with ,

then you can try

perl -lne 'BEGIN{$/=undef} push @r,$& while /(?<=New Filters).*?(?=Modified Filters)/gs; @r2=grep(!/Bugtraq ID:|Common Vulnerabilities and Exposures:|Android/g,@r); /\d{5}[^\n]+\d/g && ($_=$&) && s/:/,/ && print for @r2' file  

for this sample input file

dified Filters (logic changes)   
Modified  
Filters (metadata changes only)   
Removed Filters  

Filters     
New Filters:  
29722: HTTP: Dragonfly Backdoor.Goodor Go Implant CnC Beacon 1  

Modified Filters (logic changes):   
Text I don't want  

Modified Filters (metadata changes only):   
Text I don't want  


New Filters:  
Bugtraq ID:

Modified Filters (logic changes):   


New Filters:  
Common Vulnerabilities and Exposures:


Modified Filters (logic changes):   


New Filters:  
Android
Modified Filters (logic changes):   


New Filters:  

29723: HTTP: Dragonfly Backdoor.Goodor Go Implant CnC Beacon 1  
Modified Filters (logic changes):   


New Filters:  

29724: HTTP: Dragonfly Backdoor.Goodor Go Implant CnC Beacon 1  

Modified Filters (logic changes):   

output will be:

29722, HTTP: Dragonfly Backdoor.Goodor Go Implant CnC Beacon 1
29723, HTTP: Dragonfly Backdoor.Goodor Go Implant CnC Beacon 1
29724, HTTP: Dragonfly Backdoor.Goodor Go Implant CnC Beacon 1

1 Comment

@Nick otherwise tell me so I will delete the answer

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.