Reduce Unix Script execution time for while loop

Question

Have a reference file "names.txt" with data as below:

Tom
Jerry
Mickey

Note: there are 20k lines in the file "names.txt"

There is another delimited file with multiple lines for every key from the reference file "names.txt" as below:

Name~~Id~~Marks~~Column4~~Column5

Note: there are about 30 columns in the delimited file:
The delimited file looks like something :

Tom~~123~~50~~C4~~C5
Tom~~111~~45~~C4~~C5
Tom~~321~~33~~C4~~C5
.
.
Jerry~~222~~13~~C4~~C5
Jerry~~888~~98~~C4~~C5
.
.

Need to extract rows from the delimited file for every key from the file "names.txt" having the highest value in the "Marks" column.
So, there will be one row in the output file for every key form the file "names.txt".

Below is the code snipped in unix that I am using which is working perfectly fine but it takes around 2 hours to execute the script.

while read -r line; do
   getData `echo ${line// /}`
done < names.txt

function getData
{
   name=$1
   grep ${name} ${delimited_file} | awk -F"~~" '{if($1==name1 && $3>max){op=$0; max=$3}}END{print op} ' max=0 name1=${name} >> output.txt
}

Is there any way to parallelize this and reduce the execution time. Can only use shell scripting.

Socowi · Accepted Answer · 2021-05-27 16:30:07Z

2

Rule of thumb for optimizing bash scripts:
The size of the input shouldn't affect how often a program has to run.

Your script is slow because bash has to run the function 20k times, which involves starting grep and awk. Just starting programs takes a hefty amount of time. Therefore, try an approach where the number of program starts is constant.

Here is an approach:

Process the second file, such that for every name only the line with the maximal mark remains.
_{Can be done with sort and awk, or sort and uniq -f + Schwartzian transform.}
Then keep only those lines whose names appear in names.txt.
_{Easy with grep -f}

sort -t'~' -k1,1 -k5,5nr file2 |
awk -F'~~' '$1!=last{print;last=$1}' |
grep -f <(sed 's/.*/^&~~/' names.txt)

The sed part turns the names into regexes that ensure that only the first field is matched; assuming that names do not contain special symbols like . and *.

Depending on the relation between the first and second file it might be faster to swap those two steps. The result will be the same.

edited May 27, 2021 at 16:30

answered May 27, 2021 at 16:24

Socowi

27.9k4 gold badges41 silver badges72 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

glenn jackman Over a year ago

Very nice! I would think grep -Fwf names.txt would be sufficient to match the name -- depends on the data we don't see, of course.

Socowi Over a year ago

@SupratimDas Glad to hear that. How long did this command run on your files (instead of 2 hours)?

Supratim Das Over a year ago

@Socowi around 2 minutes, which is just unbelievable! I did not want to modify the existing files so had to create a temp file for the names, since the names.txt file do not only have the names

Bodo Over a year ago

@SupratimDas If your real file names.txt has a different format you should show it in the question. It might be easy to adapt the sed command to extract the names without the need to use a temp file.

Supratim Das Over a year ago

@Bodo thanks for the tip. yeah ultimately i used a sed command to replace the temp file creation.

Collectives™ on Stack Overflow

Reduce Unix Script execution time for while loop

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related