0

Have a reference file "names.txt" with data as below:

Tom
Jerry
Mickey

Note: there are 20k lines in the file "names.txt"

There is another delimited file with multiple lines for every key from the reference file "names.txt" as below:

Name~~Id~~Marks~~Column4~~Column5

Note: there are about 30 columns in the delimited file:
The delimited file looks like something :

Tom~~123~~50~~C4~~C5
Tom~~111~~45~~C4~~C5
Tom~~321~~33~~C4~~C5
.
.
Jerry~~222~~13~~C4~~C5
Jerry~~888~~98~~C4~~C5
.
.

Need to extract rows from the delimited file for every key from the file "names.txt" having the highest value in the "Marks" column.
So, there will be one row in the output file for every key form the file "names.txt".

Below is the code snipped in unix that I am using which is working perfectly fine but it takes around 2 hours to execute the script.

while read -r line; do
   getData `echo ${line// /}`
done < names.txt

function getData
{
   name=$1
   grep ${name} ${delimited_file} | awk -F"~~" '{if($1==name1 && $3>max){op=$0; max=$3}}END{print op} ' max=0 name1=${name} >> output.txt
}

Is there any way to parallelize this and reduce the execution time. Can only use shell scripting.

1 Answer 1

2

Rule of thumb for optimizing bash scripts:
The size of the input shouldn't affect how often a program has to run.

Your script is slow because bash has to run the function 20k times, which involves starting grep and awk. Just starting programs takes a hefty amount of time. Therefore, try an approach where the number of program starts is constant.

Here is an approach:

  1. Process the second file, such that for every name only the line with the maximal mark remains.
    Can be done with sort and awk, or sort and uniq -f + Schwartzian transform.
  2. Then keep only those lines whose names appear in names.txt.
    Easy with grep -f

sort -t'~' -k1,1 -k5,5nr file2 |
awk -F'~~' '$1!=last{print;last=$1}' |
grep -f <(sed 's/.*/^&~~/' names.txt)

The sed part turns the names into regexes that ensure that only the first field is matched; assuming that names do not contain special symbols like . and *.

Depending on the relation between the first and second file it might be faster to swap those two steps. The result will be the same.

Sign up to request clarification or add additional context in comments.

5 Comments

Very nice! I would think grep -Fwf names.txt would be sufficient to match the name -- depends on the data we don't see, of course.
@SupratimDas Glad to hear that. How long did this command run on your files (instead of 2 hours)?
@Socowi around 2 minutes, which is just unbelievable! I did not want to modify the existing files so had to create a temp file for the names, since the names.txt file do not only have the names
@SupratimDas If your real file names.txt has a different format you should show it in the question. It might be easy to adapt the sed command to extract the names without the need to use a temp file.
@Bodo thanks for the tip. yeah ultimately i used a sed command to replace the temp file creation.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.