Bash script to store list of files in an array with number of occurrences of each word in all files

Question

So far, my bash script takes in two arguments...input which can be a file or a directory, and output, which is the output file. It finds all files recursively and if the input is a file it finds all occurrences of each word in all the files found and list them in the output file with the number on the left and the word on the right sorted from greatest to least. Right now it is also counting numbers as words which it shouldn't do...how can I have it only find all occurrences of valid words and no numbers? Also, in the last if statement...if the input is a directory, I am having trouble getting it to do the same thing I had it do for the file. It needs to find all files in that directory, and if there is another directory in that directory, it needs to find all files in it and so on. Then it needs to count all occurrences of each word in all files and store them to the output file just as in the case for a file. I was thinking to store them in an array, but I'm not sure if its the best way, and my syntax is off because its not working...so I would like to know how can I do this? Thanks!

    #!/bin/bash

    INPUT="$1"
    OUTPUT="$2"
    ARRAY=();

    # Check that there are two arguments
    if [ "$#" -ne 2 ]
    then
       echo "Usage: $0 {dir-name}";
       exit 1
    fi

    # Check that INPUT is different from OUTPUT
    if [ "$INPUT" = "$OUTPUT" ]
    then
       echo "$INPUT must be different from $OUTPUT";
    fi

    # Check if INPUT is a file...if so, find number of occurrences of each word
    # and store in OUTPUT file sorted in greatest to least
    if [ -f "$INPUT" ]
    then
       for name in $INPUT; do
          if [ -f "$name" ]
          then
             xargs grep -hoP '\b\w+\b' < "$name" | sort | uniq -c | sort -n -r > "$OUTPUT"
          fi
       done
    # If INPUT is a directory, find number of occurrences of each word
    # and store in OUTPUT file sorted in greatest to least
    elif [ -d "$INPUT" ]
    then
       find $name -type f > "${ARRAY[@]}"
       for name in "${ARRAY[@]}"; do
          if [ -f "$name" ]
          then
             xargs grep -hoP '\b\w+\b' < "$name" | sort | uniq -c | sort -n -r > "$OUTPUT"
          fi
       done
    fi

Can you show examples of your input file, expected input, and expected output. e.g. not clear what for name in $INPUT is supposed to do ... since $INPUT should be one argument? — Reinstate Monica Please
– Reinstate Monica Please, Commented Sep 14, 2014 at 19:00
are you doing word frequency analysis? you might want to convert to lowercase first, accept - and a couple of other things. just a thought. do you use the regular alphabet or are there special characters? — Karoly Horvath
– Karoly Horvath, Commented Sep 14, 2014 at 19:11
"how can I have it only find all occurrences of valid words and no numbers?" Use grep -hoP '\b[[:alpha:]]+\b' in place of grep -hoP '\b\w+\b' — John1024
– John1024, Commented Sep 14, 2014 at 19:24
@BroSlow Input can be any kind of file or directory. Expected output: 17 word. List of number of occurrences with the word next to it. name in $INPUT is each filename in the input. — Harley Jones
– Harley Jones, Commented Sep 14, 2014 at 19:26

Community · Accepted Answer · 2017-05-23 12:05:49Z

I don't recommend you specifying the output file, because you must to more validity checking for it, e.g.

the output shouldn't exists (if you don't want allow the overwrite)
if you want allow the overwrite, if the output exists, it must be an plain file
and so on..
it is better to have a possibility to use more input directories/files as arguments

therefore is better (an it is more bash-ish) produces output to standard output and you can redirect it to file at invocation, like

bash wordcounter.sh files or directories more the one to count words > to_some_file

e.g

bash worcounter.sh some_dir >result.txt
#or
bash wordcounter.sh file1.txt file2.txt .... fileN.txt > result2.txt
#or
bash wordcounter.sh dir1 file1 dir2 file2 >result2.txt

the whole wordcounter.sh could be the next:

for arg
do
    find "$arg" -type f -print0
done |xargs -0 grep -hoP '\b[[:alpha:]]+\b' |sort |uniq -c |sort -nr

where:

the find will search plain files the for all arguments
and on the the generated file-list will run the counting script

The script sill has some drawbacks, e.g. will try count words in the image-files too and like, maybe in the next question in this serie you will ask for it ;)

EDIT

If you really want two argument script e.g. script where_to_search output (what isn't very bash-like), put the above script into the function, and do whatever you want, e.g:

#!/bin/bash

wordcounter() {
    for arg
    do
        find "$arg" -type f -print0
    done |xargs -0 grep -hoP '\b[[:alpha:]]+\b' |sort |uniq -c |sort -nr
}

where="$1"
output="$2"
#do here the necessary checks
#...
#and run the function
wordcounter "$where" > "$output"
#end of script

I believe I am required to specify the output file from what I understand.

Collectives™ on Stack Overflow

Bash script to store list of files in an array with number of occurrences of each word in all files

1 Answer 1

EDIT

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

EDIT

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related