1

So far, my bash script takes in two arguments...input which can be a file or a directory, and output, which is the output file. It finds all files recursively and if the input is a file it finds all occurrences of each word in all the files found and list them in the output file with the number on the left and the word on the right sorted from greatest to least. Right now it is also counting numbers as words which it shouldn't do...how can I have it only find all occurrences of valid words and no numbers? Also, in the last if statement...if the input is a directory, I am having trouble getting it to do the same thing I had it do for the file. It needs to find all files in that directory, and if there is another directory in that directory, it needs to find all files in it and so on. Then it needs to count all occurrences of each word in all files and store them to the output file just as in the case for a file. I was thinking to store them in an array, but I'm not sure if its the best way, and my syntax is off because its not working...so I would like to know how can I do this? Thanks!

    #!/bin/bash

    INPUT="$1"
    OUTPUT="$2"
    ARRAY=();

    # Check that there are two arguments
    if [ "$#" -ne 2 ]
    then
       echo "Usage: $0 {dir-name}";
       exit 1
    fi

    # Check that INPUT is different from OUTPUT
    if [ "$INPUT" = "$OUTPUT" ]
    then
       echo "$INPUT must be different from $OUTPUT";
    fi

    # Check if INPUT is a file...if so, find number of occurrences of each word
    # and store in OUTPUT file sorted in greatest to least
    if [ -f "$INPUT" ]
    then
       for name in $INPUT; do
          if [ -f "$name" ]
          then
             xargs grep -hoP '\b\w+\b' < "$name" | sort | uniq -c | sort -n -r > "$OUTPUT"
          fi
       done
    # If INPUT is a directory, find number of occurrences of each word
    # and store in OUTPUT file sorted in greatest to least
    elif [ -d "$INPUT" ]
    then
       find $name -type f > "${ARRAY[@]}"
       for name in "${ARRAY[@]}"; do
          if [ -f "$name" ]
          then
             xargs grep -hoP '\b\w+\b' < "$name" | sort | uniq -c | sort -n -r > "$OUTPUT"
          fi
       done
    fi
4
  • 2
    Can you show examples of your input file, expected input, and expected output. e.g. not clear what for name in $INPUT is supposed to do ... since $INPUT should be one argument? Commented Sep 14, 2014 at 19:00
  • 1
    are you doing word frequency analysis? you might want to convert to lowercase first, accept - and a couple of other things. just a thought. do you use the regular alphabet or are there special characters? Commented Sep 14, 2014 at 19:11
  • 2
    "how can I have it only find all occurrences of valid words and no numbers?" Use grep -hoP '\b[[:alpha:]]+\b' in place of grep -hoP '\b\w+\b' Commented Sep 14, 2014 at 19:24
  • @BroSlow Input can be any kind of file or directory. Expected output: 17 word. List of number of occurrences with the word next to it. name in $INPUT is each filename in the input. Commented Sep 14, 2014 at 19:26

1 Answer 1

1

I don't recommend you specifying the output file, because you must to more validity checking for it, e.g.

  • the output shouldn't exists (if you don't want allow the overwrite)
  • if you want allow the overwrite, if the output exists, it must be an plain file
  • and so on..
  • it is better to have a possibility to use more input directories/files as arguments

therefore is better (an it is more bash-ish) produces output to standard output and you can redirect it to file at invocation, like

bash wordcounter.sh files or directories more the one to count words > to_some_file

e.g

bash worcounter.sh some_dir >result.txt
#or
bash wordcounter.sh file1.txt file2.txt .... fileN.txt > result2.txt
#or
bash wordcounter.sh dir1 file1 dir2 file2 >result2.txt

the whole wordcounter.sh could be the next:

for arg
do
    find "$arg" -type f -print0
done |xargs -0 grep -hoP '\b[[:alpha:]]+\b' |sort |uniq -c |sort -nr

where:

  • the find will search plain files the for all arguments
  • and on the the generated file-list will run the counting script

The script sill has some drawbacks, e.g. will try count words in the image-files too and like, maybe in the next question in this serie you will ask for it ;)

EDIT

If you really want two argument script e.g. script where_to_search output (what isn't very bash-like), put the above script into the function, and do whatever you want, e.g:

#!/bin/bash

wordcounter() {
    for arg
    do
        find "$arg" -type f -print0
    done |xargs -0 grep -hoP '\b[[:alpha:]]+\b' |sort |uniq -c |sort -nr
}

where="$1"
output="$2"
#do here the necessary checks
#...
#and run the function
wordcounter "$where" > "$output"
#end of script
Sign up to request clarification or add additional context in comments.

1 Comment

I believe I am required to specify the output file from what I understand.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.