Pass in argument of directory for find command to ignore in script

Ask Question

Asked 1 year, 3 months ago

Modified 1 year, 3 months ago

Viewed 90 times

When I call the script (via commandline) I'd like to pass in an optional list of directories for the find command to ignore. This is what I have but my output includes data from files in the directory that is supposed to be ignored. This script works as expected without a directory passed in to ignore.

#!/usr/bin/env bash

# Ensure the script is called with the correct number of arguments
if [ "$#" -lt 2 ]; then
    echo "Usage: $0 <directory> <output_filename> [directories_to_ignore...]"
    exit 1
fi

# Convert the directory to an absolute path
directory=$(realpath "$1")
output_file=$2
shift 2  # Shift the arguments so we can access any ignored directories

# Collect directories to ignore (if provided)
ignore_dirs=()
for dir in "$@"; do
    ignore_dirs+=("-path" "$dir" -prune -o)
done

temp_file=$(mktemp)

# Recursively find all parquet files and save their absolute paths to a temp file
find "$directory" "${ignore_dirs[@]}" -type f -exec file "{}" \; | ug -i -e 'apache parquet' | awk -F: '{print $1}' >> "$temp_file"

# Declare an associative array to store unique MD5s
declare -A md5_dict

# Iterate over each file in the temp file
while IFS= read -r parquet_file; do
    # Extract just the filename from the absolute path
    filename=$(basename "$parquet_file")

    # Search the file for the regex pattern and loop through each match
    matches=$(pqrs cat $parquet_file | sed '/^#\{40,\}/d' | rg -o "md5=\w{32}" | sed 's/md5=//' | sort | uniq)
    
    # Write each unique match on a new line in the output file with the filename prefixed
    while IFS= read -r match; do
        # Extract the actual hash value from the match (remove "md5=")
        hash_value=$(echo "$match")

        # Check if the MD5 is already in the dictionary
        if [[ -z "${md5_dict[$hash_value]}" ]]; then
            # If not in the dictionary, add it and write to the output file
            md5_dict["$hash_value"]="$filename"
            echo "$filename,$match" >> "$output_file"
        fi
    done <<< "$matches"
done < "$temp_file"

# Clean up the temporary file
rm "$temp_file"

echo "Search complete. Results saved to $output_file."

Edited to upload correct version of the script.

UPDATE: Here is the directory structure I'm working with

.
├── data
│   ├── outputFiles
│   └── parquetFiles
│       ├── 2023-08-06-07-00_1691276460134_127.0.0.1
│       ├── 2023-08-06-07-01_1691276522106_127.0.0.1
│       ├── ignoreThisDir
│       │   ├── 2023-08-06-07-02_1691276582057_127.0.0.1
│       │   └── testFileNottoSearch.txt
│       └── testFiletoSearch.txt
├── db
│   └── docker-compose.yaml
├── dehasher.sh
├── find_pg_files.sh
├── source
└── test.sh

When I run the script I expect it to scrape values out of the two files in data/parquetFiles and ignore the file in data/parquetFiles/ignoreThisDir. When I run the script and check the output file using ripgrep rg "2023-08-06-07-02_1691276582057" ouput there are over 200 matches.

edited Sep 7, 2024 at 0:52

asked Sep 6, 2024 at 16:52

Kaverni

11 bronze badge

Check out this fine answer over at Stack Overflow. You mention directory in the question, but the script speaks of directories; you may want to clarify that. With a single one you're fine w/ $3 and -path "$3" -prune, if you want to exclude several you'll need to be smarter about this.

tink
– tink

2024-09-06 17:54:51 +00:00
Commented Sep 6, 2024 at 17:54
Thanks for the comments. I uploaded the wrong script. I'll edit with the correct one. It's very similar, there's just some minor changes. The one I originally posted was in the middle of being modified.

Kaverni
– Kaverni

2024-09-06 20:36:06 +00:00
Commented Sep 6, 2024 at 20:36
If I recall correctly, when you use a "list" of OR ed together options, you need to isolate that logic like $ -path dir1 -prune -o -path dir2 -prune -o ... $ which I think you can hard-code into your cmd like find "$directory" $ "${ignore_dirs[@]}" $ -type f ....

shellter
– shellter

2024-09-06 23:22:47 +00:00
Commented Sep 6, 2024 at 23:22
@shellter When I do that I get this error: find: -o: no expression after -o If I use this: find "$directory" $ "${ignore_dirs[@]}" -print $ -type f -exec file "{}" \; | ug -i -e 'apache parquet' | awk -F: '{print $1}' >> "$temp_file" there are no errors, but it still includes the directory that I'm trying to ignore.

Kaverni
– Kaverni

2024-09-07 00:47:08 +00:00
Commented Sep 7, 2024 at 0:47
1

".... there are over 200 matches." Yes, >> "$temp_file" will keep appending each run to your $temp_file.

shellter
– shellter

2024-09-07 00:54:57 +00:00
Commented Sep 7, 2024 at 0:54

| Show 6 more comments

0 You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Stack Exchange Network

Pass in argument of directory for find command to ignore in script

0

You must log in to answer this question.

Hot Network Questions

Pass in argument of directory for find command to ignore in script

0

You must log in to answer this question.

Related

Hot Network Questions