When I call the script (via commandline) I'd like to pass in an optional list of directories for the find command to ignore. This is what I have but my output includes data from files in the directory that is supposed to be ignored. This script works as expected without a directory passed in to ignore.
#!/usr/bin/env bash
# Ensure the script is called with the correct number of arguments
if [ "$#" -lt 2 ]; then
echo "Usage: $0 <directory> <output_filename> [directories_to_ignore...]"
exit 1
fi
# Convert the directory to an absolute path
directory=$(realpath "$1")
output_file=$2
shift 2 # Shift the arguments so we can access any ignored directories
# Collect directories to ignore (if provided)
ignore_dirs=()
for dir in "$@"; do
ignore_dirs+=("-path" "$dir" -prune -o)
done
temp_file=$(mktemp)
# Recursively find all parquet files and save their absolute paths to a temp file
find "$directory" "${ignore_dirs[@]}" -type f -exec file "{}" \; | ug -i -e 'apache parquet' | awk -F: '{print $1}' >> "$temp_file"
# Declare an associative array to store unique MD5s
declare -A md5_dict
# Iterate over each file in the temp file
while IFS= read -r parquet_file; do
# Extract just the filename from the absolute path
filename=$(basename "$parquet_file")
# Search the file for the regex pattern and loop through each match
matches=$(pqrs cat $parquet_file | sed '/^#\{40,\}/d' | rg -o "md5=\w{32}" | sed 's/md5=//' | sort | uniq)
# Write each unique match on a new line in the output file with the filename prefixed
while IFS= read -r match; do
# Extract the actual hash value from the match (remove "md5=")
hash_value=$(echo "$match")
# Check if the MD5 is already in the dictionary
if [[ -z "${md5_dict[$hash_value]}" ]]; then
# If not in the dictionary, add it and write to the output file
md5_dict["$hash_value"]="$filename"
echo "$filename,$match" >> "$output_file"
fi
done <<< "$matches"
done < "$temp_file"
# Clean up the temporary file
rm "$temp_file"
echo "Search complete. Results saved to $output_file."
Edited to upload correct version of the script.
UPDATE: Here is the directory structure I'm working with
.
├── data
│ ├── outputFiles
│ └── parquetFiles
│ ├── 2023-08-06-07-00_1691276460134_127.0.0.1
│ ├── 2023-08-06-07-01_1691276522106_127.0.0.1
│ ├── ignoreThisDir
│ │ ├── 2023-08-06-07-02_1691276582057_127.0.0.1
│ │ └── testFileNottoSearch.txt
│ └── testFiletoSearch.txt
├── db
│ └── docker-compose.yaml
├── dehasher.sh
├── find_pg_files.sh
├── source
└── test.sh
When I run the script I expect it to scrape values out of the two files in data/parquetFiles and ignore the file in data/parquetFiles/ignoreThisDir. When I run the script and check the output file using ripgrep
rg "2023-08-06-07-02_1691276582057" ouput
there are over 200 matches.
directoryin the question, but the script speaks ofdirectories; you may want to clarify that. With a single one you're fine w/$3and-path "$3" -prune, if you want to exclude several you'll need to be smarter about this.\( -path dir1 -prune -o -path dir2 -prune -o ... \)which I think you can hard-code into your cmd likefind "$directory" \( "${ignore_dirs[@]}" \) -type f ....find "$directory" \( "${ignore_dirs[@]}" -print \) -type f -exec file "{}" \; | ug -i -e 'apache parquet' | awk -F: '{print $1}' >> "$temp_file"there are no errors, but it still includes the directory that I'm trying to ignore.>> "$temp_file"will keep appending each run to your $temp_file.