0

When I call the script (via commandline) I'd like to pass in an optional list of directories for the find command to ignore. This is what I have but my output includes data from files in the directory that is supposed to be ignored. This script works as expected without a directory passed in to ignore.

#!/usr/bin/env bash

# Ensure the script is called with the correct number of arguments
if [ "$#" -lt 2 ]; then
    echo "Usage: $0 <directory> <output_filename> [directories_to_ignore...]"
    exit 1
fi

# Convert the directory to an absolute path
directory=$(realpath "$1")
output_file=$2
shift 2  # Shift the arguments so we can access any ignored directories

# Collect directories to ignore (if provided)
ignore_dirs=()
for dir in "$@"; do
    ignore_dirs+=("-path" "$dir" -prune -o)
done

temp_file=$(mktemp)

# Recursively find all parquet files and save their absolute paths to a temp file
find "$directory" "${ignore_dirs[@]}" -type f -exec file "{}" \; | ug -i -e 'apache parquet' | awk -F: '{print $1}' >> "$temp_file"

# Declare an associative array to store unique MD5s
declare -A md5_dict

# Iterate over each file in the temp file
while IFS= read -r parquet_file; do
    # Extract just the filename from the absolute path
    filename=$(basename "$parquet_file")

    # Search the file for the regex pattern and loop through each match
    matches=$(pqrs cat $parquet_file | sed '/^#\{40,\}/d' | rg -o "md5=\w{32}" | sed 's/md5=//' | sort | uniq)
    
    # Write each unique match on a new line in the output file with the filename prefixed
    while IFS= read -r match; do
        # Extract the actual hash value from the match (remove "md5=")
        hash_value=$(echo "$match")

        # Check if the MD5 is already in the dictionary
        if [[ -z "${md5_dict[$hash_value]}" ]]; then
            # If not in the dictionary, add it and write to the output file
            md5_dict["$hash_value"]="$filename"
            echo "$filename,$match" >> "$output_file"
        fi
    done <<< "$matches"
done < "$temp_file"

# Clean up the temporary file
rm "$temp_file"

echo "Search complete. Results saved to $output_file."

Edited to upload correct version of the script.

UPDATE: Here is the directory structure I'm working with

.
├── data
│   ├── outputFiles
│   └── parquetFiles
│       ├── 2023-08-06-07-00_1691276460134_127.0.0.1
│       ├── 2023-08-06-07-01_1691276522106_127.0.0.1
│       ├── ignoreThisDir
│       │   ├── 2023-08-06-07-02_1691276582057_127.0.0.1
│       │   └── testFileNottoSearch.txt
│       └── testFiletoSearch.txt
├── db
│   └── docker-compose.yaml
├── dehasher.sh
├── find_pg_files.sh
├── source
└── test.sh

When I run the script I expect it to scrape values out of the two files in data/parquetFiles and ignore the file in data/parquetFiles/ignoreThisDir. When I run the script and check the output file using ripgrep rg "2023-08-06-07-02_1691276582057" ouput there are over 200 matches.

11
  • Check out this fine answer over at Stack Overflow. You mention directory in the question, but the script speaks of directories; you may want to clarify that. With a single one you're fine w/ $3 and -path "$3" -prune, if you want to exclude several you'll need to be smarter about this. Commented Sep 6, 2024 at 17:54
  • Thanks for the comments. I uploaded the wrong script. I'll edit with the correct one. It's very similar, there's just some minor changes. The one I originally posted was in the middle of being modified. Commented Sep 6, 2024 at 20:36
  • If I recall correctly, when you use a "list" of OR ed together options, you need to isolate that logic like \( -path dir1 -prune -o -path dir2 -prune -o ... \) which I think you can hard-code into your cmd like find "$directory" \( "${ignore_dirs[@]}" \) -type f .... Commented Sep 6, 2024 at 23:22
  • @shellter When I do that I get this error: find: -o: no expression after -o If I use this: find "$directory" \( "${ignore_dirs[@]}" -print \) -type f -exec file "{}" \; | ug -i -e 'apache parquet' | awk -F: '{print $1}' >> "$temp_file" there are no errors, but it still includes the directory that I'm trying to ignore. Commented Sep 7, 2024 at 0:47
  • 1
    ".... there are over 200 matches." Yes, >> "$temp_file" will keep appending each run to your $temp_file. Commented Sep 7, 2024 at 0:54

0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.