How to use awk to split a file and store each filename in a Bash array

Question

Input

A file called input_file.csv, which has 7 columns, and n rows.

Example header and row:

Date Location Team1 Team2 Time Prize_$ Sport
2016 NY Raptors Gators 12pm $500 Soccer

Output

n files, where the rows in each new file are grouped based on their values in column 7 of the original file. Each file is named after that shared value from column 7. Note: each file will have the same header. (The script currently does this.)

Example: if 2 rows in the original file had golf as their value for column 7, they would be grouped together in a file called golf.csv. If 3 other rows shared soccer as their value for column 7, they would be found in soccer.csv.
An array that has the name of each generated file in it. This array lives outside of the scope of awk. (This is what I need help with.)

Example: Array = [golf.csv, soccer.csv]

Situation

The following script produces the desired output. However, I want to run another script on each of the newly generated files and I don't know how.

Question:

My idea is to store the names of each new file in an array. That way, I can loop through the array and do what I want to each file. The code below passes a variable called array into awk, but I don't know how to add the name of each file to the array.

#!/bin/bash

ARRAY=()

awk -v myarray="$ARRAY" -F"\",\"" 'NR==1 {header=$0}; NF>1 && NR>1 {if(! files[$7]) {print header >> ("" $7 ".csv"); files[$7]=1}; print $0 >> ("" $7 ".csv"); close("" $7 ".csv");}' input_file.csv

for i in "${ARRAY[@]}"
    do
    :
    echo $i
done

The linked answer doesn't explain how to add each filename to an array. I tried exporting to a file, but none of the filenames are being stored anywhere. If I knew how to add each filename to an array, I think I could figure out how to access that array outside of awk. — jfvasconez
– jfvasconez, Commented Feb 26, 2016 at 21:34
How do I store the name of each file - what file(s)? If you can provide a better explanation and concise, testable sample input and expected output I for one would consider voting to reopen but as it stands it looks like the question yours is closed as a dup of DOES contain the answer to your question. — Ed Morton
– Ed Morton, Commented Feb 26, 2016 at 21:39
Yes but I don't understand why you'd post a space-separated input file when you say your real one is comma-separated nor why you didn't create an input file with say a couple more lines and the output files you;d want generated from that input file to make it 100% clear. Oh well I think I know what you want now. — Ed Morton
– Ed Morton, Commented Feb 27, 2016 at 13:38

mklement0 · Accepted Answer · 2016-02-27 03:52:00Z

2

Rather than struggling to get awk to fill your shell array variable, why not:

make sure that the *.csv files are created in a clean directory
use globbing to loop over all *.csv files in that directory?

awk -F'","' ...  # your original Awk command

for i in *.csv  # use globbing to loop over resulting *.csv files
    do
    :
    echo $i
done

answered Feb 27, 2016 at 3:52

mklement0

453k68 gold badges729 silver badges989 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Mark Reed Over a year ago

you can even put the files in an array afterward if you still want to do that: filesArray=(*.csv)

ghoti Over a year ago

Ya, that makes great sense. But this doesn't address the challenges the OP described in the question, does it?

mklement0 Over a year ago

@ghoti: This answer suggests an alternative way to approach the problem (and is hopefully clear in that regard), because I suspect the question to be an instance of the XY problem.

ghoti Over a year ago

I agree that this is an XY problem -- but then, 80% of the questions we answer here probably are, if you look deep enough. Why does the OP want separate files by $7? What's the purpose of the array? I'm certain we could figure out a better way to achieve his goals. But I didn't see those goals in his question.

ghoti · Accepted Answer · 2016-02-27 03:52:43Z

Just off the top of my head, untested because you haven't supplied very much sample data, what about this?

#!/usr/bin/awk -f

FNR==1 {
  header=$0
  next
}

! $7 in files {
  files[$7]=sprintf("sport-%s.csv", $7)
  print header > file
}

{
  files[$7]=sprintf("sport-%s.csv", $7)
}

{
  print > files[$7]
}

END {
  printf("declare -a sportlist=( ")
  for (sport in files) {
    printf("\"%s\"", sport)
  }
  printf(" )\n");
}

The idea here is that we store sport names in the array files[], and build filenames out of that array. (You can format the filename inside sprintf() as you see fit.) We step through the file, adding a header line whenever we get a new sport with no recorded filename. Then for non-headers, print to the file based on the sport name.

For your second issue, exporting the array back to something outside of awk, the END block here will output a declare line which can be interpreted by bash. IF you feel lucky, you can eval this awk script inside command expansion, and the declare command will effectively be interpreted by your shell:

eval $(/path/to/awkscript inputfile.csv)

Or, if you subscribe to the school of thought that consiers eval to be evil, you can redirect the awk script's standard output to a temporary file which you source:

/path/to/awkscript inputfile.csv > /tmp/yadda.$$
. /tmp/yadda.$$

(Don't use this temp file, make a real one with mktemp or the like.)

Mark Reed · Accepted Answer · 2016-02-27 03:58:09Z

0

There's no way for any program to modify the environment of the parent shell. Just have the awk script output the names of the files as standard output, and use command substitution to put them in an array.

filesArray=($(awk ... ))

If the files might have spaces in them, you need a different solution; assuming you're on bash 4, you can just be sure to print each file on a separate line and use readarray:

readarray filesArray < <( awk ... )

if the files might have newlines in them, too, then things get tricky...

answered Feb 27, 2016 at 3:58

Mark Reed

96k17 gold badges149 silver badges189 bronze badges

Comments

karakfa · Accepted Answer · 2016-02-27 04:02:52Z

0

if your file is not large, you can run another script to get the unique $7 elements, for example

$ awk 'NR>1&&!a[$7]++{print $7}' sports

will print the values, you can change it to your file name format as well, such as

$ awk 'NR>1&&!a[$7]++{print tolower($7)".csv"}' sports

this then can be piped to your other process, here for example to wc

$ awk ... sports | xargs wc

answered Feb 27, 2016 at 4:02

karakfa

67.8k8 gold badges45 silver badges59 bronze badges

Comments

Ed Morton · Accepted Answer · 2016-02-27 13:41:20Z

0

This will do what I THINK you want:

oIFS="$IFS"; IFS=$'\n'
array=( $(awk '{out=$7".csv"; print > out} !seen[out]++{print out}' input_file.csv) )
IFS="$oIFS"

If your input file really is comma-separated instead of space-separated as you show in the sample input in your question then adjust the awk script to suit (You might want to look at GNU awk and FPAT).

If you don't have GNU awk then you'll need to add a bit more code to close the open output files as you go.

The above will fail if you have file names that contain newlines but will be fine for blank chars or other white space.

edited Feb 27, 2016 at 13:41

answered Feb 27, 2016 at 13:33

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

Collectives™ on Stack Overflow

How to use awk to split a file and store each filename in a Bash array

5 Answers 5

4 Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

4 Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related