I have a bash file that takes a large csv and splits the csv into smaller csv's based on this blog https://medium.com/swlh/automatic-s3-file-splitter-620d04b6e81c. It works well as it is fast never downloading the csv's which is great for a lambda. The csv's after they split do not have headers only the originating csv. This is problem for me since I am not able to read with apache pyspark a set of files one with header row and many other files without header rows.
I want to add a header row to each csv written.
What the code does
INFILE
- "s3//test-bucket/test.csv"
OUTFILES - split into 300K lines
- "s3//dest-test-bucket/test.00.csv"
- "s3//dest-test-bucket/test.01.csv"
- "s3//dest-test-bucket/test.02.csv"
- "s3//dest-test-bucket/test.03.csv"
You can use the dash parameter for file streaming to standard input (stdin) or standard output (stdout).
I don't know if this is even possible with an open file stream.
Original code that works
LINECOUNT=300000
INFILE=s3://"${S3_BUCKET}"/"${FILENAME}"
OUTFILE=s3://"${DEST_S3_BUCKET}"/"${FILENAME%%.*}"
FILES=($(aws s3 cp "${INFILE}" - | split -d -l ${LINECOUNT} --filter "aws s3 cp - \"${OUTFILE}_\$FILE.csv\" | echo \"\$FILE.csv\""))
This was my attempt to add a variable to outgoing file stream, but it did not work.
LINECOUNT=300000
INFILE=s3://"${S3_BUCKET}"/"${FILENAME}"
OUTFILE=s3://"${DEST_S3_BUCKET}"/"${FILENAME%%.*}"
HEADER=$(aws s3 cp "${INFILE}" - | head -n 1)
FILES=($(aws s3 cp "${INFILE}" - | split -d -l ${LINECOUNT} --filter "echo ${HEADER}; aws s3 cp - \"${OUTFILE}_\$FILE.csv\" | echo \"\$FILE.csv\""))