0

I have a bash file that takes a large csv and splits the csv into smaller csv's based on this blog https://medium.com/swlh/automatic-s3-file-splitter-620d04b6e81c. It works well as it is fast never downloading the csv's which is great for a lambda. The csv's after they split do not have headers only the originating csv. This is problem for me since I am not able to read with apache pyspark a set of files one with header row and many other files without header rows.

I want to add a header row to each csv written.

What the code does

INFILE

  • "s3//test-bucket/test.csv"

OUTFILES - split into 300K lines

  • "s3//dest-test-bucket/test.00.csv"
  • "s3//dest-test-bucket/test.01.csv"
  • "s3//dest-test-bucket/test.02.csv"
  • "s3//dest-test-bucket/test.03.csv"

AWS documentation states

You can use the dash parameter for file streaming to standard input (stdin) or standard output (stdout).

I don't know if this is even possible with an open file stream.

Original code that works

LINECOUNT=300000
INFILE=s3://"${S3_BUCKET}"/"${FILENAME}"
OUTFILE=s3://"${DEST_S3_BUCKET}"/"${FILENAME%%.*}"

FILES=($(aws s3 cp "${INFILE}" - | split -d -l ${LINECOUNT} --filter "aws s3 cp - \"${OUTFILE}_\$FILE.csv\"  | echo \"\$FILE.csv\""))

This was my attempt to add a variable to outgoing file stream, but it did not work.

LINECOUNT=300000
INFILE=s3://"${S3_BUCKET}"/"${FILENAME}"
OUTFILE=s3://"${DEST_S3_BUCKET}"/"${FILENAME%%.*}"

HEADER=$(aws s3 cp "${INFILE}" - | head -n 1)

FILES=($(aws s3 cp "${INFILE}" - | split -d -l ${LINECOUNT} --filter "echo ${HEADER}; aws s3 cp - \"${OUTFILE}_\$FILE.csv\"  | echo \"\$FILE.csv\""))
3
  • The script in that article you reference has some beginner mistakes that will cause breakage and/or security issues given some input and some environment settings, don't use it. If you'd like to know how to split a CSV, post a question with a sample CSV and expected output. Commented Sep 10, 2022 at 21:08
  • One line sentence is mandatory? With several lines and iterations it will be readable and elegant Commented Sep 18, 2022 at 14:35
  • one line is not necessary as long as it is streaming and not downloaded to the local machine which would cause out of memory exceptions Commented Oct 14, 2022 at 14:32

0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.