CSV to JSON using BASH

Question

I am trying to covert the below csv into json format.

Africa,Kenya,NAI,281
Africa,Kenya,NAI,281
Asia,India,NSI,100
Asia,India,BSE,160
Asia,Pakistan,ISE,100
Asia,Pakistan,ANO,100
European Union,United Kingdom,LSE,100

This is the desired json format and I just cannot get to create it. I will post my work in progress below this.. Any help or direction would be appreciated...

  {"name":"Africa",
      "children":[
      {"name":"Kenya",
          "children":[
          {"name":"NAI","size":"109"},
          {"name":"NAA","size":"160"}]}]},
  {"name":"Asia",
      "children":[
      {"name":"India",
          "children":[
          {"name":"NSI","size":"100"},
          {"name":"BSE","size":"60"}]},
  {"name":"Pakistan",
      "children":[
      {"name":"ISE","size":"120"},
      {"name":"ANO","size":"433"}]}]},
  {"name":"European Union",
        "children":[
        {"name":"United Kingdom",
            "children":[
            {"name":"LSE","size":"550"},
            {"name":"PLU","size":"123"}]}]}

Work in Progress.

$1 is the file with the csv values pasted above.

#!/bin/bash

pcountry=$(head -1 $1 | cut -d, -f2)

cat $1 | while read line ; do 

region=$(echo $line|cut -d, -f1)
country=$(echo $line|cut -d, -f2)
code=$(echo $line|cut -d, -f3-)
size=$(echo $line|cut -d, -f4)

if test "$pcountry" == "$country" ;
  then 
  echo -e {\"name\":\"$region\", '\n' \"children\": [ '\n'{\"name\":\"$country\",'\n'\"children\": [ '\n' \{\"name\":\"NAI\",\"size\":\"$size\"\}
  else
      if test "$pregion" == "$region"
      then :
      else 
          echo -e ,'\n'{\"name\":\""$region\", '\n' \"children\": [ '\n'{\"name\":\"$country\",'\n'\"children\": [ '\n' \{\"name\":\"NAI\",\"size\":\"$size\"\},


pcountry=$country
pregion=$region

fi ; done

Problem is that I cannot seem to find a way to find out when a countries value ends.

Why bash? Python, able to read and write csv and json, would be a better choice for this task. — mouviciel
– mouviciel, Commented Jun 19, 2014 at 6:58
You could assume that the countries values end either when you see a new country (risky) or hit EOF (safe). Pre-sorting eliminates the risk if countries are always categorized in the correct region. The ambiguity is an issue with the provided data format. — Paul
– Paul, Commented Jun 19, 2014 at 7:09
Python, nodeJS, Perl would all better support data transformation between csv and json because of library availability. — Paul
– Paul, Commented Jun 19, 2014 at 7:10
Thanks for everyone's comments. The reason why I used BASH is because I do not know any other languages. I just picked up BASH doing my job.. I guess I know what to "pick up" next. Python :) Special Thanks for @David Atchley for the script... You are champion ! — dexnow
– dexnow, Commented Jun 20, 2014 at 7:28
and guys ...I am interested in knowing how it could be done in python... If someone could be kind enough to gimme a python script which does this, then it would be a great chance for me to pick up some python ... — dexnow
– dexnow, Commented Jun 21, 2014 at 10:51

David Atchley · Accepted Answer · 2014-06-20 07:21:57Z

As a number of the commenters have said, using the shell for this kind of conversion is a horrible idea. And, it would be nigh impossible to do it with just bash builtins; and shell scripts are used to combine standard unix commands like sed, awk, cut, etc. anyway. You should choose a better language that's built for that kind of iterative parsing/processing to solve your problem.

However, because it's late and I've had too much coffee, I threw together a bash script (with a few bits of sed thrown in for parsing help) that takes the example .csv data you have and outputs the JSON in the format you noted. Here's the script:

#! /bin/bash 
# Initial input file format:
#
#         Africa,Kenya,NAI,281
#         Africa,Kenya,NAA,281
#         Asia,India,NSI,100
#         Asia,India,BSE,160
#         Asia,Pakistan,ISE,100
#         Asia,Pakistan,ANO,100
#         European Union,United Kingdom,LSE,100
#
# Intermediate file format for parsing to JSON:
#
#         Africa|Kenya:NAI=281
#         Asia|India:BSE=160&NSI=100|Pakistan:ISE=100&ANO=100
#         European Union|United Kingdom:LSE=100
#
# Call as:
#
#   $ ./script INPUTFILE.csv >OUTPUTFILE.json
#


# temporary files for output/parsing
TMP="./tmp.dat"
TMP2="./tmp2.dat"
>$TMP
>$TMP2

# read through initial file and output intermediate format
while read line
do
    region=$(echo $line | cut -d, -f1)
    country=$(echo $line | cut -d, -f2)
    code=$(echo $line | cut -d, -f3)
    size=$(echo $line | cut -d, -f4)

    # region record already started
    if grep "^$region" $TMP 2>&1 >/dev/null ;then
        >$TMP2 
        while read rec
        do
            if echo $rec | grep "^$region" 2>&1 >/dev/null
            then
                if echo "$rec" | grep "\|$country:" 2>&1 >/dev/null
                then
                    echo "$rec" | sed -e 's/\('"$country"':[^\|][^\|]*\)/\1\&'"$code"'='"$size"'/' >>$TMP2
                else
                    echo "$rec|$country:$code=$size" >>$TMP2
                fi
            else
                echo $rec >>$TMP2
            fi
        done < $TMP
        mv $TMP2 $TMP
    else
    # new region
        echo "$region|$country:$code=$size" >>$TMP
    fi

done < $1

# Parse through our intermediary format and output JSON to standard out
echo "["
country_count=$(cat $TMP | wc -l)
while read line
do
    country=$(echo $line | cut -d\| -f1)
    echo "{ \"name\": \"$country\", "
    echo "  \"children\": ["
    region_count=$(echo $line | cut -d\| -f2- | sed -e 's/|/\n/g' | wc -l)
    echo $line | cut -d\| -f2- | sed -e 's/|/\n/g' | 
    while read region
    do
        name=$(echo $region | cut -d: -f1)
        echo "    { \"name\": \"$name\", "
        echo "      \"children\": ["
            code_count=$(echo $region | sed -e 's/^'"$name"'://' -e 's/&/\n/g'  | wc -l)
            echo $region | sed -e 's/^'"$name"'://' -e 's/&/\n/g'  |
            while read code_size
            do
                code=$(echo $code_size | cut -d= -f1)
                size=$(echo $code_size | cut -d= -f2)
                code_count=$((code_count - 1))
                COMMA=""
                if [ $code_count -gt 0 ]; then
                  COMMA=","
                fi
                echo "        { \"name\": \"$code\", \"size\": \"$size\" }$COMMA " 
            done
        echo "      ]"
        region_count=$((region_count - 1))
        if [ $region_count -gt 0 ]; then
            echo "    },"
        else
            echo "    }"
        fi
    done 
    echo "  ]"
    country_count=$((country_count - 1))
    COMMA=""
    if [ $country_count -gt 0 ]; then
        COMMA=","
    fi    
    echo "}$COMMA"

done < $TMP
echo "]"

exit 0

And, here's the resulting output from the above script:

[
{ "name": "Africa",
  "children": [
    { "name": "Kenya",
      "children": [
        { "name": "NAI", "size": "281" },
        { "name": "NAA", "size": "281" }
      ]
    }
  ]
},
{ "name": "Asia",
  "children": [
    { "name": "India",
      "children": [
        { "name": "NSI", "size": "100" },
        { "name": "BSE", "size": "160" }
      ]
    },
    { "name": "Pakistan",
      "children": [
        { "name": "ISE", "size": "100" },
        { "name": "ANO", "size": "100" }
      ]
    }
  ]
},
{ "name": "European Union",
  "children": [
    { "name": "United Kingdom",
      "children": [
        { "name": "LSE", "size": "100" }
      ]
    }
  ]
}
]

Please don't use code like the above in any production environment.

I feel like there should be a badge for encouraging bad behavior 😈
Embedded systems (such as those using Yocto or DD-WRT) often have only BusyBox available, which includes a surprisingly functional Bash-like implementation, but always lack a package manager or local compilers. 100% Bash FTW!

jq170727 · Accepted Answer · 2017-09-03 17:39:58Z

Here is a solution using jq.

If filter.jq contains the following filter

 reduce (
     split("\n")[]                  # split string into lines
   | split(",")                     # split data
   | select(length>0)               # eliminate blanks
 )  as [$c1,$c2,$c3,$c4] (          # convert to object 
     {}                             #   e.g. "Africa": { "Kenya": {  
   ; setpath([$c1,$c2,"name"];$c3)  #           "name": "NAI",
   | setpath([$c1,$c2,"size"];$c4)  #           "size": "281"        
)                                   #        }, }
| [                                 # then build final array of objects format:
    keys[] as $k1                   # [ {                                               
  | {name: $k1, children: (         #   "name": "Africa",                                  
       .[$k1]                       #   "children": {                                   
     | keys[] as $k2                #     "name": "Kenya",                                 
     | {name: $k2, children:.[$k2]} #     "children": { "name": "NAI", "size": "281" }
    )}                              #   ...
  ]

and data contains the sample data then the command

$ jq -M -Rsr -f filter.jq data

produces

[
  {
    "name": "Africa",
    "children": {
      "name": "Kenya",
      "children": {
        "name": "NAI",
        "size": "281"
      }
    }
  },
  {
    "name": "Asia",
    "children": {
      "name": "India",
      "children": {
        "name": "BSE",
        "size": "160"
      }
    }
  },
  {
    "name": "Asia",
    "children": {
      "name": "Pakistan",
      "children": {
        "name": "ANO",
        "size": "100"
      }
    }
  },
  {
    "name": "European Union",
    "children": {
      "name": "United Kingdom",
      "children": {
        "name": "LSE",
        "size": "100"
      }
    }
  }
]

Reino · Accepted Answer · 2025-04-18 13:45:00Z

Please don't use RegEx, or Bash' builtin tools. They're not designed to parse or create JSON. Use a dedicated parser like xidel instead:

Assuming 'input.csv':

Africa,Kenya,NAI,109
Africa,Kenya,NAA,160
Asia,India,NSI,100
Asia,India,BSE,60
Asia,Pakistan,ISE,120
Asia,Pakistan,ANO,433
European Union,United Kingdom,LSE,550
European Union,United Kingdom,PLU,123

$ xidel -s "input.csv" -e '
  array{
    let $csv:=x:lines($raw) ! array{tokenize(.,",")}
    for $region in distinct-values($csv(1))
    return {
      "name":$region,
      "children":array{
        for $country in distinct-values($csv[.() = $region](2))
        return {
          "name":$country,
          "children":array{
            $csv[.() = ($country)] ! {
              "name":.(3),
              "size":.(4)
            }
          }
        }
      }
    }
  }
'
[
  {
    "name": "Africa",
    "children": [
      {
        "name": "Kenya",
        "children": [
          {
            "name": "NAI",
            "size": "109"
          },
          {
            "name": "NAA",
            "size": "160"
          }
        ]
      }
    ]
  },
  {
    "name": "Asia",
    "children": [
      {
        "name": "India",
        "children": [
          {
            "name": "NSI",
            "size": "100"
          },
          {
            "name": "BSE",
            "size": "60"
          }
        ]
      },
      {
        "name": "Pakistan",
        "children": [
          {
            "name": "ISE",
            "size": "120"
          },
          {
            "name": "ANO",
            "size": "433"
          }
        ]
      }
    ]
  },
  {
    "name": "European Union",
    "children": [
      {
        "name": "United Kingdom",
        "children": [
          {
            "name": "LSE",
            "size": "550"
          },
          {
            "name": "PLU",
            "size": "123"
          }
        ]
      }
    ]
  }
]

See this gist for intermediate steps leading to this query.
Also see this online xidelcgi demo.

Collectives™ on Stack Overflow

CSV to JSON using BASH

Problem is that I cannot seem to find a way to find out when a countries value ends.

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Problem is that I cannot seem to find a way to find out when a countries value ends.

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related