2

I am trying to covert the below csv into json format.

Africa,Kenya,NAI,281
Africa,Kenya,NAI,281
Asia,India,NSI,100
Asia,India,BSE,160
Asia,Pakistan,ISE,100
Asia,Pakistan,ANO,100
European Union,United Kingdom,LSE,100

This is the desired json format and I just cannot get to create it. I will post my work in progress below this.. Any help or direction would be appreciated...

  {"name":"Africa",
      "children":[
      {"name":"Kenya",
          "children":[
          {"name":"NAI","size":"109"},
          {"name":"NAA","size":"160"}]}]},
  {"name":"Asia",
      "children":[
      {"name":"India",
          "children":[
          {"name":"NSI","size":"100"},
          {"name":"BSE","size":"60"}]},
  {"name":"Pakistan",
      "children":[
      {"name":"ISE","size":"120"},
      {"name":"ANO","size":"433"}]}]},
  {"name":"European Union",
        "children":[
        {"name":"United Kingdom",
            "children":[
            {"name":"LSE","size":"550"},
            {"name":"PLU","size":"123"}]}]}

Work in Progress.

$1 is the file with the csv values pasted above.

#!/bin/bash

pcountry=$(head -1 $1 | cut -d, -f2)

cat $1 | while read line ; do 

region=$(echo $line|cut -d, -f1)
country=$(echo $line|cut -d, -f2)
code=$(echo $line|cut -d, -f3-)
size=$(echo $line|cut -d, -f4)

if test "$pcountry" == "$country" ;
  then 
  echo -e {\"name\":\"$region\", '\n' \"children\": [ '\n'{\"name\":\"$country\",'\n'\"children\": [ '\n' \{\"name\":\"NAI\",\"size\":\"$size\"\}
  else
      if test "$pregion" == "$region"
      then :
      else 
          echo -e ,'\n'{\"name\":\""$region\", '\n' \"children\": [ '\n'{\"name\":\"$country\",'\n'\"children\": [ '\n' \{\"name\":\"NAI\",\"size\":\"$size\"\},


pcountry=$country
pregion=$region

fi ; done

Problem is that I cannot seem to find a way to find out when a countries value ends.

5
  • 6
    Why bash? Python, able to read and write csv and json, would be a better choice for this task. Commented Jun 19, 2014 at 6:58
  • You could assume that the countries values end either when you see a new country (risky) or hit EOF (safe). Pre-sorting eliminates the risk if countries are always categorized in the correct region. The ambiguity is an issue with the provided data format. Commented Jun 19, 2014 at 7:09
  • 3
    Python, nodeJS, Perl would all better support data transformation between csv and json because of library availability. Commented Jun 19, 2014 at 7:10
  • Thanks for everyone's comments. The reason why I used BASH is because I do not know any other languages. I just picked up BASH doing my job.. I guess I know what to "pick up" next. Python :) Special Thanks for @David Atchley for the script... You are champion ! Commented Jun 20, 2014 at 7:28
  • and guys ...I am interested in knowing how it could be done in python... If someone could be kind enough to gimme a python script which does this, then it would be a great chance for me to pick up some python ... Commented Jun 21, 2014 at 10:51

3 Answers 3

6

As a number of the commenters have said, using the shell for this kind of conversion is a horrible idea. And, it would be nigh impossible to do it with just bash builtins; and shell scripts are used to combine standard unix commands like sed, awk, cut, etc. anyway. You should choose a better language that's built for that kind of iterative parsing/processing to solve your problem.

However, because it's late and I've had too much coffee, I threw together a bash script (with a few bits of sed thrown in for parsing help) that takes the example .csv data you have and outputs the JSON in the format you noted. Here's the script:

#! /bin/bash 
# Initial input file format:
#
#         Africa,Kenya,NAI,281
#         Africa,Kenya,NAA,281
#         Asia,India,NSI,100
#         Asia,India,BSE,160
#         Asia,Pakistan,ISE,100
#         Asia,Pakistan,ANO,100
#         European Union,United Kingdom,LSE,100
#
# Intermediate file format for parsing to JSON:
#
#         Africa|Kenya:NAI=281
#         Asia|India:BSE=160&NSI=100|Pakistan:ISE=100&ANO=100
#         European Union|United Kingdom:LSE=100
#
# Call as:
#
#   $ ./script INPUTFILE.csv >OUTPUTFILE.json
#


# temporary files for output/parsing
TMP="./tmp.dat"
TMP2="./tmp2.dat"
>$TMP
>$TMP2

# read through initial file and output intermediate format
while read line
do
    region=$(echo $line | cut -d, -f1)
    country=$(echo $line | cut -d, -f2)
    code=$(echo $line | cut -d, -f3)
    size=$(echo $line | cut -d, -f4)

    # region record already started
    if grep "^$region" $TMP 2>&1 >/dev/null ;then
        >$TMP2 
        while read rec
        do
            if echo $rec | grep "^$region" 2>&1 >/dev/null
            then
                if echo "$rec" | grep "\|$country:" 2>&1 >/dev/null
                then
                    echo "$rec" | sed -e 's/\('"$country"':[^\|][^\|]*\)/\1\&'"$code"'='"$size"'/' >>$TMP2
                else
                    echo "$rec|$country:$code=$size" >>$TMP2
                fi
            else
                echo $rec >>$TMP2
            fi
        done < $TMP
        mv $TMP2 $TMP
    else
    # new region
        echo "$region|$country:$code=$size" >>$TMP
    fi

done < $1

# Parse through our intermediary format and output JSON to standard out
echo "["
country_count=$(cat $TMP | wc -l)
while read line
do
    country=$(echo $line | cut -d\| -f1)
    echo "{ \"name\": \"$country\", "
    echo "  \"children\": ["
    region_count=$(echo $line | cut -d\| -f2- | sed -e 's/|/\n/g' | wc -l)
    echo $line | cut -d\| -f2- | sed -e 's/|/\n/g' | 
    while read region
    do
        name=$(echo $region | cut -d: -f1)
        echo "    { \"name\": \"$name\", "
        echo "      \"children\": ["
            code_count=$(echo $region | sed -e 's/^'"$name"'://' -e 's/&/\n/g'  | wc -l)
            echo $region | sed -e 's/^'"$name"'://' -e 's/&/\n/g'  |
            while read code_size
            do
                code=$(echo $code_size | cut -d= -f1)
                size=$(echo $code_size | cut -d= -f2)
                code_count=$((code_count - 1))
                COMMA=""
                if [ $code_count -gt 0 ]; then
                  COMMA=","
                fi
                echo "        { \"name\": \"$code\", \"size\": \"$size\" }$COMMA " 
            done
        echo "      ]"
        region_count=$((region_count - 1))
        if [ $region_count -gt 0 ]; then
            echo "    },"
        else
            echo "    }"
        fi
    done 
    echo "  ]"
    country_count=$((country_count - 1))
    COMMA=""
    if [ $country_count -gt 0 ]; then
        COMMA=","
    fi    
    echo "}$COMMA"

done < $TMP
echo "]"

exit 0

And, here's the resulting output from the above script:

[
{ "name": "Africa",
  "children": [
    { "name": "Kenya",
      "children": [
        { "name": "NAI", "size": "281" },
        { "name": "NAA", "size": "281" }
      ]
    }
  ]
},
{ "name": "Asia",
  "children": [
    { "name": "India",
      "children": [
        { "name": "NSI", "size": "100" },
        { "name": "BSE", "size": "160" }
      ]
    },
    { "name": "Pakistan",
      "children": [
        { "name": "ISE", "size": "100" },
        { "name": "ANO", "size": "100" }
      ]
    }
  ]
},
{ "name": "European Union",
  "children": [
    { "name": "United Kingdom",
      "children": [
        { "name": "LSE", "size": "100" }
      ]
    }
  ]
}
]

Please don't use code like the above in any production environment.

Sign up to request clarification or add additional context in comments.

2 Comments

I feel like there should be a badge for encouraging bad behavior 😈
Embedded systems (such as those using Yocto or DD-WRT) often have only BusyBox available, which includes a surprisingly functional Bash-like implementation, but always lack a package manager or local compilers. 100% Bash FTW!
6

Here is a solution using jq.

If filter.jq contains the following filter

 reduce (
     split("\n")[]                  # split string into lines
   | split(",")                     # split data
   | select(length>0)               # eliminate blanks
 )  as [$c1,$c2,$c3,$c4] (          # convert to object 
     {}                             #   e.g. "Africa": { "Kenya": {  
   ; setpath([$c1,$c2,"name"];$c3)  #           "name": "NAI",
   | setpath([$c1,$c2,"size"];$c4)  #           "size": "281"        
)                                   #        }, }
| [                                 # then build final array of objects format:
    keys[] as $k1                   # [ {                                               
  | {name: $k1, children: (         #   "name": "Africa",                                  
       .[$k1]                       #   "children": {                                   
     | keys[] as $k2                #     "name": "Kenya",                                 
     | {name: $k2, children:.[$k2]} #     "children": { "name": "NAI", "size": "281" }
    )}                              #   ...
  ]

and data contains the sample data then the command

$ jq -M -Rsr -f filter.jq data

produces

[
  {
    "name": "Africa",
    "children": {
      "name": "Kenya",
      "children": {
        "name": "NAI",
        "size": "281"
      }
    }
  },
  {
    "name": "Asia",
    "children": {
      "name": "India",
      "children": {
        "name": "BSE",
        "size": "160"
      }
    }
  },
  {
    "name": "Asia",
    "children": {
      "name": "Pakistan",
      "children": {
        "name": "ANO",
        "size": "100"
      }
    }
  },
  {
    "name": "European Union",
    "children": {
      "name": "United Kingdom",
      "children": {
        "name": "LSE",
        "size": "100"
      }
    }
  }
]

Comments

0

Please don't use RegEx, or Bash' builtin tools. They're not designed to parse or create JSON. Use a dedicated parser like instead:

Assuming 'input.csv':

Africa,Kenya,NAI,109
Africa,Kenya,NAA,160
Asia,India,NSI,100
Asia,India,BSE,60
Asia,Pakistan,ISE,120
Asia,Pakistan,ANO,433
European Union,United Kingdom,LSE,550
European Union,United Kingdom,PLU,123
$ xidel -s "input.csv" -e '
  array{
    let $csv:=x:lines($raw) ! array{tokenize(.,",")}
    for $region in distinct-values($csv(1))
    return {
      "name":$region,
      "children":array{
        for $country in distinct-values($csv[.() = $region](2))
        return {
          "name":$country,
          "children":array{
            $csv[.() = ($country)] ! {
              "name":.(3),
              "size":.(4)
            }
          }
        }
      }
    }
  }
'
[
  {
    "name": "Africa",
    "children": [
      {
        "name": "Kenya",
        "children": [
          {
            "name": "NAI",
            "size": "109"
          },
          {
            "name": "NAA",
            "size": "160"
          }
        ]
      }
    ]
  },
  {
    "name": "Asia",
    "children": [
      {
        "name": "India",
        "children": [
          {
            "name": "NSI",
            "size": "100"
          },
          {
            "name": "BSE",
            "size": "60"
          }
        ]
      },
      {
        "name": "Pakistan",
        "children": [
          {
            "name": "ISE",
            "size": "120"
          },
          {
            "name": "ANO",
            "size": "433"
          }
        ]
      }
    ]
  },
  {
    "name": "European Union",
    "children": [
      {
        "name": "United Kingdom",
        "children": [
          {
            "name": "LSE",
            "size": "550"
          },
          {
            "name": "PLU",
            "size": "123"
          }
        ]
      }
    ]
  }
]

See this gist for intermediate steps leading to this query.
Also see this online xidelcgi demo.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.