3

I have a CSV to JSON Python script thanks to the user Petri that let's me convert a Geonames CSV dump into a MongoImport-friendly JSON.

The problem is that Geonames has a field called alternatenames that is currently quoted and treated as one long string. Therefore it cannot be queried properly in MongoDB. I would like to change the field to a string array such as: "alternatenames":["name1", "name2"]

The Python script looks like this:

import csv, simplejson, decimal, codecs

data = open("cities.txt")
reader = csv.DictReader(data, delimiter=",", quotechar='"')

with codecs.open("cities.json", "w", encoding="utf-8") as out:
   for r in reader:
      for k, v in r.items():
         # make sure nulls are generated
         if not v:
            r[k] = None
         # parse and generate decimal arrays
         elif k == "loc":
            r[k] = [decimal.Decimal(n) for n in v.strip("[]").split(",")]
         # generate a number
         elif k == "geonameid":
            r[k] = int(v)
      out.write(simplejson.dumps(r, ensure_ascii=False, use_decimal=True)+"\n")

My CSV has the following fields:

"geonameid","name","asciiname","alternatenames","loc","feature_class","feature_code","country_code","cc2","admin1_code","admin2_code","admin3_code","admin4_code"
3,"Zamīn Sūkhteh","Zamin Sukhteh","Zamin Sukhteh,Zamīn Sūkhteh","[48.91667,32.48333]","P","PPL","IR",,"15",,,
5,"Yekāhī","Yekahi","Yekahi,Yekāhī","[48.9,32.5]","P","PPL","IR",,"15",,,
7,"Tarvīḩ ‘Adāī","Tarvih `Adai","Tarvih `Adai,Tarvīḩ ‘Adāī","[48.2,32.1]","P","PPL","IR",,"15",,,

My current JSON output looks like this:

{"loc": [48.91667, 32.48333], "name": "Zamīn Sūkhteh", "geonameid": 3, "feature_class": "P", "admin3_code": null, "admin2_code": null, "cc2": null, "feature_code": "PPL", "country_code": "IR", "admin1_code": "15", "alternatenames": "Zamin Sukhteh,Zamīn Sūkhteh", "asciiname": "Zamin Sukhteh", "admin4_code": null}
{"loc": [48.9, 32.5], "name": "Yekāhī", "geonameid": 5, "feature_class": "P", "admin3_code": null, "admin2_code": null, "cc2": null, "feature_code": "PPL", "country_code": "IR", "admin1_code": "15", "alternatenames": "Yekahi,Yekāhī", "asciiname": "Yekahi", "admin4_code": null}
{"loc": [48.2, 32.1], "name": "Tarvīḩ ‘Adāī", "geonameid": 7, "feature_class": "P", "admin3_code": null, "admin2_code": null, "cc2": null, "feature_code": "PPL", "country_code": "IR", "admin1_code": "15", "alternatenames": "Tarvih `Adai,Tarvīḩ ‘Adāī", "asciiname": "Tarvih `Adai", "admin4_code": null}

I would like to change the JSON output to add a string array as follows (scroll to the right to alternatenames):

{"loc": [48.91667, 32.48333], "name": "Zamīn Sūkhteh", "geonameid": 3, "feature_class": "P", "admin3_code": null, "admin2_code": null, "cc2": null, "feature_code": "PPL", "country_code": "IR", "admin1_code": "15", "alternatenames": ["Zamin Sukhteh", "Zamīn Sūkhteh"], "asciiname": "Zamin Sukhteh", "admin4_code": null}
{"loc": [48.9, 32.5], "name": "Yekāhī", "geonameid": 5, "feature_class": "P", "admin3_code": null, "admin2_code": null, "cc2": null, "feature_code": "PPL", "country_code": "IR", "admin1_code": "15", "alternatenames": ["Yekahi,Yekāhī"], "asciiname": "Yekahi", "admin4_code": null}
{"loc": [48.2, 32.1], "name": "Tarvīḩ ‘Adāī", "geonameid": 7, "feature_class": "P", "admin3_code": null, "admin2_code": null, "cc2": null, "feature_code": "PPL", "country_code": "IR", "admin1_code": "15", "alternatenames": ["Tarvih `Adai", "Tarvīḩ ‘Adāī"], "asciiname": "Tarvih `Adai", "admin4_code": null}

Also, should I change my quotechar in my Access 2010-exported CSV to ^ instead of " to avoid double quoting?

Thanks for any help.

3 Answers 3

2

Add another "elif" to your existing ones to handle the "alternatenames":

     elif k == "alternatenames":
        r[k] = [name.strip() for name in v.split(",")]

So first split the string on commas and then strip off the whitespace at the start/end.

Sign up to request clarification or add additional context in comments.

Comments

0

I don't think your quotechar is the issue here. You will have to manually specify that you want that field to be turned into a string list.

Warning: untested code follows

elif k == "alternatenames":
    r[k] = unicode.split(v, ',')

I'm assuming the v is unicode based on the characters, however if it is ascii, please adjust.

Comments

0

Try including this:

elif k == "alternatenames":
   r[k] = [v.split(",")]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.