1

I'm writing a script to process some csv data I have in order to render it on a d3.js map, I'm providing the full script immediately after this sentence, but just for context, the important part of my question is below.

# The purpose of this script is the refinement of the job data attained from the
# JSI as it is rendered by the `csv generator` contributed by Luis for purposes
# of presentation on the dashboard map. 

import csv

# The number of columns
num_headers = 9

# Remove invalid characters from records
def url_escaper(data):
  for line in data:
    yield line.replace('&','&')

# Be sure to configure input & output files
with open("input.txt", 'r') as file_in, open("try_this_output.txt", 'w') as file_out:
    csv_in = csv.reader( url_escaper( file_in ) )
    csv_out = csv.writer(file_out)

    # Get rid of rows that have the wrong number of columns
    # and rows that have only whitespace for a columnar value
    for i, row in enumerate(csv_in, start=1):
        for e in row:
            if "|" in e:
                e = e.split(";")[0]
        if not [e for e in row if not e.strip()]:
            if len(row) == num_headers:
                csv_out.writerow(row)
        else:
            print "line %d is malformed" % i

There are some columnar values that are structured as so:

linux|devops|firewall|vmware|.net-framework|.net|paas

I want to slice them up using the following snippet:

e.split("|")[0]

such that I'm left only with the first part of text before the "|", i.e. in the above example linux.

I need to write this processed data to an output file.

I know that the snippet works, but I can't figure out how to fit that in with my pipeline.

This is the part that concerns me:

for i, row in enumerate(csv_in, start=1):
    for e in row:
        if "|" in e:
            e = e.split("|")[0]
    if not [e for e in row if not e.strip()]:
        if len(row) == num_headers:
            csv_out.writerow(row)
    else:
        print "line %d is malformed" % i

Particularly this:

    for e in row:
        if "|" in e:
            e = e.split(";")[0]

It's clear that isn't the way to achieve this aim.

An example of the input data is this:

http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJpIjoiMjk1MzYyMDY2IiwicyI6ImhiTUN6MTdUUkVPdWl5NUI2bDdwQXcifQ.A6MlT_WKpLx763hZe44X4pQ0KOMHYuITosCIwuMbPxM,Technical Account Manager, Technical Delivery Manager - Cloud,Peopleworks,Farnborough,51.293999,-0.754624,United Kingdom,linux|devops|firewall|vmware|.net-framework|.net|paas,1
http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJpIjoiMzA5MzE5OTExIiwicyI6Ik9feVBUT1VNVC0tcUZ2N1FvRWNVU1EifQ.C8ZAc9RAFSMdyaCaIIMB51-jGS01Az29VY8Dblc7QM4,Management Consultant - Utilities Smart Energy,Capgemini Consulting,Lee,51.451818,-0.02806,United Kingdom,leadership|database|project management|design|scada,1

and the ideal output would be

http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJpIjoiMjk1MzYyMDY2IiwicyI6ImhiTUN6MTdUUkVPdWl5NUI2bDdwQXcifQ.A6MlT_WKpLx763hZe44X4pQ0KOMHYuITosCIwuMbPxM,Technical Account Manager, Technical Delivery Manager - Cloud,Peopleworks,Farnborough,51.293999,-0.754624,United Kingdom,linux,1
http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJpIjoiMzA5MzE5OTExIiwicyI6Ik9feVBUT1VNVC0tcUZ2N1FvRWNVU1EifQ.C8ZAc9RAFSMdyaCaIIMB51-jGS01Az29VY8Dblc7QM4,Management Consultant - Utilities Smart Energy,Capgemini Consulting,Lee,51.451818,-0.02806,United Kingdom,leadership,1

1 Answer 1

1

You could solve the problem using regular expressions.

I grabbed your input data and put it in a file 'input.txt'

http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJpIjoiMjk1MzYyMDY2IiwicyI6ImhiTUN6MTdUUkVPdWl5NUI2bDdwQXcifQ.A6MlT_WKpLx763hZe44X4pQ0KOMHYuITosCIwuMbPxM,Technical Account Manager, Technical Delivery Manager - Cloud,Peopleworks,Farnborough,51.293999,-0.754624,United Kingdom,linux|devops|firewall|vmware|.net-framework|.net|paas,1
http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJpIjoiMzA5MzE5OTExIiwicyI6Ik9feVBUT1VNVC0tcUZ2N1FvRWNVU1EifQ.C8ZAc9RAFSMdyaCaIIMB51-jGS01Az29VY8Dblc7QM4,Management Consultant - Utilities Smart Energy,Capgemini Consulting,Lee,51.451818,-0.02806,United Kingdom,leadership|database|project management|design|scada,1

Your expected result:

http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJpIjoiMjk1MzYyMDY2IiwicyI6ImhiTUN6MTdUUkVPdWl5NUI2bDdwQXcifQ.A6MlT_WKpLx763hZe44X4pQ0KOMHYuITosCIwuMbPxM,Technical Account Manager, Technical Delivery Manager - Cloud,Peopleworks,Farnborough,51.293999,-0.754624,United Kingdom,linux,1
http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJpIjoiMzA5MzE5OTExIiwicyI6Ik9feVBUT1VNVC0tcUZ2N1FvRWNVU1EifQ.C8ZAc9RAFSMdyaCaIIMB51-jGS01Az29VY8Dblc7QM4,Management Consultant - Utilities Smart Energy,Capgemini Consulting,Lee,51.451818,-0.02806,United Kingdom,leadership,1

Here's the code I used:

import re

src = open('input.txt')
output = open('output.txt', 'w')


pat = r'([^,]*\|[^,]*)'

for line in src:
    the_search = re.search(pat, line) # Search the line for something containing '|'
    if the_search:
        the_group = the_search.group(0) # Grab the capture group
        value = the_group.split("|")[0] # Grab the first item after splitting based on '|'
        new_line = re.sub(pat, value, line) # Use re.sub to replace that entire pattern with the value
        output.write(new_line)

src.close()
output.close()

You'll need to tailor this solution to fit what you're trying to do, but the regex should work.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.