0

I´m web scraping different webpages and for each webpage I´m writing each row of the csv file

import csv
fieldnames=["Title", "Author", "year"]
counter=1
for webpage of webpages:
    if counter==1:
        f = open('file.csv', 'wb')  
        my_writer = csv.DictWriter(f, fieldnames)
        my_writer.writeheader()
        f.close()

    something where I get the information (title, author and year) for each webpage

    variables={ele:"NA" for ele in fieldnames}
    variables['Title']=title        
    variables['Author']=author
    variables['year']=year


    with open('file.csv', 'a+b') as f:
    header = next(csv.reader(f))
    dict_writer = csv.DictWriter(f, header)
    dict_writer.writerow(variables) 
    counter+=1

However, there could be more than one author (so author after web scraping is actually a list) so I would like to have in the headers of the csv file: author1, author2, author3, etc. But I don't know what would be the maximum number of authors. So in the loop I would like to edit the header and start adding author2,author3 etc depending if in that row is necessary to create more authors.

1
  • after you write headers you can't overwrite them. You can keep all data in memory and write everythink when you get all data. Or write all data in file and at the end create new file, write headers and copy/add data from file without headers. And then you can also add empty values to rows which didn't have some authors (to create correctly formatted CSV). Commented Oct 14, 2016 at 18:58

2 Answers 2

1

Because "Author" is a variable-length list, you should serialize it in some way to fit inside a single field. For example, use a semicolon as a separator.

Assuming you have an authors field with all the authors in them from your webpage object, you would want to change your assignment line to something like this:

variables['Authors']=';'.join(webpage.authors)

This is a simple serialization of all of the authors. You can of course come up with something else - use a different separator or serialize to JSON or YAML or something more elaborate like that.

Hopefully that gives some ideas.

Sign up to request clarification or add additional context in comments.

Comments

1

It could be something like:

def write_to_csv(file_name, records, fieldnames=None):

    import csv
    from datetime import datetime

    with open('/tmp/' + file_name, 'w') as csvfile:
        if not fieldnames:
            fieldnames = records[0].keys()
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames,   extrasaction='ignore')
        writer.writeheader()
        for row in records:
            writer.writerow(row)

def scrape():
    for webpage of webpages:
        webpage_data = [{'title':'','author1':'foo','author2':'bar'}] #sample data
        write_to_csv(webpage[0].title+'csv', webpage_data,webpage_data[0].keys())

I`m assuming:

  • Data will be consistent for the same webpage, but differ the next webpage in loop
  • webpage data is a list of dictionaries, having values mapped to keys
  • the above code is based on Python 3

So in the loop, we`ll just get the data, and pass the relevant fieldnames and the values to another function, so be able to write it to csv.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.