2

I am trying to loop over some files and skip the rows before the header in each file using pandas. All of the files are in the same data format except some have different number of rows to skip before the header. Is there a way to loop over the files and start at the header of each file when some have more rows to skip than others?

For example, some files require this:

f = pd.read_csv(fname,skiprows = 7,parse_dates=[0])

And some require this:

f = pd.read_csv(fname,skiprows = 15, parse_dates=[0])

Here is my chunk of code looping over my files:

for name,ID in stations:
    path = str(ID)+'/*.csv'
    for fname in glob.glob(path):
        print(fname)
        f = pd.read_csv(fname,skiprows=15,parse_dates=[0]) #could also skip 7 depending on file
        ws = f['Wind Spd (km/h)']*0.27778 #convert to m/s from km/h
        dt = f['Date/Time']
2
  • 1
    How do you know when you reach a header? Are the rows above your headers always empty? Commented Jun 18, 2018 at 16:46
  • 1
    Yes there are a few empty rows above the header but the number of empty rows also changes depending on the file. I know I have reached the header when the first field of the header starts with Date/Time Commented Jun 18, 2018 at 16:51

2 Answers 2

1

One way is to read your file using pure Python I/O to extract the index, then feed this into the skip_rows argument of pd.read_csv.

This is fairly efficient since the first step uses a generator expression which reads only until the desired row is reached.

from io import StringIO
import pandas as pd
from copy import copy

mystr = StringIO("""dasfaf
kgafsda


Date/Time,num1,num2
2018-01-01,0,1
2018-01-02,2,3
""")

mystr2 = copy(mystr)

# replace mystr with open('file.csv', 'r')
with mystr as fin:
    idx = next(i for i, j in enumerate(fin) if j.startswith('Date/Time'))

# replace mystr2 with 'file.csv'
df = pd.read_csv(mystr2, skiprows=idx-1, parse_dates=[0])

print(df)

   Date/Time  num1  num2
0 2018-01-01     0     1
1 2018-01-02     2     3

Wrap this in a function if you need to repeat the task:

def calc_skiprows(fname):
    with fname as fin:
        idx = next(i for i, j in enumerate(fin) if j.startswith('Date/Time')) - 1

df = pd.read_csv(fname, skiprows=calc_skiprows(fname), parse_dates=[0])
Sign up to request clarification or add additional context in comments.

8 Comments

Thank you but the files' rows before the headers aren't all blank lines, only a couple of the rows before the header are blank and then the rest have words in them .
Is there a typo in the 5th line?
@HM14, sorry - fixed.
I am a bit confused by this. Would this all go in a loop when looping over files? I have edited my question to include the loop I am using to loop over my files
@HM14, Just wrap the logic in a function, see update.
|
0

The first suggestion/answer seemed like a really good way to handle it but I couldn't get it to work for me for some reason. I did find another way to fix my problem using the try and except funcitons in python:

for name,ID in stations:
    #read in each stations .csv files, concatenate together, insert station id column
    path = str(ID)+'/*.csv'
    for fname in glob.glob(path):
        print(fname)
        try:
            f = pd.read_csv(fname,skiprows=7,parse_dates=[0])
        except:
            f = pd.read_csv(fname,skiprows=15,parse_dates=[0])
        ws = f['Wind Spd (km/h)']*0.27778 #convert to m/s from km/h
        dt = f['Date/Time']

This way if the first attempt to read in the file fails (skipping 7 rows), then it tries again using the other read_csv line (skipping 15 rows). This is not 100% correct since I am still hardcoding the number of lines to skip, but works for my needs right now.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.