Python Pandas, Reading in file and skipping rows ahead of header

Question

I am trying to loop over some files and skip the rows before the header in each file using pandas. All of the files are in the same data format except some have different number of rows to skip before the header. Is there a way to loop over the files and start at the header of each file when some have more rows to skip than others?

For example, some files require this:

f = pd.read_csv(fname,skiprows = 7,parse_dates=[0])

And some require this:

f = pd.read_csv(fname,skiprows = 15, parse_dates=[0])

Here is my chunk of code looping over my files:

for name,ID in stations:
    path = str(ID)+'/*.csv'
    for fname in glob.glob(path):
        print(fname)
        f = pd.read_csv(fname,skiprows=15,parse_dates=[0]) #could also skip 7 depending on file
        ws = f['Wind Spd (km/h)']*0.27778 #convert to m/s from km/h
        dt = f['Date/Time']

How do you know when you reach a header? Are the rows above your headers always empty? — jpp
– jpp, Commented Jun 18, 2018 at 16:46
Yes there are a few empty rows above the header but the number of empty rows also changes depending on the file. I know I have reached the header when the first field of the header starts with Date/Time — HM14
– HM14, Commented Jun 18, 2018 at 16:51

jpp · Accepted Answer · 2018-06-18 17:24:55Z

1

One way is to read your file using pure Python I/O to extract the index, then feed this into the skip_rows argument of pd.read_csv.

This is fairly efficient since the first step uses a generator expression which reads only until the desired row is reached.

from io import StringIO
import pandas as pd
from copy import copy

mystr = StringIO("""dasfaf
kgafsda


Date/Time,num1,num2
2018-01-01,0,1
2018-01-02,2,3
""")

mystr2 = copy(mystr)

# replace mystr with open('file.csv', 'r')
with mystr as fin:
    idx = next(i for i, j in enumerate(fin) if j.startswith('Date/Time'))

# replace mystr2 with 'file.csv'
df = pd.read_csv(mystr2, skiprows=idx-1, parse_dates=[0])

print(df)

   Date/Time  num1  num2
0 2018-01-01     0     1
1 2018-01-02     2     3

Wrap this in a function if you need to repeat the task:

def calc_skiprows(fname):
    with fname as fin:
        idx = next(i for i, j in enumerate(fin) if j.startswith('Date/Time')) - 1

df = pd.read_csv(fname, skiprows=calc_skiprows(fname), parse_dates=[0])

edited Jun 18, 2018 at 17:24

answered Jun 18, 2018 at 16:53

jpp

166k37 gold badges301 silver badges363 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

HM14 Over a year ago

Thank you but the files' rows before the headers aren't all blank lines, only a couple of the rows before the header are blank and then the rest have words in them .

HM14 Over a year ago

Is there a typo in the 5th line?

jpp Over a year ago

@HM14, sorry - fixed.

HM14 Over a year ago

I am a bit confused by this. Would this all go in a loop when looping over files? I have edited my question to include the loop I am using to loop over my files

jpp Over a year ago

@HM14, Just wrap the logic in a function, see update.

|

HM14 · Accepted Answer · 2018-06-18 19:08:40Z

The first suggestion/answer seemed like a really good way to handle it but I couldn't get it to work for me for some reason. I did find another way to fix my problem using the try and except funcitons in python:

for name,ID in stations:
    #read in each stations .csv files, concatenate together, insert station id column
    path = str(ID)+'/*.csv'
    for fname in glob.glob(path):
        print(fname)
        try:
            f = pd.read_csv(fname,skiprows=7,parse_dates=[0])
        except:
            f = pd.read_csv(fname,skiprows=15,parse_dates=[0])
        ws = f['Wind Spd (km/h)']*0.27778 #convert to m/s from km/h
        dt = f['Date/Time']

This way if the first attempt to read in the file fails (skipping 7 rows), then it tries again using the other read_csv line (skipping 15 rows). This is not 100% correct since I am still hardcoding the number of lines to skip, but works for my needs right now.

Collectives™ on Stack Overflow

Python Pandas, Reading in file and skipping rows ahead of header

2 Answers 2

8 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related