Appending to dataframe in while loop in Pandas

Question

I ran into some issues while trying to sort the dataframe. My code gets data that only allows for 1000 rows at a time and then it send a continuation URL which my script follows with the while loop, but the problem is that on each pass I have it writing and appending to the CSV. It worked fine but now that I need to sort the whole data frame it is an issue.

How can I have this write to the data frame on each pass then write the dataframe to the CSV. Would I append to the data frame on each loop or would I have it make new dataframes on each pass then combine them at then end somehow?

import requests
import json
import pandas as pd
import time
import os
from  itertools import product

#what I need to loop through
instrument = ('btc-usd')
exchange = ('cbse')  
interval = ('1m','3m')  
start_time = '2021-01-14T00:00:00Z'
end_time = '2021-01-16T23:59:59Z'


for (interval) in product(interval):
    page_size = '1000'
    url = f'https://us.market-api.kaiko.io/v2/data/trades.v1/exchanges/{exchange}/spot/{instrument}/aggregations/count_ohlcv_vwap'
    #params = {'interval': interval, 'page_size': page_size, 'start_time': start_time, 'end_time': end_time }
    params = {'interval': interval, 'page_size': page_size }
    KEY = 'xxx'
    headers = {
        "X-Api-Key": KEY,
        "Accept": "application/json",
        "Accept-Encoding": "gzip"
    }

    csv_file = f"{exchange}-{instrument}-{interval}.csv"
    c_token = True

    while(c_token):
        res = requests.get(url, params=params, headers=headers)
        j_data = res.json()
        parse_data = j_data['data']
        c_token = j_data.get('continuation_token')
        today = time.strftime("%Y-%m-%d")
        params = {'continuation_token': c_token}

        if c_token:   
            url = f'https://us.market-api.kaiko.io/v2/data/trades.v1/exchanges/cbse/spot/btc-usd/aggregations/count_ohlcv_vwap?continuation_token={c_token}'        

        # create dataframe
        df = pd.DataFrame.from_dict(pd.json_normalize(parse_data), orient='columns')
        df.insert(1, 'time', pd.to_datetime(df.timestamp.astype(int),unit='ms'))          
        df['range'] = df['high'].astype(float) - df['low'].astype(float)
        df.range = df.range.astype(float)

        #sort
        df = df.sort_values(by='range')
        
        #that means file already exists need to append
        if(csv_file in os.listdir()): 
            csv_string = df.to_csv(index=False, encoding='utf-8', header=False)
            with open(csv_file, 'a') as f:
                f.write(csv_string)
        #that means writing file for the first time        
        else: 
            csv_string = df.to_csv(index=False, encoding='utf-8')
            with open(csv_file, 'w') as f:
                f.write(csv_string)

Stuart · Accepted Answer · 2021-01-26 20:21:33Z

1

Perhaps the cleanest and most efficient way is to make an empty dataframe and then append to it.

import requests
import json
import pandas as pd
import time
import os
from  itertools import product

#what I need to loop through
instruments = ('btc-usd',)
exchanges = ('cbse',)
intervals = ('1m','3m')  
start_time = '2021-01-14T00:00:00Z'
end_time = '2021-01-16T23:59:59Z'
params = {'page_size': 1000}
KEY = 'xxx'
    
headers = {
        "X-Api-Key": KEY,
        "Accept": "application/json",
        "Accept-Encoding": "gzip"
    }

for instrument, exchange, interval  in product(instruments, exchanges, intervals):
    params['interval'] = interval
    url = 'https://us.market-api.kaiko.io/v2/data/trades.v1/exchanges/{exchange}/spot/{instrument}/aggregations/count_ohlcv_vwap'
    csv_file = f"{exchange}-{instrument}-{interval}.csv"
    df = pd.DataFrame()   # start with empty dataframe

    while True:
        res = requests.get(url, params=params, headers=headers)
        j_data = res.json()
        parse_data = j_data['data']
        df = df.append(pd.DataFrame.from_dict(pd.json_normalize(parse_data), orient='columns'))  # append to the dataframe
        if 'continuation_token' in j_data:
            params['continuation_token'] = j_data['continuation_token']
        else:
            break
        
    # These parts can be done outside of the while loop, once all the data has been compiled
    df.insert(1, 'time', pd.to_datetime(df.timestamp.astype(int),unit='ms'))          
    df['range'] = df['high'].astype(float) - df['low'].astype(float)
    df.range = df.range.astype(float)
    df = df.sort_values(by='range')
    df.to_csv(csv_file, index=False, encoding='utf-8')  # write the whole CSV at once

If the size of the combined dataframe is too large for memory, then you could instead read in one page at a time and append it to the CSV, provided the column headings are the same on each page. (You might still need to take care that pandas writes the columns in the same order each time.)

edited Jan 26, 2021 at 20:21

answered Jan 26, 2021 at 20:00

Stuart

9,8771 gold badge24 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

robothead Over a year ago

Thanks for the reply, I tried running this but I get this error: Traceback (most recent call last): File "kaiko-df.py", line 40, in <module> df.insert(1, 'time', pd.to_datetime(df.timestamp.astype(int),unit='ms')) File "/home/robothead/scripts/python/venvs/kaiko/lib/python3.6/site-packages/pandas/core/generic.py", line 5141, in getattr return object.__getattribute__(self, name) AttributeError: 'DataFrame' object has no attribute 'timestamp' I ve gotten this before when i was trying to get the continution url to work, not sure why its doing it now..

Stuart Over a year ago

That will only work if your data has a column labelled timestamp - would have to see the raw data to understand what might be going wrong. Try running without that line and look at the resulting dataframe to see if it is in the right shape.

robothead Over a year ago

Ok I commented out that that line and then got this error: Posted in code. there is a column for timestamp, high, and low. The original code worked, it produced a full csv so Im certain all those columns exist.

robothead Over a year ago

I got it, it was the for statement since exchange and instrument only have one item in the list it messes up. I changed it back to for (interval) in product(interval): and now your code works! Thank you this is showing me a lot, The csv write worked, but I think this would be better.

Stuart Over a year ago

ok! but for (interval) in product(interval) doesn't make much sense! maybe for interval in intervals?

ListenSoftware Louise Ai Agent · Accepted Answer · 2021-01-26 20:28:05Z

0

you can use df.loc and len and add list of values.

    win_results_df=pd.DataFrame(columns=['GameId','Team','TeamOpponent',\
    'HomeScore', 'VisitorScore','Target'])

   df_length = len(win_results_df)
   win_results_df.loc[df_length] = [teamOpponent['gameId'], \
   key, teamOpponent['visitorDisplayName'], \
   teamOpponent['HomeScore'], teamOpponent['VisitorScore'],True]

answered Jan 26, 2021 at 20:28

ListenSoftware Louise Ai Agent

4,3432 gold badges31 silver badges39 bronze badges

Collectives™ on Stack Overflow

Appending to dataframe in while loop in Pandas

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related