0

I ran into some issues while trying to sort the dataframe. My code gets data that only allows for 1000 rows at a time and then it send a continuation URL which my script follows with the while loop, but the problem is that on each pass I have it writing and appending to the CSV. It worked fine but now that I need to sort the whole data frame it is an issue.

How can I have this write to the data frame on each pass then write the dataframe to the CSV. Would I append to the data frame on each loop or would I have it make new dataframes on each pass then combine them at then end somehow?

import requests
import json
import pandas as pd
import time
import os
from  itertools import product

#what I need to loop through
instrument = ('btc-usd')
exchange = ('cbse')  
interval = ('1m','3m')  
start_time = '2021-01-14T00:00:00Z'
end_time = '2021-01-16T23:59:59Z'


for (interval) in product(interval):
    page_size = '1000'
    url = f'https://us.market-api.kaiko.io/v2/data/trades.v1/exchanges/{exchange}/spot/{instrument}/aggregations/count_ohlcv_vwap'
    #params = {'interval': interval, 'page_size': page_size, 'start_time': start_time, 'end_time': end_time }
    params = {'interval': interval, 'page_size': page_size }
    KEY = 'xxx'
    headers = {
        "X-Api-Key": KEY,
        "Accept": "application/json",
        "Accept-Encoding": "gzip"
    }

    csv_file = f"{exchange}-{instrument}-{interval}.csv"
    c_token = True

    while(c_token):
        res = requests.get(url, params=params, headers=headers)
        j_data = res.json()
        parse_data = j_data['data']
        c_token = j_data.get('continuation_token')
        today = time.strftime("%Y-%m-%d")
        params = {'continuation_token': c_token}

        if c_token:   
            url = f'https://us.market-api.kaiko.io/v2/data/trades.v1/exchanges/cbse/spot/btc-usd/aggregations/count_ohlcv_vwap?continuation_token={c_token}'        

        # create dataframe
        df = pd.DataFrame.from_dict(pd.json_normalize(parse_data), orient='columns')
        df.insert(1, 'time', pd.to_datetime(df.timestamp.astype(int),unit='ms'))          
        df['range'] = df['high'].astype(float) - df['low'].astype(float)
        df.range = df.range.astype(float)

        #sort
        df = df.sort_values(by='range')
        
        #that means file already exists need to append
        if(csv_file in os.listdir()): 
            csv_string = df.to_csv(index=False, encoding='utf-8', header=False)
            with open(csv_file, 'a') as f:
                f.write(csv_string)
        #that means writing file for the first time        
        else: 
            csv_string = df.to_csv(index=False, encoding='utf-8')
            with open(csv_file, 'w') as f:
                f.write(csv_string)

2 Answers 2

1

Perhaps the cleanest and most efficient way is to make an empty dataframe and then append to it.

import requests
import json
import pandas as pd
import time
import os
from  itertools import product

#what I need to loop through
instruments = ('btc-usd',)
exchanges = ('cbse',)
intervals = ('1m','3m')  
start_time = '2021-01-14T00:00:00Z'
end_time = '2021-01-16T23:59:59Z'
params = {'page_size': 1000}
KEY = 'xxx'
    
headers = {
        "X-Api-Key": KEY,
        "Accept": "application/json",
        "Accept-Encoding": "gzip"
    }

for instrument, exchange, interval  in product(instruments, exchanges, intervals):
    params['interval'] = interval
    url = 'https://us.market-api.kaiko.io/v2/data/trades.v1/exchanges/{exchange}/spot/{instrument}/aggregations/count_ohlcv_vwap'
    csv_file = f"{exchange}-{instrument}-{interval}.csv"
    df = pd.DataFrame()   # start with empty dataframe

    while True:
        res = requests.get(url, params=params, headers=headers)
        j_data = res.json()
        parse_data = j_data['data']
        df = df.append(pd.DataFrame.from_dict(pd.json_normalize(parse_data), orient='columns'))  # append to the dataframe
        if 'continuation_token' in j_data:
            params['continuation_token'] = j_data['continuation_token']
        else:
            break
        
    # These parts can be done outside of the while loop, once all the data has been compiled
    df.insert(1, 'time', pd.to_datetime(df.timestamp.astype(int),unit='ms'))          
    df['range'] = df['high'].astype(float) - df['low'].astype(float)
    df.range = df.range.astype(float)
    df = df.sort_values(by='range')
    df.to_csv(csv_file, index=False, encoding='utf-8')  # write the whole CSV at once

If the size of the combined dataframe is too large for memory, then you could instead read in one page at a time and append it to the CSV, provided the column headings are the same on each page. (You might still need to take care that pandas writes the columns in the same order each time.)

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks for the reply, I tried running this but I get this error: Traceback (most recent call last): File "kaiko-df.py", line 40, in <module> df.insert(1, 'time', pd.to_datetime(df.timestamp.astype(int),unit='ms')) File "/home/robothead/scripts/python/venvs/kaiko/lib/python3.6/site-packages/pandas/core/generic.py", line 5141, in getattr return object.__getattribute__(self, name) AttributeError: 'DataFrame' object has no attribute 'timestamp' I ve gotten this before when i was trying to get the continution url to work, not sure why its doing it now..
That will only work if your data has a column labelled timestamp - would have to see the raw data to understand what might be going wrong. Try running without that line and look at the resulting dataframe to see if it is in the right shape.
Ok I commented out that that line and then got this error: Posted in code. there is a column for timestamp, high, and low. The original code worked, it produced a full csv so Im certain all those columns exist.
I got it, it was the for statement since exchange and instrument only have one item in the list it messes up. I changed it back to for (interval) in product(interval): and now your code works! Thank you this is showing me a lot, The csv write worked, but I think this would be better.
ok! but for (interval) in product(interval) doesn't make much sense! maybe for interval in intervals?
0

you can use df.loc and len and add list of values.

    win_results_df=pd.DataFrame(columns=['GameId','Team','TeamOpponent',\
    'HomeScore', 'VisitorScore','Target'])

   df_length = len(win_results_df)
   win_results_df.loc[df_length] = [teamOpponent['gameId'], \
   key, teamOpponent['visitorDisplayName'], \
   teamOpponent['HomeScore'], teamOpponent['VisitorScore'],True]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.