out of memory while trying to read csv file into python

Question

My data is 88200(rows)*29403(columns)(14Gb approximately). The data has been created in matlab using dlmwrite. I have tried to use the following methods to read the file in python. In all attempts i have run out of memory:

my system: ubuntu 16.04, RAM 32Gb, Swap 20Gb Python 2.7.12, pandas :0.19, GCC 5.4.0

1> using csv:

import csv
import numpy
filename = 'data.txt'
raw_data = open(filename, 'rb')
reader = csv.reader(raw_data, delimiter=',', quoting=csv.QUOTE_NONE)
x = list(reader)
data = numpy.array(x).astype('float')

2a> using numpy loadtxt:

import numpy
filename = 'data.txt'
raw_data = open(filename, 'rb')
data = numpy.loadtxt(raw_data, delimiter=",")

2b> using numpy genfromtxt:

import numpy
x=np.genfromtxt('vectorized_image_dataset.txt',skip_header=0,skip_footer=0,delimiter=',',dtype='float32')

3> using pandas.read_csv:

from pandas import *
import numpy as np

tp = read_csv(filepath_or_buffer='data.txt', header=None, iterator=True, chunksize=1000)
df = concat(tp, ignore_index=True)

In all the above methods it ran out of memory.

The data file has been created using dlmwrite (matlab). a list of images(list.txt) are read one by one, converted to float, vectorized and stored using dlmwrite. The code is as below:

fileID = fopen('list.txt');
N=88200;
C = textscan(fileID,'%s');
fclose(fileID);

for i=1:N

A=imread(C{1}{i});
% convert the file to vector
B=A(:);
% convert the above vector to a row
D=B';
% divide by 256
%E=double(D)/double(256);
E=single(D)/single(256);
dlmwrite('vectorized_image_dataset.txt',E,'-append');
clear A;clear B;clear D;clear E;
end

Have you tried reading the file line by line? Open it with with open("data.txt", "r") as f:" and then process each line at a time using a for loop: for line in f: . — GeckStar
– GeckStar, Commented Oct 29, 2016 at 10:18
I need the whole data in a numpy array. If I read line by line, i would have to append the data corresponding to the new line into the numpy array. That would involve resizing the array on each iteration. In matlab array resizing is very slow, i guess it will be slow in numpy as well? anyway i will give it a try. — user27665
– user27665, Commented Oct 29, 2016 at 10:26
Instead of appending one array row per line try reading chunks of the data in a loop (halves or quarters) and concatenate the arrays afterwards. — Nils Werner
– Nils Werner, Commented Oct 29, 2016 at 10:40
If the target is a numpy.array and that doesn't fit into memory, all the suggestions about how to better read the file will not help. You might want to look at numpy.memmap an/or PyTables — hvwaldow
– hvwaldow, Commented Oct 29, 2016 at 12:11

actionjezus6 · Accepted Answer · 2016-10-29 10:36:30Z

1

def read_line_by_line(file_path: str):
    with open(filepath) as file:
        for line in file:
            yield line

Maybe this function will help you - I am not very familiar with Numpy/Pandas, but it seems like you are trying to load all the data at once and store it in memory. With function, above, you will be using generator to yield only one line at a time – no need to store everything in RAM.

answered Oct 29, 2016 at 10:36

actionjezus6

471 silver badge6 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

hpaulj Over a year ago

But people normally load data into numpy arrays or pandas because they want to work with all of it at once, or at least many lines. Functions like np.genfromtxt do read the data line by line, but they collect those lines and make an array (list of lists). They may want, for example to take the mean over one or more of the columns.

user27665 · Accepted Answer · 2016-10-29 14:49:54Z

I solved it using pandas.read_csv. I broke up my data.txt into four pieces of 22050 lines each. Then I did

tp1 = read_csv(filepath_or_buffer='data_first_22050.txt', header=None, iterator=True, chunksize=1000)
df1 = concat(tp1, ignore_index=True)
tp2 = read_csv(filepath_or_buffer='data_second_22050.txt', header=None, iterator=True, chunksize=1000)
df2 = concat(tp2, ignore_index=True)>>> frames=[df1,df2]
result=concat(frames)
del frames, df1, df2, tp1, tp2
tp3 = read_csv(filepath_or_buffer='data_third_22050.txt', header=None, iterator=True, chunksize=1000)
df3 = concat(tp3, ignore_index=True)
frames=[result,df3]
result2=concat(frames)
del frames, df3, tp3, result
tp4 = read_csv(filepath_or_buffer='data_fourth_22050.txt', header=None, iterator=True, chunksize=1000)
df4 = concat(tp4, ignore_index=True)
frames=[result2,df4]
result3=concat(frames)
del frames, tp4, df4, result2
A=result3.as_matrix()
A.shape

(88200, 29403)

Collectives™ on Stack Overflow

out of memory while trying to read csv file into python

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related