2

My data is 88200(rows)*29403(columns)(14Gb approximately). The data has been created in matlab using dlmwrite. I have tried to use the following methods to read the file in python. In all attempts i have run out of memory:

my system: ubuntu 16.04, RAM 32Gb, Swap 20Gb Python 2.7.12, pandas :0.19, GCC 5.4.0

1> using csv:

import csv
import numpy
filename = 'data.txt'
raw_data = open(filename, 'rb')
reader = csv.reader(raw_data, delimiter=',', quoting=csv.QUOTE_NONE)
x = list(reader)
data = numpy.array(x).astype('float')

2a> using numpy loadtxt:

import numpy
filename = 'data.txt'
raw_data = open(filename, 'rb')
data = numpy.loadtxt(raw_data, delimiter=",")

2b> using numpy genfromtxt:

import numpy
x=np.genfromtxt('vectorized_image_dataset.txt',skip_header=0,skip_footer=0,delimiter=',',dtype='float32')

3> using pandas.read_csv:

from pandas import *
import numpy as np

tp = read_csv(filepath_or_buffer='data.txt', header=None, iterator=True, chunksize=1000)
df = concat(tp, ignore_index=True)

In all the above methods it ran out of memory.

The data file has been created using dlmwrite (matlab). a list of images(list.txt) are read one by one, converted to float, vectorized and stored using dlmwrite. The code is as below:

fileID = fopen('list.txt');
N=88200;
C = textscan(fileID,'%s');
fclose(fileID);

for i=1:N

A=imread(C{1}{i});
% convert the file to vector
B=A(:);
% convert the above vector to a row
D=B';
% divide by 256
%E=double(D)/double(256);
E=single(D)/single(256);
dlmwrite('vectorized_image_dataset.txt',E,'-append');
clear A;clear B;clear D;clear E;
end
5
  • Have you tried reading the file line by line? Open it with with open("data.txt", "r") as f:" and then process each line at a time using a for loop: for line in f: . Commented Oct 29, 2016 at 10:18
  • I need the whole data in a numpy array. If I read line by line, i would have to append the data corresponding to the new line into the numpy array. That would involve resizing the array on each iteration. In matlab array resizing is very slow, i guess it will be slow in numpy as well? anyway i will give it a try. Commented Oct 29, 2016 at 10:26
  • Instead of appending one array row per line try reading chunks of the data in a loop (halves or quarters) and concatenate the arrays afterwards. Commented Oct 29, 2016 at 10:40
  • If the target is a numpy.array and that doesn't fit into memory, all the suggestions about how to better read the file will not help. You might want to look at numpy.memmap an/or PyTables Commented Oct 29, 2016 at 12:11
  • Are you using 32bit Python or 64bit python? Commented Oct 31, 2016 at 19:40

2 Answers 2

1
def read_line_by_line(file_path: str):
    with open(filepath) as file:
        for line in file:
            yield line

Maybe this function will help you - I am not very familiar with Numpy/Pandas, but it seems like you are trying to load all the data at once and store it in memory. With function, above, you will be using generator to yield only one line at a time – no need to store everything in RAM.

Sign up to request clarification or add additional context in comments.

1 Comment

But people normally load data into numpy arrays or pandas because they want to work with all of it at once, or at least many lines. Functions like np.genfromtxt do read the data line by line, but they collect those lines and make an array (list of lists). They may want, for example to take the mean over one or more of the columns.
0

I solved it using pandas.read_csv. I broke up my data.txt into four pieces of 22050 lines each. Then I did

tp1 = read_csv(filepath_or_buffer='data_first_22050.txt', header=None, iterator=True, chunksize=1000)
df1 = concat(tp1, ignore_index=True)
tp2 = read_csv(filepath_or_buffer='data_second_22050.txt', header=None, iterator=True, chunksize=1000)
df2 = concat(tp2, ignore_index=True)>>> frames=[df1,df2]
result=concat(frames)
del frames, df1, df2, tp1, tp2
tp3 = read_csv(filepath_or_buffer='data_third_22050.txt', header=None, iterator=True, chunksize=1000)
df3 = concat(tp3, ignore_index=True)
frames=[result,df3]
result2=concat(frames)
del frames, df3, tp3, result
tp4 = read_csv(filepath_or_buffer='data_fourth_22050.txt', header=None, iterator=True, chunksize=1000)
df4 = concat(tp4, ignore_index=True)
frames=[result2,df4]
result3=concat(frames)
del frames, tp4, df4, result2
A=result3.as_matrix()
A.shape

(88200, 29403)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.