My data is 88200(rows)*29403(columns)(14Gb approximately). The data has been created in matlab using dlmwrite. I have tried to use the following methods to read the file in python. In all attempts i have run out of memory:
my system: ubuntu 16.04, RAM 32Gb, Swap 20Gb Python 2.7.12, pandas :0.19, GCC 5.4.0
1> using csv:
import csv
import numpy
filename = 'data.txt'
raw_data = open(filename, 'rb')
reader = csv.reader(raw_data, delimiter=',', quoting=csv.QUOTE_NONE)
x = list(reader)
data = numpy.array(x).astype('float')
2a> using numpy loadtxt:
import numpy
filename = 'data.txt'
raw_data = open(filename, 'rb')
data = numpy.loadtxt(raw_data, delimiter=",")
2b> using numpy genfromtxt:
import numpy
x=np.genfromtxt('vectorized_image_dataset.txt',skip_header=0,skip_footer=0,delimiter=',',dtype='float32')
3> using pandas.read_csv:
from pandas import *
import numpy as np
tp = read_csv(filepath_or_buffer='data.txt', header=None, iterator=True, chunksize=1000)
df = concat(tp, ignore_index=True)
In all the above methods it ran out of memory.
The data file has been created using dlmwrite (matlab). a list of images(list.txt) are read one by one, converted to float, vectorized and stored using dlmwrite. The code is as below:
fileID = fopen('list.txt');
N=88200;
C = textscan(fileID,'%s');
fclose(fileID);
for i=1:N
A=imread(C{1}{i});
% convert the file to vector
B=A(:);
% convert the above vector to a row
D=B';
% divide by 256
%E=double(D)/double(256);
E=single(D)/single(256);
dlmwrite('vectorized_image_dataset.txt',E,'-append');
clear A;clear B;clear D;clear E;
end
with open("data.txt", "r") as f:"and then process each line at a time using a for loop:for line in f:.