major memory problems reading in a csv file using numpy

Question

I grabbed the KDD track1 dataset from Kaggle and decided to load a ~2.5GB 3-column CSV file into memory, on my 16GB high-memory EC2 instance:

data = np.loadtxt('rec_log_train.txt')

the python session ate up all my memory (100%), and then got killed.

I then read the same file using R (via read.table) and it used less than 5GB of ram, which collapsed to less than 2GB after I called the garbage collector.

My question is why did this fail under numpy, and what's the proper way of reading a file into memory. Yes I can use generators and avoid the problem, but that's not the goal.

If single precision will do, np.fromfile / np.loadtxt( dtype=np.float32 ) will take less memory, then X = X.astype(np.float64) when done. — denis
– denis, Commented Jul 30, 2013 at 14:53

vgoklani · Accepted Answer · 2012-04-26 00:26:27Z

6

import pandas, re, numpy as np

def load_file(filename, num_cols, delimiter='\t'):
    data = None
    try:
        data = np.load(filename + '.npy')
    except:
        splitter = re.compile(delimiter)

        def items(infile):
            for line in infile:
                for item in splitter.split(line):
                    yield item

        with open(filename, 'r') as infile:
            data = np.fromiter(items(infile), float64, -1)
            data = data.reshape((-1, num_cols))
            np.save(filename, data)

    return pandas.DataFrame(data)

This reads in the 2.5GB file, and serializes the output matrix. The input file is read in "lazily", so no intermediate data-structures are built and minimal memory is used. The initial load takes a long time, but each subsequent load (of the serialized file) is fast. Please let me if you have tips!

edited Apr 26, 2012 at 0:26

answered Apr 22, 2012 at 16:38

vgoklani

11.9k20 gold badges71 silver badges107 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Joe Kington Over a year ago

If you're specifying the number of columns a-priori, why not do something more like this: gist.github.com/2465280 ? On a side note, to make an array from a generator, use np.fromiter.

Wes McKinney · Accepted Answer · 2012-04-22 21:53:55Z

2

Try out recfile for now: http://code.google.com/p/recfile/ . There are a couple of efforts I know of to make a fast C/C++ file reader for NumPy; it's on my short todo list for pandas because it causes problems like these. Warren Weckesser also has a project here: https://github.com/WarrenWeckesser/textreader . I don't know which one is better, try them both?

answered Apr 22, 2012 at 21:53

Wes McKinney

106k32 gold badges146 silver badges109 bronze badges

Comments

Akavall · Accepted Answer · 2012-04-22 03:06:47Z

1

You can try numpy.fromfile

http://docs.scipy.org/doc/numpy/reference/generated/numpy.fromfile.html

answered Apr 22, 2012 at 3:06

Akavall

86.9k58 gold badges215 silver badges261 bronze badges

Collectives™ on Stack Overflow

major memory problems reading in a csv file using numpy

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related