3

I have 100000 images and I need to get the vectors for each image

imageVectors = []
for i in range(100000):
    fileName = "Images/" + str(i) + '.jpg'
    imageVectors.append(getvector(fileName).reshape((1,2048)))
cPickle.dump( imageVectors, open( 'imageVectors.pkl', "w+b" ), cPickle.HIGHEST_PROTOCOL ) 

getVector is a function that takes 1 image at a time and takes about 1 second to process a it. So, basically my problem reduces to

for i in range(100000):
    A = callFunction(i)  //a complex function that takes 1 sec for each call

The things that I have already tried are: (only the pseduo-code is given here)

1) Using numpy vectorizer:

def callFunction1(i):
   return callFunction2(i)
vfunc = np.vectorize(callFunction1)
imageVectors = vfunc(list(range(100000))

2)Using python map:

def callFunction1(i):
    return callFunction2(i)
imageVectors = map(callFunction1, list(range(100000))

3) Using python multiprocessing:

import multiprocessing
try:
   cpus = multiprocessing.cpu_count()
except NotImplementedError:
   cpus = 4   # arbitrary default

pool = multiprocessing.Pool(processes=cpus)
result = pool.map(callFunction, xrange(100000000))

4) Using multiprocessing in a different way:

from multiprocessing import Process, Queue
q = Queue()
N = 100000000
p1 = Process(target=callFunction, args=(N/4,q))
p1.start()
p2 = Process(target=callFunction, args=(N/4,q))
p2.start()
p3 = Process(target=callFunction, args=(N/4,q))
p3.start()
p4 = Process(target=callFunction, args=(N/4,q))
p4.start()

results = []
for i in range(4):
    results.append(q.get(True))
p1.join()
p2.join()
p3.join()
p4.join()

All the above methods are taking immensely huge time. Is there any other way more efficient than this so that maybe I can loop through many elements simultaneously instead of sequentially or in any other faster way.


The time is mainly being taken by the getvector function itself. As a work around, I have split my data into 8 different batches and running the same program for different parts of the loop and running eight separate instances of python on a octa-core VM in google cloud. Could anyone suggest if map-reduce or taking help of GPU's using PyCuda may be a good option?

11
  • 2
    Well it's obviously bounded by your slow processing. So if you got N-cores, you can only expect a speedup of N. If the original code is too slow and you can use 4 cores, it's probably too slow too. There is nothing to do besides making this processing-function faster. (And you should use numpy's internal pickling or any other sane method like HDF5 or co. instead of cpickle; but that won't change your preprocessing-step performance). If your func would be faster, i would not recommend this every-iteration IO, but in your case it does not matter. Commented Apr 23, 2017 at 13:58
  • 1
    If getvector(fileName) takes 1 second, and you have 100 million files. You need to use more cores. 1 gets you the result in 1157 days, 4 cores in 289. You need 1100 cores to get it down to one day. Commented Apr 23, 2017 at 13:58
  • 2
    @Luchko It's is doing transfer learning using tensorflow. (Eliminating the last fully-connected layer of the Convoluted neural network of the inception v3 model and getting the image vectors from the penultimate layer) Commented Apr 23, 2017 at 15:49
  • 1
    Is the indentation of the pickle call right? You save the accumulated list each loop? Or do you call that just once? I don't think you read the documentation for np.vectorize. If your function takes 1 sec to process a 2048 element array, then the problem is in that function. And if that is a complex tensorflow process then the solution lies in understanding that package, not trying to make the looping process 'more parallel'. It's not the loop that's slowing you down, it doing that 'getvector' many times on many different files. Commented Apr 23, 2017 at 16:38
  • 2
    I changed the tags based on your comment. Tags are most useful when they identify the modules that you use. No one watches for questions about 'loops'. Commented Apr 23, 2017 at 16:42

1 Answer 1

2

The multiprocessing.Pool solution is a good one, in the sense that it uses all your cores. So it should be approximately N times faster than using plain old map, where N is the number of cores you have.

BTW, you can skip determining the amount of cores. By default multiprocessing.Pool uses as many processes as your CPU has cores.

Instead of a plain map (which blocks until everything has been processed), I would suggest using imap_unordered. This is an iterator that will start returning results as soon as they become available so your parent process can start further processing if any. If ordering is important, you might want to return a tuple (number, array) to identify the result.

Your function returns a numpy array of 2048 values, which I assume are numpy.float64 Using the standard mapping functions will transport the results back to the parent process using IPC. On a 4-core machine that will result in 4 IPC transports of 2048*8 = 16384 bytes, so 65536 bytes/second. That doesn't sound too bad. But I don't know how much overhead the IPC (which involves pickling and Queues) will incur.

In case the overhead is large, you might want to create a shared memory area to store the results in. You would need approximately 1.5 Gib to store 100000 results of 2048 8-byte floats. That is a sizeable amount of memory, but not impractical for current machines.

For 100000 images and 4 cores and each image taking around one second, your program's running time would be in the order of 8 hours.

Your most important task for optimization would be to look into reducing the runtime of the getvector function. For example, would it run just as well if you reduced the size of the images by half? Assuming that the runtime scales linearly to the amount of pixels, that should cut the runtime to 0.25 s.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.