2

I’m quite new to Python and I have to handle data files around 20GB. Currently I want to understand if it is possible to write any Python code (like below) and run it on the GPU. Actually I just need to open that files, make something like:

file=open(fnTar,"w")
for iLine in List:
    iLine=iLine.replace(“\\”,””)
    file.write(iLine)
file.close()

I know there are high level API’s like Dask to handle large files much more efficient, but in the future I also need to manipulate the data in a different way (some calculations). Is it possible to run such code on the GPU without changes of the original script? Something like:

Run this on GPU:
    file=open(fnTar,"w")
    for iLine in List:
        iLine=iLine.replace(“\\”,””)
        file.write(iLine)
    file.close()

My understanding is that even using CUDA needs some additional changes of the code and in case of using modules like numpy you have to find a equivalent module developed for CUDA. So that is maybe also not the simple and quick solution I'm looking for.

1
  • GPUs can't do file IO. What you are asking isn't possible, and there are no "simple and quick" solutions when it comes to GPU computing Commented Dec 30, 2019 at 14:27

3 Answers 3

3

From my understanding of your question, this probably isn't what you're looking for. GPUs are good for doing matrix manipulation, not the explicit handling of large files. For that you really need just more memory or some method of handling it in chunks.

Sign up to request clarification or add additional context in comments.

Comments

0

You can do powerful string manipulation within the RapidsAI framework. If you have CSVs then data loading can be extremely fast: https://blog.dask.org/2019/01/13/dask-cudf-first-steps

Comments

0

Thanks for this question. Here is how you would do use RAPIDS and an NVIDIA GPU to do this really quickly (both in time and code!). It will require you to use cudf or nvstrings. These are just toy examples, so adjust appropriately to your usecase.

CUDF

1) Create and read the file directly (assuming CSV for your usecase):

import cudf

fn = 'test.csv'
lines = """a, b, c
test \\ phase 1, 2, 3
test  \\phase 2, 4, 5
test\\phase 3, 6, 7
"""
with open(fn, 'w') as fp:
    fp.write(lines)

# File Manipulation starts here    
df = cudf.read_csv(fn)
df.head()

Here is the output:

                 a  b   c
0   test \ phase 1  2   3
1   test \phase 2   4   5
2   test\phase 3    6   7

2) Do the replacement:

df['a'] = df['a'].str.replace("\\","", regex=False) # regex=False is important as this particular instance as replace won't work without it (created a github issue to fix that)!
df.head()

Your output will be:

      a             b   c
0   test  phase 1   2   3
1   test phase 2    4   5
2   testphase 3     6   7

3) Write the file

df.to_csv("testdone.csv")

NVSTRINGS: Docs Link for this part

For an nvstrings solution, send your list to the device. I don't have your file, or its formatting info, so I would ask you to try it out, if you'd like :). However, watch your implementation's IO, as that latency may cause it to take longer than cudf. It's really just better (faster) to send the file to GPU and work on it only on the GPU. I don't really recommend this way, but just showing you what you CAN do in a TOY EXAMPLE :).

import nvstrings
s = nvstrings.to_device(["h\llo","go\dbye"]) #Example list, s=iLine
print(s.replace('\\', '', regex=False))

Output will be:

['hllo', 'godbye']

As for the 20GB file size, if you have the GPU memory size, like on a Titan RTX/GV100/or RTX8000, you should be okay on a single GPU without Dask. If not, apply this to dask_cudf. Dask_cudf string accesor source is here!

Hope this helps!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.