6

I have a file format (fastq format) that encodes a string of integers as a string where each integer is represented by an ascii code with an offset. Unfortunately, there are two encodings in common use, one with an offset of 33 and the other with an offset of 64. I typically have several 100 million strings of length 80-150 to convert from one offset to the other. The simplest code that I could come up with for doing this type of thing is:

def phred64ToStdqual(qualin):
    return(''.join([chr(ord(x)-31) for x in qualin]))

This works just fine, but it is not particularly fast. For 1 million strings, it takes about 4 seconds on my machine. If I change to using a couple of dicts to do the translation, I can get this down to about 2 seconds.

ctoi = {}
itoc = {}
for i in xrange(127):
    itoc[i]=chr(i)
    ctoi[chr(i)]=i

def phred64ToStdqual2(qualin):
    return(''.join([itoc[ctoi[x]-31] for x in qualin]))

If I blindly run under cython, I get it down to just under 1 second.
It seems like at the C-level, this is simply a cast to int, subtract, and then cast to char. I haven't written this up, but I'm guessing it is quite a bit faster. Any hints including how to better code a this in python or even a cython version to do this would be quite helpful.

Thanks,

Sean

5
  • Try replacing [] with () to use generators rather than creating and discarding lists. I doubt it will make a huge difference, but it should make some. Commented Sep 27, 2010 at 16:25
  • Re replacing [] with (), the () are redundant with any recent python Commented Sep 27, 2010 at 16:30
  • Fine idea, but string join needs a list, I believe, so that won't work directly I don't think. Commented Sep 27, 2010 at 16:38
  • Um, with Python 2.6.1: ".".join(x for x in ["1", "2", "3"]) -> '1.2.3' Commented Sep 27, 2010 at 16:46
  • Stupid on my part. I ran the generator under cython--not supported there. Python 2.6.2 is 20% slower using the generator version. Commented Sep 27, 2010 at 16:50

1 Answer 1

4

If you look at the code for urllib.quote, there is something that is similar to what you're doing. It looks like:

_map = {}
def phred64ToStdqual2(qualin):
    if not _map:
        for i in range(31, 127):
            _map[chr(i)] = chr(i - 31)
    return ''.join(map(_map.__getitem__, qualin))

Note that the above function works in case the mappings are not the same length (in urllib.quote, you have to take '%' -> '%25'.

But actually, since every translation is the same length, python has a function that does just this very quickly: maketrans and translate. You probably won't get much faster than:

import string
_trans = None
def phred64ToStdqual4(qualin):
    global _trans
    if not _trans:
        _trans = string.maketrans(''.join(chr(i) for i in range(31, 127)), ''.join(chr(i) for i in range(127 - 31)))
    return qualin.translate(_trans)
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, Mike. That is a blazing fast 0.1 second on the same machine as above and will be fast enough for my purposes. I'll stick with phred64ToStdqual4() as listed above....

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.