How to filter a numpy array by a regex?

Question

I am attempting to filter a numpy array by a regex in Python, however, I am running into an error where not all expected values are being matched.

The data I'm working with is a large numpy array of strings of various lengths. Preemptive to the regex filter, I've created an index of all strings of a specific length, and I'd ultimately like to subset this index with the regex filter.

I've created the following function designed to filter this index:

# Remove all indices corresponding to peptides contaning Cs.
def remove_cs(peptide_arr, indices_arr):
    indices_peptide_arr = peptide_arr[indices_arr]
    r = re.compile(b'C')
    v_search = np.vectorize(lambda x: not bool(r.search(x, re.IGNORECASE)))
    indices_filter = v_search(indices_peptide_arr)
    filtered_indices_arr = indices_arr[indices_filter]
    return filtered_indices_arr

The goal of the above function is to subset the indices array given as input so that it only contains those indices who's corresponding values don't contain any Cs. As input, the full unfiltered peptide array and the selection index array are given. This appears to work to filter the majority of the desired indices, however, when checking the filter, a select few appear to not be matched. For instance, the index corresponding to b'ACAAAAAA' is still returned. What's more, it appears as though in all instances where the regex misses, the corresponding peptide contains a C as one of it's first two characters, which I believe to be significant to this error.

The following small example script exemplifies the issue. While I would expect only 2 to be returned (the index corresponding to 'AAAA'), 6 is also returned (the index corresponding to 'ACAA').

peptide_list = ['AA', 'AAA', 'AAAA', 'AAAC', 'AACA', 'AACC', 'ACAA', 'ACAC', 'ACCC']
peptide_byte_list = [i.encode() for i in peptide_list]
peptide_arr = np.array(peptide_byte_list)
indices_arr = np.array([2, 3, 4, 5, 6, 7, 8])
print(remove_cs(peptide_arr, indices_arr))

I'd appreciate any insight on why my current regex would be missing any matches which occur within the first two characters of a string.

I assume you need to find the value (index) in indices_arr for which peptide_list entry at that index contains all 'A's. Is it right? — Austin
– Austin, Commented Apr 27, 2018 at 15:12
I want to find the indices in indices_arr that don't contain any Cs. In practice I would also match 'GGGG' and 'AGAG' but not 'ACAG' if indices were present for those. — michaelmccarthy404
– michaelmccarthy404, Commented Apr 27, 2018 at 15:17

Paul Panzer · Accepted Answer · 2018-04-27 17:11:18Z

2

EDIT The method form of search doesn't take a flags argument, so the IGNORECASE (which happens to equal 2) is interpreted as pos.

Move it to the compile call and the error goes away:

# Remove all indices corresponding to peptides contaning Cs.
def remove_cs(peptide_arr, indices_arr):
    indices_peptide_arr = peptide_arr[indices_arr]
    r = re.compile(b'C', re.IGNORECASE)
    v_search = np.vectorize(lambda x: not bool(r.search(x)))
    indices_filter = v_search(indices_peptide_arr)
    filtered_indices_arr = indices_arr[indices_filter]
    return filtered_indices_arr

print(remove_cs(peptide_arr, indices_arr))
# [2]

edited Apr 27, 2018 at 17:11

answered Apr 27, 2018 at 16:15

Paul Panzer

53.3k3 gold badges60 silver badges103 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to filter a numpy array by a regex?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related