1

I am attempting to filter a numpy array by a regex in Python, however, I am running into an error where not all expected values are being matched.

The data I'm working with is a large numpy array of strings of various lengths. Preemptive to the regex filter, I've created an index of all strings of a specific length, and I'd ultimately like to subset this index with the regex filter.

I've created the following function designed to filter this index:

# Remove all indices corresponding to peptides contaning Cs.
def remove_cs(peptide_arr, indices_arr):
    indices_peptide_arr = peptide_arr[indices_arr]
    r = re.compile(b'C')
    v_search = np.vectorize(lambda x: not bool(r.search(x, re.IGNORECASE)))
    indices_filter = v_search(indices_peptide_arr)
    filtered_indices_arr = indices_arr[indices_filter]
    return filtered_indices_arr

The goal of the above function is to subset the indices array given as input so that it only contains those indices who's corresponding values don't contain any Cs. As input, the full unfiltered peptide array and the selection index array are given. This appears to work to filter the majority of the desired indices, however, when checking the filter, a select few appear to not be matched. For instance, the index corresponding to b'ACAAAAAA' is still returned. What's more, it appears as though in all instances where the regex misses, the corresponding peptide contains a C as one of it's first two characters, which I believe to be significant to this error.

The following small example script exemplifies the issue. While I would expect only 2 to be returned (the index corresponding to 'AAAA'), 6 is also returned (the index corresponding to 'ACAA').

peptide_list = ['AA', 'AAA', 'AAAA', 'AAAC', 'AACA', 'AACC', 'ACAA', 'ACAC', 'ACCC']
peptide_byte_list = [i.encode() for i in peptide_list]
peptide_arr = np.array(peptide_byte_list)
indices_arr = np.array([2, 3, 4, 5, 6, 7, 8])
print(remove_cs(peptide_arr, indices_arr))

I'd appreciate any insight on why my current regex would be missing any matches which occur within the first two characters of a string.

6
  • A sample input and output would be great. Commented Apr 27, 2018 at 14:21
  • Great idea, I can get one prepared within a few minutes. Commented Apr 27, 2018 at 14:22
  • I've added sample input @theausome. Commented Apr 27, 2018 at 15:01
  • I assume you need to find the value (index) in indices_arr for which peptide_list entry at that index contains all 'A's. Is it right? Commented Apr 27, 2018 at 15:12
  • I want to find the indices in indices_arr that don't contain any Cs. In practice I would also match 'GGGG' and 'AGAG' but not 'ACAG' if indices were present for those. Commented Apr 27, 2018 at 15:17

1 Answer 1

2

EDIT The method form of search doesn't take a flags argument, so the IGNORECASE (which happens to equal 2) is interpreted as pos.

Move it to the compile call and the error goes away:

# Remove all indices corresponding to peptides contaning Cs.
def remove_cs(peptide_arr, indices_arr):
    indices_peptide_arr = peptide_arr[indices_arr]
    r = re.compile(b'C', re.IGNORECASE)
    v_search = np.vectorize(lambda x: not bool(r.search(x)))
    indices_filter = v_search(indices_peptide_arr)
    filtered_indices_arr = indices_arr[indices_filter]
    return filtered_indices_arr

print(remove_cs(peptide_arr, indices_arr))
# [2]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.