I am attempting to filter a numpy array by a regex in Python, however, I am running into an error where not all expected values are being matched.
The data I'm working with is a large numpy array of strings of various lengths. Preemptive to the regex filter, I've created an index of all strings of a specific length, and I'd ultimately like to subset this index with the regex filter.
I've created the following function designed to filter this index:
# Remove all indices corresponding to peptides contaning Cs.
def remove_cs(peptide_arr, indices_arr):
indices_peptide_arr = peptide_arr[indices_arr]
r = re.compile(b'C')
v_search = np.vectorize(lambda x: not bool(r.search(x, re.IGNORECASE)))
indices_filter = v_search(indices_peptide_arr)
filtered_indices_arr = indices_arr[indices_filter]
return filtered_indices_arr
The goal of the above function is to subset the indices array given as input so that it only contains those indices who's corresponding values don't contain any Cs. As input, the full unfiltered peptide array and the selection index array are given. This appears to work to filter the majority of the desired indices, however, when checking the filter, a select few appear to not be matched. For instance, the index corresponding to b'ACAAAAAA' is still returned. What's more, it appears as though in all instances where the regex misses, the corresponding peptide contains a C as one of it's first two characters, which I believe to be significant to this error.
The following small example script exemplifies the issue. While I would expect only 2 to be returned (the index corresponding to 'AAAA'), 6 is also returned (the index corresponding to 'ACAA').
peptide_list = ['AA', 'AAA', 'AAAA', 'AAAC', 'AACA', 'AACC', 'ACAA', 'ACAC', 'ACCC']
peptide_byte_list = [i.encode() for i in peptide_list]
peptide_arr = np.array(peptide_byte_list)
indices_arr = np.array([2, 3, 4, 5, 6, 7, 8])
print(remove_cs(peptide_arr, indices_arr))
I'd appreciate any insight on why my current regex would be missing any matches which occur within the first two characters of a string.
indices_arrfor whichpeptide_listentry at that index contains all'A's. Is it right?