2

Input:

A string list like this:

['a', 'a', 'a', 'b', 'b', 'a', 'b']

Output I want:

A numpy array like this:

array([[ 1,  0],
       [ 1,  0],
       [ 1,  0],
       [ 0,  1],
       [ 0,  1],
       [ 1,  0],
       [ 0,  1]])

What I tried:

Try 1 - My starting data is actually stored in a column as a csv file. So I tried the following:

data1 = genfromtxt('csvname.csv', delimiter=',')

I did this because I thought I could manipulate the csv data into to form I want after I input it into the numpy format. However, the problem is I get all nan which is not a number. I'm not sure how else to go about this effectively because I need to do this for a large data set.

Try 2 - The ineffective method which I was thinking of doing:

For each element of the list, append [1,0] if a and append [0,1] if b.

Is there a better method?

3 Answers 3

4

Using List comprehension

Code:

import numpy
lst = ['a', 'a', 'a', 'b', 'b', 'a', 'b']
numpy.array([[1,0] if val =="a" else [0,1]for val in lst])

Output:

array([[1, 0],
    [1, 0],
    [1, 0],
    [0, 1],
    [0, 1],
    [1, 0],
    [0, 1]])

Note:

  • Rather then appending to a list\numpy array, creating a list is faster
Sign up to request clarification or add additional context in comments.

Comments

3

Building List

import numpy as np
list = ['a','a','a','b','b','a','b']
np.array([[ch=='a',ch=='b'] for ch in list]).astype(int)

Output

array([[1, 0],
    [1, 0],
    [1, 0],
    [0, 1],
    [0, 1],
    [1, 0],
    [0, 1]])

Does this solve it for you?

4 Comments

I didn't refresh the page to see I was second. Is my answer different enough to keep? Or do I delete my post when this happens?
Yes I think it is different enough to keep! Thank you for your input!! Although both answers answer my question, who knows, your method may prove to be more useful for the next person who views this question.
@thundergolfer i feel that your answer maybe efficient then mine :). So just keep it.
And answering second or last does not matter providing a better output matters.
2

NumPythonic vectorized method using np.unique -

((np.unique(A)[:,None] == A).T).astype(int)

Sample run -

In [9]: A
Out[9]: ['a', 'a', 'a', 'b', 'b', 'a', 'b']

In [10]: ((np.unique(A)[:,None] == A).T).astype(int)
Out[10]: 
array([[1, 0],
       [1, 0],
       [1, 0],
       [0, 1],
       [0, 1],
       [1, 0],
       [0, 1]])

2 Comments

I have already up it. But have doubts 1. since there are only two value a,b why do you need to use np.unique and all isn't it over complicating things 2. Is this efficient thunder's answer ?
@The6thSense Well thanks for the up! On the questions - 1) I am assuming OP has posted a sample case in the question, so there could be more than just a and b in it. 2) On efficiency, being a vectorized approach I would think this should be pretty fast, given enough unique letters to iterate with.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.