1

I would really appreciate help with the following. I read data from a csv file to lists of lists, and then change that to a numpy array. I am really struggling however to change a set of values in the numpy array to floats as i would like to add a set of numbers for each row and insert the total as a new element in each row.

I am able to change them and create a copy of the changed data type, but i cant seem to do it in place (In the original numpy array).

Here is a small example of how the data from the csv looks like, and what i am trying to achieve.

list_of_lists = [["Africa", "1990", "0", "", "32.6"], ["Asia", "2006", "32.4", "5.5", "46.6"],
                 ["Europe", "2011", "5.4", "", "55.4"]]

array = np.array(list_of_lists)

array[array == ""] = np.nan

print(array)

# This doesnt change it in place

array[:, 2:].astype(np.float32, copy=False)

# And this doesnt as well

array[:, 2:] = array[:,2:].astype(np.float32)

I read several questions similar to this, but none of the methods worked for me. I thought it would be as easy as setting copy = False, but apparently it isn't...

I would really appreciate a hand and if someone can explain this to me.

1
  • insert a total in each row - can't do that in-place. Adding a new column makes a new array. Commented Mar 31, 2020 at 18:21

2 Answers 2

1

You can't change the dtype in-place.

In [59]: arr = np.array(list_of_lists)                                                         
In [60]: arr                                                                                   
Out[60]: 
array([['Africa', '1990', '0', '', '32.6'],
       ['Asia', '2006', '32.4', '5.5', '46.6'],
       ['Europe', '2011', '5.4', '', '55.4']], dtype='<U6')

The common dtype of the inputs is a string.

replacing the "" with nan puts the string representation in the array:

In [62]: arr[arr == ""] = np.nan                                                                                       
In [63]: arr                                                                                   
Out[63]: 
array([['Africa', '1990', '0', 'nan', '32.6'],
       ['Asia', '2006', '32.4', '5.5', '46.6'],
       ['Europe', '2011', '5.4', 'nan', '55.4']], dtype='<U6')

Look at a portion of the underlying databuffer:

In [64]: arr.tobytes()                                                                         
Out[64]: b'A\x00\x00\x00f\x00\x00\x00r\x00\x00\x00i\x00\x00\x00c\x00\x00\x00a\x00\x00\x001\x00\x00\x009\x00\x00\x009\x00\x00\....'

See the actual characters.

A slice of the array is a view, but the astype conversion is a new array, with its own data buffer.

In [65]: arr[:,2:]                                                                             
Out[65]: 
array([['0', 'nan', '32.6'],
       ['32.4', '5.5', '46.6'],
       ['5.4', 'nan', '55.4']], dtype='<U6')
In [66]: arr[:,2:].astype(float)                                                               
Out[66]: 
array([[ 0. ,  nan, 32.6],
       [32.4,  5.5, 46.6],
       [ 5.4,  nan, 55.4]])

You can't write Out[66] back to arr without it being converted back to string.

You could make an object dtype array:

In [67]: arr = np.array(list_of_lists, dtype=object)                                           
In [68]: arr                                                                                   
Out[68]: 
array([['Africa', '1990', '0', '', '32.6'],
       ['Asia', '2006', '32.4', '5.5', '46.6'],
       ['Europe', '2011', '5.4', '', '55.4']], dtype=object)
In [69]: arr = np.array(list_of_lists, dtype=object)                                           
In [70]: arr[arr == ""] = np.nan                                                               
In [71]: arr                                                                                   
Out[71]: 
array([['Africa', '1990', '0', nan, '32.6'],
       ['Asia', '2006', '32.4', '5.5', '46.6'],
       ['Europe', '2011', '5.4', nan, '55.4']], dtype=object)
In [72]: arr[:,2:] = arr[:,2:].astype(float)                                                   
In [73]: arr                                                                                   
Out[73]: 
array([['Africa', '1990', 0.0, nan, 32.6],
       ['Asia', '2006', 32.4, 5.5, 46.6],
       ['Europe', '2011', 5.4, nan, 55.4]], dtype=object)

dtype remains object, but the type of the elements can change - that's because object dtype is a glorified (or debased) list. You gain some flexibility, but loose most of the numpy numeric speed.

Structured array (compound dtype) as shown in the other answer is another possibility. It's easy to make this kind of array when loading a csv (with np.genfromtxt). You still can't change dtypes in-place. And you can't do math across fields of a structured array.

pandas

In [153]: df = pd.DataFrame(list_of_lists)                                                     
In [154]: df                                                                                   
Out[154]: 
        0     1     2    3     4
0  Africa  1990     0       32.6
1    Asia  2006  32.4  5.5  46.6
2  Europe  2011   5.4       55.4
In [156]: df.info()                                                                            
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
0    3 non-null object
1    3 non-null object
2    3 non-null object
3    3 non-null object
4    3 non-null object
dtypes: object(5)
memory usage: 248.0+ bytes

Convert column dtypes:

In [158]: df[2].astype(float)   
In [162]: df[4]=df[4].astype(float) 

Column 3 needs the nan conversion before we can convert that.

In [164]: df                                                                                   
Out[164]: 
        0     1     2    3     4
0  Africa  1990   0.0       32.6
1    Asia  2006  32.4  5.5  46.6
2  Europe  2011   5.4       55.4
In [165]: df.dtypes                                                                            
Out[165]: 
0     object
1     object
2    float64
3     object
4    float64
dtype: object

There are better pandas programmers here; I've focused more on numpy.

Sign up to request clarification or add additional context in comments.

4 Comments

Thank you very much for the detailed explanation. I think Pandas is more suited for purpose then. Would you say it is best to use numpy solely with integers/floats?
pandas uses numpy arrays to store its data and indices. In some cases data for a whole dataframe appears to be a 2d array, in others, each column/Series appears to have its own 1d array. It readily uses object dtype arrays (for example anything containing strings).
I've a added a rudimentary pandas conversion.
Thank you very much for the pandas code as well! Really useful.
1

It seems you need a structured array to handle multiple datatypes

list_of_lists = [["Africa", "1990", "0", "", "32.6"], ["Asia", "2006", "32.4", "5.5", "46.6"],
                 ["Europe", "2011", "5.4", "", "55.4"]]

temp = np.array(list_of_lists)
temp[temp==''] = 0

dtypes = np.dtype([('name','S10'),
    ('val1', np.float),
    ('val2',np.float),
    ('val3',np.float),
    ('val4',np.float)])

array = np.array(list(map(tuple, temp)), dtype=dtypes)

# Now you can modify the structured array
array[['val3', 'val4']]=20
array[0]['name'] = 'Australia'

The problem with this is that you can pretend these are columns, but the answer is no, it's just a structure and shape is (3,), I would recommend switching to pandas dataframe.

import pandas as pd

array = pd.DataFrame(list_of_lists)
array.replace('', '0', inplace=True)
array[data.columns[2:]] = array[array.columns[2:]].astype(float)

array.dtypes

# 0 object
# 1 object
# 2 float64
# 3 float64
# 4 float64
# dtype: object

1 Comment

Thank you very much for your answer Jose! I think Pandas is more fit for purpose.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.