String-join operation in python numpy or pandas objects

Question

I want to join columns of type string, in a pandas dataframe or numupy ndarray, into a last column like this:

        a   b   c                          a   b   c   d
        ----------         --->            ---------------
        a   b   c                          a   b   c   a_b_c             
        d   e   f                          d   e   f   d_e_f
        g   h   i                          g   h   i   g_h_i

I can think of two representative options:

# Compose data
a = ['a','b','c']
b = ['d','e','f']
c = ['g','h','i']

pdf = pd.DataFrame([a,b,c], columns=['a','b','c'])


# One option
%%timeit
pdf.loc[:,'d'] = [i for i in map(lambda x: '_'.join([x.a, x.b, x.c]), pdf.itertuples())]
>>>1.08 ms ± 4.11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# Another option
%%timeit
tmp=[]
for i in pdf.itertuples():
    tmp.append('_'.join([i.a, i.b, i.c]))

pdf.loc[:,'d'] = tmp
>>>1.08 ms ± 5.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

I understand that there might be too little data to see any difference between these methods but my question is: Is there a smarter method built-in in numpy or pandas that I can call? Also, is there any problem with any of the two methods that I thought of?

Thank you!

From my experience and knowledge.... pandas using numpy underneath, therefore, pandas adds overhead to operations. Nearly all operations are faster using straight numpy vs pandas. — Scott Boston
– Scott Boston, Commented Jul 13, 2020 at 18:31

NYC Coder · Accepted Answer · 2020-07-13 18:06:43Z

4

You can try these 2 below, don't have to use loops:

df['combined'] = df['a'] + '_' + df['b'] + '_' + df['c']

or:

df['combined'] = df[['a', 'b', 'c']].agg('_'.join, axis=1)

   a  b  c combined
0  a  b  c    a_b_c
1  d  e  f    d_e_f
2  g  h  i    g_h_i

answered Jul 13, 2020 at 18:06

NYC Coder

7,6443 gold badges14 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

sushanth Over a year ago

First one looks good, but latter one (agg) is a very low performer.

FBruzzesi Over a year ago

Explicitly converting to numpy arrays (i.e. pdf['a'].to_numpy() + '_' + pdf['b'].to_numpy() + '_' + pdf['c'].to_numpy()) will yield better time performance (in my local machine it is 20x faster)

Scott Boston · Accepted Answer · 2020-07-13 18:25:32Z

I'd like to throw another option out there:

pdf['a'].str.cat([pdf['b'], pdf['c']], sep='_')

Output:

0    a_b_c
1    d_e_f
2    g_h_i
Name: a, dtype: object

Timings

# Compose data
a = ['a','b','c']
b = ['d','e','f']
c = ['g','h','i']

pdf = pd.DataFrame([a,b,c], columns=['a','b','c'])



def met_add(d):
    return df['a'] + '_' + df['b'] + '_' + df['c']

def met_agg_axis1(d):
    return  df[['a', 'b', 'c']].agg('_'.join, axis=1)

def met_str_cat(d):
    return pdf['a'].str.cat([pdf['b'], pdf['c']], sep='_')

def met_map_join(d):
    return pd.Series( [i for i in map(lambda x: '_'.join([x.a, x.b, x.c]), pdf.itertuples())])

def met_iter_join(d):
    tmp=[]
    for i in pdf.itertuples():
        tmp.append('_'.join([i.a, i.b, i.c]))
    return pd.Series(tmp)    

def met_numpy_add(d):
    return pd.Series(pdf['a'].to_numpy() + '_' + pdf['b'].to_numpy() + '_' + pdf['c'].to_numpy())

res = pd.DataFrame(
    index=[10, 30, 100, 300,1000, 3000, 10000, 30000, 100000, 300000],
    columns='met_add met_agg_axis1 met_str_cat met_map_join met_iter_join met_numpy_add'.split(),
    dtype=float
)

for i in res.index:
    d = pd.concat([pdf]*i).add_prefix('col')
    for j in res.columns:
        print(d.shape)
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        res.at[i, j] = timeit(stmt, setp, number=100)

res.plot(loglog=True, figsize=(10,8));

Chart Output:

Celius Stingher · Accepted Answer · 2020-07-13 18:11:18Z

Given the data you provide and the little amount of columns you are working with, you might find it[ easier (but not scalable) to simply use + operator for the columns you wish to join:

pdf['d'] = pdf['a'] + '_' + pdf['b'] + '_' + pdf['c']

It's not scalable if you have 200 columns, but it sure is faster than the two other methods your propose. Using it in a 30000 rows dataframe, I get the following time results:

a = ['a','b','c']
b = ['d','e','f']
c = ['g','h','i']

pdf = pd.DataFrame([a,b,c]*10000, columns=['a','b','c'])

And here are the time results:

Method 1:  0.041734933853149414
Method 2:  0.04217410087585449
Method 3:  0.011157751083374023

Where method 1 and 2 are the ones proposed and the third one is the one above.

Collectives™ on Stack Overflow

String-join operation in python numpy or pandas objects

3 Answers 3

2 Comments

Timings

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Timings

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related