1

I want to join columns of type string, in a pandas dataframe or numupy ndarray, into a last column like this:

        a   b   c                          a   b   c   d
        ----------         --->            ---------------
        a   b   c                          a   b   c   a_b_c             
        d   e   f                          d   e   f   d_e_f
        g   h   i                          g   h   i   g_h_i

I can think of two representative options:

# Compose data
a = ['a','b','c']
b = ['d','e','f']
c = ['g','h','i']

pdf = pd.DataFrame([a,b,c], columns=['a','b','c'])


# One option
%%timeit
pdf.loc[:,'d'] = [i for i in map(lambda x: '_'.join([x.a, x.b, x.c]), pdf.itertuples())]
>>>1.08 ms ± 4.11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# Another option
%%timeit
tmp=[]
for i in pdf.itertuples():
    tmp.append('_'.join([i.a, i.b, i.c]))

pdf.loc[:,'d'] = tmp
>>>1.08 ms ± 5.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
 

I understand that there might be too little data to see any difference between these methods but my question is: Is there a smarter method built-in in numpy or pandas that I can call? Also, is there any problem with any of the two methods that I thought of?

Thank you!

1
  • From my experience and knowledge.... pandas using numpy underneath, therefore, pandas adds overhead to operations. Nearly all operations are faster using straight numpy vs pandas. Commented Jul 13, 2020 at 18:31

3 Answers 3

4

You can try these 2 below, don't have to use loops:

df['combined'] = df['a'] + '_' + df['b'] + '_' + df['c']

or:

df['combined'] = df[['a', 'b', 'c']].agg('_'.join, axis=1)

   a  b  c combined
0  a  b  c    a_b_c
1  d  e  f    d_e_f
2  g  h  i    g_h_i
Sign up to request clarification or add additional context in comments.

2 Comments

First one looks good, but latter one (agg) is a very low performer.
Explicitly converting to numpy arrays (i.e. pdf['a'].to_numpy() + '_' + pdf['b'].to_numpy() + '_' + pdf['c'].to_numpy()) will yield better time performance (in my local machine it is 20x faster)
1

I'd like to throw another option out there:

pdf['a'].str.cat([pdf['b'], pdf['c']], sep='_')

Output:

0    a_b_c
1    d_e_f
2    g_h_i
Name: a, dtype: object

Timings

# Compose data
a = ['a','b','c']
b = ['d','e','f']
c = ['g','h','i']

pdf = pd.DataFrame([a,b,c], columns=['a','b','c'])



def met_add(d):
    return df['a'] + '_' + df['b'] + '_' + df['c']

def met_agg_axis1(d):
    return  df[['a', 'b', 'c']].agg('_'.join, axis=1)

def met_str_cat(d):
    return pdf['a'].str.cat([pdf['b'], pdf['c']], sep='_')

def met_map_join(d):
    return pd.Series( [i for i in map(lambda x: '_'.join([x.a, x.b, x.c]), pdf.itertuples())])

def met_iter_join(d):
    tmp=[]
    for i in pdf.itertuples():
        tmp.append('_'.join([i.a, i.b, i.c]))
    return pd.Series(tmp)    

def met_numpy_add(d):
    return pd.Series(pdf['a'].to_numpy() + '_' + pdf['b'].to_numpy() + '_' + pdf['c'].to_numpy())

res = pd.DataFrame(
    index=[10, 30, 100, 300,1000, 3000, 10000, 30000, 100000, 300000],
    columns='met_add met_agg_axis1 met_str_cat met_map_join met_iter_join met_numpy_add'.split(),
    dtype=float
)

for i in res.index:
    d = pd.concat([pdf]*i).add_prefix('col')
    for j in res.columns:
        print(d.shape)
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        res.at[i, j] = timeit(stmt, setp, number=100)

res.plot(loglog=True, figsize=(10,8));

Chart Output:

enter image description here

Comments

0

Given the data you provide and the little amount of columns you are working with, you might find it[ easier (but not scalable) to simply use + operator for the columns you wish to join:

pdf['d'] = pdf['a'] + '_' + pdf['b'] + '_' + pdf['c']

It's not scalable if you have 200 columns, but it sure is faster than the two other methods your propose. Using it in a 30000 rows dataframe, I get the following time results:

a = ['a','b','c']
b = ['d','e','f']
c = ['g','h','i']

pdf = pd.DataFrame([a,b,c]*10000, columns=['a','b','c'])

And here are the time results:

Method 1:  0.041734933853149414
Method 2:  0.04217410087585449
Method 3:  0.011157751083374023

Where method 1 and 2 are the ones proposed and the third one is the one above.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.