1

I have a DataFrame that has a json array as the values of one column. I want to select one of those as the values of the column and get rid of the rest. I have gotten the desired values into a series but I can't figure out how to join them back to the DataFrame in place of the existing column:

import json
from pandas.io.json import json_normalize

df = pd.DataFrame({
    'bank_account': [101, 102, 201, 301],
    'data': [
        '{"uid": 100, "account_type": 1, "account_data": {"currency": {"current": 1000, "minimum": -500}, "fees": {"monthly": 13.5}}, "user_name": "Alice"}',
        '{"uid": 100, "account_type": 2, "account_data": {"currency": {"current": 2000, "minimum": 0},  "fees": {"monthly": 0}}, "user_name": "Alice"}',
        '{"uid": 200, "account_type": 1, "account_data": {"currency": {"current": 3000, "minimum": 0},  "fees": {"monthly": 13.5}}, "user_name": "Bob"}',        
        '{"uid": 300, "account_type": 1, "account_data": {"currency": {"current": 4000, "minimum": 0},  "fees": {"monthly": 13.5}}, "user_name": "Carol"}'        
    ]},
    index = ['Alice', 'Alice', 'Bob', 'Carol']
)

lst = []
for d in df['data']:
    d = pd.read_json(d, lines=True)['uid'].values[0]
    lst.append(d)
s = pd.DataFrame(lst)
df['data'] = s
print(s)
print(df)  

returns

     0
0  100
1  100
2  200
3  300
       bank_account  data
Alice           101   NaN
Alice           102   NaN
Bob             201   NaN
Carol           301   NaN

currently and I don't know why the data column shows all nan values. Any help appreciated.

Updated Issue: Some of the rows have lists of json arrays instead of just one. Here is what I have so far:

import json
from pandas.io.json import json_normalize

df = pd.DataFrame({
    'bank_account': [101, 102, 201, 301],
    'data': [
        '[{"uid": 100, "account_type": 1, "account_data": {"currency": {"current": 1000, "minimum": -500}, "fees": {"monthly": 13.5}}, "user_name": "Alice"},{"uid": 150, "account_type": 1, "account_data": {"currency": {"current": 1000, "minimum": -500}, "fees": {"monthly": 13.5}}, "user_name": "jer"}]',
        '{"uid": 100, "account_type": 2, "account_data": {"currency": {"current": 2000, "minimum": 0},  "fees": {"monthly": 0}}, "user_name": "Alice"}',
        '{"uid": 200, "account_type": 1, "account_data": {"currency": {"current": 3000, "minimum": 0},  "fees": {"monthly": 13.5}}, "user_name": "Bob"}',        
        '{"uid": 300, "account_type": 1, "account_data": {"currency": {"current": 4000, "minimum": 0},  "fees": {"monthly": 13.5}}, "user_name": "Carol"}'        
    ]},
    index = ['Alice', 'Alice', 'Bob', 'Carol']
)

# df["data"] = df["data"].apply(lambda x: pd.read_json(x, lines=True)["uid"][0])

df["data"] = df["data"].apply(lambda array : (",".join(list(map(lambda x : pd.read_json(x, lines=True)["uid"][0], array),(df['data'])))))
print(df)

2 Answers 2

1

This works for me:

df = pd.DataFrame({
    'bank_account': [101, 102, 201, 301],
    'data': [
        '{"uid": 100, "account_type": 1, "account_data": {"currency": {"current": 1000, "minimum": -500}, "fees": {"monthly": 13.5}}, "user_name": "Alice"}',
        '{"uid": 100, "account_type": 2, "account_data": {"currency": {"current": 2000, "minimum": 0},  "fees": {"monthly": 0}}, "user_name": "Alice"}',
        '{"uid": 200, "account_type": 1, "account_data": {"currency": {"current": 3000, "minimum": 0},  "fees": {"monthly": 13.5}}, "user_name": "Bob"}',        
        '{"uid": 300, "account_type": 1, "account_data": {"currency": {"current": 4000, "minimum": 0},  "fees": {"monthly": 13.5}}, "user_name": "Carol"}'        
    ]},
    index = ['Alice', 'Alice', 'Bob', 'Carol']
)

df["data"] = df["data"].apply(lambda x: pd.read_json(x, lines=True)["uid"][0])

Your code does not work because df and s have different indices. If you want to fix your code set df['data'] = s[0].values (instead of df['data'] = s) before your two print statements.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! Using apply looks much cleaner than a loop.
1

As @rachwa notes, the issue is that the indexes don't match because the index of s is numbers while the index of df is names. If you assign lst directly instead of casting it to a DataFrame, you will get the desired outcome, i.e.

df['data'] = lst

would work as expected.

You could also use json.loads instead of read_json (it should be faster):

import json
df['data'] = [json.loads(d)['uid'] for d in df['data']]

Output:

       bank_account  data
Alice           101   100
Alice           102   100
Bob             201   200
Carol           301   300

3 Comments

Hi @enke, thanks for the suggestion! I ran this on the bigger dataset (the one I posted was a test set) and realized that some of the values are lists of json arrays. Any Idea how to tackle that. I have updated the original post with a sample row to create the issue as well as my best stab at it so far.
@JustinBenfit maybe use a nested list comprehension like: [[json.loads(d)['uid'] for d in li] for li in df['data']]
I tried this (maybe I have it wrong) df["data"] = [[json.loads(d)['uid'] for d in li] for li in df['data']] and got JSONDecodeError: Expecting value: line 1 column 2 (char 1)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.