1

I have a large data set containing around 500k observation. It has a string variable that I want to create an embedding for. I used the OpenAI API to create the embedding and because of the large number of observations I used their script for parallel requests:

https://github.com/openai/openai-cookbook/blob/main/examples/api_request_parallel_processor.py

Everything worked fine. But I'm struggling to load the results to a pandas data.frame. The jsonl file with the results has the following structure, each row corresponding to one of the 500k observations:

[{"model": "text-embedding-ada-002", "input": "INPUT STRING NR 1"}, {"object": "list", "data": [{"object": "embedding", "index": 0, "embedding": [1,2,3,4...,1536]}], "model": "text-embedding-ada-002-v2", "usage": {"prompt_tokens": 2, "total_tokens": 2}}]

[{"model": "text-embedding-ada-002", "input": "INPUT STRING NR 2}, {"object": "list", "data": [{"object": "embedding", "index": 0, "embedding": [1,2,3,4...,1536]}], "model": "text-embedding-ada-002-v2", "usage": {"prompt_tokens": 2, "total_tokens": 2}}]

Now, I want to read these results into a panda data frame with the following structure. It should have a variable that contains the "INPUT STRING" and 1536 additional variables that contain the embedding.

I'm new to python and json files. I usually work with csv files and R.

I tried to use the read_json function from pandas but that did not work

import pandas as pd
openai_results = pd.read_json("results.jsonl", lines=True)

But this gives me a a data set with only 2 variables: For example for the first observation, the first variable contains : {"model": "text-embedding-ada-002", "input": "INPUT STRING NR 1"} and the second variable {"object": "list", "data": [{"object": "embedding", "index": 0, "embedding": [1,2,3,4...,1536]}], "model": "text-embedding-ada-002-v2", "usage": {"prompt_tokens": 2, "total_tokens": 2}}

0

1 Answer 1

0

You can use something like this:

df = pd.read_json('your_file.json', lines=True)
df
'''
   0                                                  1
0  {'model': 'text-embedding-ada-002', 'input': '...  {'object': 'list', 'data': [{'object': 'embedd...
1  {'model': 'text-embedding-ada-002', 'input': '...  {'object': 'list', 'data': [{'object': 'embedd...
'''

Access values:

df["input"] = df[0].str["input"]
df["embedding"] = df[1].str["data"].str[0].str["embedding"] # or df["embedding"]=df[1].apply(lambda x: x["data"][0]["embedding"])
df = df[["input","embedding"]]

Out:

               input           embedding
0  INPUT STRING NR 1  [1, 2, 3, 4, 1536]
1  INPUT STRING NR 2  [1, 2, 3, 4, 1536]

If you want to explode embedding column then use explode():

df = df.explode("embedding")
df
'''
               input embedding
0  INPUT STRING NR 1         1
0  INPUT STRING NR 1         2
0  INPUT STRING NR 1         3
0  INPUT STRING NR 1         4
0  INPUT STRING NR 1      1536
1  INPUT STRING NR 2         1
1  INPUT STRING NR 2         2
1  INPUT STRING NR 2         3
1  INPUT STRING NR 2         4
1  INPUT STRING NR 2      1536
'''
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.