I have a dataset with personal characteristics of customers who purchase from a fictional company. Initially, I don't have any target variable, only their characteristics. My goal is to find a pattern, which may not necessarily be the most frequent characteristic in each column. Is it possible to do this with RandomForest, for example? Or should I use another technique?
The dataset has a structure similar to the following. The columns are all in object format and there are some NaN values represented as 'Blank':
Date Name Salary Position Age
'05/10/2023' 'Daniel' '10,000' 'IT' 32
'05/12/2024' 'John' '9,000' 'Blank' 27
'03/01/2023' 'Niel' 'Blank' 'Data Scients' 21
'03/01/2023' 'Isa' '10,000' 'Engineer' 51
'05/10/2023' 'Ana' '11,000' 'Data Scients' 52
'05/12/2024' 'Ian' '9,500' 'Doctor' 48
'03/01/2023' 'Fred' 'Blank' 'IT' 21
'03/01/2023' 'Carol' '15,000' 'Blank' 30
I'm thinking of something that returns an output, for example, stating the characteristics that form the most standard profile, such as:
The most standard profile is: Salary x, Position y, and Age z.
I thought about using clustering, but I don't believe it is the best method (the output for the salary, for example, was a simple average). I believe the best approach would be to create a profile that may not necessarily exist and is based on studying the pattern of each variable (Salary, Position, and Age).
# Encode categorical variables
df['Position'] = pd.Categorical(df['Position']).codes
# Perform clustering
kmeans = KMeans(n_clusters=1, random_state=42)
kmeans.fit(df[['Salary', 'Position', 'Age']])
# Get the centroid of the cluster
centroid = kmeans.cluster_centers_[0]
Is there a better way to do this? NLP or RandomForest is an option?