How to use Machine Learning to find the pattern customer profile? [closed]

Question

Closed. This question is not about programming or software development. It is not currently accepting answers.

This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.

Closed last year.

Improve this question

I have a dataset with personal characteristics of customers who purchase from a fictional company. Initially, I don't have any target variable, only their characteristics. My goal is to find a pattern, which may not necessarily be the most frequent characteristic in each column. Is it possible to do this with RandomForest, for example? Or should I use another technique?

The dataset has a structure similar to the following. The columns are all in object format and there are some NaN values represented as 'Blank':

Date            Name     Salary      Position            Age
'05/10/2023'   'Daniel'  '10,000'    'IT'                32
'05/12/2024'   'John'    '9,000'     'Blank'             27
'03/01/2023'   'Niel'    'Blank'     'Data Scients'      21
'03/01/2023'   'Isa'     '10,000'    'Engineer'          51
'05/10/2023'   'Ana'     '11,000'    'Data Scients'      52
'05/12/2024'   'Ian'     '9,500'     'Doctor'            48
'03/01/2023'   'Fred'    'Blank'     'IT'                21
'03/01/2023'   'Carol'   '15,000'    'Blank'             30

I'm thinking of something that returns an output, for example, stating the characteristics that form the most standard profile, such as:

The most standard profile is: Salary x, Position y, and Age z.

I thought about using clustering, but I don't believe it is the best method (the output for the salary, for example, was a simple average). I believe the best approach would be to create a profile that may not necessarily exist and is based on studying the pattern of each variable (Salary, Position, and Age).

# Encode categorical variables
df['Position'] = pd.Categorical(df['Position']).codes

# Perform clustering
kmeans = KMeans(n_clusters=1, random_state=42)
kmeans.fit(df[['Salary', 'Position', 'Age']])

# Get the centroid of the cluster
centroid = kmeans.cluster_centers_[0]

Is there a better way to do this? NLP or RandomForest is an option?

Please see the Note in the tag wiki for machine-learning.

Ben Reiniger
– Ben Reiniger

2024-08-30 23:16:47 +00:00
Commented Aug 30, 2024 at 23:16 — Ben Reiniger
– Ben Reiniger, Commented Aug 30, 2024 at 23:16

Javeria Asim · Accepted Answer · 2024-08-30 13:42:48Z

0

To find the most "standard" profile without a target variable, clustering is a good idea, but KMeans with a single cluster might oversimplify things. Instead, try using KMeans with multiple clusters (e.g., 3-5) and then analyze the centroids to find a representative profile. Each centroid will give you an average profile for that cluster.

Alternatively, you could use Principal Component Analysis (PCA) to identify the main characteristics that vary the least, giving you a sense of the "standard" features across the dataset.

RandomForest is more about classification or regression with a target variable, so it's less useful here. For an NLP approach, if you have a lot of text data, you could try Topic Modeling (like LDA) to find patterns in descriptions or job titles.

So, stick with KMeans clustering or PCA for now!

answered Aug 30, 2024 at 13:42

Javeria Asim

242 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Richard Keene · Accepted Answer · 2024-08-30 13:57:33Z

0

Maybe Regression or Principle Component Analysis. Regression lets you find which variables or combination of variables are most significant and which don't matter.

(Note: "Regression" is because the original invention of the math was to show how species 'regressed' back to their perfect form. This was before evolution was accepted and it was thought that individuals born with deviations from the perfect form of the species were culled out by natural selection. So the 'perfect' leg length vs. overall height would be found from regression of many examples of the species. Darwin's genius was to see natural selection as a creative force, not a stabilizing force.)

answered Aug 30, 2024 at 13:57

Richard Keene

4294 silver badges18 bronze badges

Collectives™ on Stack Overflow

How to use Machine Learning to find the pattern customer profile? [closed]

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related