1

Thanks for your help in advance! I'm new to tidymodels (and modeling in general) and am having a hard time identifying what's going wrong to troubleshoot my workflow set up.

I'm running four different models to predict baseball win percentages based on a historical dataset. They are a linear model, elastic net model, random forest model, and XGBoost model. I know all the models work (I have tested them individually), but I am trying to use a workflow to test, cross-validate, and select the best models.

I have two different types of recipes, a basic recipe that includes some hyperparameterization tuning steps (selecting variables, step_zv, step_nzv, step_interact, step_corr, and step_impute_bag) for the random forest and XGBoost models. The linear and elastic net models use a recipe that adds a normalization step.

After setting up my workflows and grids, when I try to run workflow_map(), I get two errors:

  1. "Error in summary.connection(connection) : invalid connection"
  2. "2 arguments have been tagged for tuning in these components: model_spec. Please use one of the tuning functions (e.g. 'tune_grid()') to optimize them"

My questions:

  1. What does the first error indicate?
  2. As for the second, where should I be adding/incorporating tune_grid() into the workflow?

--

For reference, here is some of the relevant code:

Some initial set up

# Split data
team_split <- initial_split(mlb_final)

# Extract training and testing data
team_train <- training(team_split)
team_test <- testing(team_split)

# Resampling strategy
team_rs <- vfold_cv(team_train)

Model specification

# Random forest model 
mlb_forest <- rand_forest(min_n = tune()) %>% 
  set_engine("ranger",
             importance = "permutation") %>% 
  set_mode("regression")

# Linear model
mlb_linear <- linear_reg() %>%
  set_engine("lm") %>%
  set_mode("regression")

# XGBoost
mlb_xgb <- boost_tree(
  trees = tune(),
  min_n = tune(),
  tree_depth = tune(),
  learn_rate = tune()
) %>%
  set_engine("xgboost") %>%
  set_mode("regression")

# Elastic Net
mlb_elastic <- linear_reg(
  penalty = tune(),    
  mixture = tune()     
) %>%
  set_engine("glmnet") %>%
  set_mode("regression")

I've set up my workflows like this:

linear_workflow <- workflow() |> 
  add_model(mlb_linear) |> 
  add_recipe(normalized_recipe)
  
elastic_workflow <- workflow() |> 
  add_model(mlb_elastic) |> 
  add_recipe(normalized_recipe)
  
rf_workflow <- workflow() |> 
  add_model(mlb_forest) |> 
  add_recipe(basic_recipe)

xgb_workflow <- workflow() |> 
  add_model(mlb_xgb) |> 
  add_recipe(basic_recipe)

And my grids like this:

grid_ctrl <- control_grid(
  save_pred = TRUE,
  parallel_over = NULL,
  save_workflow = TRUE,
  verbose = TRUE
)

rf_grid <- grid_regular(
  min_n(range = c(5, 50)),  # Min number of observations per leaf (tuning parameter)
  mtry(range = c(2, 10)),   # Number of variables to randomly sample at each split
  levels = 5                # Levels of grid search
)

xgb_grid <- grid_regular(
  trees(range = c(100, 500)),      
  min_n(range = c(5, 15)),        
  tree_depth(range = c(3, 6)),    
  learn_rate(range = c(0.05, 0.1)), 
  levels = 5
)

elastic_grid <- grid_regular(
  penalty(range = c(-2, 1), trans = log10_trans()),  
  mixture(range = c(0, 1)),                          
  levels = 5
)

linear_grid <- 5

I then combined into normalized and basic workflow sets.

normalized_mlb <- workflow_set(
  preproc = list(normalized = normalized_recipe), 
  models = list(linear = mlb_linear, 
                elastic = mlb_elastic)
  )

basic_mlb <- workflow_set(
  preproc = list(basic = basic_recipe),
  models = list(rf = mlb_forest, 
                xgb = mlb_xgb)
)

And then tried to use workflow_map() for both normalized and basic workflows

lm_models <- normalized_mlb |> 
  workflow_map("fit_resamples",
               seed = 100,
               verbose = TRUE,
               resamples = team_rs, 
               control = grid_ctrl)

basic_models <- basic_mlb  |> 
  workflow_map("fit_resamples",
               seed = 100,
               verbose = TRUE,
               resamples = team_rs, 
               control = grid_ctrl)

The workflows are split into normalized and basic workflows because, initially, I was trying to run them together and running into issues. However, I'm still not sure how to address these errors.

1 Answer 1

1

I used some simulated data to try to reproduce the results (and could).

Some of the workflows have tuning parameters and some don't. workflow_map() has the default argument of fn = "tune_grid" but will fall back to "fit_resamples" if the workflow doesn't have tuning parameters.

If you take out fn = "tune_grid" from your code, it runs.

I can't reproduce

"Error in summary.connection(connection) : invalid connection"

I assume it is related to parallel processing? If you are working over a remote session, it could be related to a connection problem too.

One other thing... we won't have an obvious way of adding custom grids (yet). You can do this though:

basic_models <- basic_mlb  |> 
  workflow_map(seed = 100,         #<- removed "fit_resamples"
               verbose = TRUE,
               resamples = team_rs, 
               control = grid_ctrl) %>% 
  option_add(grid = xgb_grid, id = "basic_xgb") %>% 
  option_add(grid = rf_grid,  id = "basic_rf") 
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.