Pyspark - Get All Parameters Of Models Created With ParamGridBuilder

July 28, 2022 Post a Comment

I'm using PySpark 2.0 for a Kaggle competition. I'd like to know the behavior of a model (RandomForest) depending on different parameters. ParamGridBuilder() allows to specify diff

Solution 1:

Spark 2.4+

SPARK-21088 CrossValidator, TrainValidationSplit should collect all models when fitting - adds support for collecting submodels.

By default this behavior is disabled, but can be controlled using CollectSubModels Param (setCollectSubModels).

valid = TrainValidationSplit(
    estimator=pipeline,
    estimatorParamMaps=paramGrid,
    evaluator=evaluator,            
    collectSubModels=True)

model = valid.fit(df)

model.subModels

Spark < 2.4

Long story short you simply cannot get parameters for all models because, similarly to CrossValidator, TrainValidationSplitModel retains only the best model. These classes are designed for semi-automated model selection not exploration or experiments.

What are the parameters of all models?

While you cannot retrieve actual models validationMetrics correspond to input Params so you should be able to simply zip both:

from typing import Dict, Tuple, List, Any
from pyspark.ml.param import Param
from pyspark.ml.tuning import TrainValidationSplitModel

EvalParam = List[Tuple[float, Dict[Param, Any]]]

def get_metrics_and_params(model: TrainValidationSplitModel) -> EvalParam:
    return list(zip(model.validationMetrics, model.getEstimatorParamMaps()))

to get some about relationship between metrics and parameters.

If you need more information you should use Pipeline Params. It will preserve all model which can be used for further processing:

models = pipeline.fit(df, params=paramGrid)

It will generate a list of the PipelineModels corresponding to the params argument:

zip(models, params)

Solution 2:

I think I've found a way to do this. I wrote a function that specifically pulls out hyperparameters for a logistic regression that has two parameters, created with a CrossValidator:

def hyperparameter_getter(model_obj,cv_fold = 5.0):

    enet_list = []
    reg_list  = []

    ## Get metrics

    metrics = model_obj.avgMetrics
    assert type(metrics) is list
    assert len(metrics) > 0

    ## Get the paramMap element

    for x in range(len(model_obj._paramMap.keys())):
    if model_obj._paramMap.keys()[x].name=='estimatorParamMaps':
        param_map_key = model_obj._paramMap.keys()[x]

    params = model_obj._paramMap[param_map_key]

    for i in range(len(params)):
    for k in params[i].keys():
        if k.name =='elasticNetParam':
        enet_list.append(params[i][k])
        if k.name =='regParam':
        reg_list.append(params[i][k])

    results_df =  pd.DataFrame({'metrics':metrics, 
             'elasticNetParam': enet_list, 
             'regParam':reg_list})

    # Because of [SPARK-16831][PYTHON] 
    # It only sums across folds, doesn't average
    spark_version = [int(x) for x in sc.version.split('.')]

    if spark_version[0] <= 2:
    if spark_version[1] < 1:
        results_df.metrics = 1.0*results_df['metrics'] / cv_fold

    return results_df

Python Playground

Pyspark - Get All Parameters Of Models Created With ParamGridBuilder

Solution 1:

Solution 2:

Post a Comment for "Pyspark - Get All Parameters Of Models Created With ParamGridBuilder"