Optimizing Over Multiple Objectives

Chocolate offers multi-objective optimization. This means you can optimize the precision and recall without averaging them in a f1 score or even the precision and inference time of a model! Lets go straight to how to do that. First, as always, import we import the necessary modules.

from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import precision_score, recall_score
from sklearn.model_selection import train_test_split

import chocolate as choco

Note that we imported both the sklearn.metrics.precision_score() and sklearn.metrics.recall_score() metrics. The train function is almost identical to the realistic tutorial, except for the two losses.

def score_gbt(X, y, params):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

    gbt = GradientBoostingClassifier(**params)
    gbt.fit(X_train, y_train)
    y_pred = gbt.predict(X_test)

    return -precision_score(y_test, y_pred), -recall_score(y_test, y_pred)

Is that it? Yes! This is the only modofication required to optimize over multiple objectives (in addition to using a multi-objective capable search algorithm).

Then we will load our dataset (or make it).

X, y = make_classification(n_samples=80000, random_state=1)

And just as in the Basics tutorial, we’ll decide where the data is stored and the search space for the algorithm. We will optimize over a mix of continuous and discrete variables.

conn = choco.SQLiteConnection(url="sqlite:///db.db")
s = {"learning_rate" : choco.uniform(0.001, 0.1),
     "n_estimators"  : choco.quantized_uniform(25, 525, 25),
     "max_depth"     : choco.quantized_uniform(2, 10, 2),
     "subsample"     : choco.quantized_uniform(0.7, 1.05, 0.05)}

Finally, we will define our search algorithm, request a set of parameters to test, get the loss for that set and signify it to the database.

sampler = choco.MOCMAES(conn, s, mu=5)
token, params = sampler.next()
loss = score_gbt(X, y, params)
sampler.update(token, loss)

Once this script has run a couple of times, the results can be retrieved. Obviously, we cannot find THE ULTIMATE configuration in our database since multi-objective optimization is all about compromise. In fact, the result of the optimization is a Pareto front containing all non dominated compromises between the objectives. You can easily retrieve these compromises using the results_as_dataframe() method of your connection. To find the Pareto optimal solutions use chocolate.mo.argsortNondominated() function as follow.

conn = choco.SQLiteConnection(url="sqlite:///db.db")
results = conn.results_as_dataframe()
losses = results.as_matrix(("_loss_0", "_loss_1"))
first_front = argsortNondominated(losses, len(losses), first_front_only=True)

This front can be plotted using matplotlib.

plt.scatter(losses[:, 0], losses[:, 1], label="All candidates")
plt.scatter(losses[first_front, 0], losses[first_front, 1], label="Optimal candidates")
plt.xlabel("precision")
plt.ylabel("recall")
plt.legend()

plt.show()

And, we get this nice graph:

../_images/precision_recall_pareto.png