What Is The Difference Between Cross_val_score With Scoring='roc_auc' And Roc_auc_score?
Solution 1:
This is because you supplied predicted y's instead of the probability in roc_auc_score. This function takes a score, not the classified label. Try instead to do this:
print roc_auc_score(y_test, rf.predict_proba(X_test)[:,1])
It should give a similar result to previous result from cross_val_score. Refer to this post for more info.
Solution 2:
I just ran into a similar issue here. The key takeaway there was that cross_val_score
uses the KFold
strategy with default parameters for making the train-test splits, which means splits into consecutive chunks rather than shuffling. train_test_split
on the other hand does a shuffled split.
The solution is to make the split strategy explicit and specify shuffling, like this:
shuffle = cross_validation.KFold(len(X), n_folds=3, shuffle=True)
scores = cross_val_score(rf, X, y, cv=shuffle, scoring='roc_auc')
Solution 3:
Ran into this problem myself and after digging a bit found the answer. Sharing for the love.
There is actually two and a half problems.
- you need to use the same Kfold to compare scores (the same split of the train/test);
- you need to feed the probabilities into the
roc_auc_score
(using thepredict_proba()
method). BUT, some estimators (like SVC) does not have apredict_proba()
method, you then use thedecision_function()
method.
Here's a full example:
# Let's use the Digit dataset
digits = load_digits(n_class=4)
X,y = digits.data, digits.target
y[y==2] = 0 # Increase problem dificulty
y[y==3] = 1 # even more
Using two estimators
LR = LogisticRegression()
SVM = LinearSVC()
Split the train/test set. But keep it into a variable we can reuse.
fourfold = StratifiedKFold(n_splits=4, random_state=4)
Feed it to GridSearchCV
and save scores. Note we are passing fourfold
.
gs = GridSearchCV(LR, param_grid={}, cv=fourfold, scoring='roc_auc', return_train_score=True)
gs.fit(X,y)
gs_scores = np.array([gs.cv_results_[k][0] for k in gskeys])
Feed it to cross_val_score
and save scores.
cv_scores = cross_val_score(LR, X, y, cv=fourfold, scoring='roc_auc')
Sometimes, you want to loop and compute several different scores, so this is what you use.
loop_scores = list()
for idx_train, idx_test in fourfold.split(X, y):
X_train, y_train, X_test, y_test = X[idx_train], y[idx_train], X[idx_test], y[idx_test]
LR.fit(X_train, y_train)
y_prob = LR.predict_proba(X_test)
auc = roc_auc_score(y_test, y_prob[:,1])
loop_scores.append(auc)
Do we have the same scores across the board?
print [((a==b) and (b==c)) for a,b,c in zip(gs_scores,cv_scores,loop_scores)]
>>> [True, True, True, True]
BUT, sometimes our estimator does not have a
predict_proba()
method. So, according to this example, we do this:
for idx_train, idx_test in fourfold.split(X, y):
X_train, y_train, X_test, y_test = X[idx_train], y[idx_train], X[idx_test], y[idx_test]
SVM.fit(X_train, y_train)
y_prob = SVM.decision_function(X_test)
prob_pos = (y_prob - y_prob.min()) / (y_prob.max() - y_prob.min())
auc = roc_auc_score(y_test, prob_pos)
Post a Comment for "What Is The Difference Between Cross_val_score With Scoring='roc_auc' And Roc_auc_score?"