Skip to content Skip to sidebar Skip to footer

How Does Sklearn Calculate The Area Under The Roc Curve For Two Binary Inputs?

I noticed that sklearn has the following function: sklearn.metrics.roc_auc_score() which takes as input ground_truth and prediction. For example, ground_truth = [1,1,0,0,0] pred

Solution 1:

You're correct that with binary predictions you'll only have a single threshold/measurement for the curve. I didn't understand it myself so I ran the code with a ton of print statements both for the sklearn tutorial and then with a purely binary example. All the magic is happening in sklearn.metrics._binary_clf_curve

The "thresholds" are distinct prediction scores. For any binary classifier that outputs purely ones and zeros you're going to get two thresholds - 1 and 0 (they're sorted internally from highest to lowest). At the 1 threshold, a prediction score of >=1 is true and anything below that (only 0 in this case) is considered false, and the TP and FP rates are calculated from that. In all cases, the last threshold categorizes everything as true so the TP and FP rates will both be 1.

It appears then that to generate a correct ROC curve for a sklearn classifier you'd use clf.predict_proba() rather than predict(). Or, maybe predict_log_proba()? I'm not sure if it would make any difference

Post a Comment for "How Does Sklearn Calculate The Area Under The Roc Curve For Two Binary Inputs?"