Human Decisions and Machine Predictions

Jure Leskovec (@jure) Including joint work with Himabindu Lakkaraju, Sendhil Mullainathan, ,

Humans vs. Machines Humans vs. Machines

§ Given the same data/features X machines tend to beat human: § Games: Chess, AlphaGo, Poker § Image classification § Question answering (IBM Watson) § But humans tend see more than machines do. Humans make decisions based on (X,Z)! Humans Make Decisions

Humans make decisions based on X,Z! § Machine is trained based on P(Y|X) § But humans use P(Y|X,Z) to make decisions § Could this be a problem? Yes! Why? Because the data is selectively labeled and cross validation can severely overestimate model performance. Plan for the Talk

1) Can machines make better decisions than humans?

2) Why do humans make mistakes?

3) Can machines help humans make better decisions?

Jure Leskovec (@jure) 5 Example Problem: Criminal Court Bail Assignment

Jure Leskovec, Stanford University 6 Example Decision Problem

U.S. police makes >12M arrests/year Q: Where do people wait for trial? Release vs. Detain is high stakes: § Pre-trial detention spells avg. 2-3 months (can be up to 9-12 months) § Nearly 750,000 people in jails in US § Consequential for jobs, families as well as crime

Jure Leskovec (@jure) Stanford University 7 Judge’s Decision Problem

Judge must decide whether to release (bail) or not (jail) Outcomes: Defendant when out on bail can behave badly: § Fail to appear at trial § Commit a non-violent crime § Commit a violent crime The judge is making a prediction!

Jure Leskovec (@jure) Stanford University 8 Bail as a Decision Problem

§ Not assessing guilt on this crime § Not a punishment § Judge can only pay attention to failure to appear and crime risk

Jure Leskovec (@jure) Stanford University 9 The ML Task

We want to build a predictor that will based on defendant’s criminal history predict defendant’s future bad behavior

Prediction Defendant Learning of bad characteristics algorithm behavior in the future

Jure Leskovec (@jure) Stanford University 10 What Characteristics to Use?

Only administrative features § Common across regions § Easy to get § No “gaming the system” problem Only features that are legal to use § No race, gender, religion

Note: Judge predictions may not obey either of these

Jure Leskovec (@jure) Stanford University 11 Data: Defendant Features

§ Age at first arrest § Arrest for new offense while on § Times sentenced residential supervision or bond correction § Has active warrant § Level of charge § Has active misdemeanor § Number of active warrants warrant § Number of misdemeanor cases § Has other pending charge § Number of past revocations § Had previous adult conviction § Current charge domestic § Had previous adult violence misdemeanor conviction § Is first arrest § Had previous adult felony § Prior jail sentence conviction § § Prior prison sentence Had previous Failure to Appear § § Employed at first arrest Prior supervision within 10 years § Currently on supervision § Had previous revocation (only legal features and easy to get)

Jure Leskovec (@jure) Stanford University 12 Data: Kentucky & Federal

Fraction Fraction Failure to Non- Number Violent Jurisdiction released of Appear at violent of cases Crime people Crime Trial Crime

Kentucky 362k 73% 17% 10% 4.2% 2.8%

Federal Pretrial 1.1m 78% 19% 12% 5.4% 1.9% System

Jure Leskovec (@jure) Stanford University 13 Applying Machine Learning

§ Just apply Machine Learning: § Use labeled dataset (released defendants) § Train a learning model to predict whether defendant when our on bail will misbehave § Report accuracy on the held-out test set

Jure Leskovec, Stanford University 14 Prediction Model

�������������������������������������������

��� ����

������������ ��������������������������

��� ���� ��� ����

������������ ������������ ��������������������������������� ������������������������

��� ���� ��� ���� ��� ���� ���� ���

�������������������������������� �������������������������� ���������������� �������������������� ������������������� �������������� ������������ �������������������

���� ����� ���� ����� ������ ����� ���� ����� ���� ����� ���� ����� ���� ����� ���� �����

���� ���� ���� ���� ���� ��� ��� ���� ���� ���� ���� ���� ���� ���� ��� ����

Top four levels the decision tree ButProbability there of NOT committing is a a crimeproblem with this… Typical tree depth: 20-25 Jure Leskovec (@jure) Stanford University 15 The Problem

§ Issue: Judge sees factors the machine does not § Judge makes decisions based on P(Y|X,Z) § X … prior crime history § Z … other factors not seen by the machine § Machine makes decisions based on P(Y|X) § Problem: Judge is selectively labeling the data based on P(Y|X,Z)!

Jure Leskovec, Stanford University 16 Problem: Selective Labels

Judge is selectively labeling the dataset

Judge Release Jail

Crime? Crime?

Yes No ? ? § We can only train on released people § By jailing judge is selectively hiding labels!

Jure Leskovec, Stanford University 17 Selection on Unobservables

Selective labels introduce bias: § Say young people with no tattoos have no risk for crime. Judge releases them. § But machine does not observe whether defendants have tattoos. § So, the machine would falsely conclude that all young people do no crime. § And falsely presume that by releasing all young people it does better than judge!

Jure Leskovec, Stanford University 18 Challenges: (1) Selective Labels (2) Unobservables Z

Jure Leskovec, Stanford University 19 Solution: 3 Properties

Our solution depends on three key properties: § Multiple judges § Natural variability between judges § Random assignment of cases to judges § One-sidedness of the problem Solution: Tree Observations

1) Problem is one-sided: § Releasing a jailed person is a problem § Jailing a released person is not Algorithm Release Jail ü ü Judge

Jail Release û ü

Jure Leskovec (@jure) Stanford University 21 Solution: Three Observations

2) In Federal system cases are randomly assigned § This means that on average all judges have similar cases

Jure Leskovec (@jure) Stanford University 22 Solution: Three Observations

3) Natural variability between judges § Due to human nature there will be some variability between the judges § Some judges are more strict while others are more lenient

Jure Leskovec (@jure) Stanford University 23 Solution: Contraction

Solution: Use most lenient judges! § Take released population of a lenient judge § Which of those would we jail to become less lenient § Compare crime rate to a strict human judge

Note: Does not rely on judges having “similar” predictions

Jure Leskovec (@jure) Stanford University 24 Solution: Contraction

Making lenient judges strict: Released by a lenient judge

Use algorithm to make a lenient judge more strict

Strict human judge Defendants (ordered by crime probability) What makes contraction work: § 1) Judges have similar cases (random assignment of cases) § 2) Selective labels are one-sided

Jure Leskovec (@jure) Stanford University 25 Evaluating the Prediction

Predictions create a new release rule: § Parameterized by % released § For every defendant predict P(crime) § Sort them by increasing P(crime) and “release” bottom k § Note: This is different than AUC What is the fraction released vs. crime rate tradeoff?

Jure Leskovec (@jure) Stanford University 26 Overall Predicted Crime

Holding release rate constant, crime rate is reduced by 0.102 (58.45% decrease) Judges 17%

Machine Learning 11% Overall crime rate All features + ML algorithm Crime Rate (includes FTA, NCA, NVCA)

73% 88% ReleaseRelease Rate Rate Standard errors too small to display Decision trees beat LogReg by 30%

Jure Leskovec (@jure) Stanford University Contrast with AUC ROC 27 Predicted Failure to Appear

Holding release rate constant, crime rate is reduced by 0.06 (60% decrease) Judges decisions 10% FTA Rate FTA ML algorithm 4.1%

73% Release Rate

Jure Leskovec (@jure) Stanford University 28 Predicted Non-Violent Crime

Holding release rate constant, crime rate is reduced by 0.025 (58% decrease) Judges decisions 4.2% NCA Rate

1.7%

73% Release Rate Notes: Standard errors too small to display on graph Jure Leskovec (@jure) Stanford University 29 Predicted Violent Crime

Holding release rate constant, crime rate is Judges decisions reduced by 0.014 2.8% (50% decrease) NVCA Rate 1.4%

Release Rate 73% Notes: Standard errors too small to display on graph Jure Leskovec (@jure) Stanford University 30 Effect of Race

h#H2 d, _+BH 6B`M2bb

.`QT _2HiBp2 S2`+2Mi;2 Q7 CBH SQTmHiBQM *`BK2 _i2 _2H2b2 _mH2 iQ Cm/;2 "H+F >BbTMB+ JBMQ`Biv .Bbi`B#miBQM Q7 .272M/Mib U"b2 _i2V X93dd XjjR3 X3RN8

Cm/;2 XRRj9 X8dj XjRek X33Nk yW UXyyRyV UXyykNV UXyykdV UXyyR3V

H;Q`Bi?K lbmH _MFBM; Xy389 X8N39 Xjykj XNyyd @k9Xe3W UXyyy3V UXyykNV UXyykdV UXyyRdV

Ji+? Cm/;2 QM _+2 Xy388 X8dj XjRek X33Nk @k9Xe9W UXyyy3V UXyykNV UXyykdV UXyyR3V

1[mH _2H2b2 _i2b Xy3dj X93dd XjjR3 X3RN8 @kjXykW 7Q` HH _+2b UXyyy3V UXyykNV UXyyk3V UXyykjV

Ji+? GQr2` Q7 Xy3de X93dd XjRek X3yjN @kkXd9W "b2 _i2 Q` Cm/;2 UXyyy3V UXyykNV UXyykdV UXyykjV

LQi2b, h#H2 `2TQ`ibML i?2 TQi2MiBH does ;BMb Q7 i?2not H;Q`Bi?KB+ beat `2H2b2 `mH2 the `2HiBp2 iQ judge i?2 Dm/;2 i i?2 Dm/;2bby `2H2b2 `i2 rBi? `2bT2+i iQ +`BK2 `2/m+iBQMb M/ b?`2 Q7 i?2 DBH TQTmHiBQM i?i Bb #H+F- >BbTMB+ Q` 2Bi?2` #H+F Q` >BbTMB+X h?2 }`bi `Qr b?Qrb i?2 b?`2 Q7 i?2 /272M/Miracially TQTmHiBQM Qp2`HH discriminating i?i Bb #H+F Q` >BbTMB+X h?2 b2+QM/ `Qr b?Qrb i?2 `2bmHib Q7 i?2 Q#b2`p2/ Dm/;2 /2+BbBQMbX h?2 i?B`/ `Qr b?Qrb i?2 `2bmHib Q7 i?2 mbmH H;Q`Bi?KB+ `2@`MFBM; `2H2b2 `mH2- r?B+? /Q2b MQi mb2 `+2 BM T`2/B+iBM; /272M/Mi `BbF M/ KF2b MQ TQbi@T`2/B+iBQM /DmbiK2Mib iQ ++QmMi 7Q` `+2X AM i?2 7Qm`i? `Qr r2 /Dmbi i?2 H;Q`Bi?Kb `MFBM; Q7 /272M/Mib 7Q` /2i2MiBQM iQ 2Mbm`2 i?i i?2 b?`2 Q7 i?2 DBH TQTmHiBQM i?i Bb #H+F M/ >BbTMB+ mM/2` i?2 H;Q`Bi?KB+ `2H2b2 `mH2 `2 MQ ?B;?2` i?M i?Qb2 mM/2` +m``2Mi Dm/;2 /2+BbBQMbX h?2 M2ti `Qr +QMbi`BMib i?2 H;Q`Bi?KB+ `2H2b2 `mH2b DBH TQTmHiBQM iQ ?p2 MQ ?B;?2` b?`2 #H+F Q` >BbTMB+ i?M i?i Q7 i?2 ;2M2`H /272M/Mi TQQH- r?BH2 i?2 }MH `Qr +QMbi`BMb i?2 H;Q`Bi?Kb DBH TQTmHiBQM iQ ?p2 MQ ?B;?2` b?`2 #H+F Q` >BbTMB+ i?M 2Bi?2` i?2 Dm/;2 /2+BbBQMb Q` i?2 Qp2`HH /272M/Mi TQQHX

ed Source of Errors Judge release rate release Judge

Predicted risk Judges err on riskiest defendants Predicted Risk vs. Jailing

Least Likely to Jail by judge Most Likely to Jail Predicted risk Predicted Why do Judges do Poorly?

Odds are against algorithm: Judge sees many things the algorithm does not: § Offender history: Observable by both § Demeanor: § “I looked in his eyes and saw his soul” § Unobservable by the algorithm Can we diagnose judges’ mistakes and help them make better decisions? Jure Leskovec (@jure) Stanford University 34 Why do Judges do Poorly?

Two possible reasons why judges are making mistakes: § Misuse of observable features § These are the administrative features available to the algorithm – E.g.: Previous crime history, etc. § Misuse of unobservable features § Features not seen by the algorithm – “I looked in his eyes and saw his soul”

Jure Leskovec (@jure) Stanford University 35 A Different Test

Predict judge’s decision § Training data does NOT include whether defendant actually commits a crime Gives a release rule Release if predicted judge would release § This weights features the way judge does What does it do differently then? § Does not use the unobservables!

Jure Leskovec (@jure) Stanford University 36 Predicting the Judge

Build a model to predict judge’s behavior § We predict what the judge will do (and NOT if a defendant will commit a crime) Mix judge and the model: 1 − � ������ ���� �������� + � ��������� �������� Does this beats the judge (for some �)? § Model will do better only if the judge misweighs unobservable features

Jure Leskovec (@jure) Stanford University 37 Artificial Judge Beats the Judge

Pure Judge

Pure ML model of the judge

Model of the judge beats the judge by 35%

Jure Leskovec (@jure) Stanford University 38 Putting It All Together

0.0 ●

● ●

−0.1 ● ● ●

● ● ●

● ●

● −0.2 ● ● ●

● ● Judges ● ● ● Lenient Judge + ML

● ● Predicted Judge ● ● Random ●

Percent Change FTA Percent ●

● −0.3 ● ● ●

● ●

● ●

● ●

● ● −0.4

● −0.5

−0.20 −0.15 −0.10 −0.05 0.00 Percentage Point Change in Release Rate Recap so far…

Algorithm beats human judge in the bail assignment problem Artificial judge beats the real judge This means judge is misweighing unobservable features

Can we model human mistakes?

Jure Leskovec (@jure) Stanford University 40 Modeling Human Decisions

Judge j Decisions of j Items i True labels

Release Crime

Release No crime

Jail Crime

Jail No crime .

Jure Leskovec (@jure) Stanford University 41 Modeling Human Decisions

Judge j Defendant i Decision of judge j on defendant i

True label ti

Decision Probability that j pc,c pc,n decided “no crime” to a

Truth defendant who pn,c pn,n will do “crime”

Jure Leskovec (@jure) Stanford University 42 Problem Setting

Input: § Judges and defendants described by features Output: § Discover groups of judges and defendants that share common confusion matrices

Jure Leskovec (@jure) Stanford University 43 Joint Confusion Model

Judge j Item i Decision of judge j on item i

Decision Cluster Cluster d cj Truth i

Confusion matrix a1 a2 a3 a4 a5 b1 b2 b3 b4 zi Judge attributes Item attributes Item label Similar judges share confusion matrices

when deciding on similar items 44 What do we learn on Bail?

Judge attributes: § Year, County, Number of Previous Cases, Number of Previous Felony Cases, Number of Previous Misd. Cases, Number of Previous Minor Cases Defendant attributes: § Previous Crime History, Personal Attributes (Children/Married etc.), Social Status (Education/House Own/Moved a lot in past 10 years), Current Offense Details

Jure Leskovec (@jure) Stanford University 45 Why judges make mistakes

Decision

0.9 0.1 0.6 0.4

Truth 0.4 0.1 0.9 0.6

Less experienced judges make more mistakes ¡ Left: Judges with high number of cases ¡ Right: Judges with low number of cases

Jure Leskovec (@jure) Stanford University 46 Why judges make mistakes

Decision

0.8 0.2 0.8 0.2 Truth 0.5 0.5 0.3 0.7

Judges with many felony cases are stricter ¡ Left: Judges with many felony cases ¡ Right: Judges with few felony cases

Jure Leskovec (@jure) Stanford University 47 Why judges make mistakes

Decision

0.9 0.1 0.5 0.5 Truth 0.1 0.9 0.5 0.5

Defendants who are Defendants who have single, did felonies, kids are confusing to and moved a lot are judges accurately judged

Jure Leskovec (@jure) Stanford University 48 Conclusion

New tool to understand human decision making The framework can be applied to various other questions: § Decisions about patient treatments § Assigning students to intervention programs § Hiring and promotion decisions

Jure Leskovec (@jure) Stanford University 49 Power of Algorithms

Algorithms can help us understand if human judges are biased

Algorithms can be used to diagnose reason for bias

Key insight § More than prediction tools § Serve as behavioral diagnostic tools

Jure Leskovec (@jure) Stanford University 50 Focusing on Decisions

Not just about prediction Key is starting with decision: § Performance benchmark: Current “human” decisions § Not ROC but “human ROC” § What are we really optimizing? Focus on decision also raises new concerns

Jure Leskovec (@jure) Stanford University 51 Reflections: New Concerns

1) Selective labels § We don’t see labels of people that are jailed § Extremely common problem: Prediction -> Decision -> Action § Which outcomes we see depends on our decisions

Jure Leskovec (@jure) Stanford University 52 Reflections: New Concerns

2) Data responds to prediction § Whenever we make a prediction/decision we affect the labels/data we see in the future § Creates a self-reinforcing feedback loop

Jure Leskovec (@jure) Stanford University 53 Reflections: New Concerns

3) Constraints on input features: For example, race is an illegal feature § Our models don’t use it But, is that enough? § Many other characteristics correlate with race How do we deal with this additional constraint?

Jure Leskovec (@jure) Stanford University 54 Reflections: New Concerns

4) Omitted payoff bias § Is minimizing the crime rate really the right goal? § There are other important factors § Consequences of jailing on the family § Jobs and the workplace § Future behavior of the defendant § How could we model these?

Jure Leskovec (@jure) Stanford University 55 Conclusion

Bail is not the only prediction problem

Many important problems in public policy are predictive problems!

Potential for large impact

Jure Leskovec (@jure) Stanford University 56 References

§ The Selective Labels Problem: Evaluating Algorithmic Predictions in the Presence of Unobservables. H. Lakkaraju, J. Kleinberg, J. Leskovec, J. Ludwig, S. Mullainathan. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2017. § Human Decisions and Machine Predictions. J. Kleinberg, H. Lakkaraju, J. Leskovec, J. Ludwig, S. Mullainathan. NBER Working Paper, 2017. § A Bayesian Framework for Modeling Human Evaluations by H. Lakkaraju, J. Leskovec, J. Kleinberg, S. Mullainathan. SIAM International Conference on Data Mining (SDM) 2015. § Confusions over Time: An Interpretable Bayesian Model to Characterize Trends in Decision Making. H. Lakkaraju, J. Leskovec. Neural Information Processing Systems (NIPS), 2016.

Jure Leskovec (@jure) Stanford University 57