<<

Gender-based homophily in collaborations across a heterogeneous scholarly landscape

Carole Lee 2019 Symposium Department of Stanford University University of Washington September 5-8 Joint work with:

Y. Samuel Wang Elena Erosheva Jevin West Carl Bergstrom

Business School ; Information Biology Center for Social School U Chicago and U Washington Statistics; Social U Washington Work

U Washington Where do ideas come from? Where do ideas come from?

Ideas come from people. . . Where do ideas come from?

Different people have different forms of expertise. Where do ideas come from?

We can create new ideas by combining folks with different forms of expertise into collaborative teams. Where do ideas come from?

Collaborative teams are more likely than solo authors to create novel combinations of old ideas.

Combinations that are atypical and span a longer interdisciplinary distance have higher impact.

Uzzi, Brian, Satyam Mukherjee, Michael Stringer, & Ben Jones. "Atypical combinations and scientific impact." 342 (2013): 468-472. Larivière, Vincent, Stefanie Haustein, & Katy Börner. "Long-distance interdisciplinarity leads to higher scientific impact." Plos one 10 (2015): e0122565. Larivière, Vincent, Yves Gingras, Cassidy R. Sugimoto, and Andrew Tsou. "Team size matters: Collaboration and scientific impact since 1900." JASIST 66 (2015): 1323-1332. How do teams self-assemble?

• Intellectual factors: expertise

• Instrumental factors: resources

• Social factors: respect, friendship, curiosity-seeking, fun

Katz, J. Sylvan, and Ben R. Martin. "What is collaboration?." Research policy 26 (1997): 1-18.

Melin, Göran. "Pragmatism and self-organization: Research collaboration on the individual level." Research policy 29 (2000): 31-40.

Maglaughlin, Kelly L., and Diane H. Sonnenwald. "Factors that impact interdisciplinary scientific research collaboration: Focus on the natural sciences in academia." (2005).

Hara, Noriko, Paul Solomon, Seung-Lye Kim, and Diane H. Sonnenwald. "An emerging view of scientific collaboration: Scientists' perspectives on collaboration and factors that impact collaboration." Journal of the American Society for Information science and 54 (2003): 952-965. Collaboration involves social discretion

Does decreasing social distance increase collaboration? Collaborative teams and social distance

• Homophily: Social similarly breeds connection.

• Gender homophily divides work environments, voluntary associations, and friendships.

McPherson, Miller, Lynn Smith-Lovin, and James M. Cook. "Birds of a feather: Homophily in social networks." Annual review of 27 (2001): 415-444.

Kalleberg, Arne L., David Knoke, Peter V. Marsden, and Joe L. Spaeth, eds. Organizations in America. Sage, 1996.

McPherson, J. Miller, and Lynn Smith-Lovin. "Sex segregation in voluntary associations." American Sociological Review (1986): 61-79.

Ibarra, Herminia. "Homophily and differential returns: Sex differences in network structure and access in an advertising firm." Administrative science quarterly (1992): 422-447.

Ibarra, Herminia. "Paving an alternative route: Gender differences in managerial networks." Social quarterly (1997): 91-102. Question

• Is there gender-based homophily in collaborations across the heterogeneous scholarly landscape at varying levels of granularity?

• Women are less likely to co-author and serve as first or last author

• Given professional advantages of collaboration, it’s important to understand how and under what conditions women collaborate

Larivière, Vincent, Chaoqun Ni, Yves Gingras, Blaise Cronin, and Cassidy R. Sugimoto. ": Global gender disparities in science." News 504, no. 7479 (2013): 211.

West, Jevin D., Jennifer Jacquet, Molly M. King, Shelley J. Correll, and Carl T. Bergstrom. "The role of gender in scholarly authorship." PloS one 8, no. 7 (2013): e66212. Data

• JSTOR, a repository of papers across the humanities, social sciences, and natural sciences

• We consider cited papers from 1960 onward

• 252,413 multi-author papers with 807,588 authorships (i.e., instances of co-authoring) Hierarchical clustering

• Apply hierarchical implementation of the InfoMap network algorithm to the citation network on the JSTOR corpus

• This reveals the hierarchical structure of corpus through the efficient coding of random walks on the citation network

Rosvall, Martin, and Carl T. Bergstrom. "Maps of random walks on complex networks reveal community structure." Proceedings of the National Academy of Sciences 105, no. 4 (2008): 1118-1123.

Rosvall, Martin, and Carl T. Bergstrom. "Multilevel compression of random walks on networks reveals hierarchical organization in large integrated systems." PloS one 6, no. 4 (2011): e18209. Hierarchical clustering

At the lowest level of clustering, each paper is grouped

Status Exchange into one of 1,450 terminal fields which form the finest characteristics networks partition of the data. Hierarchical clustering

Each higher clustering forms progressively coarser Group partitions of the documents by aggregating terminal fields interactions into the 280 composite fields.

Status Exchange characteristics networks Hierarchical clustering

Sociology of communication

Group Role performance Conversation Motive interactions

Status Exchange characteristics networks Hierarchical clustering

At the top-level, we have 24 major fields.

Sociology

Sociology of Status, Sociology of Delinquency, Religion Etc . . . family education, wages communication deviance

Group Role performance Conversation Motive interactions

Status Exchange characteristics networks Hierarchical clustering

Sociology

Sociology of Status, Sociology of Delinquency, Religion Etc . . . family education, wages communication deviance

Group Role performance Conversation Motive interactions

Status Exchange Hierarchical structure has up to 6 levels. characteristics networks Hierarchical clustering

At any level, papers in a common field are more connected via citations than to papers from neighboring fields.

Fields at finer levels of the hierarchy are more connected than fields at coarser levels.

Sociology

Sociology of Status, Sociology of Delinquency, Religion Etc . . . family education, wages communication deviance

Group Role performance Conversation Motive interactions

Status Exchange characteristics networks Determining gender

• Gender inferred from first names (>95% )

• 75.3% gender determined via Social Security records

• 12.6% determined via genderizeR (user profiles from social networks)

• 12.1% gender unknown and omitted from our analysis:

• 7.6% appear in neither data base

• 4.5% are used for males and females with > 5% certainty

• Rate of missingness compares favorably to previous studies

West, Jevin D., Jennifer Jacquet, Molly M. King, Shelley J. Correll, and Carl T. Bergstrom. "The role of gender in scholarly authorship." PloS one 8, no. 7 (2013): e66212.

Larivière, Vincent, Chaoqun Ni, Yves Gingras, Blaise Cronin, and Cassidy R. Sugimoto. "Bibliometrics: Global gender disparities in science." Nature News 504, no. 7479 (2013): 211. 621 683 622 684 623 685 624 686 625 687 626 688 627 689 628 Demography 690

629 7 691 1.0 1.0 females 630 unestimated 692

631 6 693

632 0.8 0.8 694

633 5 695

634 0.6 0.6 696

635 4 697 author Papers

636 − 698 0.4 0.4

637 3 699

638 Num of Auth Avg 700 % of author gender 0.2 0.2 639 Gender2 and authorship practices across fields 701 640 % of Multi 702 1 641 0.0 0.0 703 642 1960 1980 2000 1960 1980 2000 1960 1980 2000 704 643 705 644 Ecology and evolution 706

645 7 707 1.0 1.0 females 646 unestimated 708

647 6 709 648 0.8 0.8 710

649 5 711

650 0.6 0.6 712

651 4 713 author Papers

652 − 714 0.4 0.4

653 3 715

654 Num of Auth Avg 716 % of author gender 0.2 0.2 655 2 717 656 % of Multi 718 1 657 0.0 0.0 719 658 1960 1980 2000 1960 1980 2000 1960 1980 2000 720 659 721 660 722

661 7 723 1.0 1.0 females 662 unestimated 724

663 6 725 664 0.8 0.8 726

665 5 727

666 0.6 0.6 728 4

667 author Papers 729 −

668 0.4 0.4 730 669 3 731

670 Num of Auth Avg 732 % of author gender 0.2 0.2 671 2 733 672 % of Multi 734 1 673 0.0 0.0 735 674 1960 1980 2000 1960 1980 2000 1960 1980 2000 736 675 737 676 738 677 739 678 740 679 741 680 742 681 743 682 744

6 of 22 Y. Samuel Wang, Carole J. Lee, Jevin D. West, Carl T. Bergstrom, Elena A. Erosheva 1365 1427 1366 1428 1367 1429 1368 1430 1369 1431 1370 1432 1371 1433 1372 Radiation damage 1434

1373 7 1435 1.0 1.0 females 1374 unestimated 1436

1375 6 1437

1376 0.8 0.8 1438

1377 5 1439

1378 0.6 0.6 1440

1379 4 1441 author Papers

1380 − 1442 0.4 0.4

1381 3 1443

1382 Num of Auth Avg 1444 % of author gender 0.2 0.2 1383 Gender2 and authorship practices across fields 1445 1384 % of Multi 1446 1 1385 0.0 0.0 1447 1386 1960 1980 2000 1960 1980 2000 1960 1980 2000 1448 1387 1449 1388 Sociology 1450

1389 7 1451 1.0 1.0 females 1390 unestimated 1452

1391 6 1453 1392 0.8 0.8 1454

1393 5 1455

1394 0.6 0.6 1456

1395 4 1457 author Papers

1396 − 1458 0.4 0.4

1397 3 1459

1398 Num of Auth Avg 1460 % of author gender 0.2 0.2 1399 2 1461 1400 % of Multi 1462 1 1401 0.0 0.0 1463 1402 1960 1980 2000 1960 1980 2000 1960 1980 2000 1464 1403 1465 1404 Veterinary 1466

1405 7 1467 1.0 1.0 females 1406 unestimated 1468

1407 6 1469 1408 0.8 0.8 1470

1409 5 1471

1410 0.6 0.6 1472 4

1411 author Papers 1473 −

1412 0.4 0.4 1474 1413 3 1475

1414 Num of Auth Avg 1476 % of author gender 0.2 0.2 1415 2 1477 1416 % of Multi 1478 1 1417 0.0 0.0 1479 1418 1960 1980 2000 1960 1980 2000 1960 1980 2000 1480 1419 1481 1420 1482 1421 1483 1422 1484 1423 1485 1424 1486 1425 1487 1426 1488

12 of 22 Y. Samuel Wang, Carole J. Lee, Jevin D. West, Carl T. Bergstrom, Elena A. Erosheva 993 1055 994 1056 995 1057 996 1058 997 1059 998 1060 999 1061 1000 Operations research 1062

1001 7 1063 1.0 1.0 females 1002 unestimated 1064

1003 6 1065

1004 0.8 0.8 1066

1005 5 1067

1006 0.6 0.6 1068

1007 4 1069 author Papers

1008 − 1070 0.4 0.4

1009 3 1071

1010 Num of Auth Avg 1072 % of author gender 0.2 0.2 1011 2 1073 1012 % of Multi 1074 1 1013 0.0 0.0 1075 1014 1960 1980 2000 1960 1980 2000 1960 1980 2000 1076 1015 1077 1016 Organizational and marketing 1078

1017 7 1079 1.0 1.0 females 1018 unestimated 1080

1019 6 1081 1020 0.8 0.8 1082

1021 5 1083

1022 0.6 0.6 1084

1023 4 1085 author Papers

1024 − 1086 0.4 0.4

1025 3 1087

1026 Num of Auth Avg 1088 % of author gender 0.2 0.2 1027 Gender2 and authorship practices across fields 1089 1028 % of Multi 1090 1 1029 0.0 0.0 1091 1030 1960 1980 2000 1960 1980 2000 1960 1980 2000 1092 1031 1093 1032 Philosophy 1094

1033 7 1095 1.0 1.0 females 1034 unestimated 1096

1035 6 1097 1036 0.8 0.8 1098

1037 5 1099

1038 0.6 0.6 1100 4

1039 author Papers 1101 −

1040 0.4 0.4 1102 1041 3 1103

1042 Num of Auth Avg 1104 % of author gender 0.2 0.2 1043 2 1105 1044 % of Multi 1106 1 1045 0.0 0.0 1107 1046 1960 1980 2000 1960 1980 2000 1960 1980 2000 1108 1047 1109 1048 1110 1049 1111 1050 1112 1051 1113 1052 1114 1053 1115 1054 1116

Y. Samuel Wang, Carole J. Lee, Jevin D. West, Carl T. Bergstrom, Elena A. Erosheva 9 of 22 Question

• Do males co-author with males (and females with females) more often than we would otherwise expect?

• We need to be careful about how we measure:

1. How often do males co-author w/ males (and females w/ females)?

2. Counterfactual: Is this more often than we would otherwise expect?

3. If yes, then we have behavioral homophily. 1. Tendency for same-gender authorships to co-author

• We measure homophily by computing the difference in risks:

• α = P(random co-author of random male is male) - P (random co-author of a random female is male) 1. Tendency for same-gender authorships to co-author

• We measure homophily by computing the difference in risks:

• α = P(random co-author of random male is male) - P (random co-author of a random female is male)

• This measure is equivalent to:

• Pearson correlation of gender indicators for random co-authorship pairs

• Wright’s F Coefficient of Inbreeding when all papers have two authors

• Newman’s network-based assortativity coefficient in an appropriately weighted network

Bergstrom, Theodore C. "The algebra of assortative encounters and the evolution of cooperation." International Game Theory Review 5, no. 03 (2003): 211-228.

Wright, Sewall. "The genetical structure of populations." Annals of eugenics 15, no. 1 (1949): 323-354.

Bergstrom, T., M. Bergstrom, M. King, J. Jacquet, J. West, and S. Correll. "A note on measuring gender homophily among scholarly authors." (2016).

Wang, Y. Samuel, and Elena A. Erosheva. "On the relationship between set-based and network-based measures of gender homophily in scholarly publications." arXiv preprint arXiv:1610.09026 (2016).

Newman, Mark EJ. "Mixing patterns in networks." Physical Review E 67, no. 2 (2003): 026126. 1. Tendency for same-gender authorships to co-author

Sociology

Sociology of Status, Sociology of Delinquency, Religion Etc . . . family education, wages communication deviance

Group Role performance Conversation interactions Motive

Status Exchange characteristics networks 1. Tendency for same-gender authorships to co-author

Sociology 1. Tendency for same-gender authorships to co-author

Sociology 1. Tendency for same-gender authorships to co-author

Paper 1 Paper 2 Paper 3 Paper 4

Paper 5 Paper 6 Paper 7 Paper 8 1. Tendency for same-gender authorships to co-author

Paper 1 Paper 2 Paper 3 Paper 4

Paper 5 Paper 6 Paper 7 Paper 8

female male 1. Tendency for same-gender authorships to co-author

Paper 1 Paper 2 Paper 3 Paper 4

Paper 5 Paper 6 Paper 7 Paper 8

P(co-author = male | author = male) = 3/4

P(co-author = male | author = female) = 3/10

α = 0.45 Question

1. How often do males co-author w/ males (and females w/ females)?

2. Counterfactual: Is this more often than we would otherwise expect?

1. To do this, we must account for:

1. Structural Homophily

2. Compositional Homophily 2. Is this more often than we would otherwise expect?

Paper 1 Paper 2 Paper 3 Paper 4

Paper 5 Paper 6 Paper 7 Paper 8

P(co-author = male | author = male) = 3/4

P(co-author = male | author = female) = 3/10

α = 0.45 2. Is this more often than we would otherwise expect?

Paper 1 Paper 2 Paper 3 Paper 4

Paper 5 Paper 6 Paper 7 Paper 8

Fix gender ratio, number of authorships, number of authorships per paper. 2. Is this more often than we would otherwise expect?

Paper 1 Paper 2 Paper 3 Paper 4

Paper 5 Paper 6 Paper 7 Paper 8

Randomly reassign authorships and measure counterfactual alpha value. 2. Is this more often than we would otherwise expect?

Paper 1 Paper 2 Paper 3 Paper 4

Paper 5 Paper 6 Paper 7 Paper 8

Randomly reassign authorships and measure counterfactual alpha value. 2. Is this more often than we would otherwise expect?

Paper 1 Paper 2 Paper 3 Paper 4

Paper 5 Paper 6 Paper 7 Paper 8

Randomly reassign authorships and measure counterfactual alpha value. Measuring Homophily

Doc 1 Doc 2 Doc 3 Doc 4 2. Is this more often than we would otherwise expect?

Structural homophily = Doc 5 Doc 6 Doc 7 Doc 8 Deviation of alpha from zero due to structural aspects (gender ratio, number of authorships, number of authorships per paper)

Null Distribution of alpha P−value: 0.035

−1.0 −0.5 0.0 0.5 1.0

alpha

Y. Samuel Wang Homophily in Co-authorships 08/11/2019 9/17 2. Is this more often than we would otherwise expect?

Paper 1 Paper 2 Paper 3 Paper 4

Paper 5 Paper 6 Paper 7 Paper 8

Null Distribution of alpha P−value: 0.035

−1.0 −0.5 0.0 0.5 1.0 2. Is this more often than we would otherwise expect?

Paper 1 Paper 2 Paper 3 Paper 4

Paper 5 Paper 6 Paper 7 Paper 8

p value = proportion of α values generated from the null distribution which are greater Null Distribution of alpha than or equal to the observed value of α. P−value: 0.035

Small p-value implies that the observed α is unlikely to occur.

−1.0 −0.5 0.0 0.5 1.0 2. Is this more often than we would otherwise expect?

Sociology

Sociology of Status, Sociology of Delinquency, Religion Etc . . . family education, wages communication deviance

Group Role performance Conversation interactions Motive

Status Exchange characteristics networks 2. Is this more often than we would otherwise expect?

Sociology 2. Is this more often than we would otherwise expect? Sociology

Paper 1 Paper 2 Paper 3 Paper 4

Paper 5 Paper 6 Paper 7 Paper 8 2. Is this more often than we would otherwise expect? Sociology

Paper 1 Paper 2 Paper 3 Paper 4

Paper 5 Paper 6 Paper 7 Paper 8

Each terminal field has different gender ratios, numbers of authorships, and numbers of authorships per paper. 2. Is this more often than we would otherwise expect? Sociology

Paper 1 Paper 2 Paper 3 Paper 4

Paper 5 Paper 6 Paper 7 Paper 8

Fix gender ratio, number of authorships, number of authorships per paper within each terminal field. 2. Is this more often than we would otherwise expect? Sociology

Paper 1 Paper 2 Paper 3 Paper 4

Paper 5 Paper 6 Paper 7 Paper 8

Randomly reassign authorships within terminal fields. 2. Is this more often than we would otherwise expect? Sociology

Paper 1 Paper 2 Paper 3 Paper 4

Paper 5 Paper 6 Paper 7 Paper 8

Randomly reassign authorships within terminal fields. Compositional Homophily

Molecular Biology Doc 1 Doc 2 Doc 3 Doc 4

2. Is this more often than we would otherwise expect?

Compositional homophily = Sociology Deviation ofDoc alpha 5 from theDoc 6expectedDoc alpha 7 underDoc structural 8 homophily that arises due to the tendency for authorships to co- author with those who are intellectually closer.

Null Distribution of alpha P−value: 1

−1.0 −0.5 0.0 0.5 1.0

alphaComp

Y. Samuel Wang Homophily in Co-authorships 08/11/2019 11 / 18 Field

Paper 1 Paper 2 Paper 3 Paper 4

Paper 5 Paper 6 Paper 7 Paper 8

Null Distribution of alpha P−value: 1

−1.0 −0.5 0.0 0.5 1.0 Field

Paper 1 Paper 2 Paper 3 Paper 4

Paper 5 Paper 6 Paper 7 Paper 8

Null Distribution of alpha P−value: 1

Reversal due to gender imbalances and other structural differences across sub- populations:

• Simpson’s paradox (statistics) • Wahlund effect (population genetics)

−1.0 −0.5 0.0 0.5 1.0 Testing procedure

• Generate counterfactuals:

• Assume authorships in the same terminal field are exchangeable • Assume authorships in different terminal fields are less likely to swap and use citation flows to determine swap probabilities

• Use a Markov Chain Monte Carlo Metropolis Hastings sampler to generate counterfactuals

• 75,000 draws form null distribution after burn-in

• Compare counterfactual alpha values to observed alpha: • Hypothesis test for all levels (terminal fields, composite fields, major fields)

• Use p-values adjusted by the Benjamini-Yuketieli procedure to control False Discovery Rate at 0.05 Results:Top Behavioral LevelTop Fields Levelhomophily Fields Observed Avg counterfactual Obs / Exp Obs / Exp Significant / TotalSignificant / Total Field Field P-value P-value ↵ FM↵ FM Term CompTerm Comp JSTOR JSTOR.11/.05 38.6/41.1.11/.05 38.6/41.1.00 124/1450.00 82/280124/1450 82/280 Mol/Cell Bio Mol/Cell.05/.01 Bio 38.2/39.8.05/.01 38.2/39.8.00 29/178.00 19/4429/178 19/44 • 84% of top fields Eco/Evol Eco/Evol.06/.02 31.9/33.3.06/.02 31.9/33.3.00 17/257.00 15/5617/257 15/56 Economics Economics.11/.02 18.7/20.8.11/.02 18.7/20.8.00 9/136.00 11/289/136 11/28 Sociology Sociology.19/.07 38.5/44.2.19/.07 38.5/44.2.00 13/94.00 12/2113/94 12/21 Prob/Stat Prob/Stat.09/.03 26/27.8.09/.03 26/27.8.00 1/90.00 2/231/90 2/23 Org/mkt Org/mkt.16/.04 29/33.2.16/.04 29/33.2.00 8/68.00 3/48/68 3/4 Education Education.16/.04 41.2/47.2.16/.04 41.2/47.2.00 12/42.00 6/1012/42 6/10 Occ Health Occ.10/.02 Health 41.7/45.5.10/.02 41.7/45.5.00 12/24.00 1/112/24 1/1 Anthro Anthro.12/.03 38.5/42.12/.03 38.5/42.00 5/63.00 2/85/63 2/8 Law Law.17/.08 29.7/32.9.17/.08 29.7/32.9.00 0/98.00 1/160/98 1/16 History.16/.07 32.9/36.5.16/.07 32.9/36.5.00 0/49.00 1/60/49 1/6 Phys Anthro Phys.07/.01 Anthro 34.7/36.8.07/.01 34.7/36.8.00 1/32.00 2/101/32 2/10 Intl Poli Sci Intl.09/.02 Poli Sci 27.3/29.1.09/.02 27.3/29.1.03 0/34.03 0/20/34 0/2 US Poli Sci US.15/.07 Poli Sci 25.2/27.4.15/.07 25.2/27.4.00 2/37.00 1/62/37 1/6 Philosophy Philosophy.10/.03 18.7/20.3.10/.03 18.7/20.3.03 0/45.03 0/80/45 0/8 Math Math.04/.01 14.3/14.7.04/.01 14.3/14.71.00 0/461.00 0/90/46 0/9 Vet Med Vet.09/.01 Med 38.3/41.6.09/.01 38.3/41.6.00 7/19.00 1/27/19 1/2 Cog Sci Cog.18/.09 Sci 35.7/39.4.18/.09 35.7/39.4.00 4/14.00 3/34/14 3/3 Radiation Radiation.09/.01 34/36.8.09/.01 34/36.8.00 3/14.00 1/53/14 1/5 Demography Demography.15/.06 40.3/44.4.15/.06 40.3/44.4.00 0/20.00 1/20/20 1/2 Classics Classics.07/.01 38.7/41.2.07/.01 38.7/41.2.27 0/35.27 0/80/35 0/8 Opr Res Opr.03/.00 Res 16.6/17.1.03/.00 16.6/17.1.73 0/18.73 0/40/18 0/4 Plant Phys Plant.08/.02 Phys 29.1/31.08/.02 29.1/31.03 1/21.03 0/31/21 0/3 Mycology Mycology.03/.01 36.8/37.5.03/.01 36.8/37.51.00 0/161.00 0/10/16 0/1

Y. Samuel WangY. Samuel Wang Homophily in Co-authorshipsHomophily in Co-authorships 08/11/2019 08/11/201916 / 18 16 / 18 TopResults: Level Fields BehavioralTop Levelhomophily Fields

Obs / Exp SignificantObs / Exp / Total Significant / Total Field Field P-value P-value ↵ FM Term↵ CompFM Term Comp JSTOR .11/.05 38.6/41.1JSTOR .00 .11/.05124/1450 38.6/41.182/280 .00 124/1450 82/280 Mol/Cell Bio .05/.01 38.2/39.8Mol/Cell.00 Bio .05/.0129/178 38.2/39.819/44 .00 29/178 19/44 Eco/Evol .06/.02 31.9/33.3Eco/Evol.00 .06/.0217/257 31.9/33.315/56 .00 17/257 15/56 • 29% ofEconomics composite.11/.02 fields18.7/20.8 Economics.00 .11/.029/136 18.7/20.811/28 .00 9/136 11/28 Sociology .19/.07 38.5/44.2Sociology.00 .19/.0713/94 38.5/44.212/21 .00 13/94 12/21 Prob/Stat .09/.03 26/27.8Prob/Stat.00 .09/.031/90 26/27.82/23 .00 1/90 2/23 8% of terminalOrg/mkt fields.16/.04 29/33.2Org/mkt.00 .16/.048/68 29/33.23/4 .00 8/68 3/4 • Education .16/.04 41.2/47.2Education.00 .16/.0412/42 41.2/47.26/10 .00 12/42 6/10 Occ Health .10/.02 41.7/45.5Occ Health.00 .10/.0212/24 41.7/45.51/1 .00 12/24 1/1 Anthro .12/.03 38.5/42Anthro .00 .12/.035/63 38.5/422/8 .00 5/63 2/8 Law .17/.08 29.7/32.9Law .00 .17/.080/98 29.7/32.91/16 .00 0/98 1/16 History .16/.07 32.9/36.5History .00 .16/.070/49 32.9/36.51/6 .00 0/49 1/6 • Trade-off Physbetween Anthro increasing.07/.01 34.7/36.8Phys Anthro.00 .07/.011/32 34.7/36.82/10 .00 1/32 2/10 testing powerIntl Poli by Sci aggregating.09/.02 27.3/29.1 Intl Poli.03 Sci .09/.020/34 27.3/29.10/2 .03 0/34 0/2 US Poli Sci .15/.07 25.2/27.4US Poli.00 Sci .15/.072/37 25.2/27.41/6 .00 2/37 1/6 data versusPhilosophy controlling.10/.03 for 18.7/20.3Philosophy.03 .10/.030/45 18.7/20.30/8 .03 0/45 0/8 confoundersMath by analyzing.04/.01 at a14.3/14.7 Math 1.00 .04/.010/46 14.3/14.70/9 1.00 0/46 0/9 Vet Med .09/.01 38.3/41.6Vet Med.00 .09/.017/19 38.3/41.61/2 .00 7/19 1/2 fine-grainCog level. Sci .18/.09 35.7/39.4Cog Sci.00 .18/.094/14 35.7/39.43/3 .00 4/14 3/3 Radiation .09/.01 34/36.8Radiation.00 .09/.013/14 34/36.81/5 .00 3/14 1/5 Demography .15/.06 40.3/44.4Demography.00 .15/.060/20 40.3/44.41/2 .00 0/20 1/2 Classics .07/.01 38.7/41.2Classics.27 .07/.010/35 38.7/41.20/8 .27 0/35 0/8 Opr Res .03/.00 16.6/17.1Opr Res.73 .03/.000/18 16.6/17.10/4 .73 0/18 0/4 Plant Phys .08/.02 29.1/31Plant Phys.03 .08/.021/21 29.1/310/3 .03 1/21 0/3 Mycology .03/.01 36.8/37.5Mycology1.00 .03/.010/16 36.8/37.50/1 1.00 0/16 0/1

Y. Samuel Wang Y.Homophily Samuel Wang in Co-authorships Homophily in Co-authorships08/11/2019 16 / 18 08/11/2019 16 / 18 Visualization

http://eigenfactor.org/projects/gender_homophily/ How sensitive are our results to missing gender indicators?

• To evaluate, we impute gender for authorships under two scenarios:

• Low homophily: Impute each missing gender indicator at random according to the proportions of assigned genders in the authorship’s original terminal field. Assumes no behavioral homophily in the imputed data.

• High homophily: Impute each missing gender indicator at random according to the proportions of assigned gender in the original paper. If a paper contains only unassigned authorships, impute a single gender for all according to the proportions of assigned gender authorships for the terminal field. Assumes behavioral homophily since, by construction, papers with one or no assigned authorships are always gender homophilous. 2481 E. Sensitivity Analysis: Missing Gender Indicators. To evaluate how sensitive our main results are to the missing gender 2543 2482 indicators, we impute gender for authorships with missing gender indicators under two scenarios: 2544 2483 2545 • Low homophily: Each authorship with a missing gender indicator is assigned a gender at random according to the 2484 2546 proportions of observed genders on its original terminal field. This procedure assumes that there is no behavioral 2485 2547 homophily in the imputed data because the imputed genders are conditionally independent given the terminal field. Thus, 2486 2548 it gives a reasonable lower bound on the homophily we might have observed given the full data. 2487 2549 2488 • High homophily: Each authorship with a missing gender indicator is assigned a gender at random according to the 2550 2489 proportions of observed genders on its original paper. If the original paper contains only authorships with missing gender 2551 2490 indicators, we assign all authorships on the paper the same gender indicator which is drawn randomly according to the 2552 2491 proportions of observed genders for its original terminal field. Because papers with at most one assigned gender indicator 2553 2492 are homophilous by construction, this provides a reasonable upper bound on the homophily we might have observed 2554 2493 given the full data. 2555 How sensitive2494 are our results to missing gender indicators? 2556 2495 For each scenario, we carry out 10 imputations and then repeat the entire sampling and testing procedures used for the 2557 2496 main analysis. Table S8 gives the resulting percentages of terminal, composite, and top level fields with significant behavioral 2558 2497 homophily under the Benjamini-Yekutieli FDR procedure with – = .05 under the low and high homophily missing data 2559 2498 imputation scenarios. We observed that, on average, 7%, 25%, and 78% of terminal, composite, and top level fields exhibit 2560 2499 statistically significant respectively in the low homophily scenario; for the high homophily procedure the corresponding averages 2561 • For each scenario,2500 are 54%, carry 82%, and out 100%. 10 imputations and then repeat the 2562 2501 2563 entire sampling2502 Table and S8. Each testing column shows procedures the percentage of fieldsused which for exhibit the statistically main significant analysis. homophily for each of the individual imputations 2564 2503 2565 Terminal Composite Top 2504 2566 Main Analysis 0.09 0.29 0.83 2505 2567 • Original results: Low Imputation 1 0.06 0.23 0.83 2506 Low Imputation 2 0.07 0.25 0.75 2568 2507 Low Imputation 3 0.06 0.25 0.75 2569 2508 Low Imputation 4 0.08 0.25 0.75 2570 Top fields:2509 84% Low Imputation 5 0.06 0.26 0.75 2571 • 2510 Low Imputation 6 0.07 0.28 0.75 2572 2511 Low Imputation 7 0.07 0.25 0.83 2573 2512 Low Imputation 8 0.07 0.26 0.79 2574 • Composite2513 fields: 28% Low Imputation 9 0.07 0.25 0.75 2575 2514 Low Imputation 10 0.06 0.26 0.79 2576 2515 Low Imputation Avg 0.07 0.25 0.78 2577 High Imputation 1 0.54 0.82 1.00 2516 2578 Terminal fields: 8% High Imputation 2 0.53 0.82 1.00 • 2517 High Imputation 3 0.53 0.82 1.00 2579 2518 High Imputation 4 0.53 0.82 1.00 2580 2519 High Imputation 5 0.54 0.82 1.00 2581 2520 High Imputation 6 0.53 0.81 1.00 2582 2521 High Imputation 7 0.54 0.82 1.00 2583 2522 High Imputation 8 0.54 0.83 1.00 2584 2523 High Imputation 9 0.54 0.83 1.00 2585 2524 High Imputation 10 0.54 0.82 1.00 2586 2525 High Imputation Avg 0.54 0.82 1.00 2587 2526 2588 2527 2589 2528 2590 2529 2591 2530 2592 2531 2593 2532 2594 2533 2595 2534 2596 2535 2597 2536 2598 2537 2599 2538 2600 2539 2601 2540 2602 2541 2603 2542 2604

Y. Samuel Wang, Carole J. Lee, Jevin D. West, Carl T. Bergstrom, Elena A. Erosheva 21 of 22 Secondary analysis

• Fit logistic regression for all terminal fields where the outcome is whether there is statistically significant behavioral homophily

• Result: behavioral homophily has a statistically significant positive association with the proportion of females and terminal field size

• In economics, behavioral homophily was also found to be more prevalent in sub-fields with a higher proportion of females

• Homophily: as representation of women increases, more likely that same-gender individuals who are sufficiently compatible along key dimensions become available as co-authors

• Caveat: Larger field size and balanced gender representation increase power of testing procedure (i.e., ability to detect behavioral gender homophily).

Boschini, Anne, and Anna Sjögren. "Is team formation gender neutral? Evidence from coauthorship patterns." Journal of Labor Economics 25, no. 2 (2007): 325-365. Conclusion

• We detect behavioral gender homophily in 84% of top fields, 29% of composite fields, and 8% of terminal fields

• Since behavioral gender homophily is endemic to even some of the smallest intellectual communities, it might only be mitigated by changing the cultural norms and perceptions that drive behavioral gender homophily within intellectual communities Future work: Strategic value of gender homophily?

• Short-term: does gender homophily increase retention, productivity, and impact of female authors?

: presence of other women in male- stereotyped domains enhances confidence, performance, and motivation

• Long-term: do gender homophilous co-authorships lead to gender-homophilous intellectual communities? If so, does increasing the ratio of women in an intellectual community decrease its value/impact, just as increasing the ratio of women in an occupation decreases its prestige?

Murphy, Mary C., Claude M. Steele, and James J. Gross. "Signaling threat: How situational cues affect women in math, science, and engineering settings." Psychological science 18, no. 10 (2007): 879-885.

Stout, Jane G., Nilanjana Dasgupta, Matthew Hunsinger, and Melissa A. McManus. "STEMing the tide: using ingroup experts to inoculate women's self-concept in science, technology, engineering, and mathematics (STEM)." Journal of personality and 100, no. 2 (2011): 255.

Goldin, Claudia. "A pollution theory of discrimination: male and female differences in occupations and earnings." In Human capital in history: The American record, pp. 313-348. University of Chicago Press, 2014. Future work: Fine-tuning methods

• Temporal aspects: Gender representation changes over time. Incorporate temporal info into the null distribution.

• Disambiguating authors: In terminal fields with few female authors, we may overestimate structural and compositional homophily (and underestimate behavioral homophily) by allowing multiple female authorships corresponding to the same author to be reassigned to the same paper. Thank you

• For early discussions:

• Jennifer Jacquet, Molly King, Shelley Correll, & Ted Bergstrom

• Funders:

• Royalty Research Fund Grant #A118374, Elena Erosheva (PI) and Carole Lee (co-PI)

• NSF Grant #1735194, Jevin West (co-PI) For more information

• Link to paper: http://arxiv.org/abs/1909.01284

• Visualization: http://eigenfactor.org/projects/gender_homophily/

• Code for analysis and plots: https://github.com/ysamwang/genderHomophily

• Data: Under license by JSTOR to the authors. Requests for raw data should be made to JSTOR directly. 249 2. JSTOR Description 311 250 312 Table S1 shows the size of each of the 24 top level fields identified by the map equation. The values are calculated for all papers 251 313 published after 1959. Note that the table describes the data prior to the data cleaning procedure, so counts of authorships, 252 314 papers, terminal fields and composite fields shown here may dier from those given in the main manuscript which refer to data 253 315 after the cleaning procedure. Specifically, Classics, Law, and Philosophy have entire terminal fields which are removed by the 254 316 cleaning procedure. Table S2 presents the structural characteristics of each top level field. For the multi-author columns, we 255 317 report the proportion amongst all authorships on a multi-author paper; e.g., all female authorships on multi-authored papers 256 318 divided by the total count of authorships on multi-authored papers. For the intraclass correlation (ICC) of individuals with 257 319 unimputed genders, we use the flAOV statistic from (1). This gives a measure of how unimputed authorships cluster by paper. 258 320 Anecdotally, unimputed authorships are often names which have been Romanized. Thus a high ICC may indicate homophily 259 321 by race or ethnicity. 260 322 261 Table S1. Size ofSize each of the of top leveleach fields identified top by level the map equation field hierarchical clustering 323 262 324 263 Authors Papers Terminal Composite 325 264 Label (Count) (Count) Fields Fields 326 265 Anthropology 37588 30499 63 8 327 266 Classical studies 10596 9061 37 8 328 267 Cognitive science 15715 5553 14 3 329 268 Demography 9653 5509 20 2 330 269 Ecology and evolution 264853 116327 257 56 331 Economics 95934 59096 136 28 270 332 Education 40188 23065 42 10 271 333 History 26449 24043 49 6 272 Law 23974 19779 105 16 334 273 Mathematics 18348 14125 46 9 335 274 Molecular & Cell biology 382971 92528 178 44 336 275 Mycology 7469 3679 16 1 337 276 Operations research 13716 7780 18 4 338 277 Organizational and marketing 34254 17963 68 4 339 278 Philosophy 21738 19126 46 8 340 279 Physical anthropology 29693 16703 32 10 341 280 Plant physiology 9159 5436 21 3 342 Political science - international 15283 11835 34 2 281 343 Political science-US domestic 12581 7824 37 6 282 Pollution and occupational health 50967 12359 24 1 344 283 Probability and Statistics 37471 22094 90 23 345 284 Radiation damage 14118 4215 14 5 346 285 Sociology 57146 31662 94 21 347 286 Veterinary medicine 17756 4796 19 2 348 287 349 288 350 289 351 290 352 291 353 292 354 293 355 294 356 295 357 296 358 297 359 298 360 299 361 300 362 301 363 302 364 303 365 304 366 305 367 306 368 307 369 308 370 309 371 310 372

Y. Samuel Wang, Carole J. Lee, Jevin D. West, Carl T. Bergstrom, Elena A. Erosheva 3 of 22 373 Table S2. The structural characteristics of each top level field identified by the map equation hierarchical clustering. The “Prop Single-Author” 435 374 columns display the proportion of papers (authorships) which are single-authored out of all papers (authorships). The “Single-author” (“Multi- 436 Author”) column give the proportions of all authorships on single-authored (multi-authored) papers which were imputed a gender based on 375 first name. F- Female; M- Male; U- Unimputed. The “ICC” column displays the intraclass correlation of unimputed authorships on multi-author 437 376 papers. Structural characteristics of each field 438 377 439 378 Prop Single Author Single-Author Multi-Author 440 379 Label Papers Auth %F %M %U % F % M % U ICC 441 380 Anthropology 0.86 0.70 0.27 0.63 0.10 0.28 0.61 0.12 0.10 442 381 Classical studies 0.93 0.79 0.22 0.70 0.08 0.27 0.65 0.08 0.05 443 382 Cognitive science 0.25 0.09 0.29 0.64 0.07 0.28 0.62 0.10 0.09 444 Demography 0.57 0.32 0.24 0.61 0.15 0.30 0.53 0.16 0.11 383 445 Ecology and evolution 0.37 0.16 0.14 0.79 0.08 0.20 0.70 0.11 0.10 384 Economics 0.55 0.34 0.08 0.81 0.11 0.11 0.77 0.12 0.10 446 385 Education 0.55 0.31 0.35 0.58 0.07 0.41 0.50 0.08 0.07 447 386 History 0.92 0.84 0.24 0.70 0.06 0.23 0.69 0.08 0.05 448 387 Law 0.85 0.70 0.17 0.78 0.06 0.22 0.71 0.06 0.05 449 388 Mathematics 0.75 0.58 0.06 0.76 0.18 0.06 0.73 0.21 0.19 450 389 Molecular & Cell biology 0.14 0.03 0.19 0.70 0.10 0.23 0.61 0.16 0.09 451 390 Mycology 0.45 0.22 0.20 0.71 0.09 0.22 0.65 0.13 0.08 452 391 Operations research 0.48 0.27 0.05 0.81 0.14 0.08 0.72 0.20 0.21 453 392 Organizational and marketing 0.40 0.21 0.18 0.72 0.10 0.19 0.68 0.12 0.12 454 Philosophy 0.89 0.78 0.09 0.82 0.08 0.10 0.78 0.11 0.12 393 455 Physical anthropology 0.66 0.37 0.22 0.72 0.06 0.22 0.68 0.09 0.07 394 Plant physiology 0.53 0.31 0.13 0.79 0.08 0.17 0.72 0.10 0.07 456 395 Political science - international 0.78 0.60 0.16 0.74 0.10 0.17 0.73 0.10 0.08 457 396 Political science-US domestic 0.57 0.35 0.17 0.76 0.07 0.18 0.75 0.07 0.05 458 397 Pollution and occupational health 0.22 0.05 0.24 0.65 0.11 0.31 0.53 0.17 0.19 459 398 Probability and Statistics 0.54 0.32 0.08 0.75 0.17 0.13 0.69 0.19 0.20 460 399 Radiation damage 0.23 0.07 0.22 0.66 0.11 0.22 0.62 0.16 0.14 461 400 Sociology 0.52 0.29 0.30 0.63 0.08 0.38 0.53 0.08 0.07 462 401 Veterinary medicine 0.26 0.07 0.27 0.60 0.12 0.25 0.59 0.15 0.22 463 402 464 403 465 The plots below show how the following quantities have changed over time for each top level fields- average number of 404 466 authors per paper, proportion of papers with multiple authors, and the imputed gender proportions. The values are calculated 405 467 on the data before the data-cleaning procedure which removes authorship instances with unimputed genders. 406 468 407 469 408 470 409 471 410 472 411 473 412 474 413 475 414 476 415 477 416 478 417 479 418 480 419 481 420 482 421 483 422 484 423 485 424 486 425 487 426 488 427 489 428 490 429 491 430 492 431 493 432 494 433 495 434 496

4 of 22 Y. Samuel Wang, Carole J. Lee, Jevin D. West, Carl T. Bergstrom, Elena A. Erosheva 2233 C. Calculating and adjusting P-values. In a sampled configuration, if a field only contains authorships of a single gender, – is 2295 2234 undefined. When calculating a p-value, we consider this as – =1. This approach is conservative because it increases p-values, 2296 2235 but in practice has very little eect on our results. 2297 2236 In the main manuscript, we control the false discovery rate at .05 with the Benjamini-Yekutieli procedure (5) which allows 2298 2237 for arbitrary dependence of the p-values, but is more conservative than the Benjamini-Hochberg procedure (6), which only 2299 2238 allows for certain types of positive dependence. Table S5 replicates the last 3 columns of Table 1 of the main manuscript using 2300 2239 the Benjamini-Yekutieli procedure with an FDR of .005 as well as the Benjamini-Hochberg procedure with FDR rates of .05 2301 2240 and .005. 2302 2241 2303 2242 Table S5. Main results using different FDR procedures. “BY” indicated Benjamini-Yekutieli and “BH” indicates Benjamini-Hochberg. 2304 2243 2305 BY; Rate = .005 BH; Rate = .005 BH; Rate = .05 Readjust the Benjamini-Yekutieli2244 2306 Field P-value Term Comp P-value Term Comp P-value Term Comp 2245 2307 procedure to control the false JSTOR .00 68/1450 65/280 .00 114/1450 81/280 .00 261/1450 125/280 discovery rate at 0.005 instead2246 of 0.05 Mol/Cell Bio .00 16/178 18/44 .00 28/178 19/44 .00 53/178 29/44 2308 2247 Eco/Evol .00 7/257 13/56 .00 14/257 15/56 .00 43/257 25/56 2309 2248 Economics .00 4/136 8/28 .00 8/136 11/28 .00 25/136 17/28 2310 2249 Sociology .00 9/94 9/21 .00 12/94 12/21 .00 26/94 14/21 2311 2250 Prob/Stat .00 0/90 1/23 .00 1/90 2/23 .00 7/90 7/23 2312 2251 Org/mkt .00 5/68 3/4 .00 6/68 3/4 .00 15/68 3/4 2313 2252 Education .00 6/42 4/10 .00 10/42 6/10 .00 17/42 6/10 2314 2253 Occ Health .00 8/24 1/1 .00 12/24 1/1 .00 15/24 1/1 2315 2254 Anthro .00 2/63 1/8 .00 5/63 2/8 .00 8/63 2/8 2316 Law .00 0/98 0/16 .00 0/98 1/16 .00 1/98 4/16 2255 2317 History .00 0/49 0/6 .00 0/49 1/6 .00 1/49 1/6 2256 Phys Anthro .00 1/32 2/10 .00 1/32 2/10 .00 6/32 3/10 2318 2257 Intl Poli Sci .03 0/34 0/2 .00 0/34 0/2 .00 3/34 0/2 2319 2258 US Poli Sci .00 0/37 0/6 .00 2/37 1/6 .00 4/37 3/6 2320 2259 Philosophy .03 0/45 0/8 .00 0/45 0/8 .00 5/45 0/8 2321 2260 Math 1.00 0/46 0/9 .16 0/46 0/9 .16 2/46 0/9 2322 2261 Vet Med .00 5/19 1/2 .00 7/19 1/2 .00 8/19 2/2 2323 2262 Cog Sci .00 2/14 3/3 .00 4/14 3/3 .00 7/14 3/3 2324 2263 Radiation .00 3/14 1/5 .00 3/14 1/5 .00 5/14 3/5 2325 2264 Demography .00 0/20 0/2 .00 0/20 0/2 .00 3/20 2/2 2326 2265 Classics .27 0/35 0/8 .03 0/35 0/8 .03 1/35 0/8 2327 Opr Res .73 0/18 0/4 .09 0/18 0/4 .09 1/18 0/4 2266 2328 Plant Phys .03 0/21 0/3 .00 1/21 0/3 .00 4/21 0/3 2267 Mycology 1.00 0/16 0/1 .27 0/16 0/1 .27 1/16 0/1 2329 2268 2330 2269 2331 2270 2332 2271 2333 2272 2334 2273 2335 2274 2336 2275 2337 2276 2338 2277 2339 2278 2340 2279 2341 2280 2342 2281 2343 2282 2344 2283 2345 2284 2346 2285 2347 2286 2348 2287 2349 2288 2350 2289 2351 2290 2352 2291 2353 2292 2354 2293 2355 2294 2356

Y. Samuel Wang, Carole J. Lee, Jevin D. West, Carl T. Bergstrom, Elena A. Erosheva 19 of 22 Previous work • Field-level:

• Economics: study of publications from cohort of 178 PhDs found women > 5x more likely than men to have female co- authors

• A study of 1,045,401 multi-authored papers finds gender homophily at what we’d characterize as the field level

• Sub-field studies:

• Economics: Study of 3,090 articles in the top three journals (1991-2002) found gender homophily at the sub-field level

McDowell, John M., and Janet Kiholm Smith. "The effect of gender-sorting on propensity to coauthor: Implications for academic promotion." Economic Inquiry 30, no. 1 (1992): 68-82.

AlShebli, Bedoor K., Talal Rahwan, and Wei Lee Woon. "The preeminence of ethnic diversity in scientific collaboration." Nature communications 9, no. 1 (2018): 5163.

Boschini, Anne, and Anna Sjögren. "Is team formation gender neutral? Evidence from coauthorship patterns." Journal of Labor Economics 25, no. 2 (2007): 325-365. 2109 B. Comparison with Naive Approach. We can compare the expected value of – from the null distribution which accounts 2171 2110 for structural and compositional homophily (Eq. (2) in the main document) to the expected value of – from a naive null 2172 2111 distribution which only accounts for structural homophily. We construct a naive null distribution for each level l =1,...,6 in 2173 2112 the full hierarchical clustering by preserving all structure of (terminal or composite) fields with depth less than l, but treating 2174

2113 all fields of depth l as a terminal field. We then recalculate the swap probabilities pfa,faú given the citation flows, and then 2175 2114 run the sampler for 5,000 samples. We discard the first 1,000 as burn in and use the remaining 4,000 samples to calculate an 2176 2115 expected value and calculate p-values. 2177 2116 Columns labeled “Struct” provide the expected value of –, the expected number of female-male papers, the p-value for 2178 2117 behavioral homophily, and the number of significant composite fields under the null hypothesis of only structural (but not 2179 2118 compositional) homophily. Under only structural homophily, the expected – value is smaller than the expected – when also 2180 2119 preserving compositional homophily. In addition, the p-value decreases for all top-level fields when only considering structural 2181 2120 homophily. In many top-level fields, the number of composite fields with behavioral increases when only capturing structural 2182 2121 homophily and never decreases. 2183 2122 2184 2123 StructuralTable S4. Comparison versus of naive analysis compositional only preserving structural homophily homophily to main analysis. 2185 2124 2186 2125 – FM P-values Signif Comp 2187 2126 Field Obs Exp Struct Obs Exp Struct Main Struct Main Struct 2188 2127 JSTOR .11 .05 .00 38.6 41.1 43.3 .00 .00 82/280 157/280 2189 Mol/Cell Bio .05 .01 .00 38.2 39.8 40.2 .00 .00 19/44 35/44 2128 2190 Eco/Evol .06 .02 .00 31.9 33.3 34.0 .00 .00 15/56 33/56 2129 Economics .11 .02 .00 18.7 20.8 21.1 .00 .00 11/28 18/28 2191 2130 Sociology .19 .07 .00 38.5 44.2 47.4 .00 .00 12/21 19/21 2192 2131 Prob/Stat .09 .03 .00 26.0 27.8 28.6 .00 .00 2/23 12/23 2193 2132 Org/mkt .16 .04 .00 29.0 33.2 34.7 .00 .00 3/4 4/4 2194 2133 Education .16 .04 .00 41.2 47.2 49.2 .00 .00 6/10 9/10 2195 2134 Occ Health .10 .02 .00 41.7 45.5 46.3 .00 .00 1/1 1/1 2196 2135 Anthro .12 .03 .00 38.5 42.0 43.5 .00 .00 2/8 4/8 2197 2136 Law .17 .08 .00 29.7 32.9 35.7 .00 .00 1/16 4/16 2198 2137 History .16 .07 .00 32.9 36.5 39.1 .00 .00 1/6 2/6 2199 2138 Phys Anthro .07 .01 .00 34.7 36.8 37.1 .00 .00 2/10 2/10 2200 Intl Poli Sci .09 .02 .00 27.3 29.1 29.8 .03 .00 0/2 1/2 2139 2201 US Poli Sci .15 .07 .00 25.2 27.4 29.6 .00 .00 1/6 2/6 2140 Philosophy .10 .03 .00 18.7 20.3 20.8 .03 .00 0/8 0/8 2202 2141 Math .04 .01 .00 14.3 14.7 14.9 1.00 .12 0/9 0/9 2203 2142 Vet Med .09 .01 .00 38.3 41.6 42.0 .00 .00 1/2 2/2 2204 2143 Cog Sci .18 .09 .00 35.7 39.4 43.3 .00 .00 3/3 3/3 2205 2144 Radiation .09 .01 .00 34.0 36.8 37.3 .00 .00 1/5 4/5 2206 2145 Demography .15 .06 .00 40.3 44.4 47.3 .00 .00 1/2 1/2 2207 2146 Classics .07 .01 .00 38.7 41.2 41.7 .27 .01 0/8 0/8 2208 2147 Opr Res .03 .00 .00 16.6 17.1 17.1 .73 .33 0/4 0/4 2209 2148 Plant Phys .08 .02 .00 29.1 31.0 31.8 .03 .00 0/3 1/3 2210 Mycology .03 .01 .00 36.8 37.5 37.9 1.00 .60 0/1 0/1 2149 2211 2150 2212 2151 2213 2152 2214 2153 2215 2154 2216 2155 2217 2156 2218 2157 2219 2158 2220 2159 2221 2160 2222 2161 2223 2162 2224 2163 2225 2164 2226 2165 2227 2166 2228 2167 2229 2168 2230 2169 2231 2170 2232

18 of 22 Y. Samuel Wang, Carole J. Lee, Jevin D. West, Carl T. Bergstrom, Elena A. Erosheva 1489 3. Data Cleaning Procedures 1551 1490 1552 For the main analysis, we impute gender indicators for authorships with first names that are used for a single gender with at 1491 1553 least 95% frequency in either the U.S. Social Security records or in the genderizeR database. We consider the gender indicator 1492 1554 to be missing for authorships that either do not appear in those databases or are not used with at least 95% frequency for 1493 1555 one gender. We subsequenty remove authorships with unimputed genders from our main analysis. This removal results in 1494 1556 some articles which originally had multiple authors becoming single author papers, which are excluded from the analysis. 1495 1557 The following table shows the proportion of authorships and papers which are lost solely due to unimputed genders. The 1496 1558 denominator only includes papers which have multiple authors which were published from 1960-2012. The % unimputed 1497 1559 column is the % of authors for which we do not impute a gender indicator. The % Lost column is the % of authors (or papers) 1498 1560 which are lost after removing the authorships with unimputed gender indicators and then removing the resulting single author 1499 1561 papers. For authorships, this percentage includes the authorships with unimputed genders. 1500 1562 1501 Data reductionTable S3. Data reduction due due to to unimputed unimputed gender indicators gender 1563 1502 1564 1503 Prop Authors with Authors Papers 1565 1504 Label Unimputed Gender Remaining Prop Lost Remaining Prop Lost 1566 1505 Anthropology 0.12 9326 0.17 3466 0.15 1567 1506 Classical studies 0.08 1976 0.11 610 0.09 1568 1507 Cognitive science 0.10 12510 0.13 3814 0.07 1569 1508 Demography 0.16 5069 0.22 1930 0.17 1570 1509 Ecology and evolution 0.11 192091 0.13 66152 0.08 1571 Economics 0.12 51691 0.19 22178 0.15 1510 1572 Education 0.08 24356 0.12 9396 0.09 1511 1573 History 0.08 3699 0.12 1596 0.11 1512 Law 0.06 6526 0.10 2765 0.09 1574 1513 Mathematics 0.21 5319 0.31 2459 0.25 1575 1514 Molecular & Cell biology 0.16 303761 0.18 73357 0.07 1576 1515 Mycology 0.13 4828 0.17 1759 0.11 1577 1516 Operations research 0.20 7217 0.28 3025 0.21 1578 1517 Organizational and marketing 0.12 22299 0.18 9137 0.13 1579 1518 Philosophy 0.11 3897 0.18 1770 0.15 1580 1519 Physical anthropology 0.09 16463 0.12 5175 0.07 1581 1520 Plant physiology 0.10 5388 0.14 2287 0.10 1582 Political science - international 0.10 5128 0.16 2247 0.14 1521 1583 Political science-US domestic 0.07 7269 0.11 3068 0.09 1522 Pollution and occupational health 0.17 39703 0.18 8845 0.06 1584 1523 Probability and Statistics 0.19 18763 0.27 7600 0.21 1585 1524 Radiation damage 0.16 10710 0.18 2902 0.09 1586 1525 Sociology 0.08 35858 0.12 13600 0.09 1587 1526 Veterinary medicine 0.15 13741 0.17 3275 0.06 1588 1527 Total 0.14 807588 0.16 252413 0.11 1589 1528 1590 1529 1591 1530 1592 1531 1593 1532 1594 1533 1595 1534 1596 1535 1597 1536 1598 1537 1599 1538 1600 1539 1601 1540 1602 1541 1603 1542 1604 1543 1605 1544 1606 1545 1607 1546 1608 1547 1609 1548 1610 1549 1611 1550 1612

Y. Samuel Wang, Carole J. Lee, Jevin D. West, Carl T. Bergstrom, Elena A. Erosheva 13 of 22