Arxiv:2010.12741V1 [Cs.CL] 24 Oct 2020

Total Page:16

File Type:pdf, Size:1020Kb

Arxiv:2010.12741V1 [Cs.CL] 24 Oct 2020 An Evaluation Protocol for Generative Conversational Systems Seolhwa Lee∗ Heuiseok Lim Korea University Korea University Seoul, South Korea Seoul, South Korea [email protected] [email protected] Joao˜ Sedoc ∗ New York University New York, USA [email protected] Abstract reason, human evaluation is standard practice. Al- though there is a tremendous amount of research in There are a multitude of novel generative mod- generative conversational systems recently, there els for open domain conversational systems; are no standard experimental design or evaluation however, there is no systematic evaluation of different systems. Systematic comparisons re- methods. This is a crucial issue. A systematic quire consistency in experimental design, eval- evaluation requires the exact same 1) experimental uation sets, conversational systems and their design, 2) evaluation datasets 3) models and their outputs, and statistical analysis. In this paper response utterances, and 4) statistical analyses. To layout a protocol for the evaluation of conver- solve this issue, we present an evaluation protocol sational models using head-to-head pairwise with code and template evaluation. comparison. We analyze ten recent models We call for an evaluation protocol and provide which claim state-of-the-art performance us- ing a paired head-to-head performance (win- a partial solution. In order to forward this, we loss-tie) on five evaluation datasets. Our find- present a full work through of our proposal by ex- ings show that DialoGPT and Blender are su- amining an experimental design on various models perior systems using Bradley-Terry model and and evaluation sets, and then we present a thorough TrueSkill ranking methods. These findings analysis of results. Specifically, we use a next ut- demonstrate the feasibility of our protocol to terance generation task through the ChatEval1 A/B evaluate conversational agents and evaluation paired comparison approach (Sedoc et al., 2019). sets. Finally, we make all code and evalua- tions publicly available for researchers to com- ChatEval has a standard experimental design and pare their model to other state-of-the-art dialog evaluation datasets. models. One limitation with this experimental design is the lack of interactive evaluation. There is a trade- 1 Introduction off between truly interactive evaluation (or deploy- ment A/B testing) and statistical significance. As There has been a flurry of recent work in open seen in Venkatesh et al.(2018), thousands of in- domain conversational systems which can ideally teractive conversations are required for statistical converse about any topic (Csaky, 2019; Gao et al., significance. While head-to-head next utterance 2019). Recent generative conversational systems generation comparison is further from the end-task arXiv:2010.12741v1 [cs.CL] 24 Oct 2020 use end-to-end trained neural network encoder- of conversation, it has more statistical power due to decoder models (Vinyals and Le, 2015). Evalu- the fact that it supports paired tests. Novikova et al. ating the improvement between models is difficult (2018) found that relative rankings yield more dis- as different system are rarely compared to other criminative results than absolute assessments when state-of-the-art models using the same evaluation evaluating natural language generation. datasets with the same evaluation setup. Arguably, the lack of standardized comparisons of systems In this setting, dialog systems can be viewed impedes progress in the field. as addressing a natural language generation task and not as being an interactive agent that carries The evaluation of generative conversational sys- the conversation further. Arguably, due to their tems is challenging due to a lack of automatic met- lack of planning and reasoning abilities, many cur- rics (Li et al., 2017a; Lowe et al., 2017). For this Work∗ performed while at Johns Hopkins University. 1http://chateval.org rent dialog systems are actually sentence level lan- 2.2 Ranking guage models. Although next utterance generation We follow the Workshop on Machine Translation is a more artificial task, Logacheva et al.(2018) (WMT) ranking score method (Bojar, 2018, Thesis observed a Pearson correlation of 0.6 between chapter 7.1): conversation-level and utterance-level ratings. In order to analyze our evaluation protocol, we win Majorscore(win) = ; perform a large-scale human evaluation of ten base- win + loss line and state-of-the-art systems. We find that while loss Majorscore(loss) = ; this is a O(kn2) problem (k evaluation sets and n win + loss systems), it is not prohibitively expensive.2 For our win Distinctscore(win) = ; evaluation protocol: 1) We evaluate ten state-of-the- win + loss + tie art models under the same evaluation conditions, loss Distinctscore(loss) = : 2) we use publicly available evaluation datasets win + loss + tie with both single-turn and multi-turn conversational The major score (i.e., Majorscore) is for the build- prompts and 3) we carefully analyze human an- ing pairwise ranking, which accepts only both the notations and multiple single-turn and multi-turn win and loss count ignoring tie. On the other hand, evaluation sets. the distinct score (i.e., Distinctscore) includes tie to assess how frequently the systems were judged 2 Evaluation Methodology to be better than or equal to the others. This pe- nalization allows one to differentiate systems more We utilize the ChatEval A/B paired testing frame- carefully. We also consider the total system win work to evaluate all the systems in our study. Sub- count (i.e., frequency) as rank method, for example sequently, we analyze the results using win scores if Blender “wins” over DialoGPT and ConvAI2, and system ranking analyses. then its system win count is two. Furthermore, we analyze our system compar- 2.1 ChatEval A/B Paired Test isons using two standard statistical ranking meth- ChatEval (Sedoc et al., 2019) is an online platform ods: TrueSkill (Herbrich et al., 2007) and Bradley- that compares responses of two different systems Terry (BT) model (Bradley and Terry, 1952). under the same dialog context. ChatEval interfaces TrueSkill is a non-parametric online algorithm to with Amazon Mechanical Turk3 to assess models evaluate a relative skills of players through the (See the AppendixA for further details). Annota- competitions such as Microsoft’s Xbox Live. For tors are presented the next utterance in a conversa- TrueSkill, we follow the WMT-TrueSkill (Sak- tion given the context of some number of previous aguchi et al., 2014) approach, which is used for turns. Then, the annotator (i.e., crowd worker) de- ranking MT systems in WMT by measuring the cides which answer (i.e., two possible responses ‘relative ability’ from the space of system pair A/B.) is the best or if there is a tie between both sys- matchings. The Bradley-Terry model is a paramet- tems. The platform randomly distributes the tasks ric probability model that can predict the outcome among different annotators allowing an unbiased of a paired comparison. We carry out experiments pairwise evaluation. Table1 illustrates the different with these different methods. models, responses as A and B, given the prompt. 3 Evaluation Datasets Prompt: Is the sky blue or black? We evaluate generative conversational systems A: It’s a black sky. using four datasets as single-turn (i.e., NCME, B: The sky is blue because of an optical effect DBDC, Twitter, Cornell Movie DC) and one known as Rayleigh scattering. dataset (ESL) for the multi-turn evaluation. This al- lows us to compare between single-turn and multi- Table 1: The example of Chateval A/B paired test. turn capability of each model. For the evalua- tion on the multi-turn datasets we only use models that can capture conversational history (DialoGPT, 2The total cost of all crowdsourcing experiments was ap- proximately $1,300. Blender, CakeChat (HRED implementation), Con- 3https://www.mturk.com/ vAI2 (seq2seq), ConvAI2 (KV-MemNN), ParlAI (controllable)). These evaluation datasets are pub- 2020). In addition, we evaluate several other sys- licly available on the ChatEval web portal. Aside tems: Controllable dialogue, ConvAI2 (seq2seq, from Cornell Movie DC evaluation dataset which KV-MemNN), Transformer, OpenNMT(Twitter, has 1000 prompts, all other evaluation datasets OS), Cakechat and DC-NeuralConversation. Next, have 200 prompts. we summarize these systems (see AppendixB for further details): i. Neural Conversational Model Evaluation set Human baselines We use human baselines: (NCME) Vinyals and Le(2015) conducted hu- NCME human 1, NCME human 2, DBDC human, man evaluation using a hand-crafted set of 200 Twitter baseline and Cornell movie DC baseline.1 single turn prompts. A large portion of these NCME and DBDC have two human baselines prompts are questions. The NCME dataset includes which are created post-prompt selection. Whereas both specific domains, noisy and general domain Twitter, DBDC, and ESL baselines are from the morality gen- prompts, such as questions about and next turn in the conversation. ParlAI (Blender) eral knowledge in math . (Roller et al., 2020) is recently presented as open- ii. Dialog Breakdown Detection Challenge Eval- domain generative conversational model from the 5 uation set (DBDC) The DBDC dataset consists ParlAI platform . Blender uses a ensemble of vari- of a series of text-based conversations between a ous models to create a conversational system. This human and a chatbot where the human was aware leads to high quality response generation. Blender they were chatting with a computer (Higashinaka achieves the state-of-the-art on existing approaches et al., 2016). The evaluation set is a selection of 200 in multi-turn dialogue yielding humanness and en- single turns from the DBDC 3 dataset (Higashinaka gagingness measurements. Notably, to our knowl- et al., 2017). edge Blender was not compared to DialoGPT until this work. iii. Twitter A set of 200 prompts from conver- DialoGPT (Zhang et al., 2019) is another state-of- sational threads were randomly drawn from the the-art model which uses a GPT framework trained ParlAI (Miller et al., 2017) Twitter derived test set.
Recommended publications
  • PERVASIVE BEHAVIOR INTERVENTIONS Using Mobile Devices for Overcoming Barriers for Physical Activity
    PERVASIVE BEHAVIOR INTERVENTIONS Using Mobile Devices for Overcoming Barriers for Physical Activity Vom Fachbereich Elektrotechnik und Informationstechnik der Technischen Universität Darmstadt zur Erlangung des akademischen Grades eines Doktor-Ingenieurs (Dr.-Ing.) genehmigte Dissertation von DIPL.-INF. UNIV. TIM ALEXANDER DUTZ Geboren am 20. Juli 1978 in Darmstadt Vorsitz: Prof. Dr. techn. Heinz Koeppl Referent: Prof. Dr.-Ing. habil. Ralf Steinmetz Korreferent: Prof. Dr. rer. nat. Rainer Malaka Tag der Einreichung: 14. September 2016 Tag der Disputation: 28. November 2016 Hochschulkennziffer D17 Darmstadt 2017 Dieses Dokument wird bereitgestellt von tuprints, E-Publishing-Service der Technischen Universität Darmstadt. http://tuprints.ulb.tu-darmstadt.de [email protected] Bitte zitieren Sie dieses Dokument als: URN: urn:nbn:de:tuda-tuprints-61270 URL: http://tuprints.ulb.tu-darmstadt.de/id/eprint/6127 Die Veröffentlichung steht unter folgender Creative Commons Lizenz: International 4.0 – Namensnennung, nicht kommerziell, keine Bearbeitung https://creativecommons.org/licenses/by-nc-nd/4.0/ Für meine Eltern Abstract Extensive cohort studies show that physical inactivity is likely to have negative consequences for one’s health. The World Health Organization thus recommends a minimum of thirty minutes of medium- intensity physical activity per day, an amount that can easily be reached by doing some brisk walking or leisure cycling. Recently, a Taiwanese-American team of scientists was able to prove that even less effort is required for positive health effects and that as little as fifteen minutes of physical activity per day will increase one’s life expectancy by up to three years on the average. However, simply spreading this knowledge is not sufficient.
    [Show full text]
  • Competition-Based User Expertise Score Estimation
    Competition-based User Expertise Score Estimation Jing Liu†*, Young-In Song‡, Chin-Yew Lin‡ † ‡ Harbin Institute of Technology Microsoft Research Asia No. 92, West Da-Zhi St, Nangang Dist. Building 2, No. 5 Dan Ling St, Haidian Dist. Harbin, China 150001 Beijing, China 100190 [email protected] {yosong, cyl}@microsoft.com ABSTRACT covered by existing web pages. With the explosive growth of web 2.0 sites, community question and answering services (denoted as In this paper, we consider the problem of estimating the relative 1 2 expertise score of users in community question and answering CQA) such as Yahoo! Answers and Baidu Zhidao , have become services (CQA). Previous approaches typically only utilize the important services where people can use natural language rather explicit question answering relationship between askers and an- than keywords to ask questions and seek advice or opinions from swerers and apply link analysis to address this problem. The im- real people who have relevant knowledge or experiences. CQA plicit pairwise comparison between two users that is implied in services provide another way to satisfy a user’s information needs the best answer selection is ignored. Given a question and answer- that cannot be met by traditional search engines. Users are the ing thread, it’s likely that the expertise score of the best answerer unique source of knowledge in CQA sites and all users from ex- is higher than the asker’s and all other non-best answerers’. The perts to novices can generate content arbitrarily. Therefore, it is goal of this paper is to explore such pairwise comparisons inferred desirable to have a system that can automatically estimate the user from best answer selections to estimate the relative expertise expertise score and identify experts who can provide good quality scores of users.
    [Show full text]
  • Empirical Software Engineering at Microsoft Research
    Software Analytics for Digital Games Thomas Zimmermann, Microsoft Research, USA Joint work with Nachi Nagappan and many others. © Microsoft Corporation analytics is the use of analysis, data, and systematic reasoning to make decisions. Definition by Thomas H. Davenport, Jeanne G. Harris Analytics at Work – Smarter Decisions, Better Results © Microsoft Corporation history of software analytics Tim Menzies, Thomas Zimmermann: Software Analytics: So What? IEEE Software 30(4): 31-37 (2013) © Microsoft Corporation © Microsoft Corporation trinity of software analytics Dongmei Zhang, Shi Han, Yingnong Dang, Jian-Guang Lou, Haidong Zhang, Tao Xie: Software Analytics in Practice. IEEE Software 30(5): 30-37, September/October 2013. MSR Asia Software Analytics group: http://research.microsoft.com/en-us/groups/sa/ © Microsoft Corporation software analytics is © Microsoft Corporation software analytics is diversity © Microsoft Corporation The Stakeholders The Tools The Questions © Microsoft Corporation http://aka.ms/145Questions Andrew Begel, Thomas Zimmermann. Analyze This! 145 Questions for Data Scientists in Software Engineering. ICSE 2014 © Microsoft Corporation Essential + Microsoft’s Top 10 Questions Essential Worthwhile How do users typically use my application? 80.0% 99.2% What parts of a software product are most used and/or loved by 72.0% 98.5% customers? How effective are the quality gates we run at checkin? 62.4% 96.6% How can we improve collaboration and sharing between teams? 54.5% 96.4% What are the best key performance indicators (KPIs) for 53.2% 93.6% monitoring services? What is the impact of a code change or requirements change to 52.1% 94.0% the project and its tests? What is the impact of tools on productivity? 50.5% 97.2% How do I avoid reinventing the wheel by sharing and/or searching 50.0% 90.9% for code? What are the common patterns of execution in my application? 48.7% 96.6% How well does test coverage correspond to actual code usage by 48.7% 92.0% our customers? © Microsoft Corporation Obsessing over our customers is everybody's job.
    [Show full text]
  • The Evaluation of Rating Systems in Online Free-For-All Games
    The Evaluation of Rating Systems in Online Free-for-All Games Arman Dehpanah Muheeb Faizan Ghori Jonathan Gemmell Bamshad Mobasher School of Computing School of Computing School of Computing School of Computing DePaul University DePaul University DePaul University DePaul University Chicago, USA Chicago, USA Chicago, USA Chicago, USA [email protected] [email protected] [email protected] [email protected] Abstract—Online competitive games have become increasingly might also be hampered by the inclusion of new players since popular. To ensure an exciting and competitive environment, these the system does not possess any knowledge of these players. games routinely attempt to match players with similar skill levels. In this paper, we consider six evaluation metrics. We include Matching players is often accomplished through a rating system. There has been an increasing amount of research on developing traditional metrics such as accuracy, mean absolute error, such rating systems. However, less attention has been given to the and Kendall’s rank correlation coefficient. We further include evaluation metrics of these systems. In this paper, we present an metrics adapted from the domain of information retrieval, exhaustive analysis of six metrics for evaluating rating systems in including mean reciprocal rank (MRR), average precision online competitive games. We compare traditional metrics such as (AP), and normalized discounted cumulative gain (NDCG). accuracy. We then introduce other metrics adapted from the field of information retrieval. We evaluate these metrics against several We analyze the ability of these metrics to capture meaningful well-known rating systems on a large real-world dataset of over insights when they are used to evaluate the performance of 100,000 free-for-all matches.
    [Show full text]
  • Application and Further Development of Trueskill™ Ranking in Sports
    TVE-F 19019 Examensarbete 15 hp Juni 2019 Application and Further Development of TrueSkill™ Ranking in Sports Julia Ibstedt Elsa Rådahl Erik Turesson Magdalena vande Voorde Abstract Application and Further Development of TrueSkill™ Ranking in Sports Julia Ibstedt, Elsa Rådahl, Erik Turesson, Magdalena vande Voorde Teknisk- naturvetenskaplig fakultet UTH-enheten The aim of this study was to explore the ranking model TrueSkill™ developed by Microsoft, applying it on various sports and Besöksadress: constructing extensions to the model. Two different inference Ångströmlaboratoriet Lägerhyddsvägen 1 methods for TrueSkill was constructed using Gibbs sampling and Hus 4, Plan 0 message passing. Additionally, the sequential method using Gibbs sampling was successfully extended into a batch method, in order Postadress: to eliminate game order dependency and creating a fairer, although Box 536 751 21 Uppsala computationally heavier, ranking system. All methods were further implemented with extensions for taking home team advantage, score Telefon: difference and finally a combination of the two into 018 – 471 30 03 consideration. The methods were applied on football (Premier Telefax: League), ice hockey (NHL), and tennis (ATP Tour) and evaluated on 018 – 471 30 00 the accuracy of their predictions before each game. Hemsida: On football, the extensions improved the prediction accuracy from http://www.teknat.uu.se/student 55.79% to 58.95% for the sequential methods, while the vanilla Gibbs batch method reached the accuracy of 57.37%. Altogether, the extensions improved the performance of the vanilla methods when applied on all data sets. The home team advantage performed better than the score difference on both football and ice hockey, while the combination of the two reached the highest accuracy.
    [Show full text]
  • Thesis Template
    Tailoring a Psychophysiologically Driven Rating System MASTER DISSERTATION Harryharasuthan Vasantharajah MASTER IN COMPUTER ENGINEERING SUPERVISOR Sergi Bermúdez I Badia TAILORING A PSYCHOPHYSIOLOGICALLY DRIVEN RATING SYSTEM Harryharasuthan Vasantharajah B.Sc. (Hons) Supervised by Sergi Bermúdez I Badia, Ph.D. Submitted in fulfillment of the requirements for the degree of Masters in Computer Engineering. Faculty of Exact Sciences and Engineering University of Madeira 2019 Abstract Humans have always been interested in ways to measure and compare their performances to establish who is best at a particular activity. The first Olympic Games, for instance, were carried out in 776 BC, and it was a defining moment in history where ranking based competitive activities managed to reach the general populous. Every competition must face the issue of how to evaluate and rank competitors, and often rules are required to account for many different aspects such as variations in conditions, the ability to cheat, and, of course, the value of entertainment. Nowadays, measurements are performed out through various rating systems, which considers the outcomes of the activity to rate the participants. However, they do not seem to address the psychological aspects of an individual in a competition. This dissertation employs several psychophysiological assessment instruments intending to facilitate the acquisition of skill level rating in competitive gaming. To do so, an exergame that uses non-conventional inputs, such as body tracking to prevent input biases, was developed. The sample size of this study is ten, and the participants were put on a round-robin tournament to provide equal intervals between games for each player. After analyzing the outcome of the competition, it revealed some critical insights on the psychophysiological instruments; Especially the significance of Flow in terms of the prolificacy of a player.
    [Show full text]
  • Ranking (Trueskill) Map #1 #1 Player Map #2 Player #2 Player Vs Player #3 Map Map #3 Map #4 Map #5 #4 Player Ranking Systems ~ 30 Mins Talk
    Math for Game Programmers: Ranking Systems; Elo, TrueSkill and Your Own Mario Izquierdo Sr. Software Engineer at Ranking (TrueSkill) map #1 #1 player map #2 player #2 player vs player #3 map map #3 map #4 map #5 #4 player Ranking Systems ~ 30 mins talk • Elo • TrueSkill • Practical Considerations Elo Rating System Árpád Élő (1903 - 1992) • Physics profesor and master chess player. • Elo's system constituted an improvement on the previous Harkness System. • Elo's system was adopted by the FIDE (World Chess Federation) in 1970. • Published "The Rating of Chessplayers, Past and Present” in 1978. • Fun fact: Up until the mid-80’s, Elo himself made the rating calculations! Elo Rating System: Normal Distribution Assumption: Chess performance is a normally distributed random variable. Using some simplifications (i.e. constant standard deviation) makes easy to calculate the Expected score of a match (probability of win) for two given player skill levels. Elo Rating System: Normal Distribution “Slime Curve” In the eyes of ELO, you are all “slime people” Elo Rating System: Normal Distribution “Slime Curve” Elo Rating System: Formula After a given match, rating points are transferred between players: RatingDiff = (Score - Expected) * K-factor Where: Score is 0 = loss, 0.5 = draw, 1 = win Expected is 0 to 1, the probability of winning K-factor is a constant for maximum change (update “speed”) Elo Rating System: Formula After a given match, rating points are transferred between players: RatingDiff = (Score - Expected) * K-factor Much of the trick is in figuring out what the Expected result of a game is. The original ELO system uses the following formula (from the Normal dist.): Expected[A] = 1/(1+10^(Rating[B-A]/400)) Elo Rating System: Formula After a given match, rating points are transferred between players: RatingDiff = (Score - Expected) * K-factor Much of the trick is in figuring out what the Expected result of a game is.
    [Show full text]
  • Trueskill 2: an Improved Bayesian Skill Rating System
    TrueSkill 2: An improved Bayesian skill rating system Tom Minka Ryan Cleven Yordan Zaykov Microsoft Research The Coalition Microsoft Research March 22, 2018 Abstract Online multiplayer games, such as Gears of War and Halo, use skill-based matchmaking to give players fair and enjoyable matches. They depend on a skill rating system to infer accurate player skills from historical data. TrueSkill is a popular and effective skill rating system, working from only the winner and loser of each game. This paper presents an extension to TrueSkill that incorporates additional information that is readily available in online shooters, such as player experience, membership in a squad, the number of kills a player scored, tendency to quit, and skill in other game modes. This extension, which we call TrueSkill2, is shown to significantly improve the accuracy of skill ratings computed from Halo 5 matches. TrueSkill2 predicts historical match outcomes with 68% accuracy, compared to 52% accuracy for TrueSkill. 1 Introduction When a player wants to play an online multiplayer game, such as Halo or Gears of War, they join a queue of waiting players, and a matchmaking service decides who they will play with. The matchmaking service makes its decision based on several criteria, including geographic location and skill rating. Our goal is to improve the fairness of matches by improving the accuracy of the skill ratings flowing into the matchmaking service. The skill rating of a player is an estimate of their ability to win the next match, based on the results of their previous matches. A typical match result lists the players involved, their team assignments, the length of the match, how long each player played, and the final score of each team.
    [Show full text]
  • Influence of Gameplay on Skill in Halo Reach
    Influence of Gameplay on Skill in Halo Reach Jeff Huang Abstract University of Washington We study the question of how skill develops in video [email protected] game through a rating called TrueSkill. In a previous paper [1] we used the skill ratings from 7 months of Thomas Zimmermann games from over 3 million players to look at how play Microsoft Corporation intensity, breaks in play, other titles played, and skill [email protected] change over time affect skill. In this paper, we briefly summarize our findings and discuss how we plan to Nachiappan Nagappan continue our research. Microsoft Corporation [email protected] Keywords Games, Analytics, Game Usage, Player Progression Bruce Phillips Microsoft Corporation ACM Classification Keywords [email protected] K.8.0 [Personal Computing]: Games. Chuck Harrison General Terms Microsoft Corporation Human Factors, Measurement. [email protected] Introduction In this paper we present a brief overview of player game characteristics in Halo Reach. We discuss play intensity and play patterns with respect to skill development. We present a general method of our analysis that can be applied to other games. We Copyright is held by the author/owner(s). conclude with some of our future plans for mining CHI’13, April 27 – May 2, 2013, Paris, France. game-player data. ACM 978-1-XXXX-XXXX-X/XX/XX. Analysis of Skill Data Skill in Halo Reach Other Titles Played. Players who did not play Halo 3 For our study, we selected a cohort of 3.2 million Halo previously were less skilled but gained skill at about the Step 1: Select a population of Reach players who started playing the game in its first same rate as everyone else in Halo Reach.
    [Show full text]
  • Measuring Cooperative Behavior in Contemporary Multiplayer Games
    MEASURING COOPERATIVE BEHAVIOR IN CONTEMPORARY MULTIPLAYER GAMES Martin Ashton Master of Science School of Computer Science McGill University Montreal, Quebec 2012-08-12 A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment of the requirements for the degree of Master of Science Copyright c 2012 Martin Ashton DEDICATION This work is dedicated to my friends Pete, Dave, Kevin, Vanessa, Steve, Mike, Nadina, John, Chris, Max, Dan, Francine, Peter, Daniel, Michael, Dahlia, Sabrina, Michaela, Katrina, Diana, Stevie, Kathy, Bronson, Jess, Theo, Cynthia, Ben, Shayne, Paul, Andre, Fred, Justin, J.P., Aly, Carl, Udhay, Phil, Matt, Simon, Julien, and the rest of the Koalas. ii ACKNOWLEDGEMENTS This work was supported, in part, by the Natural Sciences and Engineering Council of Canada (NSERC), and Le Fonds de Recherche du Qu´ebec - Nature et Technologies (FQRNT). My particular thanks are extended to Prof. Clark Verbrugge for giving me the opportunity to research the exciting domain of modern video games, and to my parents for teaching me the importance of continuously exceeding my own limits. Lastly, I would like to thank the developers at Bungie, Blizzard, and Valve for making such excellent games. iii ABSTRACT Social aspects of multiplayer games are well known as contributors to game success, with online friendships and socialization expected to expand and strengthen a player-base. Understanding the nature of social behavior and determining the impact of cooperation on gameplay is thus important to game design. In this work, we make use of data exposed through in-game and web-based API’s of two contemporary multiplayer games, World of Warcraft and Halo: Reach.
    [Show full text]
  • Model-Based Machine Learning
    Model-Based Machine Learning Christopher M. Bishop Microsoft Research, Cambridge, CB3 0FB, U.K. [email protected] Several decades of research in the field of machine learning have resulted in a multitude of different algorithms for solving a broad range of problems. To tackle a new application a researcher typically tries to map their problem onto one of these existing methods, often influenced by their familiarity with specific algorithms and by the availability of corresponding software implementations. In this paper we describe an alternative methodology for applying machine learning, in which a bespoke solution is formulated for each new application. The solution is expressed through a compact modelling language, and the corresponding custom machine learning code is then generated automatically. This model-based approach offers several major advantages, including the opportunity to create highly tailored models for specific scenarios, as well as rapid prototyping and comparison of a range of alternative models. Furthermore, newcomers to the field of machine learning don’t have to learn about the huge range of traditional methods, but instead can focus their attention on understanding a single modelling environment. In this paper we show how probabilistic graphical models, coupled with efficient inference algorithms, provide a very flexible foundation for model-based machine learning, and we outline a large-scale commercial application of this framework involving tens of millions of users. We also describe the concept of probabilistic programming as a powerful software environment for model-based machine learning, and we discuss a specific probabilistic programming language called Infer.NET, which has been widely used in practical applications.
    [Show full text]
  • Microsoft Research Cambridge
    Welcome to Microsoft Research Cambridge 21st November Alex Butler Cambridge Wireless UX SIG Sensors & Devices Group “No Free Lunch: The Consumer as Product in a Data-Driven Economy” Microsoft Research Presence Mission • Advance the state of the art in Computer Science • Transfer technology to Microsoft business • Lead Microsoft into the future Microsoft Research Cambridge The Numbers STAFF HONOURS OUTPUTS 121 RESEARCHERS 1 KNIGHTHOOD HUNDREDS OF TIER PUBLICATIONS PER YEAR 100 INTERNS PER YEAR 1 TURING AWARD WINNER 45+ PATENTS FILED PER YEAR 17 R&D STAFF FROM OTHER 1 KYOTO PRIZE WINNER MICROSOFT R&D GROUPS 2 MARR PRIZE WINNERS 11 VISITING RESEARCHERS PER YEAR 2 ACM FELLOWS OVER 150 DAY VISITORS, 1 IEEE FELLOW SEMINAR SPEAKERS PER YEAR 3 ROYAL SOCIETY FELLOWS 1 ROYAL SOCIETY OF EDINBURGH FELLOW 4 ROYAL ACADEMY OF ENGINEERING FELLOWS 1 MACROBERT AWARD 1 ROOKE MEDAL 1 VON NEUMANN MEDAL 1 EADS GRAND PRIZE Research Areas Machine Learning & Perception Computer Vision. Machine Learning. Online Services & Advertising. Constraint Reasoning. Programming Principles & Tools Formal Methods. Programme Structures. Programming Systems. Constructive Security. Systems & Networking Distributed Systems. Networking. Networks, Economics & Algorithms. Operating Systems. Computer-Mediated Living Integrated Systems. Sensors & Devices. Socio-Digital Systems. i3D. Computational Science Biological Computation. Computational Ecology & Environmental Sciences. Research Connections Advanced Research Tools and Services Community, Intellectual Capital Development Earth,
    [Show full text]