The Transforma ve Poten al of AI for Science: Will AI Write the Scien fic Papers of the Future?
Yolanda Gil Informa on Sciences Ins tute and Department of Computer Science University of Southern California
h ps:// nyurl.com/yolandagil-aaai2020
This license lets others distribute, remix, tweak, and build upon your work, even commercially, as long as the original source is credited. Source: “Will AI Write the Scien fic Papers of the Future?”, Yolanda Gil, 2020 Presiden al Address, Associa on for the Advancement of Ar ficial Intelligence (AAAI), February 2020.
A Personal Perspec ve on AI A Personal Perspec ve on AI
• The AI community has always been • Visionary • Broad • Inclusive • Interdisciplinary • Determined
• And dare I say • Successful A Personal View of Watershed Moments in AI: (I) 1980s
Scripts Plans Goals and Understanding Roger C. Schank and Robert P. Abelson A Personal View of Watershed Moments in AI: 1990s 1990: Sphynx shows speaker independent large vocabulary con nuous speech recogni on – used it to write my PhD thesis
1992: Soar flies helicopter teams in simula ons and is indis nguishable to commanders from human-controlled aircra 1992: TD-Gammon autonomously learns to play backgammon at human player levels
1995: SKICAT iden fies five new quasars in the Second Palomar Sky Survey
1995: Navlab is the first trained car to drive autonomously on highways to cross the United States 1997: Deep Blue defeats human chess world champion and gets grandmaster-level ra ng
1999: RAX flew a spacecra autonomously, demonstra ng planning, monitoring, and fault repair
h ps://doi.org/10.1609/aimag.v16i1.1121
h ps://www.cs.cmu.edu/afs/cs/project/alv/www/navlab_video_page_files/Navlab_Kanade_version97.mpg h ps://flic.kr/p/9Dpww A Personal View of Watershed Moments in AI: 2000s
2000: The Gene Ontology is shown to describe over 15,000 gene products for drosophila, mouse, and yeast
2000: Kismet demonstrates and recognizes emo ons
2003: USC ISI’s sta s cal machine transla on prototype beats hand-cra ed commercial systems
2004: RDF seman c specifica ons become W3C recommenda ons for the GGG (Giant Global Graph)
2007: Stanley wins $1M for first autonomous high-speed off-road driving
2007: First robot soccer team against human players in exhibi on game at RoboCup
2009: Pragma c Chaos ensemble learning wins $1M compe on to predict user film ra ngs
h ps://commons.wikimedia.org/w/index.php?curid=20798358 h ps://commons.wikimedia.org/w/index.php?curid=374949 A Personal View of Watershed Moments in AI: 2010s
2010: Siri voice-ac vated personal assistant is released as a smart phone app
2011: CMDragons team of 10 soccer robots coordinate plays for rou ne passing, intercep on, and goal scoring 2011: Watson takes first place in Jeopardy Q/A game defea ng two human champions
2012: AIexNet scores improved 10 percentage points in the ImageNet visual recogni on challenge
2013: Cogni ve Tutor shows 8 percen le points average improvement in algebra in 25,000 students study
2016: Knowledge Graph used as seman c backbone in one third of 100B monthly searches
2019: Wikidata records 8 billion triples and over 880 billion edits, surpassing Wikipedia as most edited Wikimedia site
h ps://en.wikipedia.org/w/index.php?curid=54391045
h ps://www.youtube.com/watch?v=4QtBSDSC2pk Diversity and Breadth of Advances in AI
Crea vity Vision Percep on Commonsense Robo cs Ethics Speech Sensing Explana on Language Discovery
Representa on Knowledge
Search Planning Learning Cogni on Reasoning Teamwork Matching Predic on Coordina on Constraints Dialogue Understanding h ps://commons.wikimedia.org/w/index.php?curid=70268939 AI to Address Major Future Challenges
The Impera ve for AI in Science AI and Scien fic Discovery
• [Lenat 1976] • [Lindsay et al 1980] • [Langley 1981] • [Falkenhainer 1985] • [Kulkarni and Simon 1988] • [Cheeseman et al 1989] • [Zytkow et al 1990] • [Simon 1996] • [Valdes-Perez 1997] • [Todorovski et al 2000] • [Schmidt and Lipson 2009] Human Limita ons Curb Scien fic Progress [Gil DSJ’17]
• Not systema c • e.g., [Peters et al PLOS 2014]
• Errors • e.g., [Herndon et al CJE 2013]
• Biases • e.g., [Anderson et al ACS 2014]
• Poor repor ng • e.g., [Garijo et al PLOS 2013] h ps://doi.org/10.1145/3339399 h ps://doi.org/10.1038/s41586-019-1335-8
h ps://doi.org/10.1038/s41586-019-1923-7
h ps://arxiv.org/pdf/1910.06710 http://commons.wikimedia.org/wiki/File:MRI_brain_sagittal_section.jpg http://commons.wikimedia.org/wiki/File:Earth_Eastern_Hemisphere.jpg http://www.nasa.gov/mission_pages/swift/bursts/uv_andromeda.html
Capturing Scien fic Knowledge Scien fic Knowledge
h p://www.pihm.psu.edu/pihm_home.html Suppor ng Composi onality of Scien fic Knowledge
Ø Data formats Ø Physical variables Ø Constraints for use Ø Adjustable parameters Ø Interven ons Capturing Scien fic Knowledge
Crowdsourced Common Software and Provenance vocabularies entities models
Scientific data repositories Data analysis Collaborative processes methods
Open reproducible Automating Model publications discoveries integration Capturing Scien fic Knowledge
Crowdsourced Common Software and Provenance vocabularies entities models
Scientific data repositories Data analysis Collaborative processes methods
Open reproducible Automating Model publications discoveries integration Low-Cost Crea on of Scien fic Vocabulary Standards [Gil et al ISWC 2017; Khider et al PP 2019; Emile-Geay et al PAGES 2018] Work with Deborah Khider, Julien Emile-Geay, Daniel Garijo (USC); Nick McKay (NAU)
Problem: Diversity of requirements for metadata Approach: Seman c technologies used for controlled crowdsourcing facilitate crea on of community standards to describe highly heterogeneous scien fic data • Organic growth: As scien sts annotate their datasets, they propose new metadata proper es
h ps://commons.wikimedia.org/wiki/File:An_ice_core_segment.jpg • Crowdsourcing: Scien sts proposed proper es for reuse, vote on priori es • Editorial oversight: Editors decide what proper es will be in future versions h ps://commons.wikimedia.org/wiki/File:Gravity-corer_hg.png Results: A new standard for paleoclimate (PaCTS 1.0) with one (!!) single ini al face-to-face mee ng Controlled Crowdsourcing:
7 9 Con nuous Ontology Growth 8
8
8
Figure 4: An overview of the core Linked Earth Ontology and its extensions.
We also took into account relevant standards and widely used models. We used several vocabularies3: Schema.org and Dublin Core Terms (DC) for representing the basic metadata of a dataset and its associated publications (e.g., title, description, authors, contributors, license, etc.), the wgs_84 and GeoSparql specifications for rep- resenting locations where samples are collected, the Semantic Sensor Network (SSN) to represent observation-related metadata, the FOAF vocabulary to represent basic information about contributors, and PROV-O to represent the derivation of models from raw datasets. Figure 4 shows an overview of the ontology, which is layered and has a modular structureFigure. The 3:existing Overview standards of just the mentioned main features provide an of upper the Linkedontology forEarth basic P latform. terms. We used the LiPD format, mentioned in Section 2, to develop the LiPD ontol- ogy4 which contains the main terms useful to describe any paleoclimate dataset (e.g., data tables,When variables, annotating instrument used metadata, to measure them,the calibration, system uncertainty, offers in etc.) a. pull-down menu the possible A set of extensions of LiPD cover more specific aspects of the domain. The Proxy completions of what the user is typing based on similar terms proposed by other users. Archive extension defines the types of medium in which measurements are taken, Figure 3: OverviewsuchThis as ofmarine the helps sedimentsmain avoid features or coral proliferation. Theof theProxy Linked Observation of Earth unnecessary extension Platform. describe terms s the and helps normalize the new Figure 2: Overview of the metadata annotation interface, with core ontology terms markedtypes of observations (e.g., tree ring width, trace metal ratio, etc.) that can be meas- with an “L”. The properties under “Extra Information” are part of the crowd vocabulary. uredterms. The Pro created.xy Sensor extensionIf none describes represents the types what of biological the user or non wants-biological to specify, then a new property When annotatingcomponentswill be metadata,that added. react to environmental The the property system conditions offersbecomes and reflect in part athe pullclimate of -thedown at the crowd time. menu vocabulary, the possible and a new wiki Figure 3: Overview of the main Thefeatures Instrument of the extension Linked enumerate Earths thePlatform. instruments used for taking measurements, completions ofsuch whatpage as a mass theis createdspectrometer user is for typing. The it. Inferred The based user,Variable on or similarextension perhaps describe terms others,s theproposed types can of edit by that other page users to add. documen- 4.1 Annotation Framework: The Linked Earth Platform This helps avoidclimatetation proliferationvariables. As that a result,can be ofinferred users unnecessary from build measurements the crow terms or fromd vocabulary and other inferred helps vari-while normalize curating the their new own datasets. When annotating metadata,ables the (e.g.Figure systemtemperature) 3 highlights offers. The crowd in vocabulary the a pull main- buildsdown features on these menu extensions.of the Linked possible Earth Platform. The map-based The Linked Earth Platform [21] implements the Annotation Frameworkterms as created. an exten- IfThe none core ontology represents and the crwhatowd vocabulary the user share wants a common to namespacespecify, for then all a new property completions of what the user theis typingextensionsvisualizations based(http://linked.earth/ontology/ on show similar datasets terms), in already order proposed to simplify annotated by querying other with as userswell location as. metadata. Author pages sion of the Organic Data Science framework [9], which is built onwill MediaWiki be added. [15] The property becomes part of the crowd vocabulary, and a new wiki and Semantic MediaWiki [13]. There are severalThis reasons helps for this avoid. First, proliferation a wiki pro- show of unnecessary their contributions terms and, which helps help normalize track credit the and new create incentives. Other pages page is created for it. The user, or perhaps others, can edit that page to add documen- terms created. If none represents3 what the user wants to specify, then a new property vides a collaborative environment where multiple users can edit pages, and where the http://schema.org/ are devoted, http://dublincore.org/documents/dcmi to foster community-terms/, https://www.w3.org/2003/01/geo/wgs84_pos discussions and ,take polls. The annotation interface tation. As a result, users build the crowd vocabulary while curating their own datasets. 1 will be added. The property becomeshttp://schemas.opengis.net/geospis designed part toof be arql/1.0/geosparql_vocab_all.rdf,the intuitive, crowd vocabulary,and https://www.w3.org/2005/Incubator/ssn/ssnx/ssn#,provides anddetailed a new documentation wiki with examples . history of edits is automatically tracked. Second, MediaWiki is easily extensible, https://www.w3.org/2005/Incubator/ssn/ssnx/ssn#, http://xmlns.com/foaf/spec/, http://www.w3.org/TR/prov-o/ Figure 3 highlights4 the main features of the Linked Earth Platform. The map-based allowing us to easily create special types of pages, pagegenerate is createddynamic foruser it. input The forms user, ,http://wiki.linked.earth/Linked_Paleo_Data or perhaps others, can edit that page to add documen- and create many other extensions. Third, because tationMediaWiki. As ais result, wellvisualizations maintained users build and showthe crow datasetsd vocabulary already while annotated curating with their location own datasets. metadata. Author pages 4.2 Initial Core Ontology has a strong community, there are numerous plug-insFigure available 3. highlightsFinallyshow, the their theSemantic maincontributions features ,of which the Linked help trackEarth credit Platform and. Thecreate map incentive-based s. Other pages MediaWiki API makes it easy to export content andvisualizations interoperate with showare other devoted datasets systems to .already foster communityannotated with discussions location metadata.and take polls. Author The pages annotation interface To ensure that most changes would be crowd extensions that would1 not cause major Each dataset is a page in the Linked Earth Platformshow. “ Datasettheir contributions” is isa specialdesigned class,, which to or be redesigns intuitive,help track of and creditthe provide core and ontology, screate detailed incentive the documentation initials .core Other ontology pageswith wasexamples carefully. designed. category in wiki parlance. When a user creates a neware pagedevoted for a todataset, foster all community the prop- discussionsFirst, the ontology and take was polls. developed The annotation using a traditional interface methodology for ontology en- erties that apply are shown in a table where the user can fill their values. For each 1 is designed to be intuitive,4.2 andInitial providegineering Cores detailed [ 23Ontology]. We documentation started by collect withing examples terms to. be included by the ontology in col- variable they indicate if it is observed or inferred, its value, uncertainty, and how it laboration with a select group of domain scientists. These terms where extracted from was measured. The order of the variables as columns in the data file Tois also ensure specified that. mostexamples changes provided would by be thecrowd community extensions2, and that from would previous not cause workshops major where the Figure 2 shows the metadata annotation interface for a4.2 lake sedimentInitial dataset. Core OntologyThe redesigns of the communitycore ontology, had tdiscussedhe initial coredataset ontology annotation was [carefully4]. The ontology designed. development process user has provided some of the values of the metadata properties, others have not been was also informed by previous efforts to represent basic paleoclimate metadata [14], filled out yet. The core ontology properties are shownTo e nsureat the top,that andmost First,the changescrowd the vo-ontology would bewas crowd developed extensions using that a traditional would not methodology cause major for ontology en- and by prior community proposals to unify terminology in the paleo-climate domain cabulary properties are shown near the bottom (underredesigns “Extra ofInformation”). thegineering core ontology, The [ 23user]. Wethe initialstarted core by ontologycollecting was terms carefully to be designincludeded. by the ontology in col- [5]. can also specify a new subcategory for this dataset, asFirst, shown the at the ontology bottom.laboration was withdeveloped a select using group a traditional of domain methodology scientists. These for ontology terms where en- extracted from gineering [23]. examplesWe started provided by collect bying the terms community to be included2, and by from the ontologyprevious inworkshops col- where the
laboration withcommunity a select group had of 1discussed domain scientists dataset .annotation These terms [4 ]where. The extractedontology fromdevelopment process http://wiki.linked.earth/Best_Practices2 examples providedwas also by theinforme community2 d by previous, and fromefforts previous to represent workshops basic paleoclimatewhere the metadata [14], https://github.com/LinkedEarth/Ontology/tree/master/Example community hadand discussed by prior dataset community annotation propos [4als]. The to unifyontology terminology development in the process paleo -climate domain was also informe[5]d. by previous efforts to represent basic paleoclimate metadata [14], and by prior community proposals to unify terminology in the paleo-climate domain [5]. 1 http://wiki.linked.earth/Best_Practices 2 https://github.com/LinkedEarth/Ontology/tree/master/Example 1 http://wiki.linked.earth/Best_Practices 2 https://github.com/LinkedEarth/Ontology/tree/master/Example Living with a Live Ontology 5
Challenge: Con nuous revisions of ontologies + annotated data • Maintain core + crowd ontology • Editors decide when to update • Automated upgrades when monotonic changes • Semi-automated upgrades when non-monotonic changes
Figure 1: Overview of our approach for controlled crowdsourcing.
Figure 1 highlights the main aspects of the proposed controlled crowdsourcing process. There are three major components of the process: 1) annotation and vocabu- lary crowdsourcing, shown at the top; 2) editorial revision for ontology extensions, shown at the bottom; and 3) updates to the metadata repository, shown on the right. An Annotation Framework supports the annotation and crowdsourcing process, shown in the top left of Figure 1. The framework is initialized with a core ontology (A). The core ontology represents a standard that the community has agreed to use. Users, which form the crowd (B), interact with a metadata annotation system (C) to select terms from the core ontology for annotating datasets (D). For example, a term such as “archive type” from the core ontology could be used to express that a dataset contains coral. If a term they want to specify is missing, users may propose extensions by simply adding the term, which becomes a property in the crowd vocabulary (E), That new term is immediately available to other users when they are annotating their datasets. Users may also have requests and issues (F), such as requests for changes to the core ontology, comments to discuss new proposed terms, and other issues. There are two main types of changes to the core ontology that users may request: a) Monotonic changes: these are proposed new terms for the core ontology that do not affect any prior metadata descriptions made by others. For example, a user may extend the ontology for coral and add the property “species name”. b) Non-monotonic changes: these concern existing terms in the core ontology or the crowd vocabulary that have already been used to describe datasets. For ex- ample, a request to rename an existing property or change its domain/range. A Revision Framework supports the ontology revision process, shown at the bot- tom of Figure 1. A select group of users form an editorial board (G) that is continu- The Paleoclimate Community reporTing Standard (PaCTS)
All cores Ice Chronologies cores
Trees
Marine sediments
Speleothems Capturing Scien fic Knowledge
Crowdsourced Common Software and Provenance vocabularies entities models
Scientific data repositories Data analysis Collaborative processes methods
Open reproducible Automating Model publications discoveries integration Workflows Seman c Workflows [Gil et al JETAI 2011; Gil et al IEEE IS 2011; Kim et al JETAI 2010; Gil IEMS 2014]
•