<<

Logic: overview

answer in chat Question

If X1 + X2 = 10 and X1 − X2 = 4, what is X1?

CS221 2 • Think about how you solved this problem. You could treat it as a CSP with variables X1 and X2, and search through the set of candidate solutions, checking the constraints. • However, more likely, you just added the two equations, divided both sides by 2 to easily find out that X1 = 7. This is the power of logical inference, where we apply a set of truth-preserving rules to arrive at the answer. This is in contrast to what is called model checking (for reasons that will become clear), which tries to directly find assignments. • We’ll see that logical inference allows you to perform very powerful manipulations in a very compact way. This allows us to vastly increase the representational power of our models. Course plan

Search problems Constraint satisfaction problems Markov decision processes Markov networks Adversarial games Bayesian networks Reflex States Variables

”Low-level intelligence” ”High-level intelligence” Machine learning

CS221 4 • We are at the last stage of our journey through the AI topics of this course: logic. Before launching in, let’s take a moment to reflect. Taking a step back

question

data Learning model Inference

answer

Examples: search problems, MDPs, games, CSPs, Bayesian networks

CS221 6 • For each topic (e.g., MDPs) that we’ve studied, we followed the modeling-inference-learning paradigm: We take some data, feed it into a learning algorithm to produce a model with tuned parameters. Then we take this model and use it to perform inference (turning questions into answers). • For search problems, the question is ”what is the minimum cost path?” Inference algorithms such as DFS, UCS or A* produced the minimum cost path. Learning algorithms such as the structured Perceptron filled in the action costs based on data (minimum cost paths). • For MDPs and games, the question is ”what is the maximum value policy?” Inference algorithms such as value iteration or minimax produced this. Learning algorithms such as Q-learning or TD learning allow you to work when we don’t know the transitions and rewards. • For CSPs, the question is ”what is the maximum weight assignment?” Inference algorithms such as backtracking search, beam search, or variable elimination find such an assignment. We did not discuss learning algorithms here, but something similar to the structured Perceptron works. • For Bayesian networks, the question is ”what is the probability of a query given evidence?” Inference algorithms such as Gibbs sampling and particle filtering compute these probabilistic inference queries. Learning: if we don’t know the local conditional distributions, we can learn them using maximum likelihood. • We can think of learning as induction, where we need to generalize, and inference as deduction, where it’s about computing the best predicted answer under the model. Modeling paradigms

State-based models: search problems, MDPs, games Applications: route finding, game playing, etc. Think in terms of states, actions, and costs Variable-based models: CSPs, Bayesian networks Applications: scheduling, tracking, medical diagnosis, etc. Think in terms of variables and factors Logic-based models: propositional logic, first-order logic Applications: theorem proving, verification, reasoning Think in terms of logical formulas and inference rules

CS221 8 • Each topic corresponded to a modeling paradigm. The way the modeling paradigm is set up influences the way we approach a problem. • In state-based models, we thought about inference as finding minimum cost paths in a graph. This leads us to think in terms of states, actions, and costs. • In variable-based models, we thought about inference as finding maximum weight assignments or computing conditional probabilities. There we thought about variables and factors. • Now, we will talk about logic-based models, where inference is applying a set of rules. For these models, we will think in terms of logical formulas and inference rules. A historical note

• Logic was dominant paradigm in AI before 1990s

• Problem 1: deterministic, didn’t handle uncertainty (probability addresses this) • Problem 2: rule-based, didn’t allow fine tuning from data (machine learning addresses this)

• Strength: provides expressiveness in a compact way

CS221 10 • Historically, in AI, logic was the dominant paradigm before the 1990s, but this tradition fell out of favor with the rise of probability and machine learning. • There were two reasons for this: First, logic as an inference mechanism was brittle and did not handle uncertainty, whereas probability offered a coherent framework for dealing with uncertainty. • Second, people built rule-based systems which were tedious and did not scale up, whereas machine learning automated much of the fine-tuning of a system by using data. • However, there is one strength of logic which has not quite yet been recouped by existing probability and machine learning methods, and that is the expressivity of the model. Motivation: smart personal assistant

CS221 12 • How can we motivate logic-based models? We will take a little bit of a detour and think about an AI grand challenge: building smart personal assistants. • Today, we have systems like Apple’s , Microsoft’s , Amazon’s Alexa, and Google Assistant. Motivation: smart personal assistant

Tell information Ask questions

Use natural language!

Need to: • Digest heterogenous information • Reason deeply with that information

CS221 14 • We would like to have more intelligent assistants such as Data from Star Trek. What is the functionality that’s missing in between? • At an abstract level, one fundamental thing a good personal assistant should be able to do is to take in information from people and be able to answer questions that require drawing inferences from the facts. • In some sense, telling the system information is like machine learning, but it feels like a very different form of learning than seeing 10M images and their labels or 10M sentences and their translations. The type of information we get here is both more heterogenous, more abstract, and the expectation is that we process it more deeply (we don’t want to have to tell our personal assistant 100 times that we prefer morning meetings). • And how do we interact with our personal assistants? Let’s use natural language, the very tool that was built for communication! Natural language

Example:

• A dime is better than a nickel. • A nickel is better than a penny. • Therefore, a dime is better than a penny. Example: • A penny is better than nothing. • Nothing is better than world peace. • Therefore, a penny is better than world peace??? Natural language is slippery...

CS221 16 • But natural language is tricky, because it is replete with ambiguities and vagueness. And drawing inferences using natural languages can be quite slippery. Of course, some concepts are genuinely vague and slippery, and natural language is as good as it gets, but that still leaves open the question of how a computer would handle those cases. Language

Language is a mechanism for expression. Natural languages (informal): English: Two divides even numbers. German: Zwei dividieren geraden zahlen. Programming languages (formal):

Python: def even(x): return x % 2 == 0 C++: bool even(int x) { return x % 2 == 0; } Logical languages (formal): First-order-logic: ∀x.Even(x) → Divides(x, 2)

CS221 18 • Let’s think about language a bit deeply. What does it really buy you? Primarily, language is this wonderful human creation that allows us to express and communicate complex ideas and thoughts. • We have mostly been talking about natural languages such as English and German. But as you all know, there are programming languages as well, which allow one to express computation formally so that a computer can understand it. • This lecture is mostly about logical languages such as propositional logic and first-order logic. These are formal languages, but are a more suitable way of capturing declarative knowledge rather than concrete procedures, and are better connected with natural language. Two goals of a logic language

• Represent knowledge about the world

• Reason with that knowledge

CS221 20 • Some of you already know about logic, but it’s important to keep the AI goal in mind: We want to use it to represent knowledge, and we want to be able to reason (or do inference) with that knowledge. • Finally, we need to keep in mind that our goal is to get computers to use logic automatically, not for you to do it. This means that we need to think very mechanistically. Ingredients of a logic

Syntax: defines a set of valid formulas (Formulas) Example: Rain ∧ Wet Semantics: for each formula, specify a set of models (assignments / configurations of the world)

Wet 0 1 Example: 0

Rain 1 Inference rules: given f, what new formulas g can be added that are guaranteed to follow f ( g )? Example: from Rain ∧ Wet, derive Rain

CS221 22 • The syntax defines a set of valid formulas, which are things which are grammatical to say in the language. • Semantics usually doesn’t receive much attention if you have a casual exposure to logic, but this is really the important piece that makes logic rigorous. Formally, semantics specifies the meaning of a formula, which in our setting is a set of configurations of the world in which the formula holds. This is what we care about in the end. • But in order to get there, it’s helpful to operate directly on the syntax using a set of inference rules. For example, if I tell you that it’s raining and wet, then you should be able to conclude that it is also raining (obviously) without even explicitly mentioning semantics. Most of the time when people do logic casually, they are really just applying inference rules. Syntax versus semantics

Syntax: what are valid expressions in the language?

Semantics: what do these expressions mean?

Different syntax, same semantics (5):

2 + 3 ⇔ 3 + 2

Same syntax, different semantics (1 versus 1.5):

3 / 2 (Python 2.7) 6⇔ 3 / 2 (Python 3)

CS221 24 • Just to hammer in the point that syntax and semantics are different, consider two examples from programming languages. • First, the formula 2 + 3 and 3 + 2 are superficially different (a syntactic notion), but they have the same semantics (5). • Second, the formula 3 / 2 means something different depending on which language. In Python 2.7, the semantics is 1 (integer division), and in Python 3 the semantics is 1.5 (floating point division). Propositional logic

Syntax Semantics

formula

models

Inference rules

CS221 26

Logics

• Propositional logic • Propositional logic with only Horn clauses • Modal logic • First-order logic • First-order logic with only Horn clauses • Second-order logic • ...

Key idea: tradeoff

Balance expressivity and computational efficiency.

CS221 28 • There are many different logical languages, just like there are programming languages. Whereas most programming languages have the expressive power (all Turing complete), logical languages exhibit a larger spectrum of expressivity. • The bolded items are the ones we will discuss in this class. Roadmap

Modeling Inference Propositional Logic Syntax Inference Rules

Propositional Logic Semantics Propositional modus ponens

First-order Logic Propositional resolution

First-order modus ponens

First-order resolution

CS221 30 • Here are the rest of the modules under the logic unit. • We will start by talking about the core elemenst of : syntax, semantics, and inference rules. We will start by defining syntax and semantics for propositional logic. • We will then inferece rules and their discuss soundness and correctness. • We then describe a more expressive logic, i.e., first order logic, and go over its syntax and semantics. Conclusion

Roadmap

Summary of CS221

Next courses

History of AI

Food for thought

CS221 34

Paradigm

Modeling

Inference Learning

CS221 36

Course plan

Search problems Constraint satisfaction problems Markov decision processes Markov networks Adversarial games Bayesian networks Reflex States Variables Logic

”Low-level intelligence” ”High-level intelligence” Machine learning

CS221 38

Machine learning

Objective: loss minimization X min Loss(x, y, w) w (x,y)∈Dtrain

Algorithm: stochastic gradient descent

w → w − ηt ∇Loss(x, y, w) | {z } prediction−target

Applies to wide range of models!

CS221 40

Reflex-based models

Models: linear models, neural networks, nearest neighbors

Inference: feedforward

Learning: SGD, alternating minimization

CS221 42

State-based models

S D F

B E C G

Key idea: state

A state is a summary of all the past actions sufficient to choose future actions opti- mally.

Models: search problems, MDPs, games Inference: UCS/A*, DP, value iteration, minimax Learning: structured Perceptron, Q-learning, TD learning CS221 44

Variable-based models

NT Q WA SA NSW

V

T

Key idea: factor graphs

Graph structure captures conditional independence.

Models: CSPs, Markov networks, Bayesian networks

Inference: backtracking, forward-backward, beam search, Gibbs sampling

Learning: maximum likelihood (closed form, EM) CS221 46

Logic-based models

Syntax Semantics

formula

models

Inference rules

Key idea: logic

Formulas enable more powerful models (infinite).

Models: propositional logic, first-order logic Inference: model checking, modus ponens, resolution Learning: ??? CS221 48

Tools

• CS221 provides a set of tools

• Start with the problem, and figure out what tool to use

• Keep it simple!

CS221 50

Roadmap

Summary of CS221

Next courses

History of AI

Food for thought

CS221 52

Other AI-related courses

http://ai.stanford.edu/courses/ Foundations: • CS228: Probabilistic Graphical Models • CS229: Machine Learning • CS229T: Statistical Learning Theory • CS230: Deep Learning • CS238: Decision Making Under Uncertainty • CS246: Mining Massive Data Sets • CS329D: Machine Learning Under Distributional Shifts • CS330: Deep Multi-Task and Meta Learning CS221 54

Other AI-related courses

http://ai.stanford.edu/courses/ Applications: • CS224N: Natural Language Processing (with Deep Learning) • CS224U: Natural Language Understanding • CS231A: From 3D Reconstruction to Recognition • CS231N: Convolutional Neural Networks for Visual Recognition • CS223A: Introduction to Robotics • CS237A-B: Robot Autonomy • CS227B: General Game Playing • CS348I: Computer Graphics in the Era of AI CS221 56

Probabilistic graphical models (CS228)

h

x

• Forward-backward, variable elimination ⇒ belief propagation, variational inference

• Gibbs sampling ⇒ Markov Chain Monte Carlo (MCMC)

• Learning the structure

CS221 58

Machine learning (CS229)

• Discrete ⇒ continuous

• Linear models ⇒ kernel methods, decision trees

• Boosting, bagging, feature selection

• K-means ⇒ mixture of Gaussians, PCA, ICA

CS221 60

Statistical learning theory (CS229T)

Question: what are the mathematical principles behind learning?

Uniform convergence: with probability at least 0.95, your algorithm will return a predictor h ∈ H such that

q Complexity(H) TestError(h) ≤ TrainError(h) + n

CS221 62

[figure credit: Fei-Fei Li] Vision (CS231A, CS231N)

• Challenges: variation in viewpoint, illumination, intra-class variation • Tasks: object recognition/detection/segmentation, pose estimation, 3D reconstruction, image captioning, visual , activity recognition

CS221 64

[figure credit: Karen Liu] Graphics (CS348I)

• Goal: content creation for images, shapes, and animations • Tasks: differentiable rendering, neural rendering, BRDF estimation, texture synthesis, procedural modeling, sketch simplification, character animation, physics simulation, facial animation

CS221 66

Robotics (CS223A, CS225A)

• Tasks: manipulation, grasping, navigation

• Applications: self-driving cars, medical robotics

• Physical models: kinematics, control

CS221 68

Robotics (CS237A, CS237B)

• Tasks: interaction, robot learning, autonomy

• Applications: mobile manipulation

CS221 70

Language (CS224N, CS224U)

• Designed by humans for communication

• World: continuous, words: discrete, meanings: continuous

• Properties: compositionality, grounding

• Tasks: syntactic parsing, semantic parsing, information extraction, coreference resolution, machine translation, question answering, summarization, dialogue

CS221 72

Cognitive science

Question: How does the human mind work?

and AI grew up together

• Humans can learn from few examples on many tasks Computation and cognitive science (PSYCH204, CS428): • Cognition as Bayesian modeling — probabilistic program [Tenenbaum, Goodman, Grif- fiths]

CS221 74

Neuroscience

• Neuroscience: hardware; cognitive science: software • Artificial neural network as computational models of the brain • Modern neural networks (GPUs + backpropagation) not biologically plausible • Analogy: birds versus airplanes; what are principles of intelligence?

CS221 76

Roadmap

Summary of CS221

Next courses

History of AI

Food for thought

CS221 78

Birth of AI

1956: Workshop at Dartmouth College; attendees: John McCarthy, Marvin Minsky, Claude Shannon, etc.

Aim for general principles: Every aspect of learning or any other feature of intelligence can be so precisely described that a machine can be made to simulate it.

CS221 80

Birth of AI, early successes

Checkers (1952): Samuel’s program learned weights and played at strong amateur level

Problem solving (1955): Newell & Simon’s : prove theorems in Principia Mathematica using search + heuristics; later, General Problem Solver (GPS)

CS221 82

Overwhelming optimism...

Machines will be capable, within twenty years, of doing any work a man can do. —Herbert Simon

Within 10 years the problems of artificial intelligence will be substantially solved. —Marvin Minsky

I visualize a time when we will be to robots what dogs are to humans, and I’m rooting for the machines. —Claude Shannon

CS221 84

...underwhelming results

Example: machine translation

The spirit is willing but the flesh is weak.

(Russian)

The vodka is good but the meat is rotten.

1966: ALPAC report cut off government funding for MT

CS221 86

AI is overhyped...

We tend to overestimate the effect of a technology in a short run and underestimate the effect in a long run. —Roy Amara (1925-2007)

CS221 88

Implications of early era

Problems: • Limited computation: search space grew exponentially, outpacing hardware (100! ≈ 10157 > 1080) • Limited information: complexity of AI problems (number of words, objects, concepts in the world)

Contributions: • Lisp, garbage collection, time-sharing (John McCarthy) • Key paradigm: separate modeling (declarative) and inference (procedural)

CS221 90

Knowledge-based systems (70-80s)

Expert systems: elicit specific domain knowledge from experts in form of rules:

if [premises] then [conclusion]

CS221 92

Knowledge-based systems (70-80s)

DENDRAL: infer molecular structure from mass spectrometry

MYCIN: diagnose blood infections, recommend antibiotics

XCON: convert customer orders into parts specification; save DEC $40 million a year by 1986

CS221 94

Knowledge-based systems

Contributions: • First real application that impacted industry • Knowledge helped curb the exponential growth

Problems: • Knowledge is not deterministic rules, need to model uncertainty • Requires considerable manual effort to create rules, hard to maintain

CS221 96

SHRDLU [Winograd 1971]

Person: Pick up a big red block. Computer: OK. Person: Grasp the pyramid. Computer: I don’t understand which pyramid you mean. Person (changing their mind): Find a block which is taller than the one you are holding and put it into the box. Computer: By ”it”, I assume you mean the block which is taller than the one I am holding. Computer: OK. Person: What does the box contain? Computer: The blue pyramid and the blue block. Person: What is the pyramid supported by? Computer: The box.

CS221 98

The Complexity Barrier

A number of people have suggested to me that large programs like the SHRDLU program for understanding natural language represent a kind of dead end in AI programming. Complex interactions between its components give the program much of its power, but at the same time they present a formidable obstacle to understanding and extending it. In order to grasp any part, it is necessary to understand how it fits with other parts, presents a dense mass, with no easy footholds. Even having written the program, I find it near the limit of what I can keep in mind at once.

— Terry Winograd (1972)

CS221 100

Modern AI (90s-present)

• Probability: Pearl (1988) promote Bayesian networks in AI to model uncertainty (based on Bayes rule from 1700s)

model predictions

• Machine learning: Vapnik (1995) invented support vector machines to tune parameters (based on statistical models in early 1900s)

data model

CS221 102

A melting pot

• Bayes rule (Bayes, 1763) from probability • Least squares regression (Gauss, 1795) from astronomy • First-order logic (Frege, 1893) from logic • Maximum likelihood (Fisher, 1922) from statistics • Artificial neural networks (McCulloch/Pitts, 1943) from neuroscience • Minimax games (von Neumann, 1944) from economics • Stochastic gradient descent (Robbins/Monro, 1951) from optimization • Uniform cost search (Dijkstra, 1956) from algorithms • Value iteration (Bellman, 1957) from control theory

CS221 104

Roadmap

Summary of CS221

Next courses

History of AI

Food for thought

CS221 106

Outlook

AI is everywhere: consumer services, advertising, transportation, manufacturing, etc.

AI being used to make decisions for: education, credit, employment, advertising, healthcare and policing

CS221 108

[Prates+ 2018] Biases

CS221 110

[figure credit: EFF] Image classification

CS221 112

[Szegedy et al., 2013; Goodfellow et al., 2014] Adversaries

AlexNet predicts correctly on the left

AlexNet predicts ostrich on the right

CS221 114

CS221 116

Security

[Evtimov+ 2017]

[Sharif+ 2016]

CS221 118

Optimizing for clicks

Is this a good objective function for society?

CS221 120

How to model human objectives?

Is this a good objective function for the human?

CS221 122

How to model human objectives?

Be aware of the mismatch between human preferences and what the robot thinks are the human preferences.

CS221 124

Generating fake content

Can build it 6= should build it?

CS221 126

Figure from Moritz Hardt Fairness

• Most ML training objectives will produce model accurate for majority class, at the expense of the minority one.

CS221 128

Figure from Moritz Hardt Fairness

Two classifiers with 5% error:

CS221 130

Are algorithms neutral?

x

Dtrain Learner f

y By design: picks up patterns in training data, including biases

CS221 132

Privacy

• Not reveal sensitive information (income, health, communication)

• Compute average statistics (how many people have cancer?)

yes no yes no no no no yes no yes

CS221 134

[Warner, 1965] Privacy: Randomized response

Do you have a sibling?

Method: • Flip two coins. • If both heads: answer yes/no randomly • Otherwise: answer yes/no truthfully

Analysis:

4 1 true-prob = 3 × (observed-prob − 8 )

CS221 136

Causality

Goal: figure out the effect of a treatment on survival

Data:

For untreated patients, 80% survive For treated patients, 30% survive

Does the treatment help?

Sick people are more likely to undergo treatment...

CS221 138

[Mykel Kochdorfer] Interpretability versus accuracy

• For air-traffic control, threshold level of safety: probability 10−9 for a catastrophic failure (e.g., collision) per flight hour

• Move from human designed rules to a numeric Q-value table?

CS221 140

CS221 142

Societal and industrial impact

Enormous potential for positive impact, use responsibly!

CS221 144

Search problems Constraint satisfaction problems Markov decision processes Markov networks Adversarial games Bayesian networks Reflex States Variables Logic

”Low-level intelligence” ”High-level intelligence” Machine learning

Please fill out course evaluations on Axess.

Thanks for an exciting quarter!

CS221 146