Logic: overview
answer in chat Question
If X1 + X2 = 10 and X1 − X2 = 4, what is X1?
CS221 2 • Think about how you solved this problem. You could treat it as a CSP with variables X1 and X2, and search through the set of candidate solutions, checking the constraints. • However, more likely, you just added the two equations, divided both sides by 2 to easily find out that X1 = 7. This is the power of logical inference, where we apply a set of truth-preserving rules to arrive at the answer. This is in contrast to what is called model checking (for reasons that will become clear), which tries to directly find assignments. • We’ll see that logical inference allows you to perform very powerful manipulations in a very compact way. This allows us to vastly increase the representational power of our models. Course plan
Search problems Constraint satisfaction problems Markov decision processes Markov networks Adversarial games Bayesian networks Reflex States Variables Logic
”Low-level intelligence” ”High-level intelligence” Machine learning
CS221 4 • We are at the last stage of our journey through the AI topics of this course: logic. Before launching in, let’s take a moment to reflect. Taking a step back
question
data Learning model Inference
answer
Examples: search problems, MDPs, games, CSPs, Bayesian networks
CS221 6 • For each topic (e.g., MDPs) that we’ve studied, we followed the modeling-inference-learning paradigm: We take some data, feed it into a learning algorithm to produce a model with tuned parameters. Then we take this model and use it to perform inference (turning questions into answers). • For search problems, the question is ”what is the minimum cost path?” Inference algorithms such as DFS, UCS or A* produced the minimum cost path. Learning algorithms such as the structured Perceptron filled in the action costs based on data (minimum cost paths). • For MDPs and games, the question is ”what is the maximum value policy?” Inference algorithms such as value iteration or minimax produced this. Learning algorithms such as Q-learning or TD learning allow you to work when we don’t know the transitions and rewards. • For CSPs, the question is ”what is the maximum weight assignment?” Inference algorithms such as backtracking search, beam search, or variable elimination find such an assignment. We did not discuss learning algorithms here, but something similar to the structured Perceptron works. • For Bayesian networks, the question is ”what is the probability of a query given evidence?” Inference algorithms such as Gibbs sampling and particle filtering compute these probabilistic inference queries. Learning: if we don’t know the local conditional distributions, we can learn them using maximum likelihood. • We can think of learning as induction, where we need to generalize, and inference as deduction, where it’s about computing the best predicted answer under the model. Modeling paradigms
State-based models: search problems, MDPs, games Applications: route finding, game playing, etc. Think in terms of states, actions, and costs Variable-based models: CSPs, Bayesian networks Applications: scheduling, tracking, medical diagnosis, etc. Think in terms of variables and factors Logic-based models: propositional logic, first-order logic Applications: theorem proving, verification, reasoning Think in terms of logical formulas and inference rules
CS221 8 • Each topic corresponded to a modeling paradigm. The way the modeling paradigm is set up influences the way we approach a problem. • In state-based models, we thought about inference as finding minimum cost paths in a graph. This leads us to think in terms of states, actions, and costs. • In variable-based models, we thought about inference as finding maximum weight assignments or computing conditional probabilities. There we thought about variables and factors. • Now, we will talk about logic-based models, where inference is applying a set of rules. For these models, we will think in terms of logical formulas and inference rules. A historical note
• Logic was dominant paradigm in AI before 1990s
• Problem 1: deterministic, didn’t handle uncertainty (probability addresses this) • Problem 2: rule-based, didn’t allow fine tuning from data (machine learning addresses this)
• Strength: provides expressiveness in a compact way
CS221 10 • Historically, in AI, logic was the dominant paradigm before the 1990s, but this tradition fell out of favor with the rise of probability and machine learning. • There were two reasons for this: First, logic as an inference mechanism was brittle and did not handle uncertainty, whereas probability offered a coherent framework for dealing with uncertainty. • Second, people built rule-based systems which were tedious and did not scale up, whereas machine learning automated much of the fine-tuning of a system by using data. • However, there is one strength of logic which has not quite yet been recouped by existing probability and machine learning methods, and that is the expressivity of the model. Motivation: smart personal assistant
CS221 12 • How can we motivate logic-based models? We will take a little bit of a detour and think about an AI grand challenge: building smart personal assistants. • Today, we have systems like Apple’s Siri, Microsoft’s Cortana, Amazon’s Alexa, and Google Assistant. Motivation: smart personal assistant
Tell information Ask questions
Use natural language!
Need to: • Digest heterogenous information • Reason deeply with that information
CS221 14 • We would like to have more intelligent assistants such as Data from Star Trek. What is the functionality that’s missing in between? • At an abstract level, one fundamental thing a good personal assistant should be able to do is to take in information from people and be able to answer questions that require drawing inferences from the facts. • In some sense, telling the system information is like machine learning, but it feels like a very different form of learning than seeing 10M images and their labels or 10M sentences and their translations. The type of information we get here is both more heterogenous, more abstract, and the expectation is that we process it more deeply (we don’t want to have to tell our personal assistant 100 times that we prefer morning meetings). • And how do we interact with our personal assistants? Let’s use natural language, the very tool that was built for communication! Natural language
Example:
• A dime is better than a nickel. • A nickel is better than a penny. • Therefore, a dime is better than a penny. Example: • A penny is better than nothing. • Nothing is better than world peace. • Therefore, a penny is better than world peace??? Natural language is slippery...
CS221 16 • But natural language is tricky, because it is replete with ambiguities and vagueness. And drawing inferences using natural languages can be quite slippery. Of course, some concepts are genuinely vague and slippery, and natural language is as good as it gets, but that still leaves open the question of how a computer would handle those cases. Language
Language is a mechanism for expression. Natural languages (informal): English: Two divides even numbers. German: Zwei dividieren geraden zahlen. Programming languages (formal):
Python: def even(x): return x % 2 == 0 C++: bool even(int x) { return x % 2 == 0; } Logical languages (formal): First-order-logic: ∀x.Even(x) → Divides(x, 2)
CS221 18 • Let’s think about language a bit deeply. What does it really buy you? Primarily, language is this wonderful human creation that allows us to express and communicate complex ideas and thoughts. • We have mostly been talking about natural languages such as English and German. But as you all know, there are programming languages as well, which allow one to express computation formally so that a computer can understand it. • This lecture is mostly about logical languages such as propositional logic and first-order logic. These are formal languages, but are a more suitable way of capturing declarative knowledge rather than concrete procedures, and are better connected with natural language. Two goals of a logic language
• Represent knowledge about the world
• Reason with that knowledge
CS221 20 • Some of you already know about logic, but it’s important to keep the AI goal in mind: We want to use it to represent knowledge, and we want to be able to reason (or do inference) with that knowledge. • Finally, we need to keep in mind that our goal is to get computers to use logic automatically, not for you to do it. This means that we need to think very mechanistically. Ingredients of a logic
Syntax: defines a set of valid formulas (Formulas) Example: Rain ∧ Wet Semantics: for each formula, specify a set of models (assignments / configurations of the world)
Wet 0 1 Example: 0
Rain 1 Inference rules: given f, what new formulas g can be added that are guaranteed to follow f ( g )? Example: from Rain ∧ Wet, derive Rain
CS221 22 • The syntax defines a set of valid formulas, which are things which are grammatical to say in the language. • Semantics usually doesn’t receive much attention if you have a casual exposure to logic, but this is really the important piece that makes logic rigorous. Formally, semantics specifies the meaning of a formula, which in our setting is a set of configurations of the world in which the formula holds. This is what we care about in the end. • But in order to get there, it’s helpful to operate directly on the syntax using a set of inference rules. For example, if I tell you that it’s raining and wet, then you should be able to conclude that it is also raining (obviously) without even explicitly mentioning semantics. Most of the time when people do logic casually, they are really just applying inference rules. Syntax versus semantics
Syntax: what are valid expressions in the language?
Semantics: what do these expressions mean?
Different syntax, same semantics (5):
2 + 3 ⇔ 3 + 2
Same syntax, different semantics (1 versus 1.5):
3 / 2 (Python 2.7) 6⇔ 3 / 2 (Python 3)
CS221 24 • Just to hammer in the point that syntax and semantics are different, consider two examples from programming languages. • First, the formula 2 + 3 and 3 + 2 are superficially different (a syntactic notion), but they have the same semantics (5). • Second, the formula 3 / 2 means something different depending on which language. In Python 2.7, the semantics is 1 (integer division), and in Python 3 the semantics is 1.5 (floating point division). Propositional logic
Syntax Semantics
formula
models
Inference rules
CS221 26
Logics
• Propositional logic • Propositional logic with only Horn clauses • Modal logic • First-order logic • First-order logic with only Horn clauses • Second-order logic • ...
Key idea: tradeoff
Balance expressivity and computational efficiency.
CS221 28 • There are many different logical languages, just like there are programming languages. Whereas most programming languages have the expressive power (all Turing complete), logical languages exhibit a larger spectrum of expressivity. • The bolded items are the ones we will discuss in this class. Roadmap
Modeling Inference Propositional Logic Syntax Inference Rules
Propositional Logic Semantics Propositional modus ponens
First-order Logic Propositional resolution
First-order modus ponens
First-order resolution
CS221 30 • Here are the rest of the modules under the logic unit. • We will start by talking about the core elemenst of logics: syntax, semantics, and inference rules. We will start by defining syntax and semantics for propositional logic. • We will then inferece rules and their discuss soundness and correctness. • We then describe a more expressive logic, i.e., first order logic, and go over its syntax and semantics. Conclusion
Roadmap
Summary of CS221
Next courses
History of AI
Food for thought
CS221 34
Paradigm
Modeling
Inference Learning
CS221 36
Course plan
Search problems Constraint satisfaction problems Markov decision processes Markov networks Adversarial games Bayesian networks Reflex States Variables Logic
”Low-level intelligence” ”High-level intelligence” Machine learning
CS221 38
Machine learning
Objective: loss minimization X min Loss(x, y, w) w (x,y)∈Dtrain
Algorithm: stochastic gradient descent
w → w − ηt ∇Loss(x, y, w) | {z } prediction−target
Applies to wide range of models!
CS221 40
Reflex-based models
Models: linear models, neural networks, nearest neighbors
Inference: feedforward
Learning: SGD, alternating minimization
CS221 42
State-based models
S D F
B E C G
Key idea: state
A state is a summary of all the past actions sufficient to choose future actions opti- mally.
Models: search problems, MDPs, games Inference: UCS/A*, DP, value iteration, minimax Learning: structured Perceptron, Q-learning, TD learning CS221 44
Variable-based models
NT Q WA SA NSW
V
T
Key idea: factor graphs
Graph structure captures conditional independence.
Models: CSPs, Markov networks, Bayesian networks
Inference: backtracking, forward-backward, beam search, Gibbs sampling
Learning: maximum likelihood (closed form, EM) CS221 46
Logic-based models
Syntax Semantics
formula
models
Inference rules
Key idea: logic
Formulas enable more powerful models (infinite).
Models: propositional logic, first-order logic Inference: model checking, modus ponens, resolution Learning: ??? CS221 48
Tools
• CS221 provides a set of tools
• Start with the problem, and figure out what tool to use
• Keep it simple!
CS221 50
Roadmap
Summary of CS221
Next courses
History of AI
Food for thought
CS221 52
Other AI-related courses
http://ai.stanford.edu/courses/ Foundations: • CS228: Probabilistic Graphical Models • CS229: Machine Learning • CS229T: Statistical Learning Theory • CS230: Deep Learning • CS238: Decision Making Under Uncertainty • CS246: Mining Massive Data Sets • CS329D: Machine Learning Under Distributional Shifts • CS330: Deep Multi-Task and Meta Learning CS221 54
Other AI-related courses
http://ai.stanford.edu/courses/ Applications: • CS224N: Natural Language Processing (with Deep Learning) • CS224U: Natural Language Understanding • CS231A: From 3D Reconstruction to Recognition • CS231N: Convolutional Neural Networks for Visual Recognition • CS223A: Introduction to Robotics • CS237A-B: Robot Autonomy • CS227B: General Game Playing • CS348I: Computer Graphics in the Era of AI CS221 56
Probabilistic graphical models (CS228)
h
x
• Forward-backward, variable elimination ⇒ belief propagation, variational inference
• Gibbs sampling ⇒ Markov Chain Monte Carlo (MCMC)
• Learning the structure
CS221 58
Machine learning (CS229)
• Discrete ⇒ continuous
• Linear models ⇒ kernel methods, decision trees
• Boosting, bagging, feature selection
• K-means ⇒ mixture of Gaussians, PCA, ICA
CS221 60
Statistical learning theory (CS229T)
Question: what are the mathematical principles behind learning?
Uniform convergence: with probability at least 0.95, your algorithm will return a predictor h ∈ H such that
q Complexity(H) TestError(h) ≤ TrainError(h) + n
CS221 62
[figure credit: Fei-Fei Li] Vision (CS231A, CS231N)
• Challenges: variation in viewpoint, illumination, intra-class variation • Tasks: object recognition/detection/segmentation, pose estimation, 3D reconstruction, image captioning, visual question answering, activity recognition
CS221 64
[figure credit: Karen Liu] Graphics (CS348I)
• Goal: content creation for images, shapes, and animations • Tasks: differentiable rendering, neural rendering, BRDF estimation, texture synthesis, procedural modeling, sketch simplification, character animation, physics simulation, facial animation
CS221 66
Robotics (CS223A, CS225A)
• Tasks: manipulation, grasping, navigation
• Applications: self-driving cars, medical robotics
• Physical models: kinematics, control
CS221 68
Robotics (CS237A, CS237B)
• Tasks: interaction, robot learning, autonomy
• Applications: mobile manipulation
CS221 70
Language (CS224N, CS224U)
• Designed by humans for communication
• World: continuous, words: discrete, meanings: continuous
• Properties: compositionality, grounding
• Tasks: syntactic parsing, semantic parsing, information extraction, coreference resolution, machine translation, question answering, summarization, dialogue
CS221 72
Cognitive science
Question: How does the human mind work?
• Cognitive science and AI grew up together
• Humans can learn from few examples on many tasks Computation and cognitive science (PSYCH204, CS428): • Cognition as Bayesian modeling — probabilistic program [Tenenbaum, Goodman, Grif- fiths]
CS221 74
Neuroscience
• Neuroscience: hardware; cognitive science: software • Artificial neural network as computational models of the brain • Modern neural networks (GPUs + backpropagation) not biologically plausible • Analogy: birds versus airplanes; what are principles of intelligence?
CS221 76
Roadmap
Summary of CS221
Next courses
History of AI
Food for thought
CS221 78
Birth of AI
1956: Workshop at Dartmouth College; attendees: John McCarthy, Marvin Minsky, Claude Shannon, etc.
Aim for general principles: Every aspect of learning or any other feature of intelligence can be so precisely described that a machine can be made to simulate it.
CS221 80
Birth of AI, early successes
Checkers (1952): Samuel’s program learned weights and played at strong amateur level
Problem solving (1955): Newell & Simon’s Logic Theorist: prove theorems in Principia Mathematica using search + heuristics; later, General Problem Solver (GPS)
CS221 82
Overwhelming optimism...
Machines will be capable, within twenty years, of doing any work a man can do. —Herbert Simon
Within 10 years the problems of artificial intelligence will be substantially solved. —Marvin Minsky
I visualize a time when we will be to robots what dogs are to humans, and I’m rooting for the machines. —Claude Shannon
CS221 84
...underwhelming results
Example: machine translation
The spirit is willing but the flesh is weak.
(Russian)
The vodka is good but the meat is rotten.
1966: ALPAC report cut off government funding for MT
CS221 86
AI is overhyped...
We tend to overestimate the effect of a technology in a short run and underestimate the effect in a long run. —Roy Amara (1925-2007)
CS221 88
Implications of early era
Problems: • Limited computation: search space grew exponentially, outpacing hardware (100! ≈ 10157 > 1080) • Limited information: complexity of AI problems (number of words, objects, concepts in the world)
Contributions: • Lisp, garbage collection, time-sharing (John McCarthy) • Key paradigm: separate modeling (declarative) and inference (procedural)
CS221 90
Knowledge-based systems (70-80s)
Expert systems: elicit specific domain knowledge from experts in form of rules:
if [premises] then [conclusion]
CS221 92
Knowledge-based systems (70-80s)
DENDRAL: infer molecular structure from mass spectrometry
MYCIN: diagnose blood infections, recommend antibiotics
XCON: convert customer orders into parts specification; save DEC $40 million a year by 1986
CS221 94
Knowledge-based systems
Contributions: • First real application that impacted industry • Knowledge helped curb the exponential growth
Problems: • Knowledge is not deterministic rules, need to model uncertainty • Requires considerable manual effort to create rules, hard to maintain
CS221 96
SHRDLU [Winograd 1971]
Person: Pick up a big red block. Computer: OK. Person: Grasp the pyramid. Computer: I don’t understand which pyramid you mean. Person (changing their mind): Find a block which is taller than the one you are holding and put it into the box. Computer: By ”it”, I assume you mean the block which is taller than the one I am holding. Computer: OK. Person: What does the box contain? Computer: The blue pyramid and the blue block. Person: What is the pyramid supported by? Computer: The box.
CS221 98
The Complexity Barrier
A number of people have suggested to me that large programs like the SHRDLU program for understanding natural language represent a kind of dead end in AI programming. Complex interactions between its components give the program much of its power, but at the same time they present a formidable obstacle to understanding and extending it. In order to grasp any part, it is necessary to understand how it fits with other parts, presents a dense mass, with no easy footholds. Even having written the program, I find it near the limit of what I can keep in mind at once.
— Terry Winograd (1972)
CS221 100
Modern AI (90s-present)
• Probability: Pearl (1988) promote Bayesian networks in AI to model uncertainty (based on Bayes rule from 1700s)
model predictions
• Machine learning: Vapnik (1995) invented support vector machines to tune parameters (based on statistical models in early 1900s)
data model
CS221 102
A melting pot
• Bayes rule (Bayes, 1763) from probability • Least squares regression (Gauss, 1795) from astronomy • First-order logic (Frege, 1893) from logic • Maximum likelihood (Fisher, 1922) from statistics • Artificial neural networks (McCulloch/Pitts, 1943) from neuroscience • Minimax games (von Neumann, 1944) from economics • Stochastic gradient descent (Robbins/Monro, 1951) from optimization • Uniform cost search (Dijkstra, 1956) from algorithms • Value iteration (Bellman, 1957) from control theory
CS221 104
Roadmap
Summary of CS221
Next courses
History of AI
Food for thought
CS221 106
Outlook
AI is everywhere: consumer services, advertising, transportation, manufacturing, etc.
AI being used to make decisions for: education, credit, employment, advertising, healthcare and policing
CS221 108
[Prates+ 2018] Biases
CS221 110
[figure credit: EFF] Image classification
CS221 112
[Szegedy et al., 2013; Goodfellow et al., 2014] Adversaries
AlexNet predicts correctly on the left
AlexNet predicts ostrich on the right
CS221 114
CS221 116
Security
[Evtimov+ 2017]
[Sharif+ 2016]
CS221 118
Optimizing for clicks
Is this a good objective function for society?
CS221 120
How to model human objectives?
Is this a good objective function for the human?
CS221 122
How to model human objectives?
Be aware of the mismatch between human preferences and what the robot thinks are the human preferences.
CS221 124
Generating fake content
Can build it 6= should build it?
CS221 126
Figure from Moritz Hardt Fairness
• Most ML training objectives will produce model accurate for majority class, at the expense of the minority one.
CS221 128
Figure from Moritz Hardt Fairness
Two classifiers with 5% error:
CS221 130
Are algorithms neutral?
x
Dtrain Learner f
y By design: picks up patterns in training data, including biases
CS221 132
Privacy
• Not reveal sensitive information (income, health, communication)
• Compute average statistics (how many people have cancer?)
yes no yes no no no no yes no yes
CS221 134
[Warner, 1965] Privacy: Randomized response
Do you have a sibling?
Method: • Flip two coins. • If both heads: answer yes/no randomly • Otherwise: answer yes/no truthfully
Analysis:
4 1 true-prob = 3 × (observed-prob − 8 )
CS221 136
Causality
Goal: figure out the effect of a treatment on survival
Data:
For untreated patients, 80% survive For treated patients, 30% survive
Does the treatment help?
Sick people are more likely to undergo treatment...
CS221 138
[Mykel Kochdorfer] Interpretability versus accuracy
• For air-traffic control, threshold level of safety: probability 10−9 for a catastrophic failure (e.g., collision) per flight hour
• Move from human designed rules to a numeric Q-value table?
CS221 140
CS221 142
Societal and industrial impact
Enormous potential for positive impact, use responsibly!
CS221 144
Search problems Constraint satisfaction problems Markov decision processes Markov networks Adversarial games Bayesian networks Reflex States Variables Logic
”Low-level intelligence” ”High-level intelligence” Machine learning
Please fill out course evaluations on Axess.
Thanks for an exciting quarter!
CS221 146