Artificial Intelligence(MCA-413)

Unit No:III

Lecture : I

Forward chaining (or forward reasoning) is one of the two main methods of reasoning when using an engine and can be described logically as repeated application of . Forward chaining is a popular implementation strategy for expert systems, business and production rule systems. The opposite of forward chaining is backward chaining.

Forward chaining starts with the available data and uses inference rules to extract more data (from an end user, for example) until a goal is reached. An using forward chaining searches the inference rules until it finds one where the antecedent (If clause) is known to be true. When such a rule is found, the engine can conclude, or infer, the consequent (Then clause), resulting in the addition of new information to its data.[1]

Inference engines will iterate through this process until a goal is reached.

Example

Suppose that the goal is to conclude the color of a pet named Fritz, given that he croaks and eats flies, and that the rule base contains the following four rules:

If X croaks and X eats flies - Then X is a frog

If X chirps and X sings - Then X is a canary

If X is a frog - Then X is green

If X is a canary - Then X is yellow

Let us illustrate forward chaining by following the pattern of a computer as it evaluates the rules. Assume the following facts:

Fritz croaks

Fritz eats flies

With forward reasoning, the inference engine can derive that Fritz is green in a series of steps:

1. Since the base facts indicate that "Fritz croaks" and "Fritz eats flies", the antecedent of rule #1 is satisfied by substituting Fritz for X, and the inference engine concludes:

Fritz is a frog

2. The antecedent of rule #3 is then satisfied by substituting Fritz for X, and the inference engine concludes:

Fritz is green

The name "forward chaining" comes from the fact that the inference engine starts with the data and reasons its way to the answer, as opposed to backward chaining, which works the other way around. In the derivation, the rules are used in the opposite order as compared to backward chaining. In this example, rules #2 and #4 were not used in determining that Fritz is green.

Because the data determines which rules are selected and used, this method is called data-driven, in contrast to goal-driven backward chaining inference. The forward chaining approach is often employed by expert systems, such as CLIPS.

One of the advantages of forward-chaining over backward-chaining is that the reception of new data can trigger new , which makes the engine better suited to dynamic situations in which conditions are likely to change.

Lecture: II

Backward chaining

Not to be confused with Backward chaining (applied behavior analysis) or Back-chaining.

Backward chaining (or backward reasoning) is an inference method described colloquially as working backward from the goal. It is used in automated theorem provers, inference engines, proof assistants, and other applications.

In game theory:

researchers apply it to (simpler) subgames to find a solution to the game, in a process called backward induction. In chess, it is called retrograde analysis, and it is used to generate table bases for chess endgames for computer chess.

Backward chaining is implemented in by SLD resolution. Both rules are based on the modus ponens inference rule. It is one of the two most commonly used methods of reasoning with inference rules and logical implications – the other is forward chaining. Backward chaining systems usually employ a depth-first search strategy, e.g.

How it works

Backward chaining starts with a list of goals (or a hypothesis) and works backwards from the consequent to the antecedent to see if any data supports any of these consequents.[3] An inference engine using backward chaining would search the inference rules until it finds one with a consequent (Then clause) that matches a desired goal. If the antecedent (If clause) of that rule is known to be true, then it is added to the list of goals (for one's goal to be confirmed one must also provide data that confirms this new rule).

For example, suppose a new pet, Fritz, is delivered in an opaque box along with two facts about Fritz:

Fritz croaks

Fritz eats flies

The goal is to decide whether Fritz is green, based on a rule base containing the following four rules:

An Example of Backward Chaining.

If X croaks and X eats flies – Then X is a frog

If X chirps and X sings – Then X is a canary

If X is a frog – Then X is green

If X is a canary – Then X is yellow

With backward reasoning, an inference engine can determine whether Fritz is green in four steps. To start, the query is phrased as a goal assertion that is to be proved: "Fritz is green".

1. Fritz is substituted for X in rule #3 to see if its consequent matches the goal, so rule #3 becomes:

If Fritz is a frog – Then Fritz is green

Since the consequent matches the goal ("Fritz is green"), the rules engine now needs to see if the antecedent ("Fritz is a frog") can be proved. The antecedent therefore becomes the new goal:

Fritz is a frog

2. Again substituting Fritz for X, rule #1 becomes:

If Fritz croaks and Fritz eats flies – Then Fritz is a frog

Since the consequent matches the current goal ("Fritz is a frog"), the inference engine now needs to see if the antecedent ("Fritz croaks and eats flies") can be proved. The antecedent therefore becomes the new goal:

Fritz croaks and Fritz eats flies

3. Since this goal is a conjunction of two statements, the inference engine breaks it into two sub-goals, both of which must be proved:

Fritz croaks

Fritz eats flies

4. To prove both of these sub-goals, the inference engine sees that both of these sub-goals were given as initial facts. Therefore, the conjunction is true:

Fritz croaks and Fritz eats flies therefore the antecedent of rule #1 is true and the consequent must be true:

Fritz is a frog therefore the antecedent of rule #3 is true and the consequent must be true:

Fritz is green

This derivation therefore allows the inference engine to prove that Fritz is green. Rules #2 and #4 were not used.

Note that the goals always match the affirmed versions of the consequents of implications (and not the negated versions as in ) and even then, their antecedents are then considered as the new goals (and not the conclusions as in affirming the consequent), which ultimately must match known facts (usually defined as consequents whose antecedents are always true); thus, the inference rule used is modus ponens.

Because the list of goals determines which rules are selected and us

Date : 3/4/2020

Unit III

Lecture No: 3 , 4

Resolution (logic)

In mathematical logic and automated theorem proving, resolution is a rule of inference leading to a refutation theorem-proving technique for sentences in propositional logic and first-order logic. In other words, iteratively applying the resolution rule in a suitable way allows for telling whether a propositional formula is satisfiable and for proving that a first-order formula is unsatisfiable. Attempting to prove a satisfiable first-order formula as unsatisfiable may result in a nonterminating computation; this problem doesn't occur in propositional logic.

The resolution rule can be traced back to Davis and Putnam (1960);[1] however, their algorithm required trying all ground instances of the given formula. This source of combinatorial explosion was eliminated in 1965 by John Alan Robinson's syntactical unification algorithm, which allowed one to instantiate the formula during the proof "on demand" just as far as needed to keep refutation completeness.

The clause produced by a resolution rule is sometimes called a resolvent.

Resolution in propositional logic Edit

Resolution rule Edit

The resolution rule in propositional logic is a single valid inference rule that produces a new clause implied by two clauses containing complementary literals. A literal is a propositional variable or the negation of a propositional variable. Two literals are said to be complements if one is the negation of the other (in the following, {\displaystyle \lnot c}\lnot c is taken to be the complement to {\displaystyle c}c). The resulting clause contains all the literals that do not have complements. Formally:

{\displaystyle {\frac {a_{1}\lor a_{2}\lor \cdots \lor c,\quad b_{1}\lor b_{2}\lor \cdots \lor \neg c}{a_{1}\lor a_{2}\lor \cdots \lor b_{1}\lor b_{2}\lor \cdots }}}{\displaystyle {\frac {a_{1}\lor a_{2}\lor \cdots \lor c,\quad b_{1}\lor b_{2}\lor \cdots \lor \neg c}{a_{1}\lor a_{2}\lor \cdots \lor b_{1}\lor b_{2}\lor \cdots }}} where

all {\displaystyle a_{i}}a_{i}, {\displaystyle b_{i}}b_{i}, and {\displaystyle c}c are literals, the dividing line stands for "entails".

The above may also be written as:

{\displaystyle {\frac {(\neg a_{1}\land \neg a_{2}\land \cdots )\rightarrow c,\quad c\rightarrow (b_{1}\lor b_{2}\lor \cdots )}{(\neg a_{1}\land \neg a_{2}\land \cdots )\rightarrow (b_{1}\lor b_{2}\lor \cdots )}}}{\displaystyle {\frac {(\neg a_{1}\land \neg a_{2}\land \cdots )\rightarrow c,\quad c\rightarrow (b_{1}\lor b_{2}\lor \cdots )}{(\neg a_{1}\land \neg a_{2}\land \cdots )\rightarrow (b_{1}\lor b_{2}\lor \cdots )}}}

The clause produced by the resolution rule is called the resolvent of the two input clauses. It is the principle of consensus applied to clauses rather than terms.

When the two clauses contain more than one pair of complementary literals, the resolution rule can be applied (independently) for each such pair; however, the result is always a tautology.

Modus ponens can be seen as a special case of resolution (of a one-literal clause and a two-literal clause).

{\displaystyle {\frac {p\rightarrow q,\quad p}{q}}}{\displaystyle {\frac {p\rightarrow q,\quad p}{q}}} is equivalent to

{\displaystyle {\frac {\lnot p\lor q,\quad p}{q}}}{\displaystyle {\frac {\lnot p\lor q,\quad p}{q}}}

A resolution technique Edit

When coupled with a complete search algorithm, the resolution rule yields a sound and complete algorithm for deciding the satisfiability of a propositional formula, and, by extension, the validity of a sentence under a set of axioms.

This resolution technique uses proof by contradiction and is based on the fact that any sentence in propositional logic can be transformed into an equivalent sentence in conjunctive normal form.[

The steps are as follows.

All sentences in the knowledge base and the negation of the sentence to be proved (the conjecture) are conjunctively connected.

The resulting sentence is transformed into a conjunctive normal form with the conjuncts viewed as elements in a set, S, of clauses

For example, {\displaystyle (A_{1}\lor A_{2})\land (B_{1}\lor B_{2}\lor B_{3})\land (C_{1})}(A_{1}\lor A_{2})\land (B_{1}\lor B_{2}\lor B_{3})\land (C_{1}) gives rise to the set {\displaystyle S=\{A_{1}\lor A_{2},B_{1}\lor B_{2}\lor B_{3},C_{1}\}}S=\{A_{1}\lor A_{2},B_{1}\lor B_{2}\lor B_{3},C_{1}\}.

The resolution rule is applied to all possible pairs of clauses that contain complementary literals. After each application of the resolution rule, the resulting sentence is simplified by removing repeated literals. If the clause contains complementary literals, it is discarded (as a tautology). If not, and if it is not yet present in the clause set S, it is added to S, and is considered for further resolution inferences.

If after applying a resolution rule the empty clause is derived, the original formula is unsatisfiable (or contradictory), and hence it can be concluded that the initial conjecture follows from the axioms.

If, on the other hand, the empty clause cannot be derived, and the resolution rule cannot be applied to derive any more new clauses, the conjecture is not a theorem of the original knowledge base.

One instance of this algorithm is the original Davis–Putnam algorithm that was later refined into the DPLL algorithm that removed the need for explicit representation of the resolvents.

This description of the resolution technique uses a set S as the underlying data-structure to represent resolution derivations. Lists, Trees and Directed Acyclic Graphs are other possible and common alternatives. Tree representations are more faithful to the fact that the resolution rule is binary. Together with a sequent notation for clauses, a tree representation also makes it clear to see how the resolution rule is related to a special case of the cut-rule, restricted to atomic cut-formulas. However, tree representations are not as compact as set or list representations, because they explicitly show redundant subderivations of clauses that are used more than once in the derivation of the empty clause. Graph representations can be as compact in the number of clauses as list representations and they also store structural information regarding which clauses were resolved to derive each resolvent.

A simple example:

{\displaystyle {\frac {a\vee b,\quad \neg a\vee c}{b\vee c}}}{\frac {a\vee b,\quad \neg a\vee c}{b\vee c}}

In plain language: Suppose {\displaystyle a}a is false. In order for the premise {\displaystyle a\vee b}a\vee b to be true, {\displaystyle b}b must be true. Alternatively, suppose {\displaystyle a}a is true. In order for the premise {\displaystyle \neg a\vee c}\neg a\vee c to be true, {\displaystyle c}c must be true. Therefore, regardless of falsehood or veracity of {\displaystyle a}a, if both premises hold, then the conclusion {\displaystyle b\vee c}b\vee c is true.

Example:

Consider the following axioms:

All hounds howl at night.

Anyone who has any cats will not have any mice. Light sleepers do not have anything which howls at night.

John has either a cat or a hound.

(Conclusion) If John is a light sleeper, then John does not have any mice.

The conclusion can be proved using Resolution as shown below. The first step is to write each axiom as a well-formed formula in first-order predicate calculus. The clauses written for the above axioms are shown below, using LS(x) for `light sleeper'.

∀ x (HOUND(x) → HOWL(x))

∀ x ∀ y (HAVE (x,y) ∧ CAT (y) → ¬ ∃ z (HAVE(x,z) ∧ MOUSE (z)))

∀ x (LS(x) → ¬ ∃ y (HAVE (x,y) ∧ HOWL(y)))

∃ x (HAVE (John,x) ∧ (CAT(x) ∨ HOUND(x)))

LS(John) → ¬ ∃ z (HAVE(John,z) ∧ MOUSE(z))

LECTURE NO: 5 & 6

UNIT: 3

DATE : 6/4/2020

TOPIC:

PROBABILISTIC REASONING/LOGIC

The aim of a probabilistic logic (probability logic and probabilistic reasoning) is to combine the capacity of probability theory to handle uncertainty with the capacity of deductive logic to exploit structure of formal argument. The result is a richer and more expressive formalism with a broad range of possible application areas. Probabilistic logics attempt to find a natural extension of traditional logic truth tables:

The results they define are derived through probabilistic expressions instead. A difficulty with probabilistic logics is that they tend to multiply the computational complexities of their probabilistic and logical components. Other difficulties include the possibility of counter-intuitive results, such as those of Dempster-Shafer theory in evidence-based subjective logic. The need to deal with a broad variety of contexts and issues has led to many different proposals.

Modern proposals Below is a list of proposals for probabilistic and evidentiary extensions to classical and predicate logic.

The term "probabilistic logic" was first used in a paper by Nils Nilsson published in 1986, where the truth values of sentences are probabilities.[2] The proposed semantical generalization induces a probabilistic logical entailment, which reduces to ordinary logical entailment when the probabilities of all sentences are either 0 or 1.

This generalization applies to any logical system for which the consistency of a finite set of sentences can be established.

The central concept in the theory of subjective logic[3] are opinions about some of the propositional variables involved in the given logical sentences. A binomial opinion applies to a single proposition and is represented as a 3-dimensional extension of a single probability value to express various degrees of ignorance about the truth of the proposition. For the computation of derived opinions based on a structure of argument opinions, the theory proposes respective operators for various logical connectives, such as e.g. multiplication (AND), comultiplication (OR), division (UN-AND) and co-division (UN-OR) of opinions ,as well as conditional deduction (MP) and abduction (MT).

1. Approximate reasoning formalism proposed by fuzzy logic can be used to obtain a logic in which the models are the probability distributions and the theories are the lower envelopes.In such a logic the question of the consistency of the available information is strictly related with the one of the coherence of partial probabilistic assignment and therefore with Dutch book phenomenon.

Markov logic networks implement a form of uncertain inference based on the maximum entropy principle—the idea that probabilities should be assigned in such a way as to maximize entropy, in analogy with the way that Markov chains assign probabilities to finite state machine transitions.

2. Systems such as Pei Wang's Non-Axiomatic (NARS) or Ben Goertzel's Probabilistic Logic Networks (PLN) add an explicit confidence ranking, as well as a probability to atoms and sentences. The rules of deduction and induction incorporate this uncertainty, thus side-stepping difficulties in purely Bayesian approaches to logic (including Markov logic), while also avoiding the paradoxes of Dempster-Shafer theory. The implementation of PLN attempts to use and generalize algorithms from logic programming, subject to these extensions.

In the theory of probabilistic argumentation,probabilities are not directly attached to logical sentences. Instead it is assumed that a particular subset {\displaystyle W}W of the variables {\displaystyle V}V involved in the sentences defines a probability space over the corresponding sub-σ-algebra. This induces two distinct probability measures with respect to {\displaystyle V}V, which are called degree of support and degree of possibility, respectively. Degrees of support can be regarded as non-additive probabilities of provability, which generalizes the concepts of ordinary logical entailment (for {\displaystyle V=\{\}}V=\{\}) and classical posterior probabilities (for {\displaystyle V=W}V=W). Mathematically, this view is compatible with the Dempster-Shafer theory. 3. The theory of evidential reasoning[9] also defines non-additive probabilities of probability (or epistemic probabilities) as a general notion for both logical entailment (provability) and probability. The idea is to augment standard propositional logic by considering an epistemic operator K that represents the state of knowledge that a rational agent has about the world. Probabilities are then defined over the resulting epistemic universe Kp of all propositional sentences p, and it is argued that this is the best information available to an analyst. From this view, Dempster-Shafer theory appears to be a generalized form of probabilistic reasoning.

Possible application areas :

Argumentation theory

Artificial intelligence

Artificial general intelligence

Bioinformatics

Explainable artificial intelligence

Formal epistemology

Game theory

Philosophy of science

Psychology

Statistics

Life

Lecture no: 7 & 8

9/4/2020

UNIT III

LECTURE: 7

Utility Theory

Utility functions are one of the elements of artificial intelligence(AI) solutions that are frequently mentioned but seldom discussed in details in AI articles. That basic AI theory has become an essential element of modern AI solutions. In some context, we could generalize the complete spectrum of AI applications as scenarios that involve a utility function that needs to be maximized by a rational agent. Before venturing that far, we should answer a more basic question: What is a utility function?

Utility functions are a product of Utility Theory which is one of the disciplines that helps to address the challenges of building knowledge under uncertainty. Utility theory is often combined with probabilistic theory to create what we know as decision-theoretic agents. Conceptually, a decision-theoretic agent is an AI program that can make rational decisions based on what it believes and what it wants. Sounds rational, right? :)

Ok, let’s get a bit more practical. In many AI scenarios, agents don’t have the luxury of operating in an environment in which they know the final outcome of every possible state. Those agents operate under certain degree of uncertainly and need to rely on probabilities to quantify the outcome of possible states. That probabilistic function is what we call Utility Functions.

Diving Into Utility Theory and MEU

Utility Theory is the discipline that lays out the foundation to create and evaluate Utility Functions. Typically, Utility Theory uses the notion of Expected Utility (EU) as a value that represents the average utility of all possible outcomes of a state, weighted by the probability that the outcome occurs. The other key concept of Utility Theory is known as the Principle of Maximum Utility(MEU) which states that any rational agent should choose to maximize the agent’s EU.

The principle of MEU seems like an obvious way to make decisions until you start digging into it and run into all sorts of interesting questions. Why using the average utility after all? Why not to try to minimize the loss instead of maximizing utility? There are dozens of similar questions that challenge the principle of MEU. However, in order to validate the principle of MEU, we should go back to the laws of Utility Theory.

Utility Theory Axioms

There are six fundamental axioms that setup the foundation of Utility Theory. In order to explain those, let’s pick a scenario in which you are having dinner at a restaurant and you are trying to decide between a salmon or a chicken dish. There are many factors that go into that simple decision. Which dish goes better with this gorgeous Montrachet we just ordered? How about the dessert? How would one feel if the rice are overcooked? Is the salmon’s portion too small?…Hopefully you got the point.

Utility theory assigns a probability to each one of those possible states that try to orchestrate decisions based on that. However, those decisions are governed by a group of six fundamental axioms: Orderability, Transitivity, Continuity, Substitutability, Monotonicity and Decomposability. Together these six axioms help to enforce the principle of MEU.

LECTURE NO: 08

Hidden Markov Models

HMMs are the most common models used for dealing with temporal Data. They also frequently come up in different ways in a Data Science Interview usually without the word HMM written over it. In such a scenario it is necessary to discern the problem as an HMM problem by knowing characteristics of HMMs.

In the Hidden Markov Model we are constructing an inference model based on the assumptions of a Markov process.

The Markov process assumption is that the “future is independent of the past given that we know the present”.

It means that the future state is related to the immediately previous state and not the states before that. These are the first order HMMs.

What is Hidden?

With HMMs, we don’t know which state matches which physical events instead each state matches a given output. We observe the output over time to determine the sequence of states.

Example: If you are staying indoors you will be dressed up a certain way. Lets say you want to step outside. Depending on the weather, your clothing will change. Over time, you will observe the weather and make better judgements on what to wear if you get familiar with the area/climate. In an HMM, we observe the outputs over time to determine the sequence based on how likely they were to produce that output.

Let us consider the situation where you have no view of the outside world when you are in a building. The only way for you to know if it is raining outside it so see someone carrying an umbrella when they come in. Here, the evidence variable is the Umbrella, while the hidden variable is Rain.

HMM representation:

Since this is a Markov model, R(t) depends only on R(t-1)

A number of related tasks ask about the probability of one or more of the latent variables, given the model’s parameters and a sequence of observations which is sequence of umbrella observations in our scenario. Some tasks that related to this example are also similar to those asked in a Data Science Interview(See Questions here):

If I see someone having an umbrella for the last three days, what is the probability that it is raining today? (Inference type — Filtering) If I see someone having an umbrella for the last three days, what is the probability it will rain day after tomorrow? (Prediction type)

If I see someone having an umbrella for the last three days, what is the probability it rained yesterday? (Hindsight type — Smoothing)

If I see someone having an umbrella for the last three days, what could be the weather like since the past three days? (Sequence type — Most Likely explanation)

It is worth spending time learning HMMs in detail. Above you will see the Matrix based representations for HMM for the same umbrella problem we talked about. Scikit-learn provides the framework to use HMMs in Python.

Conclusion:

HMMs allow us to model processes with a hidden state, based on observable parameters. The main problems solved with HMMs include determining how likely it is that a set of observations came from a particular model, and determining the most likely sequence of hidden states. They are a valuable tool in temporal pattern recognition. Within the temporal pattern recognition area, the HMMs find application in speech, handwriting and gesture recognition, musical score following and SONAR detection.

LECTURE NO:9,10 & 11

DATE: 11/4/2020

UNIT III

ARTIFICIAL INTELLIGENCE-MCA-413

MCA IV Sem

Topic:

BAYESIAN NETWORKS

Bayesian network, Bayes network, belief network, decision network, Bayes(ian) model or probabilistic directed acyclic graphical model is a probabilistic graphical model (a type of statistical model) that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

Efficient algorithms can perform inference and learning in Bayesian networks. Bayesian networks that model sequences of variables (e.g. speech signals or protein sequences) are called dynamic Bayesian networks. Generalizations of Bayesian networks that can represent and solve decision problems under uncertainty are called influence diagrams.

Graphical model :

Formally, Bayesian networks are directed acyclic graphs (DAGs) whose nodes represent variables in the Bayesian sense: they may be observable quantities, latent variables, unknown parameters or hypotheses. Edges represent conditional dependencies; nodes that are not connected (no path connects one node to another) represent variables that are conditionally independent of each other. Each node is associated with a probability function that takes, as input, a particular set of values for the node's parent variables, and gives (as output) the probability (or probability distribution, if applicable) of the variable represented by the node. For example, if {\displaystyle m}m parent nodes represent {\displaystyle m}m Boolean variables, then the probability function could be represented by a table of {\displaystyle 2^{m}}2^{m} entries, one entry for each of the {\displaystyle 2^{m}}2^{m} possible parent combinations. Similar ideas may be applied to undirected, and possibly cyclic, graphs such as Markov networks.

Example

A simple Bayesian network with conditional probability :

Two events can cause grass to be wet: an active sprinkler or rain. Rain has a direct effect on the use of the sprinkler (namely that when it rains, the sprinkler usually is not active). This situation can be modeled with a Bayesian network (shown to the right). Each variable has two possible values, T (for true) and F (for false).

The joint probability function is:

{\displaystyle \Pr(G,S,R)=\Pr(G\mid S,R)\Pr(S\mid R)\Pr(R)}{\displaystyle \Pr(G,S,R)=\Pr(G\mid S,R)\Pr(S\mid R)\Pr(R)} where G = "Grass wet (true/false)", S = "Sprinkler turned on (true/false)", and R = "Raining (true/false)".

The model can answer questions about the presence of a cause given the presence of an effect (so- called inverse probability) like "What is the probability that it is raining, given the grass is wet?" by using the conditional probability formula and summing over all nuisance variables:

{\displaystyle \Pr(R=T\mid G=T)={\frac {\Pr(G=T,R=T)}{\Pr(G=T)}}={\frac {\sum _{S\in \{T,F\}}\Pr(G=T,S,R=T)}{\sum _{S,R\in \{T,F\}}\Pr(G=T,S,R)}}}{\displaystyle \Pr(R=T\mid G=T)={\frac {\Pr(G=T,R=T)}{\Pr(G=T)}}={\frac {\sum _{S\in \{T,F\}}\Pr(G=T,S,R=T)}{\sum _{S,R\in \{T,F\}}\Pr(G=T,S,R)}}}

Using the expansion for the joint probability function {\displaystyle \Pr(G,S,R)}{\displaystyle \Pr(G,S,R)} and the conditional probabilities from the conditional probability tables (CPTs) stated in the diagram, one can evaluate each term in the sums in the numerator and denominator. For example,

{\displaystyle {\begin{aligned}\Pr(G=T,S=T,R=T)&=\Pr(G=T\mid S=T,R=T)\Pr(S=T\mid R=T)\Pr(R=T)\\&=0.99\times 0.01\times 0.2\\&=0.00198.\end{aligned}}}{\displaystyle {\begin{aligned}\Pr(G=T,S=T,R=T)&=\Pr(G=T\mid S=T,R=T)\Pr(S=T\mid R=T)\Pr(R=T)\\&=0.99\times 0.01\times 0.2\\&=0.00198.\end{aligned}}}

Then the numerical results (subscripted by the associated variable values) are

{\displaystyle \Pr(R=T\mid G=T)={\frac {0.00198_{TTT}+0.1584_{TFT}}{0.00198_{TTT}+0.288_{TTF}+0.1584_{TFT}+0.0_{TFF}}}={\frac {891}{2491}}\approx 35.77\%.}{\displaystyle \Pr(R=T\mid G=T)={\frac {0.00198_{TTT}+0.1584_{TFT}}{0.00198_{TTT}+0.288_{TTF}+0.1584_{TFT}+0.0_{TFF}}}={\frac {891}{2491}}\approx 35.77\%.}

To answer an interventional question, such as "What is the probability that it would rain, given that we wet the grass?" the answer is governed by the post-intervention joint distribution function

{\displaystyle \Pr(S,R\mid {\text{do}}(G=T))=\Pr(S\mid R)Pr(R)}{\displaystyle \Pr(S,R\mid {\text{do}}(G=T))=\Pr(S\mid R)Pr(R)} obtained by removing the factor {\displaystyle \Pr(G\mid S,R)}{\displaystyle \Pr(G\mid S,R)} from the pre-intervention distribution. The do operator forces the value of G to be true. The probability of rain is unaffected by the action:

{\displaystyle \Pr(R\mid {\text{do}}(G=T))=\Pr(R).}{\displaystyle \Pr(R\mid {\text{do}}(G=T))=\Pr(R).}

To predict the impact of turning the sprinkler on:

{\displaystyle \Pr(R,G\mid {\text{do}}(S=T))=\Pr(R)\Pr(G\mid R,S=T)}{\displaystyle \Pr(R,G\mid {\text{do}}(S=T))=\Pr(R)\Pr(G\mid R,S=T)} with the term {\displaystyle \Pr(S=T\mid R)}{\displaystyle \Pr(S=T\mid R)} removed, showing that the action affects the grass but not the rain.

These predictions may not be feasible given unobserved variables, as in most policy evaluation problems. The effect of the action {\displaystyle {\text{do}}(x)}{\text{do}}(x) can still be predicted, however, whenever the back-door criterion is satisfied.[1][2] It states that, if a set Z of nodes can be observed that d-separates[3] (or blocks) all back-door paths from X to Y then

{\displaystyle \Pr(Y,Z\mid {\text{do}}(x))={\frac {\Pr(Y,Z,X=x)}{\Pr(X=x\mid Z)}}.}{\displaystyle \Pr(Y,Z\mid {\text{do}}(x))={\frac {\Pr(Y,Z,X=x)}{\Pr(X=x\mid Z)}}.}

A back-door path is one that ends with an arrow into X. Sets that satisfy the back-door criterion are called "sufficient" or "admissible." For example, the set Z = R is admissible for predicting the effect of S = T on G, because R d-separates the (only) back-door path S ← R → G. However, if S is not observed, no other set d-separates this path and the effect of turning the sprinkler on (S = T) on the grass (G) cannot be predicted from passive observations. In that case P(G | do(S = T)) is not "identified". This reflects the fact that, lacking interventional data, the observed dependence between S and G is due to a causal connection or is spurious (apparent dependence arising from a common cause, R). (see Simpson's paradox)

To determine whether a causal relation is identified from an arbitrary Bayesian network with unobserved variables, one can use the three rules of "do-calculus"and test whether all do terms can be removed from the expression of that relation, thus confirming that the desired quantity is estimable from frequency data.

Using a Bayesian network can save considerable amounts of memory over exhaustive probability tables, if the dependencies in the joint distribution are sparse. For example, a naive way of storing the conditional probabilities of 10 two-valued variables as a table requires storage space for {\displaystyle 2^{10}=1024}2^{10}=1024 values. If no variable's local distribution depends on more than three parent variables, the Bayesian network representation stores at most {\displaystyle 10\cdot 2^{3}=80}10\cdot 2^{3}=80 values.

One advantage of Bayesian networks is that it is intuitively easier for a human to understand (a sparse set of) direct dependencies and local distributions than complete joint distributions.

Lecture no: 12,13

Date: 13/4/2020

Topic:

Supervised vs Unsupervised Learning:

What is Supervised Machine Learning?

In Supervised learning, you train the machine using data which is well "labeled." It means some data is already tagged with the correct answer. It can be compared to learning which takes place in the presence of a supervisor or a teacher.

A supervised learning algorithm learns from labeled training data, helps you to predict outcomes for unforeseen data. Successfully building, scaling, and deploying accurate supervised machine learning Data science model takes time and technical expertise from a team of highly skilled data scientists. Moreover, Data scientist must rebuild models to make sure the insights given remains true until its data changes.

Unsupervised Learning

Unsupervised learning is a machine learning technique, where you do not need to supervise the model. Instead, you need to allow the model to work on its own to discover information. It mainly deals with the unlabelled data.

Unsupervised learning algorithms allow you to perform more complex processing tasks compared to supervised learning. Although, unsupervised learning can be more unpredictable compared with other natural learning deep learning and reinforcement learning methods.

Supervised Learning

Supervised learning allows you to collect data or produce a data output from the previous experience.

Helps you to optimize performance criteria using experience.

Supervised machine learning helps you to solve various types of real-world computation problems.

Prime reasons for using Unsupervised Learning:

1.Unsupervised machine learning finds all kind of unknown patterns in data.

2. Unsupervised methods help you to find features which can be useful for categorization.

3. It is taken place in real time, so all the input data to be analyzed and labeled in the presence of learners.

4.It is easier to get unlabeled data from a computer than labeled data, which needs manual intervention.

How Supervised Learning works?

For example, you want to train a machine to help you predict how long it will take you to drive home from your workplace. Here, you start by creating a set of labeled data. This data includes:

Weather conditions

Time of the day

Holidays

All these details are your inputs. The output is the amount of time it took to drive back home on that specific day.

You instinctively know that if it's raining outside, then it will take you longer to drive home. But the machine needs data and statistics.

Let's see now how you can develop a supervised learning model of this example which help the user to determine the commute time.

The first thing you requires to create is a training data set. This training set will contain the total commute time and corresponding factors like weather, time, etc. Based on this training set, your machine might see there's a direct relationship between the amount of rain and time you will take to get home.

So, it ascertains that the more it rains, the longer you will be driving to get back to your home. It might also see the connection between the time you leave work and the time you'll be on the road.

The closer you're to 6 p.m. the longer time it takes for you to get home. Your machine may find some of the relationships with your labeled data.

This is the start of your Data Model. It begins to impact how rain impacts the way people drive. It also starts to see that more people travel during a particular time of day.

How Unsupervised Learning works?

Let's, take the case of a baby and her family dog.

She knows and identifies this dog. A few weeks later a family friend brings along a dog and tries to play with the baby

Baby has not seen this dog earlier. But it recognizes many features (2 ears, eyes, walking on 4 legs) are like her pet dog. She identifies a new animal like a dog. This is unsupervised learning, where you are not taught but you learn from the data (in this case data about a dog.) Had this been supervised learning, the family friend would have told the baby that it's a dog.

Types of Supervised Machine Learning Techniques

1. Regression:

Regression technique predicts a single output value using training data.

Example: You can use regression to predict the house price from training data. The input variables will be locality, size of a house, etc.

2. Classification: Classification means to group the output inside a class. If the algorithm tries to label input into two distinct classes, it is called binary classification. Selecting between more than two classes is referred to as multiclass classification.

Example: Determining whether or not someone will be a defaulter of the loan.

3.Strengths: Outputs always have a probabilistic interpretation, and the algorithm can be regularized to avoid overfitting.

4.Weaknesses: Logistic regression may underperform when there are multiple or non-linear decision boundaries. This method is not flexible, so it does not capture more complex relationships.

Types of Unsupervised Machine Learning Techniques

Unsupervised learning problems further grouped into clustering and association problems:

1. Clustering

Clustering is an important concept when it comes to unsupervised learning. It mainly deals with finding a structure or pattern in a collection of uncategorized data. Clustering algorithms will process your data and find natural clusters(groups) if they exist in the data. You can also modify how many clusters your algorithms should identify. It allows you to adjust the granularity of these groups.

2. Association

Association rules allow you to establish associations amongst data objects inside large databases. This unsupervised technique is about discovering exciting relationships between variables in large databases. For example, people that buy a new home most likely to buy new furniture.

Other Examples:

A subgroup of cancer patients grouped by their gene expression measurements

Groups of shopper based on their browsing and purchasing histories Movie group by the rating given by movies viewers.

Supervised vs. Unsupervised Learning

Parameters Supervised machine learning technique Unsupervised machine learning technique

Process In a supervised learning model, input and output variables will be given. In unsupervised learning model, only input data will be given

Input Data Algorithms are trained using labeled data. Algorithms are used against data which is not labeled

Algorithms Used by Supervised Learning:

Support vector machine, Neural network, Linear and logistics regression, random forest, and Classification trees.

Unsupervised algorithms can be divided into different categories: like

Cluster algorithms, K-means, Hierarchical clustering, etc.

Computational Complexity

Supervised learning is a simpler method. Unsupervised learning is computationally complex

Use of Data- Supervised learning model uses training data to learn a link between the input and the outputs. Unsupervised learning does not use output data.

Superbised-Accuracy of Results Highly accurate and trustworthy method. Unsupervised-Less accurate and trustworthy method.

Real Time Learning Learning method takes place offline. Learning method takes place in real time.

Number of Classes

Supervised-Number of classes is known. Unsupervised- Number of classes is not known.

Main Drawback

Classifying big data can be a real challenge in Supervised Learning. You cannot get precise information regarding data sorting, and the output as data used in unsupervised learning is labeled and not known.

Conclusion

In Supervised learning, you train the machine using data which is well "labeled."

Unsupervised learning is a machine learning technique, where you do not need to supervise the model.

Supervised learning allows you to collect data or produce a data output from the previous experience.

Unsupervised machine learning helps you to finds all kind of unknown patterns in data.

For example, you will able to determine the time taken to reach back come base on weather condition, Times of the day and holiday.

For example, Baby can identify other dogs based on past supervised learning.

Regression and Classification are two types of supervised machine learning techniques.

Clustering and Association are two types of Unsupervised learning.

In a supervised learning model, input and output variables will be given while with unsupervised learning model, only input data will be given

Lecture no:14 , 15 & 16

Date:17/4/2020

Unit IV

TOPIC:

Decision trees

Decision tree learning is one of the predictive modeling approaches used in statistics, data mining and machine learning. It uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees.

In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data (but the resulting classification tree can be an input for decision making). This page deals with decision trees in data mining.

Decision tree types

Decision trees used in data mining are of two main types:

Classification tree analysis is when the predicted outcome is the class (discrete) to which the data belongs.

Regression tree analysis is when the predicted outcome can be considered a real number (e.g. the price of a house, or a patient's length of stay in a hospital).

The term Classification And Regression Tree (CART) analysis is an umbrella term used to refer to both of the above procedures, first introduced by Breiman et al. in 1984.

Trees used for regression and trees used for classification have some similarities - but also some differences, such as the procedure used to determine where to split.

Some techniques, often called ensemble methods, construct more than one decision tree:

Boosted trees Incrementally building an ensemble by training each new instance to emphasize the training instances previously mis-modeled. A typical example is AdaBoost. These can be used for regression-type and classification-type problems.

Bootstrap aggregated (or bagged) decision trees, an early ensemble method, builds multiple decision trees by repeatedly resampling training data with replacement, and voting the trees for a consensus prediction

A random forest classifier is a specific type of bootstrap aggregating

Rotation forest – in which every decision tree is trained by first applying principal component analysis (PCA) on a random subset of the input features.[8]

A special case of a decision tree is a decision list,[9] which is a one-sided decision tree, so that every internal node has exactly 1 leaf node and exactly 1 internal node as a child (except for the bottommost node, whose only child is a single leaf node). While less expressive, decision lists are arguably easier to understand than general decision trees due to their added sparsity, permit non-greedy learning methods and monotonic constraints to be imposed.

Decision tree learning is the construction of a decision tree from class-labeled training tuples. A decision tree is a flow-chart-like structure, where each internal (non-leaf) node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf (or terminal) node holds a class label. The topmost node in a tree is the root node.

There are many specific decision-tree algorithms. Notable ones include:

ID3 (Iterative Dichotomiser 3)

C4.5 (successor of ID3)

CART (Classification And Regression Tree)[4]

Chi-square automatic Instruction detection (CHAID). Performs multi-level splits when computing classification trees.

MARS: extends decision trees to handle numerical data better.

Conditional Inference Trees. Statistics-based approach that uses non-parametric tests as splitting criteria, corrected for multiple testing to avoid overfitting. This approach results in unbiased predictor selection and does not require pruning.

ID3 and CART were invented independently at around the same time (between 1970 and 1980)[citation needed], yet follow a similar approach for learning a decision tree from training tuples.

Metrics

Algorithms for constructing decision trees usually work top-down, by choosing a variable at each step that best splits the set of items.Different algorithms use different metrics for measuring "best". These generally measure the homogeneity of the target variable within the subsets. Some examples are given below. These metrics are applied to each candidate subset, and the resulting values are combined (e.g., averaged) to provide a measure of the quality of the split.

Information gain in decision trees

Used by the ID3, C4.5 and C5.0 tree-generation algorithms , is based on the concept of entropy and information content from information theory.

The expected information gain is the mutual information, meaning that on average, the reduction in the entropy of T is the mutual information.

Information gain is used to decide which feature to split on at each step in building the tree. Simplicity is best, so we want to keep our tree small. To do so, at each step we should choose the split that results in the purest daughter nodes. A commonly used measure of purity is called information which is measured in bits. For each node of the tree, the information value "represents the expected amount of information that would be needed to specify whether a new instance should be classified yes or no, given that the example reached that node"

Consider an example data set with four attributes: outlook (sunny, overcast, rainy), temperature (hot, mild, cool), humidity (high, normal), and windy (true, false), with a binary (yes or no) target variable, play, and 14 data points. To construct a decision tree on this data, we need to compare the information gain of each of four trees, each split on one of the four features. The split with the highest information gain will be taken as the first split and the process will continue until all children nodes are pure, or until the information gain is 0.

The split using the feature windy results in two children nodes, one for a windy value of true and one for a windy value of false. In this data set, there are six data points with a true windy value, three of which have a play (where play is the target variable) value of yes and three with a play value of no. The eight remaining data points with a windy value of false contain two no's and six yes's. The information of the windy=true node is calculated using the entropy equation above. Since there is an equal number of yes's and no's in this node, we have

To find the the information of the split, we take the weighted average of these two numbers based on how many observations fell into which node.

To find the information gain of the split using windy, we must first calculate the information in the data before the split. The original data contained nine yes's and five no's.

Now we can calculate the information gain achieved by splitting on the windy feature.

To build the tree, the information gain of each possible first split would need to be calculated. The best first split is the one that provides the most information gain. This process is repeated for each impure node until the tree is complete. This example is adapted from the example appearing in Witten et al.

Variance reduction ,

Introduced in CART,variance reduction is often employed in cases where the target variable is continuous (regression tree), meaning that use of many other metrics would first require discretization before being applied. The variance reduction of a node N is defined as the total reduction of the variance of the target variable x due to the split at this node:

which the split test is true, and set of sample indices for which the split test is false, respectively. Each of the above summands are indeed variance estimates, though, written in a form without directly referring to the mean.

Uses

Advantages

Amongst other data mining methods, decision trees have various advantages:

Simple to understand and interpret. People are able to understand decision tree models after a brief explanation. Trees can also be displayed graphically in a way that is easy for non-experts to interpret.

Able to handle both numerical and categorical data.

Other techniques are usually specialized in analyzing datasets that have only one type of variable. (For example, relation rules can be used only with nominal variables while neural networks can be used only with numerical variables or categoricals converted to 0-1 values.)

Requires little data preparation. Other techniques often require data normalization. Since trees can handle qualitative predictors, there is no need to create dummy variables. Uses a white box model. If a given situation is observable in a model the explanation for the condition is easily explained by boolean logic. By contrast, in a black box model, the explanation for the results is typically difficult to understand, for example with an artificial neural network.

Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model.

Non-statistical approach that makes no assumptions of the training data or prediction residuals; e.g., no distributional, independence, or constant variance assumptions

Performs well with large datasets. Large amounts of data can be analyzed using standard computing resources in reasonable time.

Mirrors human decision making more closely than other approaches.This could be useful when modeling human decisions/behavior.

Robust against co-linearity, particularly boosting

In built feature selection. Additional irrelevant feature will be less used so that they can be removed on subsequent runs.

Decision trees can approximate any Boolean function eq. XOR.

Limitations

Trees can be very non-robust. A small change in the training data can result in a large change in the tree and consequently the final predictions.

The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts.

Consequently, practical decision-tree learning algorithms are based on heuristics such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. To reduce the greedy effect of local optimality, some methods such as the dual information distance (DID) tree were proposed.

Decision-tree learners can create over-complex trees that do not generalize well from the training data. (This is known as overfitting.) Mechanisms such as pruning are necessary to avoid this problem (with the exception of some algorithms such as the Conditional Inference approach, that does not require pruning).

For data including categorical variables with different numbers of levels, information gain in decision trees is biased in favor of attributes with more levels.However, the issue of biased predictor selection is avoided by the Conditional Inference approach, a two-stage approach, or adaptive leave-one-out feature selection.

Implementations

Many data mining software packages provide implementations of one or more decision tree algorithms.

Examples include Salford Systems CART (which licensed the proprietary code of the original CART authors), IBM SPSS Modeler, RapidMiner, SAS Enterprise Miner, Matlab, R (an open-source software environment for statistical computing, which includes several CART implementations such as rpart, party and randomForest packages), Weka (a free and open-source data-mining suite, contains many decision tree algorithms), Orange, KNIME, Microsoft SQL Server , and scikit-learn (a free and open-source machine learning library for the Python programming language).

Date: 19/4/2020

LECTURE:17, 18, 19

UNIT IV

TOPIC 1: Statistical Learning Models

Uncertainty is a key element of many artificial intelligence(AI) environments in the real world. By uncertainly, we refer to the characteristics that prevent an AI agent from knowing the precise outcome of a specific state-action combination in a given scenario. Uncertainly is typically the result of nondeterministic and partially observable environments. Statistical learning has become a powerful weapon to overcome uncertainty in AI scenarios and, consequently, it has been widely implemented in many modern AI frameworks.

When we talk about statistical learning, there is a name that comes to mind: Bayes. Even though most of modern statistic theory was the result of the work of French mathematician Pierre-Simon de Laplace who lived five decades after Bayes, it is Bayes who got all the credit in a theorem that bears his name. Thomas Bayes was an eighteen century British clergyman that described new insights to think about chance and uncertainty. Laplace codified Bayes ideas into a single theorem that helps us reason about almost anything in the world with an once of uncertainty:

P(cause|effect) = P(effect)xP(effect|cause)/P(cause).

By P(A|B) we denote the probability and A occurs given B. Replacing cause and effect in the previous equation with the probabilities of any state-action combination in an AI environment we arrive to the fundamentals of Bayesian learning. Essentially, Bayesian or statistical learning focuses on calculating the probabilities of each hypothesis and make predictions accordingly.

Although Bayesian learning seems theoretically trivial, it runs into many roadblocks in real world AI solutions. Specifically, Bayesian learning models frequently result impractical in environments in which the number of hypothesis is very large or infinite. A very well-known AI algorithm that tries to address those limitations is the maximum a posteriori(MAP) model that simply makes predictions based on the single most probable hypothesis,

Maybe the most notorious algorithm in statistical learning is the Naive Bayes model( also referred to as the Bayesian classifier) which uses networks to model environments in which the effects are independent given the cause. The model is “naive” precisely because it assumes that attributes are independent of each other given the class.

AI models like Native Bayes are only applicable in fully-observable environments. Many AI which don’t resemble many real world AI environments. For instance, many AI environments contain hidden variables that are not available in the training data set. Let’s take an example from the health care world, electronical medical records typically include observations about the symptoms is a disease rather than about the disease itself. In those scenarios algorithms such as unsupervised clustering using Mixture of Gaussians, Learning Bayesian Networks and Learning Hidden Markov Models are typically a good choice.

As it turns out, statistical learning is not a solution to every AI problem. Laplace’s original thought that any form of human knowledge can be codified in a statistical network failed to account for many aspects of human reasoning. A very well known limitation of statical learning models is the absence of logic which results key in many forms of knowledge. As a result, statistical learning techniques are not applicable in many AI scenarios in the real world.

TOPIC 2

Learning With Complete Data:

Naive Bayes Algorithms:

Naive Bayes is a simple but surprisingly powerful algorithm for predictive modeling.

In machine learning we are often interested in selecting the best hypothesis (h) given data (d).

In a classification problem, our hypothesis (h) may be the class to assign for a new data instance (d).

One of the easiest ways of selecting the most probable hypothesis given the data that we have that we can use as our prior knowledge about the problem. Bayes’ Theorem provides a way that we can calculate the probability of a hypothesis given our prior knowledge.

Bayes’ Theorem is stated as:

P(h|d) = (P(d|h) * P(h)) / P(d)

Where

P(h|d) is the probability of hypothesis h given the data d. This is called the posterior probability.

P(d|h) is the probability of data d given that the hypothesis h was true.

P(h) is the probability of hypothesis h being true (regardless of the data). This is called the prior probability of h.

P(d) is the probability of the data (regardless of the hypothesis).

You can see that we are interested in calculating the posterior probability of P(h|d) from the prior probability p(h) with P(D) and P(d|h).

After calculating the posterior probability for a number of different hypotheses, you can select the hypothesis with the highest probability. This is the maximum probable hypothesis and may formally be called the maximum a posteriori (MAP) hypothesis.

This can be written as:

MAP(h) = max(P(h|d)) or

MAP(h) = max((P(d|h) * P(h)) / P(d)) or MAP(h) = max(P(d|h) * P(h))

The P(d) is a normalizing term which allows us to calculate the probability. We can drop it when we are interested in the most probable hypothesis as it is constant and only used to normalize.

Back to classification, if we have an even number of instances in each class in our training data, then the probability of each class (e.g. P(h)) will be equal. Again, this would be a constant term in our equation and we could drop it so that we end up with:

MAP(h) = max(P(d|h))

Naive Bayes Classifier

Naive Bayes is a classification algorithm for binary (two-class) and multi-class classification problems. The technique is easiest to understand when described using binary or categorical input values.

It is called naive Bayes or idiot Bayes because the calculation of the probabilities for each hypothesis are simplified to make their calculation tractable. Rather than attempting to calculate the values of each attribute value P(d1, d2, d3|h), they are assumed to be conditionally independent given the target value and calculated as P(d1|h) * P(d2|H) and so on.

This is a very strong assumption that is most unlikely in real data, i.e. that the attributes do not interact. Nevertheless, the approach performs surprisingly well on data where this assumption does not hold.

Representation Used By Naive Bayes Models

The representation for naive Bayes is probabilities.

A list of probabilities are stored to file for a learned naive Bayes model. This includes:

Class Probabilities: The probabilities of each class in the training dataset.

Conditional Probabilities: The conditional probabilities of each input value given each class value.

Learn a Naive Bayes Model From Data Learning a naive Bayes model from your training data is fast.

Training is fast because only the probability of each class and the probability of each class given different input (x) values need to be calculated. No coefficients need to be fitted by optimization procedures.

Calculating Class Probabilities

The class probabilities are simply the frequency of instances that belong to each class divided by the total number of instances.

For example in a binary classification the probability of an instance belonging to class 1 would be calculated as:

P(class=1) = count(class=1) / (count(class=0) + count(class=1))

In the simplest case each class would have the probability of 0.5 or 50% for a binary classification problem with the same number of instances in each class.

Calculating Conditional Probabilities

The conditional probabilities are the frequency of each attribute value for a given class value divided by the frequency of instances with that class value.

For example, if a “weather” attribute had the values “sunny” and “rainy” and the class attribute had the class values “go-out” and “stay-home“, then the conditional probabilities of each weather value for each class value could be calculated as:

P(weather=sunny|class=go-out) = count(instances with weather=sunny and class=go-out) / count(instances with class=go-out)

P(weather=sunny|class=stay-home) = count(instances with weather=sunny and class=stay-home) / count(instances with class=stay-home)

P(weather=rainy|class=go-out) = count(instances with weather=rainy and class=go-out) / count(instances with class=go-out)

P(weather=rainy|class=stay-home) = count(instances with weather=rainy and class=stay-home) / count(instances with class=stay-home) Make Predictions With a Naive Bayes Model

Given a naive Bayes model, you can make predictions for new data using Bayes theorem.

MAP(h) = max(P(d|h) * P(h))

Using our example above, if we had a new instance with the weather of sunny, we can calculate:

go-out = P(weather=sunny|class=go-out) * P(class=go-out) stay-home = P(weather=sunny|class=stay-home) * P(class=stay-home)

We can choose the class that has the largest calculated value. We can turn these values into probabilities by normalizing them as follows:

P(go-out|weather=sunny) = go-out / (go-out + stay-home)

P(stay-home|weather=sunny) = stay-home / (go-out + stay-home)

If we had more input variables we could extend the above example. For example, pretend we have a “car” attribute with the values “working” and “broken“. We can multiply this probability into the equation.

For example below is the calculation for the “go-out” class label with the addition of the car input variable set to “working”:

go-out = P(weather=sunny|class=go-out) * P(car=working|class=go-out) * P(class=go-out)

Gaussian Naive Bayes

Naive Bayes can be extended to real-valued attributes, most commonly by assuming a Gaussian distribution.

This extension of naive Bayes is called Gaussian Naive Bayes. Other functions can be used to estimate the distribution of the data, but the Gaussian (or Normal distribution) is the easiest to work with because you only need to estimate the mean and the standard deviation from your training data.

Representation for Gaussian Naive Bayes

Above, we calculated the probabilities for input values for each class using a frequency. With real-valued inputs, we can calculate the mean and standard deviation of input values (x) for each class to summarize the distribution.

This means that in addition to the probabilities for each class, we must also store the mean and standard deviations for each input variable for each class.

Learn a Gaussian Naive Bayes Model From Data

This is as simple as calculating the mean and standard deviation values of each input variable (x) for each class value.

mean(x) = 1/n * sum(x)

Where n is the number of instances and x are the values for an input variable in your training data.

We can calculate the standard deviation using the following equation:

standard deviation(x) = sqrt(1/n * sum(xi-mean(x)^2 ))

This is the square root of the average squared difference of each value of x from the mean value of x, where n is the number of instances, sqrt() is the square root function, sum() is the sum function, xi is a specific value of the x variable for the i’th instance and mean(x) is described above, and ^2 is the square.

Make Predictions With a Gaussian Naive Bayes Model

Probabilities of new x values are calculated using the Gaussian Probability Density Function (PDF).

When making predictions these parameters can be plugged into the Gaussian PDF with a new input for the variable, and in return the Gaussian PDF will provide an estimate of the probability of that new input value for that class.

pdf(x, mean, sd) = (1 / (sqrt(2 * PI) * sd)) * exp(-((x-mean^2)/(2*sd^2)))

Where pdf(x) is the Gaussian PDF, sqrt() is the square root, mean and sd are the mean and standard deviation calculated above, PI is the numerical constant, exp() is the numerical constant e or Euler’s number raised to power and x is the input value for the input variable.

We can then plug in the probabilities into the equation above to make predictions with real-valued inputs.

For example, adapting one of the above calculations with numerical values for weather and car:

go-out = P(pdf(weather)|class=go-out) * P(pdf(car)|class=go-out) * P(class=go-out)

Best Prepare Your Data For Naive Bayes

Categorical Inputs: Naive Bayes assumes label attributes such as binary, categorical or nominal.

Gaussian Inputs: If the input variables are real-valued, a Gaussian distribution is assumed. In which case the algorithm will perform better if the univariate distributions of your data are Gaussian or near- Gaussian. This may require removing outliers (e.g. values that are more than 3 or 4 standard deviations from the mean).

Classification Problems:

Naive Bayes is a classification algorithm,suitable for binary and multiclass classification.

Log Probabilities: The calculation of the likelihood of different class values involves multiplying a lot of small numbers together. This can lead to an underflow of numerical precision. As such it is good practice to use a log transform of the probabilities to avoid this underflow.

Kernel Functions: Rather than assuming a Gaussian distribution for numerical input values, more complex distributions can be used such as a variety of kernel density functions. Update Probabilities: When new data becomes available, you can simply update the probabilities of your model. This can be helpful if the data changes frequently.

Summary

In this you discovered the Naive Bayes algorithm for classification. You learned about:

The Bayes Theorem and how to calculate it in practice.

Naive Bayes algorithm including representation, making predictions and learning the model.

The adaptation of Naive Bayes for real-valued input data called Gaussian Naive Bayes.

Date: 19/4/2020

Question Bank: 1

ARTIFICIAL INTELLIGENCE

1.

(a) Explain the term Artificial Intelligence.

(b) Describe the role of Computer Vision in Artificial Intelligence.

(c) What do you mean by Agent Program? How do you assure that an agent program is an

Intelligent Agent Program?

(d) Discuss the role of Machine Intelligence in game playing.

(e) What is Modus Pones Rule in Prepositional Logic?

(f) What is Turing Test?

(g) Write short note on state of the art of Artificial Intelligence.

(h) What is Pattern Recognition?

(i) Discuss Supervised & Unsupervised learning.

(j) Describe the role of Artificial Intelligence in Natural Language Processing.

2.

(a) What is Production System? Explain the various types of production system.

(b) What is Probabilistic Reasoning? Also describe the role HMM in probabilistic reasoning.

(c) What is Clustering? Describe K-Means Clustering Algorithm.

(d) Explain Learning with complete data i.e. Naive Bayes Model and learning with hidden data i.e. EM algorithm.

(e) Describe A* Search Technique. Prove that A* is complete and optimal.

(f) Derive the expressions for time and space complexity of Breadth-First and Depth-First

Search strategies.

(g) Determine whether the following argument is valid: "If I work whole night on this problem, then I can solve it. If I solve the problem, then I will understand the topic.

Therefore, I will work whole night on this problem, then I will understand the topic".

(h) Describe Bayesian Network technique of Knowledge Representation. How does it useful in representing uncertainty knowledge?

3 Explain how PCA is used in Pattern Recognition. Describe Parameter Estimation methods in Pattern Recognition.

4 Translate the following sentences into formulas in Predicate Logic and Clausal

Form:

(i) John likes all kind of food.

(ii) Apples are food.

(iii) Chicken is food.

(iv) Bill eats peanuts and is still alive.

(v) Sue eats everything Bill eats.

(vi) Anything any one eats and is not killed by is food.

5 Write short notes on following: i. Linear Discriminant Analysis ii. Support Vector Machine iii. Game Search

Date: 19/4/2020

Question Bank : 2

MCA (Semester - 4th)

ARTIFICIAL INTELLIGENCE (MCA - 413)

Q1) a) Define Artificial Intelligence. b) What is an Expert System? c) Define NLP (Natural Language Processing). d) What is the relation between Resolution and Unification? e) What are the rules and facts in PROLOG? f) Name any two Expert Systems and their purpose. g) What is the role of modal logic in discourse and pragmatic processing? h) Give an example of statement and its clausal form. i) Define operators used in PROLOG. j) Write a note on compilers of PROLOG. k) Write general structure of PROLOG program. l) Explain Backtracking. m) Explain Water-Jug problem. n) What do you mean by inference mechanism? o) What is the role of LEXICON in syntactic processing?

Q2) Discuss the application area of Artificial Intelligence.

Q3) Write all known Data Structures to play the tic-tac-toe game.

Q4) Differentiate between DFS (Depth First Search) and BFS (Breadth First Search).

Q5) Discuss knowledge Acquisition Techniques.

Q6) Explain basic components of an Expert System with the help of a diagram.

Q7) Explain the tasks performed in the Syntactic Processing.

Q8) Explain Semantic Analysis with context to Natural Language Processing.

Q9) Write difference between Conventional programming and AI programming.

Q10) What is the role of recursion in PROLOG?

Q11) Discourse and Pragmatic processing is complex. Explain with reasoning.

Q12) Write a note on Pattern Matching.

Q13) What are the statements that control program flow in PROLOG?

Lecture:20 & 21

Date:21/4/2020

Unit IV

Topic:

Learning with hidden variables – the EM algorithm

It is a very general algorithm used to learn probabilistic models in which variables are hidden; that is, some of the variables are not observed. Models with hidden variables are sometimes called latent variable models. The EM algorithm is a solution to this kind of problem and goes very well with probabilistic graphical models.

Most of the time, when we want to learn the parameters of a model, we write an objective function, such as the likelihood function, and we aim at finding the parameters that maximize this function. Generally speaking, one could simply use a black-box numerical optimizer and just compute the relevant parameters given this function. However, in many cases, this would be intractable and too prone to numerical errors (due to the inherent approximations done by CPUs). Therefore it is generally not a good solution.

Principles of the EM algorithm

Because the latent variables are not observed, the likelihood function of such a model is a marginal distribution where we have to sum out (or integrate out) the hidden variables. Marginalization will create dependencies between the variables and make the problem complex to solve.

The EM algorithm deals with this problem essentially by filling-in missing data with their expected values, given a distribution. When we iterate this process over and over, it will converge to the maximum likelihood solution. This filling-in is achieved by computing the posterior probability distribution of the hidden variables given a current set of parameters and the observed variables. This is what is done in the E-step, (E for Expectation). In the M-step, M for Maximization, the parameters of the models are adjusted and we iterate again with a new E-step. We will go on until we see a convergence in the parameters, or a convergence in the growth of the likelihood.

Summary

Here we saw how to compute the parameters of a graphical model by using the maximum likelihood estimation.

We should note however that this approach is not Bayesian and could be improved by setting prior distributions over the parameters of the graphical models. This could be used to include more domain knowledge and help in obtaining better estimations.

When the data is not fully observed and variables are hidden, we learned how to use the very powerful EM algorithm. We also saw a full implementation of a learning algorithm in R for a fully observed graph.

The most important requirement when doing machine learning is to focus on what is not working. From a dataset, any algorithm will, at some point, extract some information.

Lecture no:22, 23. &24

Date: 27/4/2020

Unit IV

Topic:

Reinforcement learning(Last topic of unit IV)

Reinforcement learning is the training of machine learning models to make a sequence of decisions. The agent learns to achieve a goal in an uncertain, potentially complex environment. In reinforcement learning, an artificial intelligence faces a game-like situation. The computer employs trial and error to come up with a solution to the problem. To get the machine to do what the programmer wants, the artificial intelligence gets either rewards or penalties for the actions it performs. Its goal is to maximize the total reward.

Although the designer sets the reward policy–that is, the rules of the game–he gives the model no hints or suggestions for how to solve the game. It’s up to the model to figure out how to perform the task to maximize the reward, starting from totally random trials and finishing with sophisticated tactics and superhuman skills. By leveraging the power of search and many trials, reinforcement learning is currently the most effective way to hint machine’s creativity. In contrast to human beings, artificial intelligence can gather experience from thousands of parallel gameplays if a reinforcement learning algorithm is run on a sufficiently powerful computer infrastructure.

Examples : Applications of reinforcement learning were in the past limited by weak computer infrastructure. However, as Gerard Tesauro’s backgamon AI superplayer developed in 1990’s shows, progress did happen. That early progress is now rapidly changing with powerful new computational technologies opening the way to completely new inspiring applications.

Training the models that control autonomous cars is an excellent example of a potential application of reinforcement learning. In an ideal situation, the computer should get no instructions on driving the car. The programmer would avoid hard-wiring anything connected with the task and allow the machine to learn from its own errors. In a perfect situation, the only hard-wired element would be the reward function.

For example, in usual circumstances we would require an autonomous vehicle to put safety first, minimize ride time, reduce pollution, offer passengers comfort and obey the rules of law. With an autonomous race car, on the other hand, we would emphasize speed much more than the driver’s comfort. The programmer cannot predict everything that could happen on the road. Instead of building lengthy “if-then” instructions, the programmer prepares the reinforcement learning agent to be capable of learning from the system of rewards and penalties. The agent (another name for reinforcement learning algorithms performing the task) gets rewards for reaching specific goals.

Another example:

deepsense.ai took part in the “Learning to run” project, which aimed to train a virtual runner from scratch. The runner is an advanced and precise musculoskeletal model designed by the Stanford Neuromuscular Biomechanics Laboratory. Learning the agent how to run is a first step in building a new generation of prosthetic legs, ones that automatically recognize people’s walking patterns and tweak themselves to make moving easier and more effective. While it is possible and has been done in Stanford’s labs, hard-wiring all the commands and predicting all possible patterns of walking requires a lot of work from highly skilled programmers.

Challenges with reinforcement learning

The main challenge in reinforcement learning lays in preparing the simulation environment, which is highly dependant on the task to be performed. When the model has to go superhuman in Chess, Go I'mor Atari games, preparing the simulation environment is relatively simple. When it comes to building a model capable of driving an autonomous car, building a realistic simulator is crucial before letting the car ride on the street. The model has to figure out how to brake or avoid a collision in a safe environment, where sacrificing even a thousand cars comes at a minimal cost. Transferring the model out of the training environment and into to the real world is where things get tricky. Scaling and tweaking the neural network controlling the agent is another challenge. There is no way to communicate with the network other than through the system of rewards and penalties.This in particular may lead to catastrophic forgetting, where acquiring new knowledge causes some of the old to be erased.

Yet another challenge is reaching a local optimum – that is the agent performs the task as it is, but not in the optimal or required way. A “jumper” jumping like a kangaroo instead of doing the thing that was expected of it-walking-is a great example, and is also one that can be found in our recent blog post.

Finally, there are agents that will optimize the prize without performing the task it was designed for.

What distinguishes reinforcement learning from deep learning and machine learning?

In fact, there should be no clear divide between machine learning, deep learning and reinforcement learning. It is like a parallelogram – rectangle – square relation, where machine learning is the broadest category and the deep reinforcement learning the most narrow one.

In the same way, reinforcement learning is a specialized application of machine and deep learning techniques, designed to solve problems in a particular way.

Although the ideas seem to differ, there is no sharp divide between these subtypes. Moreover, they merge within projects, as the models are designed not to stick to a “pure type” but to perform the task in the most effective way possible. So “what precisely distinguishes machine learning, deep learning and reinforcement learning” issues actually a tricky question to answer.

Machine learning – is a form of AI in which computers are given the ability to progressively improve the performance of a specific task with data, without being directly programmed ( this is Arthur Lee Samuel’s definition. He coined the term “machine learning”, of which there are two types, supervised and unsupervised machine learning

Supervised machine learning happens when a programmer can provide a label for every training input into the machine learning system.

Example – by analyzing the historical data taken from coal mines, deepsense.ai prepared an automated system for predicting dangerous seismic events up to 8 hours before they occur. The records of seismic events were taken from 24 coal mines that had collected data for several months. The model was able to recognize the likelihood of an explosion by analyzing the readings from the previous 24 hours.

Some of the mines can be exactly identified by their main working height values. To obstruct the identification, we added some Gaussian noise

From the AI point of view, a single model was performing a single task on a clarified and normalized dataset. To get more details on the story, read our blog post.

Unsupervised learning takes place when the model is provided only with the input data, but no explicit labels. It has to dig through the data and find the hidden structure or relationships within. The designer might not know what the structure is or what the machine learning model is going to find.

An example we employed was for churn prediction. We analyzed customer data and designed an algorithm to group similar customers. However, we didn’t choose the groups ourselves. Later on, we could identify high-risk groups (those with a high churn rate) and our client knew which customers they should approach first.

Another example of unsupervised learning is anomaly detection, where the algorithm has to spot the element that doesn’t fit in with the group. It may be a flawed product, potentially fraudulent transaction or any other event associated with breaking the norm.

Deep learning consists of several layers of neural networks, designed to perform more sophisticated tasks. The construction of deep learning models was inspired by the design of the human brain, but simplified. Deep learning models consist of a few neural network layers which are in principle responsible for gradually learning more abstract features about particular data.

Although deep learning solutions are able to provide marvelous results, in terms of scale they are no match for the human brain. Each layer uses the outcome of a previous one as an input and the whole network is trained as a single whole. The core concept of creating an artificial neural network is not new, but only recently has modern hardware provided enough computational power to effectively train such networks by exposing a sufficient number of examples. Extended adoption has brought about frameworks like TensorFlow, Keras and PyTorch, all of which have made building machine learning models much more convenient.

Example: deepsense.ai designed a deep learning-based model for the National Oceanic and Atmospheric Administration (NOAA). It was designed to recognize Right whales from aerial photos taken by researchers. From a technical point of view, recognizing a particular specimen of whales from aerial photos is pure deep learning. The solution consists of a few machine learning models performing separate tasks. The first one was in charge of finding the head of the whale in the photograph while the second normalized the photo by cutting and turning it, which ultimately provided a unified view (a passport photo) of a single whale.

The third model was responsible for recognizing particular whales from photos that had been prepared and processed earlier. A network composed of 5 million neurons located the blowhead bonnet-tip. Over 941,000 neurons looked for the head and more than 3 million neurons were used to classify the particular whale. That’s over 9 million neurons performing the task, which may seem like a lot, but pales in comparison to the more than 100 billion neurons at work in the human brain. We later used a similar deep learning-based solution to diagnose diabetic retinopathy using images of patients’ retinas.

Reinforcement learning, as stated above employs a system of rewards and penalties to compel the computer to solve a problem by itself. Human involvement is limited to changing the environment and tweaking the system of rewards and penalties. As the computer maximizes the reward, it is prone to seeking unexpected ways of doing it. Human involvement is focused on preventing it from exploiting the system and motivating the machine to perform the task in the way expected. Reinforcement learning is useful when there is no “proper way” to perform a task, yet there are rules the model has to follow to perform its duties correctly. Take the road code, for example.

Example: By tweaking and seeking the optimal policy for deep reinforcement learning, we built an agent that in just 20 minutes reached a superhuman level in playing Atari games. Similar algorithms in principal can be used to build AI for an autonomous car or a prosthetic leg. In fact, one of the best ways to evaluate the reinforcement learning approach is to give the model an Atari video game to play, such as Arkanoid or Space Invaders. According to Google Brain’s Marc G. Bellemare, who introduced Atari video games as a reinforcement learning benchmark, “although challenging, these environments remain simple enough that we can hope to achieve measurable progress as we attempt to solve them”.

Breakout

Initial performance After 15 minutes of training After 30 minutes of training

Playing atari with deep reinforcement learning - 0 Playing atari with deep reinforcement learning -

1 Playing atari with deep reinforcement learning -

2

Assault

Initial performance After 15 minutes of training After 30 minutes of training

Playing atari with deep reinforcement learning - 3 Playing atari with deep reinforcement learning - 4 Playing atari with deep reinforcement learning - 5

In particular, if artificial intelligence is going to drive a car, learning to play some Atari classics can be considered a meaningful intermediate milestone. A potential application of reinforcement learning in autonomous vehicles is the following interesting case. A developer is unable to predict all future road situations, so letting the model train itself with a system of penalties and rewards in a varied environment is possibly the most effective way for the AI to broaden the experience it both has and collects.

Conclusion

The key distinguishing factor of reinforcement learning is how the agent is trained. Instead of inspecting the data provided, the model interacts with the environment, seeking ways to maximize the reward. In the case of deep reinforcement learning, a neural network is in charge of storing the experiences and thus improves the way the task is performed.

Is reinforcement learning the future of machine learning?

Although reinforcement learning, deep learning, and machine learning are interconnected no one of them in particular is going to replace the others. Yann LeCun, the renowned French scientist and head of research at Facebook, jokes that reinforcement learning is the cherry on a great AI cake with machine learning the cake itself and deep learning the icing. Without the previous iterations, the cherry would top nothing.

In many use cases, using classical machine learning methods will suffice. Purely algorithmic methods not involving machine learning tend to be useful in business data processing or managing databases.

Sometimes machine learning is only supporting a process being performed in another way, for example by seeking a way to optimize speed or efficiency.

When a machine has to deal with unstructured and unsorted data, or with various types of data, neural networks can be very useful. How machine learning improved the quality of machine translation has been described by The New York Times.

Summary

Reinforcement learning is no doubt a cutting-edge technology that has the potential to transform our world. However, it need not be used in every case. Nevertheless, reinforcement learning seems to be the most likely way to make a machine creative – as seeking new, innovative ways to perform its tasks is in fact creativity. This is already happening: DeepMind’s now famous AlphaGo played moves that were first considered glitches by human experts, but in fact secured victory against one of the strongest human players, Lee Sedol.

Thus, reinforcement learning has the potential to be a groundbreaking technology and the next step in AI development.

Lecture;25&26

Date:29/4/2020

Unit:V

Topic:

Pattern Recognition | Introduction

Pattern is everything around in this digital world. A pattern can either be seen physically or it can be observed mathematically by applying algorithms.

Example: The colours on the clothes, speech pattern etc.

Pattern recognition is the process of recognizing patterns by using machine learning algorithm. Pattern recognition can be defined as the classification of data based on knowledge already gained or on statistical information extracted from patterns and/or their representation. One of the important aspects of the pattern recognition is its application potential.

Examples: Speech recognition, speaker identification, multimedia document recognition (MDR), automatic medical diagnosis.

In a typical pattern recognition application, the raw data is processed and converted into a form that is amenable for a machine to use. Pattern recognition involves classification and cluster of patterns.

In classification, an appropriate class label is assigned to a pattern based on an abstraction that is generated using a set of training patterns or domain knowledge. Classification is used in supervised learning.

Clustering generated a partition of the data which helps decision making, the specific decision making activity of interest to us. Clustering is used in an unsupervised learning.

Features may be represented as continuous, discrete or discrete binary variables. A feature is a function of one or more measurements, computed so that it quantifies some significant characteristics of the object.

Example: consider our face then eyes, ears, nose etc are features of the face.

A set of features that are taken together, forms the features vector. Example: In the above example of face, if all the features (eyes, ears, nose etc) taken together then the sequence is feature vector([eyes, ears, nose]). Feature vector is the sequence of a features represented as a d-dimensional column vector. In case of speech, MFCC (Melfrequency Cepstral Coefficent) is the spectral features of the speech. Sequence of first 13 features forms a feature vector.

Pattern recognition possesses the following features:

Pattern recognition system should recognise familiar pattern quickly and accurate

Recognize and classify unfamiliar objects

Accurately recognize shapes and objects from different angles

Identify patterns and objects even when partly hidden

Recognise patterns quickly with ease, and with automaticity.

Training and Learning in Pattern Recognition

Learning is a phenomena through which a system gets trained and becomes adaptable to give result in an accurate manner. Learning is the most important phase as how well the system performs on the data provided to the system depends on which algorithms used on the data. Entire dataset is divided into two categories, one which is used in training the model i.e. Training set and the other that is used in testing the model after training, i.e. Testing set.

Training set:

Training set is used to build a model. It consists of the set of images which are used to train the system. Training rules and algorithms used give relevant information on how to associate input data with output decision. The system is trained by applying these algorithms on the dataset, all the relevant information is extracted from the data and results are obtained. Generally, 80% of the data of the dataset is taken for training data.

Testing set:

Testing data is used to test the system. It is the set of data which is used to verify whether the system is producing the correct output after being trained or not. Generally, 20% of the data of the dataset is used for testing. Testing data is used to measure the accuracy of the system. Example: a system which identifies which category a particular flower belongs to, is able to identify seven category of flowers correctly out of ten and rest others wrong, then the accuracy is 70 %.

Real-time Examples and Explanations:

A pattern is a physical object or an abstract notion. While talking about the classes of animals, a description of an animal would be a pattern. While talking about various types of balls, then a description of a ball is a pattern. In the case balls considered as pattern, the classes could be football, cricket ball, table tennis ball etc. Given a new pattern, the class of the pattern is to be determined. The choice of attributes and representation of patterns is a very important step in pattern classification. A good representation is one which makes use of discriminating attributes and also reduces the computational burden in pattern classification.

An obvious representation of a pattern will be a vector. Each element of the vector can represent one attribute of the pattern. The first element of the vector will contain the value of the first attribute for the pattern being considered.

Example: While representing spherical objects, (25, 1) may be represented as an spherical object with 25 units of weight and 1 unit diameter. The class label can form a part of the vector. If spherical objects belong to class 1, the vector would be (25, 1, 1), where the first element represents the weight of the object, the second element, the diameter of the object and the third element represents the class of the object.

Advantages:

Pattern recognition solves classification problems

Pattern recognition solves the problem of fake bio metric detection.

It is useful for cloth pattern recognition for visually impaired blind people.

It helps in speaker diarization.

We can recognise particular object from different angle.

Disadvantages:

Syntactic Pattern recognition approach is complex to implement and it is very slow process.

Sometime to get better accuracy, larger dataset is required.

It cannot explain why a particular object is recognized.

Example: my face vs my friend’s face.

Applications:

Image processing, segmentation and analysis

Pattern recognition is used to give human recognition intelligence to machine which is required in image processing.

Computer vision

Pattern recognition is used to extract meaningful features from given image/video samples and is used in computer vision for various applications like biological and biomedical imaging.

Seismic analysis

Pattern recognition approach is used for the discovery, imaging and interpretation of temporal patterns in seismic array recordings. Statistical pattern recognition is implemented and used in different types of seismic analysis models.

Radar signal classification/analysis

Pattern recognition and Signal processing methods are used in various applications of radar signal classifications like AP mine detection and identification.

Speech recognition

The greatest success in speech recognition has been obtained using pattern recognition paradigms. It is used in various algorithms of speech recognition which tries to avoid the problems of using a phoneme level of description and treats larger units such as words as pattern

Finger print identification

The fingerprint recognition technique is a dominant technology in the biometric market. A number of recognition methods have been used to perform fingerprint matching out of which pattern recognition approaches is widely used.

Lectures: 27 & 28

Date: 5/5/2020

Unit : V

Topic 1:

Pattern Recognition :Design Principles

Pattern is everything around in this digital world. A pattern can either be seen physically or it can be observed mathematically by applying algorithms.

In Pattern Recognition, pattern is comprises of the following two fundamental things:

1.Collection of observations

The concept behind the observation

2. Feature Vector:

The collection of observations is also known as a feature vector. A feature is a distinctive characteristic of a good or service that sets it apart from similar items. Feature vector is the combination of n features in n-dimensional column vector.The different classes may have different features values but the same class always has the same features values.

Example:

Differentiate between good and bad features.

Feature properties.

Classifier and Decision Boundaries:

In a statistical-classification problem, a decision boundary is a hypersurface that partitions the underlying vector space into two sets. A decision boundary is the region of a problem space in which the output label of a classifier is ambiguous.Classifier is a hypothesis or discrete-valued function that is used to assign (categorical) class labels to particular data points.

Classifier is used to partition the feature space into class-labeled decision regions. While Decision Boundaries are the borders between decision regions.

Components in Pattern Recognition System:

A pattern recognition systems can be partitioned into components.There are five typical components for various pattern recognition systems. These are as following:

A Sensor : A sensor is a device used to measure a property, such as pressure, position, temperature, or acceleration, and respond with feedback.

A Preprocessing Mechanism : Segmentation is used and it is the process of partitioning a data into multiple segments. It can also be defined as the technique of dividing or partitioning an data into parts called segments.

A Feature Extraction Mechanism : feature extraction starts from an initial set of measured data and builds derived values (features) intended to be informative and non-redundant, facilitating the subsequent learning and generalization steps, and in some cases leading to better human interpretations. It can be manual or automated.

A Description Algorithm : Pattern recognition algorithms generally aim to provide a reasonable answer for all possible inputs and to perform “most likely” matching of the inputs, taking into account their statistical variation

A Training Set : Training data is a certain percentage of an overall dataset along with testing set. As a rule, the better the training data, the better the algorithm or classifier performs.

Design Principles of Pattern Recognition

In pattern recognition system, for recognizing the pattern or structure two basic approaches are used which can be implemented in diferrent techniques. These are –

1.Statistical Approach

2, Structural Approach

Statistical Approach:

Statistical methods are mathematical formulas, models, and techniques that are used in the statistical analysis of raw research data. The application of statistical methods extracts information from research data and provides different ways to assess the robustness of research outputs.

Two main statistical methods are used :

Descriptive Statistics: It summarizes data from a sample using indexes such as the mean or standard deviation. Inferential Statistics: It draw conclusions from data that are subject to random variation.

Structural Approach:

The Structural Approach is a technique wherein the learner masters the pattern of sentence. Structures are the different arrangements of words in one accepted style or the other.

Types of structures:

Sentence Patterns

Phrase Patterns

Formulas

Idioms

Difference Between Statistical Approach and Structural Approach:

Statistical Approach vs Structural Approach

1 Statistical decision theory.

Human perception and cognition.

2 Quantitative features.

Morphological primitives

3 Fixed number of features.

Variable number of primitives.

4 Ignores feature relationships.

Captures primitives relationships.

5 Semantics from feature position.

Semantics from primitives encoding.

6 Statistical classifiers. Syntactic grammars.

Topic:2

Statistical Pattern Recognition:

Statistical pattern recognition is now a mature discipline which has been successfully applied in several application domains. The primary goal in statistical pattern recognition is classification, where a pattern vector is assigned to one of a finite number of classes and each class is characterized by a probability density function on the measured features. A pattern vector is viewed as a point in the multidimensional space defined by the features. Design of a recognition system based on this paradigm requires careful attention to the following issues: type of classifier (single-stage vs. hierarchical), feature selection, estimation of classification error, parametric vs. nonparametric decision rules, and utilizing contextual information. Current research emphasis in pattern recognition is on designing efficient algorithms, studying small sample properties of various estimators and decision rules, implementing the algorithms on novel computer architecture, and incorporating context and domain-specific knowledge in decision making.

Lectures: 29&30

Date: 13/5/2020

Unit V

Parameter Estimation Methods: (i)Principle Component Analysis(PCA)

With the advancements in the field of artificial intelligence and machine learning, it has become essential to understand the fundamentals behind such technologies. This article on Principal Component Analysis will help you understand the concepts behind dimensionality reduction and how it can be used to deal with high dimensional data.

You might also like: 10 Machine Learning Algorithms You Should Know to Become a Data Scientist

Need for Principle Component Analysis (PCA)

In general, machine learning works wonders when the dataset provided for training the machine is large and concise. Usually, having a good amount of data lets us build a better predictive model since we have more data to train the machine with. However, using a large data set has its own pitfalls. The biggest pitfall is the curse of dimensionality.

It turns out that in large dimensional datasets, there might be lots of inconsistencies in the features or lots of redundant features in the dataset, which will only increase the computation time and make data processing and EDA more convoluted.

To get rid of the curse of dimensionality, a process called dimensionality reduction was introduced. Dimensionality reduction techniques can be used to filter only a limited number of significant features needed for training and this is where PCA comes in.

What Is Principal Component Analysis?

Principal components analysis (PCA) is a dimensionality reduction technique that enables you to identify correlations and patterns in a data set so that it can be transformed into a data set of significantly lower dimension without loss of any important information.

The main idea behind PCA is to figure out patterns and correlations among various features in the dataset. On finding a strong correlation between different variables, a final decision is made about reducing the dimensions of the data in such a way that the significant data is still retained.

Such a process is very essential in solving complex data-driven problems that involve the use of high- dimensional data sets. PCA can be achieved via a series of steps. Let’s discuss the whole end-to-end process.

Step by Step Computation of PCA

The below steps need to be followed to perform dimensionality reduction using PCA:

Standardization of the data

Computing the covariance matrix

Calculating the eigenvectors and eigenvalues

Computing the Principal Components

Reducing the dimensions of the data set

Let’s discuss each of the steps in detail:

Step 1: Standardization of the Data

If you’re familiar with data analysis and processing, you know that missing out on standardization will probably result in a biased outcome. Standardization is all about scaling your data in such a way that all the variables and their values lie within a similar range.

Consider an example, let’s say that we have 2 variables in our data set, one has values ranging between 10-100 and the other has values between 1000-5000. In such a scenario, the output calculated by using these predictor variables is going to be biased since the variable with a larger range will have a more obvious impact on the outcome.

Therefore, standardizing the data into a comparable range is very important. Standardization is carried out by subtracting each value in the data from the mean and dividing it by the overall deviation in the data set.

It can be calculated like so:

Standardization - Principal Component Analysis - EdurekaPost this step, all the variables in the data are scaled across a standard and comparable scale.

Step 2: Computing the Covariance Matrix

As mentioned earlier, PCA helps to identify the correlation and dependencies among the features in a data set. A covariance matrix expresses the correlation between the different variables in the data set. It is essential to identify heavily dependent variables because they contain biased and redundant information which reduces the overall performance of the model.

Mathematically, a covariance matrix is a p × p matrix, where p represents the dimensions of the data set. Each entry in the matrix represents the covariance of the corresponding variables.

Consider a case where we have a 2-Dimensional data set with variables a and b, the covariance matrix is a 2×2 matrix as shown below:

Covariance Matrix - Principal Component Analysis - EdurekaIn the above matrix:

Cov(a, a) represents the covariance of a variable with itself, which is nothing but the variance of the variable ‘a’

Cov(a, b) represents the covariance of the variable ‘a’ with respect to the variable ‘b’. And since covariance is commutative, Cov(a, b) = Cov(b, a)

Here are the key takeaways from the covariance matrix:

The covariance value denotes how co-dependent two variables are with respect to each other

If the covariance value is negative, it denotes the respective variables are indirectly proportional to each other

A positive covariance denotes that the respective variables are directly proportional to each other

Simple math, isn’t it? Now let’s move on and look at the next step in PCA.

Step 3: Calculating the Eigenvectors and Eigenvalues

Eigenvectors and eigenvalues are the mathematical constructs that must be computed from the covariance matrix in order to determine the principal components of the data set.

But first, let’s understand more about principal components

What are Principal Components?

Simply put, principal components are the new set of variables that are obtained from the initial set of variables. The principal components are computed in such a manner that newly obtained variables are highly significant and independent of each other. The principal components compress and possess most of the useful information that was scattered among the initial variables.

If your data set is of 5 dimensions, then 5 principal components are computed, such that, the first principal component stores the maximum possible information and the second one stores the remaining maximum info and so on, you get the idea.

Now, where do Eigenvectors fall into this whole process?

Assuming that you all have a basic understanding of Eigenvectors and eigenvalues, we know that these two algebraic formulations are always computed as a pair, i.e, for every eigenvector there is an eigenvalue. The dimensions in the data determine the number of eigenvectors that you need to calculate.

Consider a 2-Dimensional data set, for which 2 eigenvectors (and their respective eigenvalues) are computed. The idea behind eigenvectors is to use the Covariance matrix to understand where in the data there is the most amount of variance. Since more variance in the data denotes more information about the data, eigenvectors are used to identify and compute Principal Components.

Eigenvalues, on the other hand, simply denote the scalars of the respective eigenvectors. Therefore, eigenvectors and eigenvalues will compute the Principal Components of the data set.

Step 4: Computing the Principal Components

Once we have computed the Eigenvectors and eigenvalues, all we have to do is order them in the descending order, where the eigenvector with the highest eigenvalue is the most significant and thus forms the first principal component. The principal components of lesser significances can thus be removed in order to reduce the dimensions of the data.

The final step in computing the Principal Components is to form a matrix known as the feature matrix that contains all the significant data variables that possess maximum information about the data.

Step 5: Reducing the Dimensions of the Dataset

The last step in performing PCA is to re-arrange the original data with the final principal components which represent the maximum and the most significant information of the data set. In order to replace the original data axis with the newly formed Principal Components, you simply multiply the transpose of the original data set by the transpose of the obtained feature vector.

Levture:31

Date:20/5/2020

Unit V

Topic:

Linear Discriminant Analysis(LDA)

Linear Discriminant Analysis or Normal Discriminant Analysis or Discriminant Function Analysis is a dimensionality reduction technique which is commonly used for the supervised classification problems. It is used for modeling differences in groups i.e. separating two or more classes. It is used to project the features in higher dimension space into a lower dimension space.

For example, we have two classes and we need to separate them efficiently. Classes can have multiple features. Using only a single feature to classify them may result in some overlapping,

So, we will keep on increasing the number of features for proper classification.

Example:

Suppose we have two sets of data points belonging to two different classes that we want to classify. As shown in the given 2D graph, when the data points are plotted on the 2D plane, there’s no straight line that can separate the two classes of the data points completely. Hence, in this case, LDA (Linear Discriminant Analysis) is used which reduces the 2D graph into a 1D graph in order to maximize the separability between the two classes.

Linear Discriminant Analysis uses both the axes (X and Y) to create a new axis and projects data onto a new axis in a way to maximize the separation of the two categories and hence, reducing the 2D graph into a 1D graph.

Two criteria are used by LDA to create a new axis:

1.Maximize the distance between means of the two classes.

2.Minimize the variation within each class.

In simple terms, this newly generated axis increases the separation between the dtla points of the two classes. After generating this new axis using the above-mentioned criteria, all the data points of the classes are plotted on this new axis .

But Linear Discriminant Analysis fails when the mean of the distributions are shared, as it becomes impossible for LDA to find a new axis that makes both the classes linearly separable. In such cases, we use non-linear discriminant analysis.

Extensions to LDA:

Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of variance (or covariance when there are multiple input variables).

Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs is used such as splines.

Regularized Discriminant Analysis (RDA): Introduces regularization into the estimate of the variance (actually covariance), moderating the influence of different variables on LDA.

Applications:

1.Face Recognition: In the field of Computer Vision, face recognition is a very popular application in which each face is represented by a very large number of pixel values. Linear discriminant analysis (LDA) is used here to reduce the number of features to a more manageable number before the process of classification. Each of the new dimensions generated is a linear combination of pixel values, which form a template. The linear combinations obtained using Fisher’s linear discriminant are called Fisher faces.

2.Medical: In this field, Linear discriminant analysis (LDA) is used to classify the patient disease state as mild, moderate or severe based upon the patient various parameters and the medical treatment he is going through. This helps the doctors to intensify or reduce the pace of their treatment.

3.Customer Identification:

Suppose we want to identify the type of customers which are most likely to buy a particular product in a shopping mall. By doing a simple question and answers survey, we can gather all the features of the customers. Here, Linear discriminant analysis will help us to identify and select the features which can describe the characteristics of the group of customers that are most likely to buy that particular product in the shopping mall.