MLA -- Naive Bayes theory

· 2020-05-15 · # Machine Learning

"Priori Probability + data = Posterior probability"

条件概率： P(X|Y) or P(Y|X)
联合概率： P(X,Y), P(XY) or P(XnY)
边缘概率： P(X) or P(Y)

Let's look at the conditional independence formula first. If X and Y are independent of each other, there are:

P(X, Y) = P(X) \times P(Y)

P(X|Y) = P(X) ; P(Y|X) = P(Y) ;

Usually P(X|Y) != P(X) unless X and Y are independent of each other.
And standard conditional probability formula:

P(X|Y) = P(X, Y) / P(Y)

P(Y|X) = P(X, Y) / P(X)

or =>

P(Y|X) = \frac {P(X|Y) \times P(Y)}{P(X)}

Then the total probability formula:

P(X) = \sum_kP(X|Y = Y_k)P(Y_k)

\sum_kP(Y_k) = 1

Therefore, The Bayesian formula is easily derived from the above formulas:

P(Y_k|X) = \frac {P(X|Y_k) \times P(Y_k)} {\sum_k P(X|Y=Y_k)P(Y_k)}

Question:

Suppose there is a latent disease among the people.

P(disease) = 0.01

P(well) = 0.99

Doctor can detect this disease by some kinds of kit. The test result is >positive(not healthy) and negtive(healthy). However, detection result >may also be wrong. The conditional probability is :

P(positive | well) = 0.01 (healthy people detected with positive)

P(negtive | well) = 0.99

P(positive | disease) = 0.99 (unhealthy people detected with positive)

P(negtive | disease) = 0.01

So, What is probability of P(disease | postive)? that is, if the detection is positive(bad), what is the probabiliy of true disease？
(Answer: P(disease | postive) = 0.5 )

(We can conclude P(X|Y) and P(Y|X) are totally different. P(X|Y) = P(Y|X)>*P(X)/P(Y))

Bayesian interpretation

In the Bayesian (or epistemological) interpretation, probability measures a “degree of belief.” Bayes’ theorem then links the degree of belief in a proposition before and after accounting for evidence. For example, suppose it is believed with 50% certainty that a coin is twice as likely to land heads than tails. If the coin is flipped a number of times and the outcomes observed, that degree of belief may rise, fall or remain the same depending on the results.

P(A|B) = \frac {P(B|A) \times P(A)}{P(B)}

For proposition A and evidence B,

P (A), the prior, is the initial degree of belief in A.
P (A | B), the posterior is the degree of belief having accounted for B.
the quotient P(B | A)/P(B) represents the support B provides for A.

In theory, the probabilistic model classifier is a conditional probability model: P( C|F1,...,Fn )

Based on the Bayes Formula:

P(C|F1,...,Fn) = \frac {P(F1,...,Fn|C) * P(C)} {P(F1,...,Fn)}

P(类别|特征) = \frac {P(特征|类别) * P(类别)} {P(特征)}

P(C) is the prior probability
P(C|F1,...,Fn) is the posterior probability
C represents any Group of the classifier.
(F1,...,Fn) represents features of data that we need to predict.

We can understand it this way, when we don’t know any features of the sample (predict database) that we need to predict, first determine the probability that the sample is a certain category is P(C), then, after knowing the features of the sample, multiply $ \frac {P(F1,...,Fn|C)} {P(F1,...,Fn)} $ to get the sample's classification prediction.

Denominator is P(F1,...,Fn), we can view it as a constant, since it does not depend on C, and value is same for each group probability prediction. Therefore, what we need to concern is the numerator P(F1,...,Fn|C) * P(C), which is equivalent to the joint distribution model P(C,F1,...,Fn)

The Naive Bayes model makes a bold assumption here that ALL FEATURES (F1,...,Fn) are independent of each other, so that we can draw:

P(F1,...,Fn|C) = P(F1|C)*P(F2|C)*...*P(Fn|C)

Then the numerator P(F1,...,Fn|C) * P(C) is equivalent to:

P(C)* \prod_{i=1}^n P(F_i|C)

Suppose there are k groups for classification: C1, C2, ... Ck. Then the prediction result formula of Naive Bayes Classifier is:

C_{result} = \argmax_{C1, C2, ... Ck} P(C_k)* \prod_{i=1}^n P(F_i|C_k)

classify(f1,...fn) = \argmax_{C1, C2, ... Ck} P(C = c)* \prod_{i=1}^n P(F_i=f_i|C=c)

We calculate all the probabilities that the predicted data belong to each group, and then, by comparison, classify the data into the group with the highest probability (the most probable group).

Naive Bayes parameter estimation

In the previous section, we knew the formula, as long as we compare which group has largest $ P(C = c)* \prod_{i=1}^n P(F_i=f_i|C=c) $. In this section, we discuss how to calculate this two kinds of probability.

P(C=Ck) is easy, by maximum likelihood estimation we can easily get the frequency of occurrence of Ck. If the total number of sample is M, number of occurrences of Ck is Mk. Then P(C=Ck) = Mk/M. Therefore, the prior probability P(C=Ck) is very dependent on the selection of the overall sample. If the size of dataset is not big enough, or the propotion of count of each category is uneven, it will greatly affect the prediction of Naive Bayes Classification.

P(Fi=fi | C=Ck) is a bit complicated, it depends on the data type of Fi.

If the Fi is the Discrete Value: Polynomial distribution $P(Fi=fi | C=Ck) = \frac{M_{kfi}}{M_k}$ Mkfi is the number of occurrences of fi in Mk. Somtime, Mkfi may be 0, which will affect he estimation of posterior probability. The solution is to import Laplace smoothing: $P(Fi=fi | C=Ck) = \frac{M_{kfi} + \lambda}{M_k + M_{fi}\lambda}$

λ is a constant larger than 0, usually 1. Mfi is number of fi value.

If the Fi is the Very Sparse discrete values: Bernoulli distribution $P(Fi=fi | C=Ck) = P(Fi=1|C=Ck)*fi + (1-P(Fi=1|C=Ck))*(1-fi)$

fi is 0 or 1. if fi appears in the subset Mk, fi is 1, otherwise 0.

If the Fi is a very sparse discrete value, that is, the occurrence probability of each feature is very low, then we can assume Fi in line with Bernoulli distribution. Feature fi appears as 1, does not appear as 0. We only pay attention to whether fi appears, not the number of occurrences.

If the Fi is the Continuous Value, we usually take Fi's prior probability as normally distributed:

P(Fi=fi | C=C_k) = \frac{1}{\sqrt{2\pi\sigma_k^2}} * e^{-\frac{(fi - \mu_k)^2}{2\sigma_k^2}}

# Continuous Data Code
import math
pi = math.pi
e = math.e

def probabilityDensity(mu, sig2, v):
	res = 1/math.sqrt(2*pi*sig2) * e ** (-(v-mu)**2/(2*sig2))
	return format(res,'.4e')

Naive Bayes algorithm process

Step one: Input the prior probability P(C=Ck), if the P(C=Ck) is Null, P(C=Ck) = (Mk + λ) / (M + Kλ)
Step two: Calculate the conditional probability of each feature of the k-th category.
- Condition 1: Feature i is the Discrete Value: $P(Fi=fi | C=Ck) = \frac{M_{kfi} + \lambda}{M_k + M_{fi}\lambda}$
- Condition 2: Feature i is the Sparse Discrete Value: $P(Fi=fi | C=Ck) = P(Fi=1|C=Ck)*fi + (1-P(Fi=1|C=Ck))*(1-fi)$
- Condition 1: Feature i is the Continuous Value: $P(Fi=fi | C=C_k) = \frac{1}{\sqrt{2\pi\sigma_k^2}} * e^{-\frac{(fi - \mu_k)^2}{2\sigma_k^2}}$
Step three: For the instance prediction features (pf1,...pfn), Calculate posterior probability separately:
$P(C=C_k) * \prod_{i=1}^n P(F_i = pf_i | C_k)$
Determine the classification of prediction data (pf1,...pfn):
$C_{result} = \argmax_{C_1, C_2,...C_k} P(C_k) \prod_{i=1}^n P(F_i = pf_i | C_k)$

Pros and cons Conclusion

Advantages of Naive Bayes:

The naive Bayesian model originates from classical mathematical theory and has a stable classification efficiency.
Performs well on small-scale data, can handle multi-classification tasks, suitable for incremental training, especially when the amount of data exceeds the memory, we can batch-by-batch incremental training.
Less sensitive to missing data, and the algorithm is relatively simple, often used for text classification.

Disadvantages of Naive Bayes:

In theory, the Naive Bayes model has the smallest error rate compared with other classification methods. However, this is not always the case, because the naive Bayes model gives an output category, and assumes that the attributes are mutually independent. This assumption is often not true in practical applications. When the correlation between them is large, the classification effect is not good. When the attribute correlation is small, Naive Bayes has the best performance. For this, there are algorithms such as semi-Naive Bayes that are moderately improved by considering partial relevance.
Need to know the a priori probability, and the priori probability often depends on the hypothesis, there can be many models of hypothesis, so in some cases, due to the hypothesis of the prior model, the prediction effect is not good.
Since we decide the classification by prior and data to determine the posterior probability, the classification decision has a certain error rate.
Very sensitive to the type of input data: discrete, sparse discrete or contiuous data.