Naive Bayes: Helping AI Product Managers 'Move Fast and Iterate Quickly'

baoshi.rao

Many people have encountered Bayes' theorem, but what makes this seemingly mathematical concept so appealing to AI product managers?

We often encounter scenarios like this: When chatting with a friend, you might not initially know what they're going to say, but after they say one sentence, you can guess what they'll say next. The more information the friend provides, the better we can infer their intended meaning. This is precisely the reasoning approach described by Bayes' theorem.

Bayes' theorem is widely applicable because it aligns with the natural way humans understand things.

We aren't born knowing the underlying principles of everything. Most of the time, we face situations with insufficient or uncertain information. In such cases, we can only make decisions based on limited resources and adjust them as events unfold.

Bayesian classification is a general term for a class of classification algorithms, all of which are based on Bayes' theorem and assume 'feature independence.' Naive Bayes classification is the most common method under Bayesian classification and is also one of the most classic machine learning algorithms.

It is direct and efficient in solving problems across many scenarios, making it widely applicable in fields like spam filtering, text classification, and spell-checking. For product managers, the Naive Bayes classifier is an excellent entry point for studying natural language processing problems.

Naive Bayes classification is a very simple algorithm, primarily because its approach is straightforward. For a given item to classify, it calculates the probability of each category appearing under the condition that the item exists, and assigns the item to the category with the highest probability.

For example, if you see a dark-skinned foreigner on the street and are asked to guess their origin, you'd most likely say Africa, as the majority of dark-skinned people are from Africa. While they could also be from the Americas or Asia, in the absence of other information, we choose the category with the highest probability—this is the core idea of Naive Bayes.

It's worth noting that Naive Bayes classification isn't random guessing or without theoretical basis. It is a classification algorithm grounded in Bayes' theorem and the assumption of feature independence.

To understand the algorithm's principles, one must first grasp 'feature independence' and 'Bayes' theorem,' which in turn involve concepts like 'prior probability,' 'posterior probability,' and 'conditional probability.'

As shown in the diagram below, while there are many concepts, they are relatively easy to understand. We'll explain them one by one.

Feature independence is the foundation of Naive Bayes classification, meaning each feature in the sample is assumed to be unrelated to others.

For instance, in predicting credit card defaults, we might consider features like monthly income, credit limit, and property ownership. While seemingly unrelated, these features may have underlying connections, like the butterfly effect. Generally, banks grant higher credit limits to higher-income clients, who are also more likely to own property. Thus, these features are interdependent.

However, in Naive Bayes, we ignore these relationships and treat monthly income, property ownership, and credit limit as entirely independent features.

Next, we’ll focus on explaining 'theoretical probability,' 'conditional probability,' and the difference between 'prior probability' and 'posterior probability.'

Let’s start with a small experiment.

Suppose we toss a fair coin. Theoretically, since the coin is fair, the probability of landing heads or tails is 50%. This probability doesn’t change with more tosses—even if heads come up 10 times in a row, the next toss still has a 50% chance of being heads.

In practice, if we toss the coin 100 times, the counts of heads and tails won’t always be exactly 50-50. It might be 40-60 or 35-65. Only after thousands of tosses will the counts converge to equal.

Thus, the '50% probability of heads or tails' refers to the theoretical probability, achievable only with infinite tosses. Even if the first five tosses are heads, the sixth toss still has a 50% chance of being tails.

However, in reality, people who’ve tossed coins might feel that after five consecutive heads, the next toss is very likely to be tails. How likely? Is there a way to calculate this actual probability?

To solve this, mathematician Thomas Bayes devised a method to calculate the probability of an event occurring under known conditions. This method requires an initial subjective 'prior probability,' which is adjusted based on subsequent observations. With more adjustments, the probability becomes more accurate.

How does this work?

Let’s use a subway example. Shenzhen Metro Line 1 has 18 stations from Chegongmiao to the terminal. Every morning, Xiao Lin travels five stops from Chegongmiao to Gaoxinyuan for work.

One morning during rush hour, Xiao Lin couldn’t see or hear the station announcements due to the crowd and his headphones, so he didn’t know if the train had arrived at Gaoxinyuan.

If he exited at the next stop, the theoretical probability of it being Gaoxinyuan was only 1/18—very low. But then he spotted a colleague exiting. Xiao Lin reasoned that, although he didn’t know the colleague’s destination, during rush hour, the colleague was more likely heading to work. Using this information, he followed and correctly exited at Gaoxinyuan—this is the reasoning behind Bayes' theorem.

In probability and statistics, Bayes' theorem describes the likelihood of an event based on prior knowledge of related conditions.

For example, if cancer incidence is age-related, Bayes' theorem allows us to more accurately estimate the probability of someone having cancer based on their age. In other words, Bayes' theorem calculates the probability of one event given the probability of another.

Mathematically, Bayes' theorem can be expressed as:

[Formula]

This formula shows the relationship between two conditional probabilities linked by joint probability. If we know P(A|B), we can calculate P(B|A).

Thus, Bayes' theorem essentially explains the following, as shown in the diagram:

[Diagram]

A Venn diagram can help visualize Bayes' theorem:

[Diagram]

In Xiao Lin’s case, seeing his colleague exit during rush hour provided new information. Like in the diagram, knowing the black dot is in region A (which overlaps heavily with region B) increases the probability that it’s also in B. The result we want is P(B|A)—the probability of the event occurring given existing factors.

Using this probability, we can make targeted decisions. To calculate P(B|A), we need P(B), P(A|B), and P(A), but P(A) can be challenging to determine.

Upon closer consideration, there appears to be no correlation between P(A) and P(B) - they are independent events. Regardless of the value of P(B), P(A) remains a fixed denominator. This means that calculating the various possible values of P(A) won't affect the relative magnitudes of the outcomes, so we can disregard the specific value of P(A).

Assuming P(A) takes value m, and P(B) can take values b1, b2, or b3, we know:

When calculating P(B|A), we obtain the following results respectively:

Since the sum of P(b1|A), P(b2|A), and P(b3|A) must equal 1, we can derive that ox+py+qz=m. Even if we don't know the value of m, it doesn't matter because the values of ox, py, and qz can be calculated, and thus m becomes known. The remaining task is to calculate P(B) and P(A|B), both of which must be estimated using the dataset we have.

There's an interesting historical note about Bayesian algorithms. After its invention, Bayesian methods were largely ignored for nearly 200 years.

Classical statistics at that time could adequately solve simple probability problems with objective explanations. Moreover, compared to Bayesian methods which required subjective judgment, people were more willing to accept classical statistics based on objective facts - they preferred the notion that a coin toss would always have a 50% chance of landing heads or tails regardless of how many times it was flipped.

However, there are many complex problems in life where probabilities cannot be predetermined, such as typhoon occurrences or earthquake patterns. Classical statistics often fails with complex problems due to insufficient sample data, making it impossible to infer overall patterns. We can't simply say there's always a 50% chance of a typhoon occurring each day, with only two possible outcomes: it comes or it doesn't.

The sparsity of data caused Bayesian theorem to face repeated challenges. It wasn't until the rapid development of modern computer technology made large-scale data computation feasible that Bayesian methods regained attention.

Some readers might wonder: while Bayesian theorem simulates human thought processes, what kind of problems can it actually help us solve? Let's examine a classic example that's almost always mentioned when discussing Bayesian theorem.

In disease testing, suppose a certain disease has an infection rate of 0.1% in the general population. A hospital's current testing technology has 99% accuracy for this disease. This means that if someone is known to be infected, there's a 99% chance of testing positive; while a healthy person has a 99% chance of testing normal. If we randomly select someone from the population for testing and the hospital reports a positive result, what is the actual probability that this person has the disease?

Many readers might immediately say "99%". But the actual probability is much lower because many confuse prior probability with posterior probability.

If we let A represent the person having the disease and B represent a positive test result, then P(B|A)=99% means "the probability of testing positive given that the person has the disease". But what we're asking is "for this randomly selected person, given a positive test result, what's the probability they have the disease?" - that is, P(A|B). Through calculation, we find P(A|B)=9%. So even with a positive test result, the actual probability of having the disease is less than 10%, with a high chance of a false positive. Therefore, follow-up tests and additional information are needed for more confident diagnosis.

This example shows how we often confuse prior and posterior probabilities in daily life, leading to incorrect judgments. Bayesian theorem helps clarify the logical relationship between these conditional probabilities and provides more accurate probability estimates.

In fact, the core idea of this theorem offers significant inspiration for product managers' thinking:

First, we need to clearly identify what constitutes prior probability and posterior probability in our requirements scenarios, avoiding being misled by superficial data;

Second, we can use Bayesian theorem to build a thinking framework - one that requires continuous adjustment of our understanding of things, forming relatively stable and correct views only after a series of new evidence emerges.

When new ideas emerge, most of the time we can only roughly judge a product's viability based on experience, with no clear prediction of its market response.

Therefore, we often need to experiment, creating simple versions to quickly validate our ideas in the market; then continuously seek "event B" to gradually increase the new product's success rate - this is how our products can potentially succeed.

Thus, "small steps, fast iteration" is indeed the best approach to improve fault tolerance.