From Algorithms to Products: The Evolution of NLP Technology Applications

baoshi.rao

The article reviews the development of NLP in recent years, outlining the evolution of NLP technology applications through two phases of project implementation.

The first case shared is based on NLP and is divided into three parts: the development of NLP, project narrative, and lessons learned.

Discussing the development of NLP helps in better understanding the technology and sets the stage for the project. The 'Lessons Learned' section summarizes the author's takeaways from the entire project.

The author does not have a formal computer science background, so their understanding of theoretical knowledge may not be deep and could contain inaccuracies. Feedback is welcome.

Table of Contents:

The field of study that focuses on the interactions between human language and computers is called Natural Language Processing, or NLP for short. It sits at the intersection of computer science, artificial intelligence, and computational linguistics (Wikipedia).

Summarizing Wikipedia's definition of NLP: NLP focuses on the interaction between human language and computers.

Using language, we can precisely describe thoughts and facts in our minds, express our emotions, and communicate with friends.

The underlying state of computers consists of only two possibilities: 0 and 1.

So, can machines understand human language?

The history of NLP has gone through two phases. The first phase was dominated by the 'bird-flying' approach, and the second phase by the 'statistical' approach.

Let's delve into the differences between these two phases:

Phase 1: The academic understanding of natural language processing was that for a machine to perform tasks like translation or speech recognition—tasks only humans could do—it first needed to understand natural language. Achieving this required the computer to possess intelligence similar to humans. This methodology was called the 'bird-flying' approach, akin to observing how birds fly to build an airplane.

Phase 2: Today, machine translation has improved significantly and is used by hundreds of millions of people. The achievements in NLP are largely based on mathematics, more precisely, statistics.

The transition from Phase 1 to Phase 2 occurred around 1970, driven by Frederick Jelinek and his team at IBM's Watson Laboratory (those interested in IBM's Watson Laboratory can read Wu Jun's book The Wave of the Summit for a detailed account).

Today's NLP applications are all based on statistics. So, what are the current applications of NLP?

NLP is widely used in areas such as knowledge graphs, intelligent Q&A, and machine translation.

Note: Specific details in the project description have been omitted.

The client is a tech company providing a financial investment database. Among its product lines, there is one called the 'People Database,' which includes investor and founder databases.

The service I provided was for these two product lines. Since the project primarily focused on the career information of relevant individuals, it was codenamed 'Career Information Extraction.'

The career information to be extracted consisted of five parts: education, work experience, investment cases, entrepreneurial experience, and honors received.

Project metrics included algorithm metrics and engineering metrics.

2.2.1 Algorithm Metrics

At the algorithm level, the metrics used were Recall and Precision. To ensure everyone is familiar with these metrics, let's review them.

First, let's understand the confusion matrix. A confusion matrix counts the number of observations a classification model correctly and incorrectly classifies, presenting the results in a table. Each row represents the predicted class, and each column represents the actual class.

The confusion matrix helps us visually see if the system is confusing two classes.

Here's an example of a confusion matrix:

0 represents Negative, and 1 represents Positive.

In addition, we need to understand three other metrics: Recall, Precision, and F1.

Now that we've covered these basic concepts for evaluating algorithm models, let's look at the project's metrics.

The algorithm metrics for the model were: Recall 90, Precision 60.

A thought question: Why Recall 90 and Precision 60? And why not F1, or why not set F1 to 72? Because if Recall is 90 and Precision is 60, then F1 would be 72.

To answer these questions, we must start from the business perspective. Remember, even recite this three times: Why must metrics always be defined from the business perspective?

Let's take an extreme example: If a model achieves Recall 90 and Precision 90, can we say the metrics are excellent?

I believe in most scenarios, this model performance would be outstanding. Note that I said 'most.' So, in which scenarios would it not be?

For example, cancer detection.

Suppose you're preparing a 'cancer detection' project. For each test subject, there are two possible outcomes:

Your colleague tells you the good news: the model's accuracy on the test set is 99%. Sounds great, but as a meticulous AI PM, you decide to review the test set yourself.

The test set has been labeled by medical professionals. Here's the actual situation of your test set:

With this data—our model's Ground Truth—let's look at the Predicted Result. Since we've learned about the confusion matrix, let's recall its two features: rows represent Predicted class, and columns represent Actual class. Here's the matrix:

Applying what we've learned:

You might feel a bit dizzy here, but let me summarize: the model's correct judgments are 1 and 3, and incorrect judgments are 2 and 4.

We want the model to effectively distinguish whether medical images are malignant or not. Effective distinction refers to 1 and 3. Everything else is an incorrect distinction.

Now, let's revisit the model's performance.

When the colleague says the model's accuracy is 99%, what exactly does she mean? Let's analyze carefully:

Is she referring to Precision?

Precision answers the question: What proportion of the samples predicted as 1 by the model are actually 1? The formula is:

In the real scenario, our Precision = 990 / (990 + 9,990) = 0.09 = 9%.

Is she referring to Recall?

In this scenario, our Recall = 990 / (990 + 10) = 990 / 1,000 = 0.99 = 99%.

Is she referring to Accuracy?

In this scenario, our Accuracy = (990 + 989,010) / 1,000,000 = 0.99 = 99%.

From these metrics, we can see that our algorithm model has high Recall and high Accuracy but low Precision.

Our algorithm's Precision is only 9%. Does this mean our model is terrible?

Not necessarily. In this case, Recall is more important than Precision. So, even with a Precision of 9%, a Recall of 99% is actually an ideal performance. Because missing a cancer diagnosis is something no one wants.

When is Recall more important than Precision?
When is Precision more important than Recall?

Precision becomes crucial when False Positives (FP) incur significant costs. For example, in email spam detection, where spam is 1 and normal mail is 0, a high FP rate means many normal emails end up in the spam folder, which is highly undesirable.

Let's revisit our business metrics: recall 90 and precision 60. Why were these values chosen? It all comes down to the business context. When evaluating customer AI needs with my team, a crucial step is understanding how they currently perform the task without machines—their judgment criteria and operational procedures.

Only with this domain knowledge can we design appropriate AI solutions. For extracting biographical information, the client's operations team manually reads articles, identifies relevant details, extracts them, and performs secondary processing.

Their key focus is secondary processing—integrating biographical information. Recall isn't their top priority because new articles constantly provide fresh material. However, efficiently locating relevant biographical details in lengthy articles (thousands of words) is critical for productivity.

Efficiency drives our algorithm metrics. Higher recall improves operational efficiency. Why not set F1 at 72? Because while recall 90 and precision 60 yield F1=72, this doesn't meet actual business needs.

2.2.2 Performance Metrics

Delivered via API:

Handles 10 QPS for 1000-character texts
95% response time (RT) of 3 seconds
99% success rate

AI projects typically deliver via API or Docker, each suited for different scenarios. QPS measures server throughput, while RT reflects system responsiveness.

Implementation occurred in two phases:

Rule-based approach
Model-based approach

The transition considered project complexity and enriched datasets.

2.3.1 Phase One: Rules

We employed three text lists:

Whitelist: Extract sentences containing these words (e.g., "graduated," "promoted")
Blacklist: Discard sentences with these terms (e.g., "died," "attended")
Scored terms: Add points when these words appear (weighted by TF-IDF)

While rules achieved good precision, recall remained inadequate. Bad case analysis revealed poor generalization, especially with ambiguous sentences like:

A: "Xiaohong invested in sportswear company Xilanhua."
B: "Xiaohong's investment in sportswear company Xilanhua."

Rules couldn't distinguish their semantic differences, prompting our next phase.

2.3.2 Phase Two: Model

Addressing rule limitations, we implemented Part-of-Speech (POS) tagging with BERT. POS tags reveal word relationships, solving our verb/noun disambiguation problem. With limited data, we trained BERT for just one epoch.

BERT—Google's revolutionary NLP model—excels across tasks by adding task-specific layers. Our final model achieved recall 90 and precision 90.

[End of discussion]

Actually, there's a lot more to discuss about project advancement, dataset management, and strategy analysis, but it feels like too much to write at once. I'll cover these topics separately in future articles.

The Language Instinct
The Beauty of Mathematics
The Age of Intelligence
What was the state of NLP research and applications in 2018?
https://en.wikipedia.org/wiki/Confusion_matrix#cite_note-Powers2011-2
https://lawtomated.com/accuracy-precision-recall-and-f1-scores-for-lawyers/
Throughput (TPS), QPS, Concurrency, Response Time (RT) Concepts - Hu Lifeng - Blog Park
Bad Case Analysis
[Structured Machine Learning Projects] Lesson 2 - Machine Learning Strategy 2_Artificial Intelligence_[Artificial Intelligence] Wang Xiaocao's Blog - CSDN Blog](https://blog.csdn.net/sinat_33761963/article/details/80559099)
5-Minute Introduction to Google's Most Powerful NLP Model: BERT - Jianshu
https://arxiv.org/pdf/1810.04805.pdf 《BERT: Pre-training of Bidirectional Transformers for Language Understanding》