AI in Medical Scenarios: How to Use AI Technology for Esophageal Cancer Identification and Auxiliary Diagnosis?

baoshi.rao

AI is applied in the medical field to assist in diagnosis through the characteristics of machine learning. The author shares the experience of using AI technology for esophageal cancer identification and auxiliary diagnosis, explaining the challenges at each stage.

Medical imaging auxiliary diagnosis mainly focuses on two tasks:

Today, we will analyze how to use AI technology for esophageal cancer identification and auxiliary diagnosis.

Esophageal cancer is one of the top five malignant tumors worldwide, and China is a high-incidence region for esophageal cancer. The goal of this project is to determine whether a patient may have cancer through imaging.

The overall process of the project is as follows:

The entire process can be roughly divided into three stages. Below, I will briefly introduce the challenges of each stage.

Compared to typical image classification tasks, which often involve hundreds of thousands, millions, or even tens of millions of data points, medical imaging data is very limited. Additionally, due to variations in equipment parameters, doctors' imaging techniques, shooting angles, and lighting conditions, the appearance of the esophagus can vary significantly.

So, how can we obtain a reliable and stable model under such conditions?

We use Feature Maps. A Feature Map is generated by convolving the original image with various kernels. By multiplying the original image with kernels under different conditions, we obtain various feature maps. You can think of it as analyzing the image from multiple perspectives. Different feature extraction kernels extract different features, and the model aims to solve an optimization problem to find the best set of kernels that explain the observed phenomena.

At the same layer, we aim to obtain descriptions of an image from multiple angles. Specifically, we use various convolution kernels to convolve the image, obtaining responses on different kernels (which can be understood as descriptions) as image features.

The connection lies in forming descriptions of the image on different bases at the same level. Lower-level kernels mainly include simple edge detectors (which can also be understood as simple cells in physiology).

After obtaining esophageal data, how do we determine whether the esophagus is healthy or diseased?

This problem is similar to the previous one and is also a discriminative model.

What are the differences?

When determining whether an esophagus is abnormal, we only need to find one diseased area to conclude that the esophagus is abnormal.

However, in a normal image, we cannot conclude that the esophagus is normal just by finding one normal feature. We can only say that no abnormal features were found in this image, and it may be normal.

Therefore, between normal and abnormal features, we tend to extract diseased features and suppress normal features.

How do we achieve this?

Both diseased and normal cases pass through the neural network to obtain feature vectors. For these vectors, we aim to highlight abnormal features as much as possible and make normal features approach zero.

How do we model this information into the model?

We remodeled the model, achieving an accuracy of approximately 97%.

The previous models were relatively simple. The third model mainly distinguishes between inflammation and cancer, which is quite different from the first two problems.

Generally, images of diseased esophagi often include features of inflammation.

Our judgment of cancer is often based on very small texture areas, so we need to extract more refined features. A good approach is to have experts meticulously annotate the lesion areas, allowing us to focus on identifying these regions.

However, the annotation workload is enormous, leading to a severe lack of data. We do not have annotated data for cancer regions but still need very refined features. How do we resolve this contradiction?

Fortunately, although we cannot obtain very precise annotations of lesion areas, we can relatively easily determine whether an image contains cancer by correlating it with medical records. This allows us to obtain global labels for images more easily.

If an image contains cancer, there must be one or several areas with cancer features. In other words, if we divide the image into patches, some patches must contain cancer features. Based on this idea, we adopted a multi-sequence learning approach. The core idea is simple: divide the image into several patches, model each patch, and determine the probability of cancer in each patch.

We then use the patch with the highest cancer probability as the label for whether the image contains cancer.

During this process, we gradually accumulate precisely annotated data. Although this data is scarce and insufficient to train a model, the features in these images are the most accurate, having been manually verified and annotated.

How can we integrate this small amount of precise data into cancer identification?

This is a very interesting problem. If we can solve it, we can continuously improve even with limited standard data.

Here, we mainly use a multi-task learning approach, which requires completing two tasks:

These two models share a feature extraction network, which must satisfy both tasks to integrate the precisely annotated features into cancer identification.

The above is a brief introduction to our esophageal cancer project. Below, we will briefly introduce some of our work in auxiliary diagnosis.

We hope that machines can eventually diagnose diseases like clinical doctors.

Before introducing the auxiliary diagnosis project, let's look at how a doctor or a student grows into an expert: A student starts by learning numerous professional courses and reading extensive medical literature to accumulate medical knowledge.

Once they have sufficient medical knowledge, they can intern at hospitals, where clinical doctors guide them in learning diagnostic skills using real cases.

With these skills, they become general practitioners. Doctors see many patients and gain extensive experience, eventually becoming experts.

The growth process of machines is similar to that of humans.

We can divide it into three stages:

In the process of building a medical knowledge graph, we first process text data. Text data is divided into two types: semi-structured and unstructured.

Here is an example of how we transform unstructured text into structured text, a form computers can understand.

We can divide medical histories into several parts: disease conditions, treatment history, admission basis, etc. After categorizing the history, we refine and extract each type of information. After extraction, unstructured text becomes structured text that computers can understand. We then convert this information into a medical knowledge graph stored in the computer, allowing the machine to learn this knowledge.

This is the process of building a medical knowledge graph.

The second step involves a diagnostic model.

The diagnostic process is as follows: First, convert a human-language description of a condition into structured knowledge that a computer can understand. With structured knowledge, the machine can understand the patient's situation and feed the knowledge into the disease diagnosis model, which outputs a list of diseases. This is roughly the diagnostic model's workflow.

Below is an example of condition understanding.

Through technology, we can extract basic information about a patient's condition, including gender, age, the patient's self-description, current medical history, and past medical history.

The self-description may include symptoms, duration, and even more complex information, such as the appearance of saliva or whether coughing involves phlegm. This information is detailed and mapped into a medical record according to the aforementioned pattern, completing the understanding of the condition.

After understanding the condition, it is input into the diagnostic model.

The diagnostic demo includes several parts: a human-language description of the condition, a structured representation of the condition after understanding, and the machine's diagnostic results, listing the top five probable diagnoses in order of probability.

We have also provided an interface for doctors to score diagnostic results, feeding these evaluations back into the model.

Through the interaction between doctors and the machine, the model can be iteratively improved.

We tested approximately 100,000 real cases from laboratory data, achieving a 92% consistency rate for TOP1 results compared to doctors' diagnoses and 90% for TOP3. However, this model still requires more clinical cases for validation.