What is Machine Learning? Understand It After Reading This

baoshi.rao

When discussing artificial intelligence, machine learning is an essential concept. From information-based software to e-commerce, the rapidly developing internet era, and now cloud computing and big data, it permeates our lives and work. Driven by the internet, people have gained a clearer understanding and utilization of data—not just for statistics and analysis but also for data mining and prediction.

Machine learning involves training a computer on a portion of data and then using it to predict or judge other data.

The core of machine learning is "using algorithms to parse data, learn from it, and then make decisions or predictions about new data." In other words, it is a method where computers use acquired data to derive a model and then apply this model for predictions. This process is somewhat similar to human learning, where accumulated experience allows us to make predictions about new problems.

For example, consider Alipay's "Collect Five Blessings" activity during the Chinese New Year, where users scan photos of the character "福" (fortune) for recognition. This employs machine learning. By providing the computer with training data of "福" character images and using algorithmic models, the system continuously learns and updates. When a new "福" image is input, the machine automatically identifies whether the character is present.

Machine learning is an interdisciplinary field involving probability theory, statistics, computer science, and more. The concept revolves around inputting vast amounts of training data to train a model, enabling it to grasp the underlying patterns in the data and accurately classify or predict new input data. As shown below:

Now that we understand the concept of machine learning and how models self-learn, what are the learning methods?

(1) Supervised Learning

Supervised learning involves training data where each sample has a corresponding target value. It establishes connections between data features and known outcomes, extracting feature values and mapping relationships. Through continuous learning and training on labeled data, it predicts outcomes for new data.

Supervised learning is commonly used for classification and regression. For instance, identifying spam messages or emails involves training models on historically labeled data (e.g., marking messages as spam or not). When new messages arrive, the model matches them to predict their category.

Another example is regression, such as predicting a company's net profit. Using historical profit data (target value) and related indicators like revenue, liabilities, and administrative expenses, a regression equation can be derived to predict future profits based on input factors.

The challenge in supervised learning is the high cost of obtaining labeled training data, as it often requires manual annotation.

(2) Unsupervised Learning

Unlike supervised learning, unsupervised learning does not require labeled data with target values. Instead, it focuses on uncovering inherent patterns in the data without analyzing their impact on specific outcomes.

Unsupervised learning is often used for clustering and dimensionality reduction. For example, the RFM model clusters customer data based on sales behavior (frequency, recency, monetary value).

A key advantage of unsupervised learning is that it doesn’t require manual labeling, reducing data acquisition costs.

(3) Semi-Supervised Learning

Semi-supervised learning combines supervised and unsupervised learning, enabling the integration of classification, regression, and clustering. It is a trending approach in recent years.

(4) Reinforcement Learning

Reinforcement learning is a complex method emphasizing continuous interaction and feedback between the system and its environment. It is suited for scenarios requiring ongoing reasoning, such as autonomous driving, and focuses heavily on performance. It is a hot topic in machine learning.

Deep learning, a subset of machine learning, has gained significant attention. Inspired by the human brain, it uses deep neural networks to solve feature representation problems.

The relationship between artificial intelligence, machine learning, and deep learning is illustrated below:

Deep learning is ultimately a form of machine learning but differs from supervised, unsupervised, semi-supervised, and reinforcement learning. It classifies algorithms based on neural network depth into shallow and deep learning algorithms.

Shallow learning algorithms are used for structured or semi-structured data predictions, while deep learning tackles complex scenarios like image, text, and speech recognition.

This section clarifies basic machine learning concepts and introduces application scenarios, emphasizing that machine learning is fundamentally a data processing method—extracting patterns to predict future outcomes.

When discussing machine learning categories, we briefly introduced different methods and their purposes. Here, we delve into common applications, focusing on usage rather than algorithms or principles.

Classification and clustering are the most common machine learning applications. Both involve grouping data but differ significantly.

Classification involves predefined groups, where data is assigned to known categories. For example, separating students into male and female groups during military training is classification—we know the groups beforehand and use an algorithm to assign data accordingly.

Mathematically, classification learns a target function f that maps attribute sets x to predefined class labels y. It uses known samples (with attributes and labels) to build a model, then applies this function to classify new data.

Classification is a supervised method, solving "yes or no" problems like image recognition (e.g., identifying cats or dogs in photos). It matches analyzed data to known categories.

In clustering, the classes are unknown, and the goal is to partition data based on parameter similarities. For example, the RFM model clusters customer sales data into groups with high similarity, later defining labels based on analyzed traits. It addresses similarity, grouping like data together.

To summarize classification and clustering:

Classification: Given 1,000 photos, if we’ve predefined cat and dog categories and trained a model, separating the photos into these groups is classification.
Clustering: If we don’t know the categories, grouping photos based on similarities (e.g., color, shape) is clustering.

Suppose we haven't worked with photos of cats and dogs before. We simply categorize a dataset of 1,000 photos to identify groups with high similarity. After classification, we can then define these groups as cats, dogs, or other types of images based on their similarities.

From a statistical perspective, regression refers to a method for determining the quantitative relationship between two or more interdependent variables. Regression analysis can be categorized into univariate and multivariate regression based on the number of variables involved, simple and multiple regression based on the number of dependent variables, and linear and nonlinear regression based on the type of relationship between independent and dependent variables.

In big data analysis, regression is a predictive modeling technique that examines the relationship between dependent (target) and independent (predictor) variables. This technique is commonly used for predictive analysis, time series modeling, and uncovering causal relationships between variables.

Mathematically, regression is an equation—a problem-solving method that involves learning relationships through functional factors. For example, a simple function like y = ax + b.

From an algorithmic standpoint, regression predicts continuous outcomes in supervised learning. For instance, by analyzing past salary data and related influencing parameters, a regression model can predict future salary based on changes in these parameters.

By building regression models and mathematically analyzing the equations, we can also reverse-engineer the necessary parameter optimizations to achieve a predefined outcome. The ultimate goal of regression is to derive relevant parameters and their characteristic values, which is why correlation analysis of target parameters is often performed during regression analysis.

With sufficient data, regression analysis can aid in prediction and decision-making. For example, after launching new features, we can perform regression analysis on metrics like click-through rates, open rates, and sharing behavior to predict business outcomes. Similarly, regression analysis of historical data such as age, weight, blood pressure, cholesterol levels, smoking, and drinking habits can predict an individual's risk of certain diseases.

The primary purpose of regression is to predict potential outcomes for new data based on patterns observed in continuous data.

Dimensionality reduction involves eliminating redundant features and reducing the number of feature parameters, representing characteristics with fewer dimensions. For instance, in image recognition, converting an image into a high-dimensional dataset requires dimensionality reduction to simplify processing complexity, reduce redundancy-induced errors, and improve recognition accuracy.

From a statistical perspective, the four major applications of machine learning can be understood as follows: If we have a set of samples and want to predict whether they belong to a certain attribute, we use classification for discrete values and regression for continuous values. If the samples lack predefined attributes and we aim to uncover correlations, clustering is employed. For high-dimensional data, dimensionality reduction helps identify more precise parameters, enhancing accuracy in classification, clustering, or regression.

Beyond these, applications like speech recognition, image recognition, text recognition, and semantic analysis combine these fundamental machine learning methods.

The diagram below provides examples of algorithms for different scenarios. Interested readers can explore the principles of each algorithm independently.

Understanding machine learning applications is highly valuable for product managers:

On one hand, product managers need to grasp what problems machine learning can solve and whether it can address their business needs. Understanding these applications also clarifies why AI platforms are so impactful.

For example, clustering can be used for audience or product categorization, classification for predicting app feature engagement, and regression for forecasting product purchases.

On the other hand, machine learning highlights the importance of data, urging product managers to leverage data effectively for predictive and decision-making purposes using algorithms.

While much effort in machine learning is devoted to selecting and optimizing algorithms, this is just one step in the process. Understanding the entire workflow is crucial, especially for product managers.

The machine learning workflow essentially involves data preparation, analysis, processing, and feedback. This can be broken down into: business scenario analysis, data processing, feature engineering, algorithm model training, and application services. Below is a detailed explanation of these steps.

Business scenario analysis translates business needs and usage scenarios into machine learning requirements, followed by data analysis and algorithm selection. This preparatory stage includes three key aspects: business abstraction, data preparation, and algorithm selection.

(1) Business Abstraction

Business abstraction involves framing business requirements as machine learning applications. For example, a product recommendation task—determining whether to recommend Product A to User A—becomes a binary classification problem (yes/no). This is the essence of business abstraction: translating business needs into machine learning scenarios.

(2) Data Preparation

Data is the foundation of machine learning. Without data, models cannot be trained. Data preparation involves identifying, collecting, and processing data. This includes structured, semi-structured, and unstructured data, as discussed in knowledge graphs. Key considerations for product managers include:

Data Fields: All data must be abstracted into a two-dimensional table, where headers represent field names. Considerations include:
- Field scope: Which fields are needed as machine learning parameters?
- Field types: For regression, numerical values are required (e.g., encoding gender as 0/1). For classification, strings (e.g., "male"/"female") may suffice.
Data Quality: Ensuring data is clean, relevant, and representative for model training.

Regarding data considerations, the focus is on the actual data cases you can obtain, which are the real data in a two-dimensional table excluding the header field names. As product managers, we need to consider two main aspects:

Data Volume: In machine learning, a certain amount of data is required, and it is preferable to have as much data as possible.
Data Completeness: This pertains to data quality, requiring us to collect data as comprehensively as possible. Fields with significant missing data or garbled data should not be included in model calculations, as they could affect the results.

(3) Algorithm Selection

Algorithm selection determines the requirements for machine learning and the data items, addressing the question of which algorithm model to choose. This stage is led by algorithm engineers. There are many algorithms in machine learning, so algorithm selection is diverse. The same problem can often be solved using multiple algorithms. With advancements in computer science, more algorithms will emerge, and existing algorithms can also be optimized through parameter tuning.

Data Processing

Data processing involves the selection and cleaning of data. Once the data is prepared, the algorithm is determined, and the requirements are set, the data must be processed to minimize interference with the algorithm. Common techniques in data processing include "denoising" and "normalization."

Denoising: This involves removing noisy or abnormal data from the dataset. As product managers, we must ensure the data reflects real-world scenarios, and algorithms can help identify noisy data. For example, for data with a normal distribution, we can use the 3-sigma rule to remove outliers. The goal of denoising is to eliminate abnormal data.
Normalization: This simplifies the data, typically scaling it to a range of [0, 1]. Normalization helps algorithms find optimal solutions more effectively. It addresses issues such as multiple representations of the same data field (e.g., counting sheep by "herd" or "individual") and ensures consistency in units (e.g., hours, minutes, or seconds). Normalization also helps algorithms converge by standardizing data ranges (e.g., when analyzing variables with vastly different scales).

There are many techniques and algorithms for data processing, and the goal is to optimize the data to minimize interference with the model based on the business scenario.

Feature Engineering

In machine learning, it is often said that data and features determine the upper limit of performance, while models and algorithms merely approach that limit. Data and features are the foundation of algorithm models. Feature engineering involves extracting features from processed data and converting them into a format usable by the algorithm.

Key objectives of feature engineering include:

Feature abstraction: Transforming raw data into feature data (e.g., converting "yes/no" to "0/1").
Feature evaluation and selection: Assessing and choosing the best features from multiple possibilities.
Feature derivation: Combining features to create new ones.

Feature engineering is a critical step in machine learning, as the quality of features directly impacts the results. Even with the same algorithm, different feature choices can lead to vastly different outcomes.

Model Training

After data preparation, processing, and feature engineering, the selected algorithm is trained and evaluated. The trained model is tested with new data to assess its quality. This stage is primarily handled by algorithm engineers, while product managers should ensure the model can be retrained as new data is introduced.

Application Services

Once the model is trained, the focus shifts to how it is deployed and how parameters (e.g., confidence thresholds) are configured. Models can be exposed via APIs for application-layer calls, and configuration interfaces can allow parameter adjustments.

This straightforward overview of the machine learning workflow highlights its relevance for product managers.