1. Introduction
Current AI systems in production environments often face scenarios that are dynamic, complex, and require independent operation. This demands that the system actively responds to its environment, rather than merely relying on preset instructions. The core value of an autonomous decision‑making agent lies precisely here—through the closed‑loop collaboration of four major capabilities: perception, planning, decision‑making, and execution, it transcends from “receiving instructions” to “acting autonomously.” This article elaborates on this framework, explaining the definitions, key technologies, and engineering implementation points of each capability module, accompanied by practical code examples.
After reading, you will understand the agent capability framework, the rationale for mainstream algorithm selection, common pitfalls, and how to implement a closed‑loop design in real projects.
2. Core Concept: Perception‑Planning‑Decision‑Execution Closed‑Loop Model
The four core capabilities of an autonomous decision‑making agent are not independent modules; rather, they form a continuously running closed‑loop system. Their collaborative relationship can be summarized as:
- Perception: Acquires raw data from the environment (e.g., sensor readings, images, logs), processes it into structured information, and outputs an abstract representation of the current environmental state.
- Planning: Based on the perceived information and a given goal, generates a series of feasible action sequences (e.g., paths, steps) that take the agent from the current state to the target state.
- Decision‑making: From the candidate options generated by planning, selects the most appropriate action considering real‑time constraints and strategies (e.g., risk preference, resource consumption).
- Execution: Transforms the decision into actual output—affecting the environment through physical actuators (e.g., motors) or virtual interfaces (e.g., API calls), triggering state changes.
- Feedback Loop: The execution results are fed back through a new round of perception, forming a closed loop. The agent continuously optimizes its behavior—a fundamental difference from the traditional “set‑and‑run” mode.
This closed loop is the fundamental reason why an agent can autonomously adapt in uncertain environments. For example, in autonomous driving: the perception system detects a pedestrian ahead (perception), the planning unit generates a deceleration or lane‑change plan (planning), the decision‑making unit selects deceleration based on current speed and road conditions (decision‑making), the execution unit controls the braking system (execution), and then sensors re‑acquire the vehicle state (new perception) to ensure the deceleration meets expectations. Without any one of these links, the agent would degrade into a passive reactive tool.
3. Perception: Multi‑Sensor Fusion and SLAM Environmental Modeling
Perception is the starting point of the agent’s interaction with the external world; its quality directly affects the reliability of subsequent planning and decision‑making. The core challenge of this layer is that raw sensor data (waveforms, images, point clouds) often contains noise, redundancy, and local blind spots. It must be processed to transform into a unified, understandable environmental model.
Multi‑sensor fusion technology is a key means to improve robustness. A single sensor (e.g., camera) degrades significantly in low‑light or occlusion conditions, while LiDAR accuracy decreases in rain or fog; fusion compensates for their weaknesses. Common practical methods include: using LiDAR point clouds to generate 3D obstacle contours, combining camera images to recognize object classes (e.g., distinguishing pedestrians from trees), then supplementing close‑range blind spot data with ultrasonic sensors, and finally using Kalman filters or Extended Kalman Filters (EKF) for spatiotemporal alignment and state estimation of heterogeneous data.
Timestamp alignment (typically based on hardware clocks or the PTP protocol) and spatial calibration (rotation‑translation matrices between sensor coordinate systems) are two error‑prone steps—if ignored, fused data may be offset and cause misjudgments.
Feature extraction extracts semantic information from raw data. In the image domain, Convolutional Neural Networks (CNNs) are the mainstream approach, used to detect object positions, lane lines, traffic signs, etc. In point cloud processing, PointNet or its variants can directly extract features from 3D coordinates. Note: feature extraction models need fine‑tuning for specific scenarios (e.g., indoor vs. outdoor, daytime vs. nighttime), otherwise generalization ability is insufficient.
Environmental modeling organizes perception results into structured representations; typical outcomes include occupancy grid maps or topological maps. In unknown environments, SLAM (Simultaneous Localization and Mapping) technology is standard—the agent must simultaneously solve “where am I” and “what does the surrounding environment look like.” The front‑end (feature matching and inter‑frame pose estimation) and back‑end (graph optimization / nonlinear optimization) of SLAM each have mature solutions (e.g., ORB‑SLAM, Cartographer). Engineering must pay attention to accumulated drift: without loop closure detection, after long‑distance travel, coordinate drift of an agent can reach several meters. This must be corrected through loop closure detection (e.g., bag‑of‑words models) plus graph optimization.
Case: Autonomous driving perception system. The vehicle is equipped with LiDAR (64/128‑line), forward‑looking cameras (multi‑view), millimeter‑wave radar, and ultrasonic sensors. LiDAR generates a 360° point cloud at 10 Hz, cameras provide pixel‑level semantics at 30 fps. After fusion and temporal alignment, a 3D environmental model containing obstacle positions, velocities, lane lines, and traffic light status is constructed in the vehicle domain controller for use by subsequent planning modules.
This system has high demands on computational resources and real‑time performance (typically requiring a complete perception output within 50 ms), requiring a trade‑off between algorithm accuracy and computing budget.
4. Planning: Path Planning Algorithms and Model Predictive Control (MPC)
The core task of the planning layer is to answer “how to reach the goal.” It must generate a series of actions or a path from the current state to the target state based on the environmental model provided by perception and the agent’s physical constraints (e.g., maximum speed, steering radius, energy budget). Planning is usually divided into two layers: global planning and local planning.
Global planning searches for an optimal path on a known or prior map. The A* algorithm is a common industrial‑grade solution: it combines the completeness of Dijkstra’s algorithm with the heuristics of greedy search, speeding up convergence on directed weighted graphs via a heuristic function (e.g., Euclidean / Manhattan distance). Dijkstra can find the shortest path but is computationally expensive on large‑scale graphs (e.g., millions of nodes in HD maps); A* on grid maps reduces time complexity to (O(b^d)) (b = branching factor, d = path depth), but the heuristic function must be monotonic to avoid overestimation.
Note: A* is not suitable for continuous space; the environment must first be rasterized (cell decomposition). Cell resolution directly affects path smoothness and computational cost—too fine leads to combinatorial explosion, too coarse ignores obstacle details.
Local planning performs real‑time smoothing and obstacle avoidance on top of the global path. Model Predictive Control (MPC) is a mainstream approach: it transforms the control problem into a sequential optimization problem over a prediction horizon (e.g., the next 2–3 seconds), solving for the minimum‑cost trajectory that satisfies kinematic equations, obstacle constraints, and control command bounds. The advantage of MPC is its ability to anticipate dynamic obstacles (e.g., a pedestrian about to cross) and adjust the trajectory to avoid hard braking or collisions. Its main bottleneck is real‑time performance—nonlinear MPC solvers must converge within tens of milliseconds. In engineering, efficient QP solvers like CasADi are often used, or the problem is approximated as linear MPC.
The prediction horizon tuning requires a balance: too short and obstacles cannot be foreseen; too long increases computational burden and sensitivity to modeling errors. For typical autonomous driving scenarios, 50–100 steps are set (each step 0.1 s).
Time windows and constraint handling: Planning is not just about finding a path; it also needs to consider execution time windows (e.g., synchronous arrival, resource lock conflicts) and avoid path interleaving in multi‑agent scenarios. Spatiotemporal trajectories (paths expanded along the time dimension) can be introduced to handle both spatial and temporal constraints simultaneously.
Case: In autonomous driving path planning, global planning uses lane topology on HD maps to run A* and generates a guided path with lane IDs; local planning computes a trajectory using MPC based on that path and avoids temporary obstacles (e.g., a broken‑down vehicle) in real time. The overall planning module outputs trajectory points to the lower‑level execution layer at 20 Hz.
5. Decision‑making: Reinforcement Learning and Rule‑Based Strategies
The decision‑making layer, built upon the planning results, selects specific actions based on real‑time conditions (e.g., sensor feedback, task priorities, risk thresholds). Unlike planning which addresses “how to do,” decision‑making focuses on “what to do”—it selects one of multiple feasible paths or action sequences generated by planning, or further refines the planning results.
Rule‑based decision‑making is suitable for scenarios with high certainty and limited state spaces. For example, the logic of a smart home agent: if the temperature > 28°C and someone is indoors, turn on the air conditioner. Rule cascading can be implemented via state machines or decision trees; advantages include high interpretability and simple engineering debugging. The disadvantage is inability to handle unseen state combinations; rule sets need manual maintenance when the scenario changes. In enterprise office automation agents, rule‑based decision‑making still dominates: for example, when processing approval workflows, the agent decides whether to auto‑approve based on historical authority tables.
Machine learning‑based decision‑making, especially Reinforcement Learning (RL), is suitable for long‑term reward optimization scenarios. RL enables the agent to learn a policy function (typically via Deep Q‑Networks or PPO algorithms) through trial‑and‑error interaction with the environment, directly outputting action probabilities or Q‑values from perception states (e.g., driving scene images). Typical flow: the perception layer outputs environment state s → the decision layer selects action a (e.g., accelerate/decelerate) → the execution layer implements it → receives new state s’ and immediate reward r → update policy.
RL excels in game AI (e.g., AlphaStar) and continuous‑control robot grasping, but two issues must be addressed for engineering deployment: first, sparse rewards—the agent may go long periods without effective feedback; this can be mitigated by reward shaping or curriculum learning. Second, the exploration‑exploitation balance—early training should explore more, later exploit learned policies.
It is recommended to validate the policy in simulation first before migrating to real hardware to avoid collision damage.
Game‑theoretic decision‑making is used in multi‑agent interaction scenarios, such as interactive games in autonomous driving: two vehicles at an intersection, each needs to predict the other’s strategy and choose its own optimal action. This approach is computationally complex and currently mostly used in simulation; mass production deployment is still exploratory.
Decision‑making must pay special attention to real‑time performance—the planning layer may output multiple candidate trajectories per second, and decision‑making must respond at the millisecond level (e.g., the final “brake” command in motion planning). That is why real systems often combine rules and RL: rules handle extreme cases as a safety net, RL optimizes normal operating conditions.
6. Execution: Control Interfaces and Actuator Feedback
The execution layer converts decisions into actual operations on the environment. If the agent decides to “move backward 0.5 meters,” the execution layer must determine which motor rotates how many degrees for how long. The engineering challenges in this layer focus on two aspects: precision and delay compensation.
Controller interfaces: In physical robots, PID (Proportional‑Integral‑Derivative) controllers are the most mature solution, regulating actuator position/speed errors in real time through feedback. PID parameter tuning must be done based on mechanical characteristics (e.g., moment of inertia, friction)—too high proportional gain can cause oscillation, too strong integral term slows response. In virtual agents (e.g., software automation), execution is done via API calls to send commands (e.g., database operations, HTTP requests), requiring attention to idempotency and retry mechanisms to avoid duplicate operations.
For communication protocols, Zigbee, Wi‑Fi, industrial Ethernet (TSN) each have their applications.
Actuator feedback: Execution results affect the environment, and environmental changes are captured by the next round of perception, forming the closed loop. For example, a smart home agent decides to “adjust the lights,” the execution unit sends a command to the smart bulb via Wi‑Fi; after the bulb state changes, the light sensor detects illumination changes and feeds back to the perception layer; the agent then determines whether the target illumination is achieved. If perception shows no change after execution (bulb failure or network disconnection), the agent should trigger exception handling—re‑send or report for maintenance.
Execution delay compensation: There is an inherent delay between issuing a decision and the actual start of actuator motion (e.g., communication, mechanical response). If the agent ignores this delay during decision‑making, it can cause oscillation or collision in high‑speed scenarios (e.g., robot grasping). A common compensation method is to introduce “prediction steps” at the decision moment—assuming the delay is fixed as T, the decision uses a state estimate at future time T (from a state estimator) to determine the current action, rather than the current measurement. This can be naturally implemented in MPC: the model prediction already implicitly includes future control sequences for several steps.
Case: Robot assembly in industrial automation—perception detects workpiece position (deviation < 1 mm), planning generates a movement trajectory, decision selects the optimal path from current position to assembly point (considering time and energy consumption), execution layer controls motors precisely via servo drives; after assembly, a force sensor (next perception) provides pressure feedback to ensure good contact.
7. Practical Code Example: Agent Planning‑Execution Loop Based on A*
The following code demonstrates a simplified scenario where the agent completes one iteration of the closed loop through perception (reading a grid map), planning (A* pathfinding), decision (selecting the first step), and execution (moving). The code retains core logic and can serve as an introductory reference.
1 | |
Code Explanation:
- Perception uses the 2D list
raw_mapto simulate the environment state; in real projects, it should read from a multi‑sensor fusion interface to obtain a real‑time grid. - Planning uses the A* algorithm; the key point is that the heuristic function (Manhattan distance) must be admissible and consistent. The code only searches among passable cells without considering dynamic obstacles.
- Decision selects the first step from the planned path, but more logic can be added in this module (e.g., waiting when idle, avoiding temporary obstacles)—that is exactly the value of the decision layer.
- Execution updates the agent’s position. Note that no mechanical delay is simulated here; in real scenarios, waiting or state confirmation is required.
Common Mistakes:
- Not checking whether the
came_fromdictionary contains the goal node—if the start and goal are separated by a wall, the code will directly report an error. - When using
heappopon the open list of A*, if multiple nodes have the same f‑value, the priority may be inconsistent; a secondary sorting key (e.g., timestamp) can be introduced for stability. - In dynamic environments, each re‑planning directly discards the old path instead of reusing it—this may cause frame loss in global optimization.
8. Advanced Tips and Pitfall Records
The following are common practical issues and solutions, listed by module.
Timestamp alignment and spatial calibration in multi‑sensor fusion:
- Different sensors have different sampling frequencies (e.g., camera at 30 fps, LiDAR at 10 Hz). If the latest frames are directly fused, time misalignment occurs. A hardware‑level solution is to use PTP (Precision Time Protocol) to synchronize sensor clocks; a software‑level approach can use interpolation or Kalman filtering to extrapolate slower frames.
- Spatial calibration: the extrinsic parameters (rotation‑translation matrices) between sensors need to be calibrated with a target in a quiet scene; ChArUco boards can be used.
Periodically check calibration results, as environmental vibrations can cause drift.
Handling sparse rewards in reinforcement learning training:
- Reward shaping is a common technique: for example, in autonomous navigation tasks, give a small positive reward when the agent approaches the goal, and a negative reward when it moves away. Note that introducing bias may cause the policy to converge to a local optimum.
- Curriculum learning: start training with easy tasks (e.g., goal within 2 meters, no obstacles) and gradually increase difficulty. This significantly improves convergence speed and generalization.
MPC real‑time bottleneck:
- Solver selection: for low accuracy requirements, the interior point method (IPOPT) can be used but has poor real‑time performance (>50 ms); for high real‑time scenarios, OSQP (written in C, supports NMPC and linear approximations) is recommended; single‑step solving typically takes 5–15 ms. If the prediction horizon is too long (>200 steps), multi‑rate MPC (alternating complex and simple step sizes) can reduce computational load.
- Parameter tuning: a short prediction horizon (<20 steps) fails to anticipate obstacles, leading to repeated detours; a horizon too long (>150 steps) makes the optimization problem stiffer, degrading solver speed. Simulation‑based parameter scanning is recommended.
Handling cumulative error in SLAM:
- Loop closure is the core method for correcting drift. Use bag‑of‑words models (e.g., DBoW3) for image matching between keyframes; after detecting a loop, correct all keyframe poses via pose‑graph optimization (g2o, GTSAM). Note: loop closure detection may produce false positives (similar appearance at different locations); geometric verification (e.g., RANSAC) should be used to eliminate incorrect matches.
- Engineering suggestion: periodically reset the SLAM system (e.g., every 1 km of travel) or use RTK GPS as a global positioning prior before high‑precision localization scenarios (e.g., VIO).
Execution delay compensation:
- If the delay between decision issuance and actual actuator motion is stable (e.g., fixed 20 ms), prediction steps can be used at the decision moment: in solving the MPC trajectory, force the first step to wait T time before execution, or advance the state estimator to the future time T. If the delay is not fixed (e.g., network jitter in IoT), it is recommended to add a timestamp‑based deferred queue in the software architecture; when the message arrives at the actuator, calculate the actual relative time based on the timestamp.
9. Summary and Outlook
This article introduced the four core capabilities of an autonomous decision‑making agent: perception, planning, decision, and execution, and explained how they form a closed‑loop model. This closed loop is the foundation for the agent to operate stably in dynamic environments—without any one link, the system degrades. Key takeaways are as follows:
Loop integrity is more important than single‑module optimization: Each module is constrained by its preceding and succeeding links in the complete process. Optimization must consider upstream and downstream interfaces (e.g., perception output format matching planning input).
Algorithm selection should combine scenario characteristics: Rules are simple and reliable for conservative scenarios (e.g., automated approval), RL is suitable for exploration and optimization (e.g., game AI), MPC is suitable for continuous control. A hybrid architecture (rules + learning) avoids the bias of “all learning or all rules.”
Engineering pitfalls concentrate on latency, calibration, and real‑time performance: timestamp alignment in multi‑sensor fusion, SLAM cumulative error, MPC solving time, and actuator feedback delay are all high‑frequency pitfalls in practice.
Outlook:
End‑to‑end learning: integrate perception, planning, decision, and even execution into a single neural network, directly generating control signals in simulation to reduce inter‑module transmission errors. However, interpretability and generalization in complex scenarios are still lacking; suitable for restricted environments (e.g., warehouse logistics).
Multi‑agent collaborative decision‑making: when multiple agents coexist, they need to address communication topology (broadcast vs. point‑to‑point), task allocation (game‑theoretic or auction‑based), and collision avoidance (velocity obstacle methods). Recent attempts use graph neural networks to encode inter‑agent relationships for joint decision‑making.
Interpretable decision‑making: in regulated fields such as finance and healthcare, the decision reasons of an agent must be auditable. Consider recording decision chains in a rule engine, or adding a post‑hoc explanation module on top of RL policies.
Memory and reflection capability: long‑term task scenarios require the agent to record and reuse previous success/failure experiences—currently implemented via dialogue memory or experience replay pools. This belongs to more advanced autonomous capabilities and is worth further exploration.
The above content can serve as a reference for your team when building or selecting an autonomous decision‑making agent. Follow‑up steps could include studying classic papers and open‑source projects for each module (e.g., ROS Navigation Stack, Autoware, Stable‑Baselines3) and accumulating practical experience through trial and error in real projects.
Summary
Through studying this article, I believe you have gained a deeper understanding of the core capabilities of autonomous decision‑making agents. It is recommended to practice more in combination with real projects. If you have any questions, feel free to discuss!