Building Autonomous Robot Navigation from Scratch with ROS2 and Isaac Sim
How we designed a 4-service microservice architecture for autonomous object tracking and navigation using ROS2, NanoOWL, NVSLAM, and Nav2.
Introduction
Imagine telling a robot "go find the red chair" and watching it navigate across a room, avoid obstacles, and stop right in front of the chair — all autonomously. That is exactly what the Carter project set out to achieve: a robot that can see, understand natural language descriptions of objects, and navigate toward them in real time.
Building this system from scratch was one of the most rewarding engineering challenges our team has taken on. It required combining computer vision, natural language processing, depth estimation, 3D mapping, and path planning into a cohesive system that runs reliably on both desktop workstations and NVIDIA Jetson edge devices.
This article walks through the architecture, the key technical decisions, and the lessons learned from building a fully autonomous object-tracking navigation system using ROS2 and NVIDIA's Isaac ecosystem.
The Architecture
Early in the project, we made the decision to build the system as a microservice architecture with four distinct Docker services. Each service handles one responsibility, communicates with the others through ROS2 topics and services, and can be developed, tested, and updated independently.
Service 1: Depth Generation
The Depth Generation service uses NVIDIA Isaac ROS ESS (Efficient Stereo Disparity) to convert stereo camera images into dense depth maps. ESS is a DNN-based stereo matching solution that runs efficiently on NVIDIA GPUs, producing accurate depth estimates even in challenging lighting conditions.
The service subscribes to left and right camera image topics, processes them through the ESS model, and publishes a depth map. This depth information is critical — without knowing how far away objects are, the robot has no way to generate meaningful 3D navigation goals.
One early lesson was the importance of camera calibration. The ESS model is sensitive to the quality of the stereo calibration parameters. Poor calibration leads to noisy depth maps, which cascade into inaccurate navigation goals. We implemented an automated calibration validation step that runs at startup and warns if the calibration quality is below threshold.
Service 2: Object Detection with NanoOWL
The Object Detection service is where things get interesting. Instead of training a custom object detection model for every possible target object, we used NanoOWL — a TensorRT-optimized version of OWL-ViT that runs on NVIDIA GPUs. NanoOWL is a zero-shot, text-prompted object detector. You give it a text description like "red chair" or "cardboard box" and it detects matching objects in the image — no training required.
This was a game-changing decision for the project. Traditional object detection workflows require collecting training data, annotating images, training a model, optimizing it, and deploying it. With NanoOWL, the robot can be redirected to find any object simply by changing the text prompt. In practice, this means the same system can navigate toward a "fire extinguisher" in one scenario and a "delivery package" in another without any model changes.
The service processes camera frames, runs NanoOWL inference, and publishes detection results including bounding boxes, confidence scores, and the text labels. We added confidence thresholding and temporal smoothing to reduce false positives — a single spurious detection should not send the robot careening across the room.
Service 3: Goal Pose Generation
The Goal Pose Generation service is the bridge between perception and action. It takes two inputs — object detections from NanoOWL and the depth map from ESS — and produces a 3D navigation goal in the robot's coordinate frame.
The pipeline works as follows:
- Receive a detection with a 2D bounding box
- Sample depth values within the bounding box region
- Filter outliers using median filtering to handle noisy depth estimates
- Back-project the 2D center point with the filtered depth into 3D space using the camera intrinsics
- Transform the 3D point from the camera frame into the robot's base frame
- Apply a configurable offset so the robot stops at a comfortable distance from the object rather than crashing into it
- Publish the goal as a ROS2 PoseStamped message compatible with Nav2
The offset calculation was more nuanced than expected. Simply subtracting a fixed distance along the line from the robot to the object works in open spaces, but can place the goal inside a wall or obstacle if the object is near a boundary. We implemented a check against the occupancy map to validate goal poses before publishing them.
Service 4: Perception and Path Planning
The final service handles SLAM (Simultaneous Localization and Mapping) and navigation. It combines three powerful NVIDIA Isaac ROS packages:
- NVSLAM for visual-inertial simultaneous localization and mapping, giving the robot a continuously updated estimate of its position in the world
- NVBLOX for 3D reconstruction and occupancy mapping, building a voxel-based representation of the environment that includes obstacles the robot needs to avoid
- Nav2 for path planning and execution, computing collision-free paths from the robot's current position to the goal and sending velocity commands to drive there
When a goal pose arrives from the Goal Pose Generation service, Nav2 queries the NVBLOX occupancy map, plans an optimal path around obstacles, and begins executing the trajectory. The robot continuously updates its position through NVSLAM, allowing Nav2 to adjust the path in real time as new obstacles appear or the environment changes.
Why Microservices for Robotics
The microservice architecture was not just an academic choice — it solved real engineering problems throughout the project.
Independent development and testing. Our engineers could work on the detection model while the navigation stack remained unchanged. Testing object detection did not require a physical robot — we could feed recorded images to the detection service and verify its output in isolation.
Hardware flexibility. The system needed to run on both x86 workstations (for development with Isaac Sim) and ARM64 Jetson devices (for deployment on the physical robot). Docker containers with platform-specific base images made this seamless. The same Docker Compose file works on both platforms with only the base image tag changing.
Graceful failure handling. If the object detection service crashes, the navigation stack continues operating with its last known goal. If depth estimation fails, the system pauses goal generation rather than sending the robot to invalid positions. Each service has its own restart policy and health monitoring.
Resource optimization. On memory-constrained Jetson devices, we could tune the resource allocation for each container independently. The detection service gets more GPU memory while the navigation service gets more CPU allocation.
Simulation with Isaac Sim
Before deploying on physical hardware, every feature was first validated in NVIDIA Isaac Sim. Isaac Sim provides a photorealistic simulation environment with accurate physics, making it possible to test the entire navigation pipeline without risking damage to a real robot.
We used the Carter robot model in Isaac Sim, which closely matches the physical robot's sensor configuration and dynamics. The simulation environment included various objects placed at different locations and distances, allowing us to validate detection, depth estimation, goal generation, and navigation in a controlled setting.
One of the most valuable aspects of simulation was the ability to test edge cases systematically. What happens when the target object is behind an obstacle? What if two similar objects are in view? What if the object moves? These scenarios are difficult and time-consuming to set up with physical hardware but trivial to configure in simulation.
Cross-Platform Challenges
Supporting both x86 and ARM64 architectures introduced several challenges:
Docker base images. NVIDIA provides different base images for x86 (using standard CUDA containers) and ARM64 (using L4T-based containers for Jetson). We used Docker multi-stage builds with platform-specific base images, keeping the application code identical across platforms.
Model optimization. TensorRT engines are architecture-specific — a model optimized on an x86 GPU will not run on a Jetson. We implemented a first-run optimization step that automatically builds TensorRT engines for the current platform when the service starts. This adds time to the first startup but ensures optimal performance on every platform.
Performance tuning. The x86 development workstation with a discrete GPU could run all four services at high frame rates. The Jetson required careful tuning of inference resolution, frame rates, and NVBLOX voxel sizes to maintain real-time performance. We parameterized these settings through environment variables so the same codebase adapts to different hardware capabilities.
The Complete Pipeline in Action
When the system is running, here is what happens in real time:
- The operator sends a text command: "navigate to the blue box"
- The text prompt is forwarded to the NanoOWL detection service
- NanoOWL begins detecting "blue box" in every incoming camera frame
- Meanwhile, ESS continuously produces depth maps from the stereo cameras
- When a "blue box" is detected with sufficient confidence, the Goal Pose service samples the corresponding depth, computes the 3D position, and publishes a navigation goal
- Nav2 receives the goal, plans a path through the NVBLOX occupancy map, and the robot begins moving
- As the robot moves, NVSLAM updates its position, NVBLOX updates the obstacle map, and Nav2 adjusts the path
- The robot arrives at a safe distance from the blue box and stops
The entire pipeline runs in real time, with detection and goal updates happening multiple times per second. If the target object moves, the robot adjusts its goal accordingly. If a new obstacle appears in the path, Nav2 replans automatically.
Lessons Learned
Start in simulation, always. Isaac Sim saved countless hours of debugging. Issues that would take days to diagnose on physical hardware — like subtle coordinate frame misalignments — become obvious in simulation where you can visualize every transform.
Docker Compose is essential for multi-service robotics. Managing four interconnected services with their dependencies, environment variables, and network configurations would be a nightmare without Docker Compose. It also makes deployment reproducible — the same compose file that works on a development machine works on the target robot.
Zero-shot detection changes the game. NanoOWL eliminated the traditional ML pipeline of collecting data, annotating, training, and deploying. The flexibility to change target objects at runtime without any model retraining fundamentally changed how we thought about the system's capabilities.
Depth estimation quality matters more than detection quality. A slightly inaccurate bounding box still produces a usable navigation goal. But a noisy or biased depth estimate sends the robot to the wrong location entirely. Investing time in camera calibration and depth filtering paid off enormously.
ROS2 lifecycle management is your friend. ROS2's managed lifecycle nodes allowed us to implement clean startup sequences, graceful shutdown procedures, and state transitions. When a service needs to be reconfigured, it can transition through deactivate-configure-activate without requiring a full restart.
Conclusion
The Carter autonomous navigation project demonstrated that building a sophisticated autonomous robot — one that understands natural language, perceives its environment in 3D, and navigates safely — is achievable with today's tools. The combination of ROS2 for distributed system architecture, NVIDIA Isaac for perception and simulation, and Docker for deployment creates a powerful foundation for autonomous robotics.
The microservice architecture proved its value repeatedly throughout the project, enabling rapid iteration, cross-platform deployment, and robust failure handling. And the choice of zero-shot detection with NanoOWL gave the system a flexibility that traditional trained models simply cannot match.
Autonomous robotics is no longer confined to research labs. With the right architecture and engineering discipline, production-grade autonomous navigation systems can be built, tested in simulation, and deployed on edge hardware — from the same codebase.