8 min read

How We Built a Medical AI System That Runs on 7GB of RAM

A deep dive into building ShyldAI — a 6-node ROS2 system for hospital operating room monitoring, deployed on NVIDIA Jetson Orin Nano with just 7GB of memory.

Edge AIROS2Medical AIJetsonComputer Vision

Introduction

Hospital operating rooms are among the most high-stakes environments on the planet. Every second counts, every decision matters, and the margin for error is razor-thin. Continuous monitoring of OR activities — from tracking safety risks to documenting equipment usage and patient status — is critical for both real-time safety and post-operative review.

The problem? Most hospitals cannot rely on cloud connectivity for real-time AI inference. Network latency, data privacy regulations like HIPAA, and the sheer unreliability of internet connections in clinical settings make cloud-based solutions impractical. The AI system needs to run locally, right there in the operating room, processing audio and video in real time.

That was the challenge our team took on with ShyldAI: build a complete OR activity monitoring system that runs entirely on an NVIDIA Jetson Orin Nano — a device with just 7GB of shared memory between the CPU and GPU. This is the story of how we made it work.

The Challenge

The requirements were deceptively simple on paper. We needed a system that could:

  • Listen to conversations in the operating room and transcribe them in real time
  • Watch the OR through a camera feed and understand what was happening visually
  • Analyze both audio and visual data to detect safety risks, equipment issues, and notable events
  • Run 24/7 without human intervention
  • Deploy inside a Docker container on Jetson Orin Nano hardware

The constraints, however, were severe. The Jetson Orin Nano provides 7GB of unified memory shared between the CPU and GPU. To put that in perspective, a single large language model like Llama 3 8B requires more memory than this device has in total. We needed to run a speech recognition model (Whisper), a language model (Qwen 3-4B), and a vision-language model (Qwen3-VL-2B) — plus all the supporting infrastructure for audio capture, image capture, and pipeline orchestration.

Running all three models simultaneously was simply impossible. We needed a fundamentally different approach.

Architecture Design

After extensive prototyping, we settled on a 6-node distributed ROS2 architecture. ROS2 (Robot Operating System 2) might seem like an unusual choice for a hospital monitoring system, but it turned out to be the perfect fit. ROS2 provides a battle-tested framework for building distributed systems with well-defined communication patterns, lifecycle management, and fault tolerance — exactly what we needed.

Here is how the six nodes break down:

Node 1: Audio Recording

The Audio Recording node continuously captures audio from a microphone in the operating room. It buffers audio into configurable segments and publishes them as ROS2 messages. The key design decision here was implementing a circular buffer that keeps the last N seconds of audio, allowing the system to capture context even when triggered mid-conversation.

Node 2: Image Recording

The Image Recording node captures frames from the OR camera at a configurable interval. Rather than streaming video, we capture individual frames at strategic moments — when the orchestrator requests a visual snapshot or at regular intervals. This approach drastically reduces memory and compute requirements compared to continuous video processing.

Node 3: Whisper Transcription

This node loads OpenAI's Whisper model (optimized with TensorRT for the Jetson) to transcribe audio segments into text. But transcription alone was not enough — we implemented a post-processing pipeline that extracts safety-relevant information from the transcribed text. Keywords and phrases related to safety risks, equipment malfunctions, and procedural concerns are flagged automatically.

Node 4: OR Analysis (Qwen 3-4B)

The LLM analysis node takes transcribed text and performs deeper reasoning. It can identify the phase of surgery, detect communication breakdowns, flag potential safety concerns that keyword matching would miss, and generate structured summaries. We chose Qwen 3-4B because it offered the best balance of capability versus memory footprint for our use case.

Node 5: VLM Analysis (Qwen3-VL-2B)

The vision-language model node processes captured images to understand the visual scene. It can detect whether a patient is present on the bed, identify the number and positions of staff, recognize equipment placement, and flag anomalies. Qwen3-VL-2B was selected specifically for its compact size — at 2 billion parameters, it could fit within our memory budget while still providing meaningful visual understanding.

Node 6: Unified Pipeline Orchestrator

The orchestrator is the brain of the system. It coordinates all other nodes, manages the processing pipeline, and makes decisions about when to invoke each model. It supports two modes: continuous processing (regular interval-based monitoring) and on-demand processing (triggered analysis for specific events). The orchestrator also handles all model lifecycle management, which brings us to the most critical engineering challenge.

The Memory Problem

With 7GB total memory and three AI models to run, we faced a fundamental resource allocation problem. Here are the approximate memory requirements:

  • Whisper (TensorRT optimized): ~1.5GB GPU memory
  • Qwen 3-4B (quantized): ~3GB GPU memory
  • Qwen3-VL-2B (quantized): ~2GB GPU memory
  • System overhead + ROS2 + Docker: ~1.5GB

The math simply did not add up for simultaneous loading. Even with aggressive quantization (INT4/INT8), loading any two models concurrently would push us dangerously close to OOM (out-of-memory) errors, leading to system crashes — unacceptable in a medical environment.

Dynamic Model Lifecycle Management

The solution was dynamic model lifecycle management — a system where models are loaded into memory only when needed and unloaded immediately after use. The processing pipeline works like this:

  1. Audio Recording captures a segment
  2. The orchestrator signals Whisper to load
  3. Whisper transcribes the audio
  4. Whisper is unloaded from GPU memory
  5. If the transcription contains safety-relevant content, the orchestrator loads the LLM
  6. The LLM performs deep analysis
  7. The LLM is unloaded
  8. At the next visual checkpoint, the VLM loads for scene understanding
  9. The VLM is unloaded after processing

Each transition includes a garbage collection step and a CUDA cache clear to ensure memory is fully reclaimed before the next model loads. We also implemented health monitoring that tracks memory usage and triggers emergency cleanup if available memory drops below a threshold.

TensorRT Optimization

Raw PyTorch models would have been far too large and slow for our needs. Every model went through TensorRT optimization:

  • Whisper was converted using the faster-whisper library with a TensorRT backend, reducing inference time by roughly 3x compared to the original PyTorch implementation
  • Qwen 3-4B was quantized to INT4 precision, cutting its memory footprint nearly in half while maintaining acceptable output quality for our structured analysis tasks
  • Qwen3-VL-2B was optimized with FP16 precision and TensorRT graph optimizations

The optimization process was not straightforward. Vision-language models with dynamic input shapes required careful handling of the TensorRT conversion pipeline. We spent considerable time ensuring that the quantized models still produced reliable outputs for safety-critical classifications.

Processing Modes

The system supports two distinct processing modes, each designed for different operational scenarios:

Continuous Mode

In continuous mode, the system runs on a timer-based loop. Every N seconds (configurable), it captures audio and an image, runs the full inference pipeline, and logs the results. This mode is designed for routine monitoring — building a continuous record of OR activity without any human intervention.

On-Demand Mode

On-demand mode allows clinical staff or external systems to trigger a specific analysis. For example, a nurse could trigger a safety assessment at a critical moment, or an external system could request a scene understanding snapshot when an alarm fires. On-demand requests are prioritized over continuous processing to ensure rapid response times.

Deployment and Reliability

The entire system is deployed as a Docker container using NVIDIA Container Toolkit on the Jetson Orin Nano. Docker gives us reproducible deployments and the ability to update the system remotely without touching the host OS.

We used Supervisor inside the container to manage the ROS2 node processes, with automatic restart policies for crash recovery. Each node has its own health check endpoint, and the orchestrator monitors all nodes for responsiveness.

Reliability in a medical setting is non-negotiable. The system includes:

  • Automatic model reload if inference fails
  • Graceful degradation — if the VLM fails to load, audio-only monitoring continues
  • Persistent logging to external storage for post-operative review
  • Memory watchdog that prevents OOM situations before they cause crashes

What We Learned

Building ShyldAI taught our team several lessons that apply broadly to edge AI deployment:

Memory is the constraint, not compute. On the Jetson Orin Nano, the GPU is surprisingly capable — inference speed was acceptable for all our models. The bottleneck was always memory. Every architectural decision revolved around memory management.

Dynamic model loading is viable for non-real-time tasks. Loading and unloading models takes a few seconds each time. For a system where processing happens every 30-60 seconds, this overhead is perfectly acceptable. For systems requiring sub-second latency, a different approach would be needed.

ROS2 is underrated for non-robotics applications. The lifecycle management, node discovery, and message passing infrastructure that ROS2 provides saved months of development time. It is not just for robots — it is a solid distributed systems framework.

Quantization quality varies wildly. INT4 quantization worked beautifully for structured text analysis (the LLM node) but was unacceptable for visual understanding tasks where detail matters. Each model needs its own quantization strategy based on the specific task requirements.

Docker on Jetson has quirks. NVIDIA Container Toolkit works well, but memory reporting inside containers does not always match reality. We learned to monitor memory from the host level rather than relying on container-internal metrics.

Conclusion

ShyldAI runs 24/7 in hospital operating rooms today. It processes audio and video from the OR, detects safety risks in real time, and builds a comprehensive record of surgical activities — all on a device smaller than a paperback book with 7GB of RAM.

Edge AI is not about having unlimited compute or massive GPU clusters. It is about being smart with constraints. Dynamic model lifecycle management, aggressive optimization with TensorRT, and careful architectural decisions made it possible to run what would typically require a cloud server on a tiny embedded device.

The healthcare industry is only beginning to tap the potential of on-device AI. As edge hardware continues to improve and models become more efficient, systems like ShyldAI will become the norm rather than the exception. The key insight is that you do not need to wait for better hardware — with the right engineering, today's edge devices are already capable of running sophisticated AI pipelines in production.