Agentic ML Operations

Agents that observe, diagnose,
and act across your ML stack.

Four purpose-built agentic tools — for training, inference, edge deployment, and regulatory compliance. Each one doesn't just alert: it reasons, explains, and takes action. Use one, use all.

lensai — unified agent live

Observe. Diagnose. Act.

Every LensAI tool runs an autonomous loop: observe what's happening, diagnose why, and act before the problem costs you. Whether you're training on a GPU cluster, serving LLM traffic, deploying to edge hardware, or proving compliance to a regulator — there's an agent built for that. Every module is fully standalone. Start with one, grow into the suite.

TrainLens

Predict failing runs early — save compute before it's wasted.

Autonomous agent for PyTorch training jobs. Watches every rank and step, diagnoses failures as they form, and terminates bad runs on its own — before they burn hours of GPU compute.

  • Step-time breakdown: compute vs. dataloader vs. barrier
  • DDP rank straggler detection across all workers
  • Gradient health tracking and divergence prediction
  • Autonomous early termination — save compute
Explore TrainLens →
EdgeLens

Monitor, secure, and retrain AI models at the edge.

Free community library for agentic edge AI monitoring. Detects model and data drift in fixed memory, decides which samples are worth sending back, and triggers targeted retraining — autonomously, on-device.

  • Model & data drift with fixed memory footprint
  • Uncertainty-based sampling — reduce transfer costs
  • Model security: prevent injection, protect memory
  • EU AI Act and EU data privacy compliant
Explore EdgeLens →
ComplianceLens

Make your AI safe, compliant, and audit-ready.

Agentic compliance and safety layer for regulated AI. Continuously probes model robustness, tracks regulatory gaps in real time, and generates audit evidence — so you're always submission-ready, not scrambling.

  • Adversarial robustness and bias testing — automated
  • Real-time compliance gap analysis across 200+ requirements
  • Traceability matrix and risk register, always up to date
  • Submission-ready reports: FDA, EU AI Act, ISO 42001
Explore ComplianceLens →
Fully modular. Each agent deploys and operates independently — no platform lock-in, no forced bundling. Drop a single agent into your existing stack, or run the full suite. Integrate via Python SDK, REST API, or drop-in hooks.
TrainLens only + EdgeLens only + ComplianceLens only = Full suite

What does LensAI save you?

Agents that act autonomously save more than engineers who react manually. Estimate what each LensAI agent recovers across your ML stack.

GPU count
512 GPUs
642565121k2k4k8k
GPU type
Cluster tier
Annual GPU spend
$0
 
Wasted compute / yr
$0
goodput loss
CO₂ wasted / yr
0 t
tonnes / year
with TrainLens — detect · diagnose · terminate
Compute recovered / yr
$0
~60% via early termination
Eng time saved / yr
$0
 
Total saved / yr
$0
$0/mo
Goodput loss: SemiAnalysis / Nebius TCO report, Feb 2026 (2.8–10.7%). Compute recovery: ~60% via early termination. Debug savings: failures/mo = GPU-hrs ÷ MTBF; 4 hrs/failure × $180/hr eng rate.
Edge devices
200 devices
10502005002K10K
Data / device / month
2 GB/device
1252050
Transfer cost
Retraining schedule
Annual transfer cost
$0
 
Non-informative transfers
$0
samples with no new signal
Scheduled retraining / yr
$0
cloud compute cost
with EdgeLens — sample · detect · retrain only when needed
Transfer savings / yr
$0
60% reduction via uncertainty sampling
Retraining savings / yr
$0
70% fewer sessions — drift-triggered only
Total saved / yr
$0
$0/mo
Transfer: uncertainty-based sampling sends only samples where the model is least confident — typically 40% of raw volume. Retraining: scheduled retraining fires regardless of whether the model has drifted. EdgeLens triggers retraining only when drift is statistically detected, reducing sessions by ~70% without sacrificing model freshness.
Regulatory framework
Team size
Monthly revenue at risk
$1M / mo
$100K$500K$1M$5M$10M
Manual audit prep hrs
0 hrs
 
Eng cost of compliance
$0
gap analysis + evidence + reports
Revenue delayed / mo
$0
avg months to certification
with ComplianceLens — automate · accelerate · certify
Eng time freed
0 hrs
70% automated — evidence + gap analysis
Revenue unlocked
$0
months saved × monthly revenue
Total value
$0
$0 eng + rev
Audit hours: FDA SaMD submissions require 600–1,400 engineer-hours depending on team size (design docs, test evidence, traceability matrices). ComplianceLens automates evidence collection, gap analysis, and report formatting — 70% reduction. Time-to-cert: FDA avg 16 months, EU AI Act 10 months. ComplianceLens continuous monitoring cuts 2–4 months off submission cycles by keeping evidence current.

TrainLens

Find why your PyTorch training got slow, live.

TrainLens runs alongside your training loop as a step-level autonomous agent. It reads every signal — timing, memory, gradients, FSDP metrics, MFU — diagnoses failures as they form, and acts before the run crashes or wastes hours of compute.

  • Step-time breakdownCompute, dataloader wait, barrier wait, and optimizer time per step, per rank. Tells you what nvidia-smi never will.
  • DDP rank straggler detectionFinds the single slow rank causing your whole DDP job to wait at the barrier. Step-level granularity, not average throughput.
  • Gradient health and divergence predictionTracks norm trends, flags gradient spikes early, and predicts runs heading toward NaN loss.
  • Autonomous early terminationPredicts failure at step 800, terminates cleanly, saves 6 hours of A100 time before the inevitable crash.
<1%training overhead on H100
2 minto first useful diagnosis
60%goodput loss recovered
Usage
import trainlens

# Wrap your training loop — no other changes needed
with trainlens.agent() as lens:
    for epoch in range(num_epochs):
        for batch in dataloader:
            loss = train_step(batch)
            lens.step(loss=loss)

# Watches every step. Diagnoses failures.
# Terminates the run early if needed.
PyTorch 2.5+DDPFSDP Pipeline ParallelHuggingFace LightningDeepSpeed A100 / H100 / H200

EdgeLens

Data & Model Observability for Edge Devices.

EdgeLens is a free, community library for monitoring, securing, and retraining AI models deployed on edge hardware. All key metrics are computed on-device with a fixed, bounded memory footprint — no cloud dependency, no surprise data-transfer costs.

  • Model & data drift detectionDetects distribution shift on-device using memory-efficient sketches. Space complexity O(1/epsilon log(epsilonN)) vs. classical logging at O(N) — scales to long-running, memory-constrained processes.
  • Intelligent retraining samplingWide range of built-in techniques to sample data where the model is most uncertain. Reduces data transfer costs significantly and keeps your model updated with the data that matters.
  • Model security scanningPrevents malicious code injection, ensures model consistency, and protects model memory from tampering — before deployment and at runtime.
  • EU AI Act & privacy compliantMonitoring and observability designed to meet EU AI Act requirements and EU data privacy standards out of the box, from Berlin.
O(1)memory — fixed, predictable always
Freecommunity edition on GitHub
C++ & Pyprofiler APIs included
Python profiler — during training
from edgelens import Profiler

# Capture base profiles during training
profiler = Profiler()
profiler.record(model, dataloader)
profiler.export("baseline.elp")
C++ — edge device inference
// Integrate into your edge inference code
EdgeLens::Monitor monitor("baseline.elp");
EdgeLens::UncertaintySampler sampler;

auto result = model.infer(input);
monitor.record(input, result);

// Drift detected → collect uncertain samples for retrain
if (monitor.drift_detected()) {
    sampler.collect(input, result);
}
Python 3.8+C++17 ONNX RuntimeTFLite OpenVINONVIDIA Jetson Raspberry PiSTM32

ComplianceLens

AI safety, adversarial testing, and regulatory compliance — in one place.

ComplianceLens transforms your AI development process into a compliant, audit-ready pipeline. It covers adversarial robustness testing, real-time gap analysis, and traceability across global frameworks — for any regulated domain, with a specific depth for healthcare AI (FDA SaMD, EU AI Act, ISO 42001, GMLP).

  • Adversarial robustness & bias testingAutomatically stress-test your AI models against adversarial scenarios, distribution shifts, and demographic bias. Surface failures before they reach production — or a regulator.
  • Real-time compliance gap analysisMapped against 200+ requirements across FDA, EU AI Act, ISO 42001, and NIST AI RMF. See exactly what is missing — documentation, test coverage, traceability — so you are always audit-ready.
  • Model observability across the lifecycleDetect silent model drift, monitor post-deployment behavior, and maintain a continuous evidence trail. Built for EU AI Act post-market monitoring obligations and FDA predetermined change control plans.
  • Submission-ready reports, one clickExport traceability matrices, risk logs, and validation summaries formatted for FDA pre-submissions, CE mark technical files, and ISO 42001 certification audits.
44%
FDA medical AI recalls involve undocumented model behavior
60%
Hospital AI pilots fail due to compliance gaps, not model performance
200+
Requirements across global AI regulations to track simultaneously
Compliance Dashboard
FDA SaMD 78% covered
EU AI Act 61% covered
ISO 42001 45% covered
NIST AI RMF 29% covered
3 critical gaps detected
Bias testing evidence missing • Post-market plan not documented • Risk register incomplete
FDA SaMD EU AI Act ISO 42001 NIST AI RMF GMLP CHAI Jira GitHub NVIDIA Inception CMU CyLab
Who it's for
AI / ML teams — embed compliance directly in your workflow
Founders & product leaders — de-risk pilots with built-in governance
Quality & regulatory teams — replace spreadsheets with real-time traceability
Clinical innovation teams — pilot faster, document better, scale confidently

Pick the tool your team needs today.

Each tool is independent. Use one, two, or all three. No platform lock-in. No shared data between products unless you want it.

TrainLens → EdgeLens community → ComplianceLens demo