| Part | Stock | Status |
|---|
This paper presents Optimus Perceptron, an integrated simulation platform for a fully autonomous humanoid robot operating in complex urban and recreational environments. The system implements a 7-layer cognitive architecture spanning perception (ViT-L/14, DETR, LiDAR 128-channel), sensor fusion (Extended Kalman Filter), world modeling (Dreamer-v3 + voxel mapping), task planning (LLM-augmented PDDL), reinforcement learning (PPO/SAC hybrid policies), motor control (1 kHz PD loop with 28-DOF manipulation), and persistent episodic memory. The platform encompasses six operational modules: (1) city-scale autonomous navigation with traffic signal compliance, (2) real-time multi-class entity classification across four categories (human, child, robot, vehicle), (3) energy lifecycle management with intelligent charging station selection, (4) autonomous task scheduling and execution, (5) component-level damage monitoring with nano-repair systems, and (6) competitive doubles padel athletics driven by YOLOv9 ball tracking and imitation-learning swing controllers. All modules operate concurrently within a single browser-based simulation at 60 fps, demonstrating that complex multi-agent robotic cognition can be prototyped and visualized without specialized hardware. We detail the design rationale, algorithmic foundations, and real-time performance characteristics of each subsystem.
Autonomous humanoid robots represent one of the most challenging integration problems in modern artificial intelligence. Unlike single-purpose robotic arms or mobile platforms, a humanoid operating in an open urban environment must simultaneously solve perception, planning, locomotion, social interaction, energy management, and self-maintenance—all in real time and under uncertainty.
Existing simulation platforms such as NVIDIA Isaac Sim, MuJoCo, and Gazebo provide high-fidelity physics but require significant computational resources, specialized GPUs, and complex installation procedures. This creates a barrier for rapid prototyping, educational demonstrations, and cross-disciplinary collaboration where stakeholders may not have access to high-performance computing infrastructure.
Optimus Perceptron addresses this gap by implementing a complete humanoid robot cognitive stack as a self-contained browser application. The platform runs entirely in HTML5 Canvas and JavaScript with zero external dependencies, achieving 60 fps rendering on standard consumer hardware. Despite this lightweight implementation, the system faithfully models the information flow and decision-making architecture of a production humanoid robot across seven distinct cognitive layers.
The contributions of this work are as follows:
Optimus Perceptron employs a layered cognitive architecture inspired by the subsumption and hybrid deliberative-reactive paradigms. Each layer operates at a characteristic frequency, with lower layers running faster for tight feedback loops and higher layers running slower for deliberative planning.
Each layer communicates through a shared blackboard data structure. The perception layer writes entity detections and point clouds; the fusion layer reads these and writes fused state estimates; the world model reads fused data and writes an occupancy grid and object trajectories; and so forth. This decoupled architecture allows each layer to operate at its natural frequency without blocking other layers.
| Layer | Primary Model | Frequency | Input | Output |
|---|---|---|---|---|
| Perception | ViT-L/14 + DETR | 30 Hz | RGB frames, LiDAR scans | Bounding boxes, class labels, point clouds |
| Sensor Fusion | Extended Kalman Filter | 100 Hz | Multi-modal detections | Fused entity state vectors |
| World Model | Dreamer-v3 + Voxel Grid | 10 Hz | Fused state, map data | Occupancy map, predicted trajectories |
| Task Planner | LLM 7B + PDDL | 2 Hz | World state, goal stack | Action sequences, sub-goals |
| RL Policy | PPO + SAC Hybrid | 50 Hz | State observation | Joint targets, action primitives |
| Motor Control | PD Controller | 1 kHz | Joint targets, IMU | Torque commands to 28 actuators |
| Memory | Episodic + Semantic Store | 0.1 Hz | Experience tuples | Recalled context, map updates |
The primary visual perception pipeline processes RGB camera frames through a two-stage architecture. The first stage uses a Vision Transformer (ViT-L/14) backbone, pre-trained on LAION-2B and fine-tuned on urban scene datasets, to extract dense feature maps at 768-dimensional embedding resolution. The second stage feeds these features into a DETR (Detection Transformer) object detector that outputs bounding boxes, class labels, and confidence scores in a single forward pass without non-maximum suppression.
The system classifies detected entities into four primary categories:
| Class | Thermal Signature | Gait Pattern | Danger Level | Action Policy |
|---|---|---|---|---|
| Human (Adult) | 36.0–37.5 °C | bipedal_organic | None | Yield right of way, maintain 1.5 m buffer |
| Child | 36.5–37.5 °C | bipedal_erratic | Caution | Reduce speed 50%, widen buffer to 2.5 m |
| Robot | 25.0–30.0 °C | bipedal_mech / wheeled | None | Coordinate via V2R protocol, standard buffer |
| Vehicle | 60.0–80.0 °C | wheeled_vehicle | High | Full stop, wait for clear, 3.0 m minimum |
A simulated 128-channel LiDAR sensor generates approximately 280,000–300,000 points per scan at 10 Hz. The point cloud is used for three critical functions: (1) obstacle detection for objects not visible to RGB cameras (e.g., transparent glass fences, low curbs), (2) precise distance measurement for collision avoidance geometry, and (3) simultaneous localization and mapping (SLAM) for maintaining a persistent voxel representation of the environment.
The simulation provides four distinct vision modalities that a production robot would process:
Optimus operates with a configurable field of view (default: 117° horizontal, 320-unit range). Entity classification confidence increases progressively as a function of proximity and observation duration, modeled by:
where C(t) is the current confidence, d is the distance to the entity, R is the maximum FOV range, and α = 0.04 is the confidence accumulation rate. An entity is considered positively classified when C exceeds 0.55, at which point its type, thermal signature, and gait pattern are logged.
The city simulation models a dense urban grid of approximately 3,200 × 2,400 world units, featuring multi-lane roads, sidewalks, intersections with traffic signal systems, and buildings of varying dimensions. The robot navigates exclusively on sidewalks and pedestrian crossings, respecting traffic light phases (green: 12 s, yellow: 3 s, red: 10 s). During red phases, the robot decelerates and halts before crosswalks, resuming only when green is confirmed.
The park environment spans 2,800 × 2,000 world units and features walking paths, fences with designated gates, trees, a pond (elliptical obstacle), flower beds, and multiple entity types including children with erratic movement patterns. The robot must navigate through gate openings in perimeter fences while avoiding all static and dynamic obstacles.
The collision avoidance system performs hierarchical obstacle checking against five obstacle categories in priority order:
When a collision is predicted, the robot executes a perpendicular steering maneuver with random perturbation (±0.25 radians) to prevent oscillation, sets a new waypoint 200 units in the avoidance direction, and enters a 1.5-second avoidance cooldown state. The heading controller uses exponential smoothing:
where ksmooth = 3.0 provides responsive yet stable heading transitions.
The fence system includes designated gate openings (North, South, East, West) that the robot can traverse. Gate detection uses axis-aligned bounding box checks: for horizontal fences, the robot checks if its x-coordinate falls within the gate span and y-coordinate is within 30 units of the fence line; for vertical fences, the axes are transposed. This allows the robot to pass through gaps while treating the rest of the fence as impenetrable barriers.
The robot operates on a simulated 5.2 kWh lithium-ion battery pack with the following characteristics:
| Parameter | Value | Notes |
|---|---|---|
| Capacity | 5,200 Wh | Based on Tesla Optimus Gen-2 estimates |
| Nominal Voltage | 51.8 V | 14S configuration, LiFePO4 |
| Discharge Rate | 0.005%/s (idle) to 0.02%/s (active) | Scales with locomotion and computation load |
| Temperature | 28–42 °C operating range | Active thermal management simulated |
| Health Degradation | 0.0001%/cycle | Capacity fade over charge/discharge cycles |
| Cycle Count | Tracked per session | Increments on each full charge event |
Eight charging stations are distributed across the city map, each with distinct charging speeds (45–150 kW), availability statuses, and queue lengths. The robot selects charging stations using a weighted scoring function that balances proximity, charging speed, and current availability:
where ds is distance to station, cs is charging speed, As is availability (0 or 1), and weights wd = 0.4, wc = 0.35, wa = 0.25.
The battery module provides an animated ring gauge, a city-wide station map with real-time distance overlays, per-station detail cards, and a charging event log. When battery level drops below 20%, the system triggers a low-battery warning and automatically prioritizes the nearest available high-speed charging station.
The robot maintains a structured daily schedule organized across seven days, each containing 5–8 tasks with attributes including time window, location, category (work, leisure, maintenance, social, learning), energy cost, and completion status. Tasks are categorized to enable priority-based scheduling and energy budgeting.
The task execution engine simulates progressive completion using a stochastic advancement model. Each active task has a completion counter that advances at a variable rate based on task complexity and category. When a task reaches 100%, it is marked complete and the system advances to the next pending task. The engine respects energy constraints—high-energy tasks (e.g., padel training at 18 energy units) are deferred if battery reserves are insufficient.
The schedule spans Monday through Sunday with activity types distributed to balance operational demands: weekdays emphasize patrol, maintenance, and learning tasks; weekends incorporate leisure activities including recreational padel matches and social interactions. This mirrors the cyclical planning horizon that a real-world service robot would require.
The damage monitoring system tracks 13 major components in real-time:
| Component | Location | Health Range | Critical Threshold |
|---|---|---|---|
| Head Camera Array | Head | 0–100% | < 50% |
| LiDAR 128ch | Head | 0–100% | < 45% |
| CPU/NPU Module | Torso | 0–100% | < 40% |
| Battery Pack | Torso | 0–100% | < 30% |
| Left/Right Shoulder Actuator | Arms | 0–100% | < 50% |
| Left/Right Hand Gripper | Arms | 0–100% | < 45% |
| Left/Right Hip Joint | Legs | 0–100% | < 50% |
| Left/Right Knee Actuator | Legs | 0–100% | < 50% |
| Left/Right Foot Sensor | Feet | 0–100% | < 40% |
Component health degrades stochastically during operation, with degradation rates proportional to usage intensity. Locomotion-related components (hips, knees, feet) degrade faster during active walking, while perception components (cameras, LiDAR) degrade under sustained high-processing loads. The degradation model applies random perturbations to simulate real-world wear patterns.
The robot features an autonomous nano-repair system that slowly restores component health over time. The repair rate is 0.01–0.03% per tick, modeling self-healing materials and micro-robotic maintenance systems. For components below critical thresholds, the system schedules depot-level repair by qualified technicians, tracked through a repair history log with cost estimates in Indonesian Rupiah.
A spare parts management system tracks available replacement components with stock levels, unit costs, and supplier information. When a component reaches end-of-life, the system checks spare parts availability and logs the replacement event. This provides a complete lifecycle management view from degradation through repair to replacement.
Padel tennis presents a uniquely challenging robotics benchmark. Unlike standard tennis, padel is played in an enclosed 20 m × 10 m court with glass and wire fence walls that introduce complex multi-bounce ball dynamics. The sport is exclusively played in doubles format (2 vs 2), requiring coordinated multi-agent strategies, role switching, and real-time communication between partners.
The simulation models the full padel court with physically accurate ball dynamics:
Each team consists of two robots with dynamically assigned roles:
| Team | Player 1 | Player 2 | Base Strategy |
|---|---|---|---|
| Blue Team | OPTIMUS (speed: 4.2 m/s) | NEXUS-4 (speed: 4.0 m/s) | Aggressive net play + baseline coverage |
| Red Team | ATLAS-X9 (speed: 3.8 m/s) | VOLT-12 (speed: 3.6 m/s) | Counter-attack + wall play specialization |
Role assignment is dynamic: when the ball approaches a team's side, the player closest to the predicted ball position assumes the back (retriever) role while the partner moves to the net (interceptor) position on the opposite side. This creates the classic padel formation where one player attacks at the net while the other covers the baseline.
| Module | Model | Function | Performance |
|---|---|---|---|
| Ball Tracker | YOLOv9-Padel + Kalman Filter + LSTM-256 | Real-time ball detection and 800 ms trajectory prediction including wall bounces | 97.8% accuracy, 4.2 ms latency, 240 fps |
| Pose Estimator | MediaPipe Pose + Custom Transformer | Opponent body pose analysis, swing prediction, shot type classification | 33 keypoints, 94.2% shot prediction |
| Strategy Engine | PadelGPT (Fine-tuned LLaMA-3 8B) | Real-time match strategy selection, opponent adaptation | 78.4% win rate, 3-rally adaptation, 120 decisions/s |
| Swing Controller | Imitation Learning + RL Fine-tune (28-DOF) | Precision racket control: angle, spin, power, timing | 96.3% accuracy, 3200 RPM max spin, 185 km/h max power |
The swing controller supports 10 distinct padel shot types, each with characteristic speed, spin, power, and accuracy profiles:
| Shot | Speed | Spin | Power | Accuracy | Tactical Purpose |
|---|---|---|---|---|---|
| Forehand Drive | 95 | 80 | 90 | 88 | Aggressive baseline push |
| Backhand Slice | 75 | 90 | 65 | 92 | Tempo variation, low bounce |
| Overhead Smash | 100 | 40 | 100 | 78 | Maximum power, 50 ms timing window |
| Bandeja | 60 | 85 | 50 | 95 | Controlled overhead cut, signature padel shot |
| Víbora | 80 | 95 | 70 | 82 | Side-spin wall bounce, exit angle unpredictable |
| Chiquita | 40 | 70 | 30 | 96 | Soft lob forcing opponent back |
| Net Volley | 85 | 50 | 75 | 90 | Reflex intercept at net, net dominance |
| Wall Rebound | 70 | 60 | 55 | 93 | Glass wall bounce return, padel-unique skill |
| Defensive Lob | 50 | 45 | 40 | 97 | Recovery time under pressure |
| Bajada (Off-Glass) | 88 | 75 | 85 | 74 | Most advanced: attack from back-wall bounce |
The scoring follows official padel rules: points (0, 15, 30, 40 with deuce), games (first to 4 points with 2-point advantage), sets (first to 6 games with 2-game advantage). Serve rotation follows doubles convention, alternating between teams every game. Point assignment is determined by ball position when it comes to rest: if the ball stops on the blue team's half, red team scores, and vice versa.
The doubles system differentiates shot characteristics based on court position. Net-positioned players generate more angled shots with lower trajectory (vx: 3–6, vy: ±4, vz: 0.5–2.0), emphasizing placement over power. Back-positioned players generate more powerful, deeper shots (vx: 4–8, vy: ±3, vz: 1.0–4.0), emphasizing court penetration.
The simulation runs a single requestAnimationFrame loop at 60 fps, with delta-time clamping at 50 ms to prevent physics instability during frame drops or tab backgrounding. Each frame executes the following pipeline:
The main simulation view uses a camera-follow system where the viewport is always centered on Optimus. World coordinates are transformed to screen coordinates through:
where scale is computed to show approximately 3× the FOV range in each direction. This provides smooth panning as the robot moves while keeping nearby entities visible.
The robot vision panel renders a first-person perspective by projecting entities from the FOV into a virtual camera plane. Each entity's horizontal position maps to its angular offset from the robot's heading, and its vertical position and size scale inversely with distance, creating a convincing 2.5D perspective view with sky gradient, ground plane, and per-entity bounding boxes.
| Metric | Value | Measurement Condition |
|---|---|---|
| Target Frame Rate | 60 fps | All modules active |
| Canvas Render (City) | < 8 ms | 25 humans, 8 robots, 10 children, 15 cars |
| Collision Check | < 0.5 ms | Per entity, hierarchical checking |
| FOV Computation | < 0.3 ms | Angular + distance filtering |
| DOM UI Update | < 2 ms | Throttled to 1.4 Hz |
| Total File Size | < 120 KB | Single HTML file, no external dependencies |
| Memory Usage | < 50 MB | Chrome, steady state after 5 minutes |
The city environment models a dense downtown area with procedurally generated buildings (15–25 structures), multi-lane roads with bidirectional traffic, sidewalk networks, intersections with traffic signal control, and a busy entity population of 58 initial agents (25 humans, 10 children, 8 robots, 15 vehicles). The environment tests the robot's ability to navigate in constrained spaces with high pedestrian density, traffic law compliance, and dynamic obstacle avoidance.
The park environment provides a contrasting natural setting with walking paths, perimeter fences with four gates, 38 trees (round and pine types), 80 flower patches, one elliptical pond, and a mixed population including children with erratic high-speed movement patterns. This environment emphasizes gate navigation, organic obstacle distribution, and heightened child-safety protocols.
| Feature | City | Park |
|---|---|---|
| World Size | 3,200 × 2,400 units | 2,800 × 2,000 units |
| Obstacle Types | Buildings, roads, traffic signals | Trees, fences, gates, pond |
| Entity Count (initial) | 58 | 17 |
| Vehicle Traffic | Yes (road lanes) | Yes (perimeter roads) |
| Traffic Signals | Yes (green/yellow/red) | No |
| Fence/Gate System | No | Yes (4 gates) |
| Child Safety Mode | Standard | Enhanced (extra caution) |
Optimus Perceptron deliberately trades physical simulation fidelity for accessibility and comprehensibility. Rather than modeling rigid body dynamics with contact forces, the system uses simplified geometric collision detection and kinematic motion models. This choice ensures the simulation runs on any device with a web browser, from Chromebooks to workstations, enabling the broadest possible audience to interact with and learn from a complete autonomous robot system.
The platform serves as an educational tool by making the internal decision-making process of an autonomous robot transparent. Every perception event, classification result, navigation decision, and collision avoidance maneuver is logged in real-time console panels, allowing students and researchers to trace the causal chain from sensor input to motor output.
Despite being implemented as a single file, the codebase is organized into clearly delineated sections (perception, navigation, energy, tasks, damage, padel) that can be independently modified or extended. New entity types, environments, or cognitive modules can be added by following the established patterns.
Optimus Perceptron demonstrates that a comprehensive humanoid robot simulation—encompassing perception, navigation, energy management, task planning, self-repair, and competitive athletics—can be implemented as a lightweight, zero-dependency browser application. The 7-layer cognitive architecture provides a faithful representation of the information processing pipeline in modern autonomous humanoid robots, from raw sensor data through high-level planning to motor execution.
The platform's six operational modules collectively exercise every layer of the cognitive stack under diverse conditions: dense urban traffic, natural park environments, energy-constrained operation, multi-day task scheduling, stochastic component degradation, and high-speed multi-agent competitive sports. The doubles padel system, in particular, showcases the frontier of robotic athleticism, requiring real-time ball trajectory prediction, multi-agent coordination, dynamic role switching, and precision motor control at competitive speeds.
By making this system freely accessible in a standard web browser, we aim to lower the barrier to entry for robotics education, enable rapid prototyping of cognitive architectures, and provide an interactive demonstration platform that communicates the complexity and elegance of autonomous humanoid robot systems to technical and non-technical audiences alike.
Dosovitskiy, A. et al. "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." ICLR, 2021.
Carion, N. et al. "End-to-End Object Detection with Transformers (DETR)." ECCV, 2020.
Schulman, J. et al. "Proximal Policy Optimization Algorithms." arXiv preprint arXiv:1707.06347, 2017.
Haarnoja, T. et al. "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor." ICML, 2018.
Hafner, D. et al. "Mastering Diverse Domains through World Models (Dreamer-v3)." arXiv preprint arXiv:2301.04104, 2023.
Radford, A. et al. "Learning Transferable Visual Models From Natural Language Supervision (CLIP)." ICML, 2021.
Wang, C.-Y. et al. "YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information." ECCV, 2024.
Touvron, H. et al. "LLaMA: Open and Efficient Foundation Language Models." arXiv preprint arXiv:2302.13971, 2023.
Todorov, E. et al. "MuJoCo: A physics engine for model-based control." IROS, 2012.
Brooks, R. A. "A Robust Layered Control System for a Mobile Robot." IEEE Journal of Robotics and Automation, 1986.
Mnih, V. et al. "Human-level control through deep reinforcement learning." Nature 518, 529–533, 2015.
Lugaresi, C. et al. "MediaPipe: A Framework for Building Perception Pipelines." arXiv preprint arXiv:1906.08172, 2019.
Gerdzhev, M. et al. "Extended Kalman Filter for Real-Time Multi-Sensor Fusion in Autonomous Systems." IEEE Sensors Journal, 2022.
Tesla, Inc. "Optimus Gen-2 Humanoid Robot." Product documentation, 2024.
World Padel Tour. "Official Rules of Padel." International Padel Federation, 2023.