Deploying Reinforcement Learning Locomotion Policies on Real Actuator Hardware
The gap between simulated policy performance and real-world actuator behavior is the central engineering challenge in RL-based locomotion. I've watched teams demonstrate impressive MuJoCo gaits at 4x real-time, then spend three months trying to get a recognizable walk on hardware. The gap isn't inevitable — but closing it requires treating actuator fidelity as a first-class engineering problem, not something to paper over with enough domain randomization.
Where the Gap Comes From
Reinforcement learning locomotion policies are typically trained with PPO or SAC on simulation rollouts. The policy learns to exploit whatever dynamics the simulator exposes. When those dynamics differ from hardware, the policy attempts actions that produce different outcomes than it expects, and locomotion fails. The failure modes are consistent enough that we've categorized the main sources:
- Actuator bandwidth mismatch: Policies trained in sim with instant torque response behave differently on hardware where motor electrical dynamics limit torque slew rate. A policy commanding a 50 Nm step at 500 Hz will get approximately 50 Nm in sim and approximately 20–35 Nm in the first millisecond on real hardware, depending on the motor's electrical time constant.
- Reflected inertia not modeled: The MuJoCo
armatureparameter captures rotor inertia reflected through the gear ratio. If it's set to zero (the default), the sim's dynamic response to control inputs is faster than hardware, and policies trained this way consistently over-command during rapid gait transitions. - Observation latency: Real hardware has sensor pipeline latency — from physical sensor reading, to SPI transfer, to EtherCAT frame, to application layer. This adds 1–4 ms of observation delay. Policies trained without observation delay modeling will exhibit limit-cycle oscillations on hardware where the control loop is effectively chasing a state it already responded to.
- Contact model divergence: Foot contact models in MuJoCo diverge from real ground contact stiffness and damping during the loading and unloading phases of a step. This is the hardest gap to close without extensive real-world data.
Closing the Gap: Actuator Model First
Our standard workflow inverts the typical sequence. Most teams train first, then try to deploy. We identify the physical actuator model first, validate it against hardware, then train on the calibrated model. The calibration process — measuring motor electrical time constants, gearbox friction profiles, and reflected inertia from bench tests — takes 2–3 weeks per actuator size class. That upfront investment saves much more time than iterating blindly on domain randomization hyperparameters.
Domain randomization is not a substitute for model accuracy. It's insurance against parameter uncertainty around a correct nominal model. In our experience, teams that skip the calibration step spend 3–6x longer on the sim-to-real gap than teams that invest in it upfront.
Structured Domain Randomization
Once the nominal actuator model is calibrated, domain randomization over the residual parameter uncertainty is genuinely effective. Our current ranges, based on unit-to-unit variation measured across our evaluation hardware:
- Joint damping: ±20% around nominal (temperature-dependent friction variation)
- Link mass: ±5% around CAD-derived values (assembly variation and cable mass estimates)
- Observation delay: 0–4 ms uniform (covers sensor pipeline variation across control computers)
- Ground friction coefficient: 0.4–0.9 uniform (covers smooth floor, textured rubber, tile)
- Push disturbances: 50–150 N horizontal, random direction, 0.1–0.5 s duration, at random phases in the gait cycle
We don't randomize gear ratios or motor inertia beyond ±3%. Those parameters are tightly manufactured and unit variation is small. Randomizing them heavily adds noise without improving real-world transfer, and can actually make policies more conservative than necessary.
Policy Architecture Choices for Hardware Deployment
Several architectural decisions in the policy itself affect hardware deployment reliability significantly:
Action space as joint position targets versus direct torque commands is a recurring debate. Position targets with a PD controller on hardware are more forgiving of model mismatch — the PD controller absorbs some of the actuator dynamic error. Direct torque control gives the policy finer authority over contact forces, which matters for compliant interaction, but demands a much more accurate actuator model. For initial hardware deployment, position targets with well-tuned PD gains are lower risk.
Network size has a practical upper bound on hardware. A 3-layer MLP with 256 hidden units per layer is standard in sim. On an embedded ARM core running the locomotion controller alongside EtherCAT communication, inference latency becomes relevant at 1 kHz control rates. We've found 512 flops inference time per step is a practical budget for policies running on current embedded hardware without needing a dedicated NPU.
History stacking — providing the policy with the last N observations rather than just the current state — helps with observations that arrive with variable latency and with gait phase estimation. 3–5 steps of history is generally sufficient and doesn't increase inference cost significantly.
First Hardware Deployment: What to Expect and Measure
The first deployment of a sim-trained policy on hardware almost never walks perfectly. What matters is systematic diagnosis, not frustration. The measurements to collect immediately:
- Log commanded vs. actual joint torques at 1 kHz. If actual torques consistently undershoot commands during fast transitions, you have an actuator bandwidth issue in your model.
- Compare joint trajectories in sim and hardware for the same policy rollout. Systematic divergence in a specific joint usually points to a localized model error — wrong damping, wrong mass, wrong reflected inertia.
- Check for oscillations at a specific frequency. A consistent limit-cycle frequency in the 5–15 Hz range on hardware (but not in sim) is almost always observation latency — add delay randomization and increase the delay range.
- Measure foot contact timing. If stance-phase duration on hardware is consistently shorter or longer than in sim, your contact model needs adjustment.
The Policy Deployment API
The Tendonkindle motion SDK includes a policy deployment interface that accepts PyTorch or JAX policy checkpoints and handles the hardware-side observation assembly, delay buffering, and action command dispatch. The interface matches the observation and action space conventions used in our MuJoCo export, so a policy trained against our simulator export can be loaded directly without manual array reshaping or observation scaling changes. This isn't a technical breakthrough — it's just the consequence of keeping the simulation model and hardware interface synchronized from the start, which is the entire point of co-designing hardware and software together.
If you're at the stage of preparing your first policy for hardware deployment and want to compare your sim-to-real gap diagnostics against what we've seen, reach out to our engineering team. The patterns are consistent enough across platforms that comparative diagnosis is usually informative.