Humanoid robots are getting a major upgrade in how they handle real-world messiness—using touch, force, and predictive contact awareness instead of relying mainly on vision. A new AI system called Humanoid Transformer with Touch Dreaming (HTD) boosted success rates across five demanding tasks, delivering a 90.9% relative improvement over a strong baseline by teaching robots to anticipate how contact will change before mistakes happen.
Humanoid robots can already do impressive things in controlled demos—pick up objects, walk forward, follow basic instructions. But in everyday environments, manipulation gets complicated fast. Objects slip. Fabric folds unpredictably. Containers tilt. Small contact changes turn into full task failures.
That’s where a new AI-driven robotics platform from Carnegie Mellon University and the Bosch Center for AI aims to close the gap. Their system focuses on something humans rely on constantly but robots often lack: continuous awareness of touch and force while moving their entire body.
Why humanoid robots still struggle with real-world manipulation
Many modern humanoid robots perform well on basic manual tasks, but more advanced actions—especially those involving complex contact—remain difficult. The challenge isn’t just about having better hands or sharper cameras. It’s about coordinating the entire body while reacting to subtle physical interactions.
According to Yaru Niu, first author of the work and a Ph.D. candidate at CMU’s Safe AI Lab, humans succeed at manipulation because they blend whole-body coordination, dexterous hands, and intuitive predictions about how contact will evolve. Folding cloth, inserting objects, scooping, or carrying fragile items all depend on more than sight.
The researchers emphasized that robots trained mostly on vision and proprioception often fail in contact-rich tasks because contact can change too quickly and is only partially observed.
The system behind HTD: making robots more “contact aware”
To address this, the team built a robotics platform designed for whole-body loco-manipulation, combining multiple components into one working pipeline.
At the center is their AI model, called Humanoid Transformer with Touch Dreaming (HTD), introduced in a paper posted on arXiv. The model was trained using imitation learning, but with a key difference: it doesn’t just learn which actions to take—it also learns to predict future touch-related outcomes.
That approach is what the team calls “touch dreaming.”
Instead of treating touch as passive input, HTD tries to anticipate how tactile feedback and force will evolve as the robot continues manipulating an object. This helps the system respond to physical interaction before things go wrong.
How “touch dreaming” works inside the model
HTD predicts future actions in chunks, but also predicts future hand-joint forces and tactile representations. This design encourages the policy to become sensitive to physical interaction patterns, not just visual appearance or joint angles.
Importantly, HTD does not attempt to reconstruct raw tactile sensor signals directly. Instead, it predicts compact tactile latent representations generated by a slowly updated target network.
This choice is meant to keep the system focused on meaningful contact patterns while avoiding noisy sensor fluctuations. It also allows touch-aware learning to remain integrated within a single imitation-learning framework, rather than requiring separate tactile modeling pipelines.
The researchers found that this latent approach was significantly more effective than predicting raw tactile data.
A controller design that separates balance from manipulation
A major feature of the platform is its RL-based whole-body controller, designed to keep the humanoid stable while its arms and hands perform dexterous work.
Rather than forcing one controller to handle every part of the robot’s motion, the system decouples responsibilities.
The lower-body controller handles balance-critical behaviors like tracking base velocity, torso orientation, and body height, while staying stable under disturbances caused by arm movement. Meanwhile, the upper body is guided through inverse kinematics, and dexterous hand motion is achieved through hand retargeting.
This structure reduces interference between balancing and manipulation—one of the key issues that can cause humanoid robots to stumble or fail tasks mid-action.
Training the lower-body controller in simulation
The lower-body controller was trained in simulation using a teacher-student approach. During training, the teacher has access to privileged information, while the student must plan future actions using real-world observations such as base angular velocity, gravity measurements, lower-body joint positions, and joint speeds.
To make the simulation more realistic, the team replayed retargeted arm motions from the AMASS dataset, helping the controller learn how to remain stable while upper-body movements introduce realistic disturbances.
After training, the student controller is the one deployed on the real robot.
Real-world task results: folding towels, serving tea, and more
In experiments, the researchers tested their system across five real-world tasks:
insert-T, book organization, towel folding, cat litter scooping, and tea serving.
Across these tasks, HTD delivered a 90.9% relative improvement in average success rate over the stronger ACT baseline, according to Ding Zhao, senior author and Director of CMU Safe AI Lab.
The researchers also ran ablation tests to isolate what mattered most. One key finding was that simply adding touch as an extra sensory input did not solve the problem.
Instead, predicting tactile signals in latent space made a major difference. The team reported a 30% relative gain in success rate compared to predicting raw tactile signals directly.
This suggests that the system’s strength comes not just from having tactile sensors, but from learning a predictive internal model of contact that remains stable and meaningful.
Why this platform stands out from earlier humanoid systems
The researchers argue that previous systems have often offered pieces of the solution—vision-based imitation learning, whole-body motion control, tactile sensing, or teleoperation—but rarely combined them into a single practical platform.
Their system integrates:
an RL-based whole-body controller, upper-body inverse kinematics, dexterous hand retargeting, VR teleoperation, and distributed tactile sensing.
This combination also makes it easier to collect high-quality demonstrations for difficult contact-rich tasks, which are often expensive and time-consuming to gather.
Unlike many earlier tactile-learning approaches, HTD does not require separate tactile pre-training or an extra tactile world model during inference. Touch-aware learning is built directly into the imitation-learning policy.
What the researchers want to improve next
While HTD performed strongly, the team says major open questions remain—especially around how to make tactile latent representations more transferable and physically interpretable.
Zhao noted that the tactile latent space works well as a compact training target, but the researchers want to understand what structure makes it most useful for downstream behavior. Future work will explore how to shape the latent space so it captures contact dynamics more explicitly, generalizes better across tasks, and potentially supports stronger reasoning beyond action imitation.
The researchers also plan to scale the framework further and test it in human-robot collaboration experiments. They aim to incorporate more visual data and more human demonstrations, which they believe could provide coordination patterns difficult to replicate with robot data alone.
Another goal is to make the pipeline generalize across different robot embodiments, including robots with different hand designs and tactile sensor layouts.
Some of the system’s code has been released as open-source on GitHub, allowing other researchers to build on the approach.
Why this matters
Robots that can walk and grasp are no longer the main challenge—robots that can reliably handle physical contact in real environments are. Household chores, hospital support tasks, store assistance, and industrial work all demand more than visual recognition. They require stable balance, coordinated movement, and constant prediction of how objects will respond to touch.
By training humanoid robots to anticipate contact changes through touch dreaming, HTD moves robotic manipulation closer to the way humans naturally operate: adjusting in real time, using force feedback, and predicting what will happen next. If these systems continue to scale and generalize, humanoid robots could become significantly more capable in the unpredictable, contact-heavy settings where they’re expected to eventually work.
Study details
Yaru Niu et al, Learning Versatile Humanoid Manipulation with Touch Dreaming, arXiv (2026). DOI: 10.48550/arxiv.2604.13015






