Real-Time Robotic Emotional Expression from Mixed-Reality Demonstrations via Flow Matching

Abstract

Expressive behaviors in robots are critical for effectively conveying their emotional states during interactions with humans. In this work, we present a framework that autonomously generates realistic and diverse robotic emotional expressions based on expert human demonstrations captured in Mixed Reality (MR). Our system enables experts to tele-operate a virtual robot from a first-person perspective, capturing their facial expressions, head movements, and upper-body gestures, and mapping these behaviors onto corresponding robotic components including eyes, ears, neck, and arms. Leveraging a flow-matching-based generative process, our model learns to produce coherent and varied behaviors in real-time in response to moving objects, conditioned explicitly on given emotional states.

Using XR glasses for training and observe the result for real-time inference

The XR platform

1. 7 facial-expression values detected by the XR-headset map the robot's ears angle and shape of the eyes, the gaze direction also maps the position of the eyes on the plane of robot face screen. Some value of the facial expression also maps the movement of the robot's ear. 2. Human's head position and orientation maps the robot's end effector, relative to the operator's head pose as the origin. The positional value is scaled by 1.5 for enhancing operator's reachability. 3. There is a virtual screen floating in front of the operator, which allows the operator to observe the environment from the first person perspective.

Overview of flow matching for expression generation

A history window of robot and target poses plus an emotion label (pink) is fed through FiLM-conditioned U-Net to predict the blue action sequence executed on the robot.

Preliminary Results

Training protocol. The flow model was trained for 3000 epochs with a batch size of 256 and a learning rate of 1 x 10-4. We explored four history-window lengths (1, 2, 4, and 16 frames) in combination with two prediction horizons (16 and 32 frames).
Expert appraisal. A panel of HRI researchers informally inspected roll-outs and provided qualitative feedback:

Temporal context. A 16-frame history performed noticeably worse than 2–4 frames, suggesting that our FiLM-conditioned U-Net does not fully exploit long temporal correlations. Replacing FiLM with a transformer-based temporal encoder may improve sequence understanding at the cost of heavier training.
Prediction horizon. Longer horizons (32 frames) produced more complete, fluid gestures, whereas short horizons introduced occasional “jumps” when the policy re-planned. This points to a weak internal notion of phase; additional data or an explicit timing signal could reduce discontinuities.
Emotion coverage. Six of the seven emotions transferred convincingly; the curious behaviour lacked the distinctive “poke” motion present in the demonstrations. We attribute this to data sparsity and will extend the dataset with targeted examples.