Human-Robot Collaboration (HRC) allows human operators to work side by side with robots in close proximity where the combined skills of the robots and operators are adopted to achieve higher overall productivity and better product quality. Under the concept of human-centricity in Industry 5.0, the robots should possess comprehensive perception capabilities so that the mutual empathy can be realized in the HRC assembly. Therefore, KTH has been working on the real-time perception of human operators in terms of the skeleton joints and ongoing motions.
For the human skeleton detection, a Kinect V2 sensor is involved that consists of a RGB camera, a 3D depth sensor, and multi-array microphones. A total of 25 body joints are detected based on the calculations from RGB camera and 3D depth sensor. Multiple persons in the field of view can be simultaneously analysed. For each detected body joint, the result includes the 3D cartesian position and 3D orientation with respect to the Kinect V2 coordinate system, as well as the tracking state and restriction state. The Kinect V2 sensor is calibrated first in the HRC system and the obtained position is normalized. To further enhance the smoothness and consistency of joint positions in successive snapshots, the Unscented Kalman Filter (UKF) is implemented upon the original trajectory of values.
On the other hand, the human motion recognition module covers multiple data modalities (human body skeletons, RGB video frames and depth maps) and a deep learning approach. The skeletons directly reflect the topological movements of human body and the human intentions. Meanwhile, RGB video frames can capture the context information that is not encapsulated in the skeletons, such as the interacting objects, the used tools, the workbench, etc. The depth maps further represent the distances in space, thus avoiding the confusion of similar motions due to camera perspective. Spatial-Temporal Graph Convolutional Networks (ST-GCN) and Inflated 3D ConvNet are employed respectively to learn the spatial-temporal patterns from the multiple data modalities.