Framework: We propose a skeleton-to-point net- work (Skeleton2Point) that consists of two trunk branches. In the first branch, referred to as the human skeleton branch, skeleton data is encoded with given space-time information and then fed into a graph transformer neural network to obtain predictions. In the second branch, regarded as the point cloud branch, skeleton data is transformed into point clouds form with an information transform module. Next, FPS and kNN are used to sample and model the posi- tion relations between the points, then a point cloud information extractor is leveraged to extract latent features. We also propose a Cluster-Dispatch-based interaction module to enhance the discrim- ination of local-global information.
Skeleton-based action recognition has achieved remarkable results by developing graph convolutional networks (GCNs) and skeleton transformers. However, the existing methods pay much more attention to encoding joints' position with the given time and serial number information, neglecting to model the positional information contained in the 3D coordinate channel itself. To solve these problems, this paper proposes a skeleton-to-point network (Skeleton2Point) to model joints' position relationships in three-dimensional space, which is the first to leverage point cloud methods into skeleton-based action recognition in a dual-learner approach. The human skeleton learner feeds compact skeletal representations in the skeleton transformer network, which is composed of a spatial transformer block and a temporal transformer block. In the point cloud learner, skeleton data is transformed into point clouds form with a proposed Information Transform Module (ITM), which fills the channel information with the spatial and temporal serial number. Then, several point cloud learning levels are adopted to extract deep position features. The point cloud learning level is made of three key layers: Sampling layer, Grouping layer, and Point cloud extract layer. We also propose a Cluster-Dispatch-based Interaction module (CDI) to enhance the discrimination of local-global information. In comparison with existing methods on NTU-RGB+D 60 and NTU-RGB+D 120 datasets, Skeleton2Point achieves SOTA levels on both joint modality and stream fusion. Especially, on the challenging NTU-RGB+D 120 dataset under the X-Sub and X-Set setting, the accuracies reach 90.63% and 91.86%
Motivation: Containing 3D coordinates, skeleton joints can be naturally viewed as point clouds distributed in three-dimensional space as illustrated. However, while dealing with skeleton joint coordinates in skeleton data, all these methods pay much more attention to the topological relationships that exist between the joints, encoding joints' position with the given time and serial num- ber information, neglecting to model the positional information contained in the 3D coordinates channel itself.
Key insight: To our best knowledge, we are the first to regard skeleton joints as point clouds via incorporating the position information of skeletons into point cloud methods, demonstrating the validity of modeling position relationships with 3D coordinates.
Through extensive benchmark, we found our proposed framework can significantly enhance the ability of skeleton recognition models in performing different types of actions.
We adopt PointConT as the lite point cloud information extractor and PointMLP as the heavy point cloud information extractor and compared with the state- of-the-art methods on two different datasets: NTU-RGB+D 60 and NTU-RGB+D 120 datasets to verify the competitive performance. Especially, for a fair comparison, we also used joint stream dataset only, because joints contain 3D coordinates the same as point clouds. If the bone or motion stream datasets were used in point cloud learner, it couldn't be well-compatible.
BibTex Code Here