Related Work怎么写
1 BEHAVE: Dataset and Method for Tracking Human Object Interactions (CVPR2022)
In this section, we first briefly review work focused on object and human reconstruction, in isolation from their environmental context. Such methods focus on modelling appearance and do not consider interactions. Next, we cover methods focused on humans in static scenes and finally discuss closer-related work to ours, for modelling dynamic human-object interactions.
在这一节中,我们首先简要回顾了专注于物体和人体重建的工作,与他们的环境背景隔离。这些方法主要是对外观进行建模,并不考虑交互作用。接下来,我们将介绍专注于静态场景中的人类的方法,最后讨论与我们更接近的动态人-物交互建模的工作。
1.1. Appearance modelling: Humans and objects without scene context
Human reconstruction and performance capture
Perceiving humans from monocular RGB data [12, 29, 31, 41, 43, 44, 58, 59, 64, 87] and under multiple views [37–40, 62] settings has been widely explored. Recent work tends to focus on reconstructing fine details like hand gestures and facial expressions [20,25,85,91], self-contacts [27,54], interactions between humans [26], and even clothing [6, 11].从单眼RGB数据[12, 29, 31, 41, 43, 44, 58, 59, 64, 87]和多视图[37-40, 62]设置下感知人类已经被广泛探索。最近的工作往往集中在重建精细的细节,如手势和面部表情[20,25,85,91],自我接触[27,54],人类之间的互动[26],甚至服装[6, 11]。
These methods benefit from representing human with parametric body models [52, 58, 81], thus motivating our use of recent implicit diffused representations [8, 10] as backbone for our tracker. 这些方法得益于用参数化的身体模型来表示人类[52, 58, 81],从而促使我们使用最近的隐性扩散表征[8, 10]作为我们追踪器的骨干。
Following the success of pixel-aligned implicit function learning [64, 65], recent methods can capture human performance from sparse [38, 80] or even a single RGB camera [47,48]. However, capturing 3D humans from RGB data involves a fundamental ambiguity between depth and scale.在像素对齐的隐含函数学习[64, 65]取得成功之后,最近的方法可以从稀疏的[38, 80]甚至单一的RGB相机[47,48]中捕捉人类的表现。然而,从RGB数据中捕捉3D人类涉及深度和尺度之间的基本模糊性。
Therefore, recent methods use RGBD [56,69,73,76,84] or volumetric data [9,10,19] for reliable human capture. These insights motivate us to build novel trackers based on multiview RGBD data. 因此,最近的方法使用RGBD[56,69,73,76,84]或体积数据[9,10,19]进行可靠的人体捕捉。这些见解促使我们建立基于多视角RGBD数据的新型跟踪器。
Object reconstruction
Most existing work on reconstructing 3D objects from RGB [21, 46, 53, 75, 78] and RGBD [45, 55, 82] data does so in isolation, without the human involvement or the interaction. While challenging, it is arguably more interesting to reconstruct objects in a dynamic setting under severe occlusions from the human.
大多数现有的从RGB[21, 46, 53, 75, 78]和RGBD[45, 55, 82]数据中重建3D物体的工作都是孤立进行的,没有人类的参与或互动。虽然具有挑战性,但可以说,在人类严重遮挡的情况下,在动态环境中重建物体是更有趣的。
1.2. Interaction modelling: Humans and objects with scene context
Humans in static scenes
Modelling how humans act in a scene is both important and challenging. Tasks like placement of humans into static scenes [34, 49, 90], motion prediction [15,32] or human pose reconstruction [16,33,77,86,89] under scene constrains, or learning priors for humanobject interactions [66], have been investigated extensively in recent years. These methods are relevant but restricted to modelling humans interacting with static objects. We address a more challenging problem of jointly tracking human-object interactions in dynamic environments where objects are manipulated.
对人类在场景中的行为进行建模既重要又具有挑战性。近年来,像将人类放置在静态场景中[34, 49, 90]、运动预测[15,32]或场景约束下的人类姿势重建[16,33,77,86,89],或学习人与物体互动的先验因素[66]等任务已经被广泛地研究。这些方法都是相关的,但仅限于模拟人类与静态物体的互动。我们解决的是一个更具挑战性的问题,即在物体被操纵的动态环境中共同跟踪人与物体的互动。
Dynamic human object interactions
Recently, there has been a strong push on modeling hand-object interactions based on 3D [42,72], 2.5D [13,14] and 2D [22,24,28,35,83] data. Although powerful, these methods are currently restricted to modelling only hand-object interactions. In contrast, we are interested in full body capture. Methods for dynamic full body human object interaction approach the problem via 2D action recognition [36, 51] or reconstruct 3D object trajectories during interactions [23]. Despite being impressive, such methods either lack full 3D reasoning [36,51] or are limited to specific objects [23]. 最近,基于3D[42,72]、2.5D[13,14]和2D[22,24,28,35,83]数据的手-物交互建模得到了大力推动。尽管功能强大,但这些方法目前只限于对手-物互动的建模。与此相反,我们对全身捕捉感兴趣。动态全身人类物体交互的方法是通过二维动作识别[36,51]或重建交互过程中的三维物体轨迹[23]来解决这个问题。尽管令人印象深刻,这些方法要么缺乏完全的三维推理[36,51],要么仅限于特定的物体[23]。
More recent work reconstructs and tracks human-object interactions from RGB [71] or RGBD streams [70], but does not consider contact prediction, thus missing a component necessary for accurate interaction estimates. 最近的工作是从RGB[71]或RGBD流[70]中重建和跟踪人与物体的互动,但没有考虑接触预测,因此缺少了准确的互动估计所需的一个组成部分。
Very relevant to our work, PHOSA [88] reconstructs humans and objects from a single image. PHOSA uses hand crafted heuristics, instance specific optimization for fitting, and pre-defined contact regions, which limits generalization to diverse human-object interactions. Our method on the other hand learns to predict the necessary information from data, making our models more scale-able. As shown in the experiments, the accuracy of our method is significantly higher to PHOSA. 与我们的工作非常相关,PHOSA[88]从单一图像中重建人类和物体。PHOSA使用手工制作的启发式方法、特定实例的优化拟合,以及预先定义的接触区域,这限制了对不同的人-物互动的概括。另一方面,我们的方法学会了从数据中预测必要的信息,使我们的模型更具可扩展性。如实验所示,我们的方法的准确性明显高于PHOSA。