Обучение по данным как основа моделирования позы и внешности людей и виртуальных аватаров тема диссертации и автореферата по ВАК РФ 00.00.00, кандидат наук Бурков Егор Андреевич

  • Бурков Егор Андреевич
  • кандидат науккандидат наук
  • 2024, ФГАОУ ВО «Национальный исследовательский университет «Высшая школа экономики»
  • Специальность ВАК РФ00.00.00
  • Количество страниц 161
Бурков Егор Андреевич. Обучение по данным как основа моделирования позы и внешности людей и виртуальных аватаров: дис. кандидат наук: 00.00.00 - Другие cпециальности. ФГАОУ ВО «Национальный исследовательский университет «Высшая школа экономики». 2024. 161 с.

Оглавление диссертации кандидат наук Бурков Егор Андреевич

Contents

Introduction

Context: Human Capture

Illustrative Example of Human Capture: Telepresence

Pipeline Overview

Pose and Identity

Challenges

Raionale and Objectives

3D body pose from single-view RGB video

3D body pose from multi-view RGB

Latent pose for face and head

Single-view 3D head reconstruction

Contribution

1 Related Work

1.1 Appearance Modeling

1.1.1 Traditional Methods

1.1.2 Neural Implicit 3D Reconstruction

1.1.3 Few- or Single-View Reconstruction

1.2 Pose Estimation

1.2.1 Keypoints

1.2.2 Other Representations

1.3 Controllable Avatars

1.3.1 Explicit Models

1.3.2 Non-Interpretable/Latent Models

2 3D-Rotatable Pose for Human Body

2.1 Target Application

2.1.1 Limitation of Interest

2.2 Temporal 3D Lifting of 2D Keypoints

2.2.1 Method

2.2.2 Results

2.3 Learnable Triangulation of 3D Keypoints for the Multi-View Case

2.3.1 Method

2.3.2 Results

2.4 Discussion

3 Person-Agnostic Pose for Head and Face

3.1 Target Application

3.1.1 Limitation of Interest

3.2 Self-Supervised Learning of Latent Pose Descriptors

3.2.1 Method

3.2.2 Results

3.3 Discussion

4 Head Geometry Capture from Single RGB Image

4.1 Target Application

4.2 3D Surface Reconstruction via Meta-Learning Neural Implicit Functions

4.2.1 Method

4.2.2 Results

4.3 Discussion

Conclusion

Bibliography

Appendix: Перевод диссертации на русский язык (Russian Translation of the Disser-

tation)

Введение

Глава 1. Обзор литературы

Глава 2. Трёхмерная поза тела человека

Глава 3. Представление позы головы и лица, не зависящее от личности

Глава 4. Оценка формы головы по одному RGB-изображению

Заключение

Рекомендованный список диссертаций по специальности «Другие cпециальности», 00.00.00 шифр ВАК

Введение диссертации (часть автореферата) на тему «Обучение по данным как основа моделирования позы и внешности людей и виртуальных аватаров»

Introduction

Context: Human Capture

Human modeling and tracking, collectively termed human capture, are a set of problems with the end goal of automatically understanding the appearance and the pose of a person. The actual form of "understanding" here is very flexible and depends on the technologies involved, the sensors used, the desired level of detail, and, importantly, the target applications. Example applications of human capture algorithms include VR/AR telepresence, action recognition for surveillance, automated sports highlight generation, recognition of clothing on a real person for re-dressing their virtual avatar (e.g. in a so-called virtual mirror), or automatically turning oneself into a 3D video game character.

Since the range of applications is extremely broad, human capture has become a strong driver of modern computer vision research, spawning numerous research areas. Problem settings in these areas, again, serve different applications; setups and kinds of input sensors [115, 69, 100]; types of available training datasets [1, 145]; desired level of detail [46, 22, 33, 56] etc.

Often, when high quality capture is required, e.g. in production of movies and games, creators still resort to complicated advanced sensors and methods of capture that use little or no prior knowledge. For instance, a commercial volumetric capture (i.e. digitization of a moving person into a dynamic 3D mesh) system [118] required a custom studio with 16 stereo cameras. Another example are state-of-the-art commercial motion capture (i.e. estimation of 3D locations of several specified points on a body of a moving person) systems [47, 138] that require setting up dozens of cameras and the actor to wear so-called markers. However, for a human brain, it is usually sufficient to rely only on visual cues to perform similar tasks. That is, just by looking at someone, we can almost always e.g. imagine how they might look like from different views, and fully understand their body pose. We argue that this mental inference is possible because we have "learned from data", having seen many people before. In contrast, the example systems above seem to never use any prior knowledge about human bodies.

In this work, we try to push the boundaries of some existing human capture algorithms by introducing and emphasizing the trainability mentioned above. Automatically learning prior knowledge

about human appearance and motion from data should allow methods of capture to use less complicated sensors and setups (in this work, we restrict ourselves to RGB cameras) and/or to improve capture quality.

Namely, choosing from the broad spectrum of problem settings, we address some of these (Section ) where only commodity RGB sensors (cameras) are used and, importantly, where learning additional human priors from data would be especially beneficial. We especially seek for problems where we could leverage human priors learned from data to improve the currently available solutions.

Illustrative Example of Human Capture: Telepresence

Out of all human capture applications, we were mostly inspired by telepresence, in particular next-generation telepresence such as remote communication via avatars in VR/AR. It normally involves heavy use of both human appearance capture and generalized human pose estimation, so we also believe that it is a perfect complete example of a human capture application. Therefore, we would like to discuss it in more details in this section to bring the reader deeper into the context. Below, we formalize telepresence, define several important terms, point out some challenges and research problems, and draw connections to our work.

Pipeline Overview

Telepresence is a set of technologies that, among other things, allow people to communicate as if they were present at a same location. The current de facto standard of telepresence are remote video chat apps.

Suppose person A is communicating something to person B via a telepresence system. We call person A driver and person B observer. In telepresence, one device captures the driver and then processes and transmits the captured information to another device that presents (renders) this information to the observer. In case of conventional video chat apps, the capturing and the rendering devices can be the driver's smartphone camera and the observer's PC screen respectively.

Video conferencing has changed the world and has become an extremely important means of communication. Modern technologies can revolutionize telepresence again. For example, if the capturing device was a smartphone camera equipped with advanced computer vision and the rendering device was a virtual reality (VR) headset, users could interact with each other's realistic avatars in VR or alter their avatar's identity to preserve privacy.

Specifically, here are some vision-related features that are all optional but distinguish next-generation

telepresence considered in this section from conventional video conferencing:

1. free viewpoint, i.e. the ability to look around other people's avatars from arbitrary positions, e.g. through an augmented reality (AR) or a VR device;

2. the ability to alter the pose or facial expression of one's avatar, for instance to artificially shift the eye gaze of the avatar to maintain eye contact;

3. the ability to easily alter the appearance or identity of one's avatar, for instance to pretend being a celebrity.

Pose and Identity

driver

*

avatar identity, fixed

raw capture

X

i '—• I

■ L J ■ mm

observer's viewpoint

(orientation, position) ©

the render from observer's viewpoint

nt

R(I,V(x),Q) <

sensor(s)

pose estimation

avatar renderer

it

t

observer

rendering device

Figure 1: An abstract "perfect" telepresence pipeline where pose and identity representations are separate and disentangled.

For this example, we consider next-generation telepresence systems that are human-centric. This means that, unlike common video conferencing apps, they have the notion of a person and maintain the dedicated representation of a person. Further, we distinguish two representations in case of a perfect system: one for characteristics of a person that are fixed during a telepresence session (identity), e.g. skin color, and one for the changing ones (pose), e.g. facial expression.

Such abstract telepresence system is depicted in Figure 1 and works as follows. The observer has a rendering device R (e.g. VR headset) that lets them watch a virtual avatar from arbitrary viewpoint 6. This avatar's static appearance is defined by fixed identity information I, while its dynamics are controlled by the (possibly altered) pose P of the driving person, which is initially captured by some sensors, e.g. an RGB camera of a smartphone.

Here, by identity we mean information about static properties of an avatar that should not change during a telepresence session. In life-like visual telepresence, these may include skin texture, facial features, clothing, eye color, body shape, or adornments. Depending on the implementation, I can be pre-computed by a special algorithm from a real person or defined by an artist. For example, it

could be modeled e.g. from the driver with or without enhancements, from a celebrity, or from an imaginary character.

On the other hand, by pose we mean all other properties of an avatar, i.e. the dynamic ones. These may include person's posture, facial expression, eye gaze, skin wrinkles, and hair layout. In the context of telepresence, pose estimation P(x) are a family of algorithms that try to estimate as many of the above properties as possible from sensor data x. In a perfect telepresence scenario, the pose (pose representation) P (x) should contain as little information about the identity as possible to allow arbitrary drivers and for better security.

Such dual representation permits more control over the system, in particular paving the way to the functions listed in Section . It can help keep the avatar rendering algorithm from overfitting to the identity of a fixed person (from whom the pose is obtained) and instead allow arbitrary person to seamlessly manipulate (drive) an avatar with their pose. Besides, it improves privacy because a telepresence session may only require transmitting the pose, keeping the identity secure.

Challenges

In practice, the above disentanglement rarely holds. Specifically, it is difficult to devise a pose estimation algorithm that outputs a pose representation which is completely free from identity information. For example, one of the most popular representations - locations of prominent keypoints [95 ] - gives away person's identity via gait pattern and bone lengths. Naturally disentangled representations such as 3D morphable models [7] or action units [21] are limited by datasets that are difficult to collect.

Even if we ignore the problem of disentanglement, pose estimation is still a challenging task. For instance, 2D keypoint detectors need complex tracking to yield temporally smooth predictions while staying robust to occluded body parts. 3D keypoints, even if detected from multiple calibrated cameras, are susceptible to noisy estimations again because of (self-)occlusions.

Capturing the identity alone is no easier. The geometry of a person (usually in a neutral pose) would often be represented by a textured mesh or an implicit function with a color field. One problem is that these representations need to be equipped with a rigging algorithm to modify the pose. More importantly, for higher quality of reconstruction, these representations will either require advanced inputs (depth, video, multi-view RGB) [69] or parametrization by pre-training on complex datasets [155]. To avoid representing the geometry directly, fully neural models can be used. These encode identity into some representation that is not directly interpretable. However, with this approach, it is highly non-trivial to achieve disentanglement.

Raionale and Objectives

In this work, we set out to improve pose and identity estimation for various human modeling and tracking applications. We do this by introducing novel algorithms which address some of the challenges discussed in Section . We are particularly interested in the advantages of learning nontrivial patterns from data, so we pick 4 challenges where we feel that such data-driven paradigm (e.g. large datasets, end-to-end differentiable neural models, self-supervised learning, meta-learning) is underused in the existing research but could be a good fit.

3D body pose from single-view RGB video

First, in Section 2.2 we consider a telepresence system where the driving person is captured with a single RGB video camera and drives a full-body avatar that can be viewed from any direction (e.g. through a VR headset or an augmented reality smartphone app). The issue here is that the pose representation consumed by the avatar rendering algorithm is the 2D locations of landmarks (keypoints) on body and face, thus, to achieve the free viewpoint capability, there has to be an algorithm that "reprojects" driver's captured pose into novel views. We propose to solve this by training a novel neural network that "lifts" the 2D landmarks detected by an off-the-shelf algorithm into 3D in a temporally smooth way, thanks to a large dataset of temporal sequences of 3D human poses. These predicted 3D landmarks are then simply projected into the observer's view to obtain the 2D landmarks for the avatar renderer.

3D body pose from multi-view RGB

The 3D pose predicted above appears realistic but not always precise due to single-view projection ambiguity. To address this, in Section 2.3 we explore a multi-camera setup to obtain precise 3D keypoint coordinates. Since existing algorithms for pose triangulation (i.e. multi-camera 3D pose estimation) still often output unrealistic and noisy poses, we propose an algorithm that, for the first time, triangulates the 3D body pose according to pose priors learned from a multi-view dataset. We show that it is much more precise than existing algorithms and outputs realistic poses.

Latent pose for face and head

Next, we consider a dedicated head-and-shoulders-only telepresence scenario known as "head reen-actment" (Chapter 3). In this scenario, the driver is similarly captured with a single RGB camera, but only their head pose and facial expression are transferred to the avatar. In addition, the head

avatar is rendered from a fixed viewpoint, so the 3D representation is not needed although desirable. The current state-of-the-art algorithm for this task would represent the pose with 68 2D landmarks and thus inherit numerous disadvantages of this representation (Section 1.2.1). Particularly undesirable is the identity information contained in landmarks, to which the renderer tempts to overfit, causing the rendering of a wrong identity when the identities of the driver and the avatar differ. To fix this and some other issues, we propose to replace landmarks with a novel latent pose representation that is person-agnostic and is learned automatically from a large dataset in a self-supervised manner. We prove that it indeed works much better for cross-person reenactment than landmarks, and is more precise than the person-independent pose representation obtained from 3DMM.

Single-view 3D head reconstruction

Finally, we focus completely on the estimation of identity I. We would like to obtain a textured 3D mesh of a human head given just one or few RGB images. Traditionally, this would involve complex or hand-crafted models, such as facial mesh templates computed from 3D scans, or hair styles pre-created by artists. We again rely on the power of learning from data and build a model automatically from a simple dataset of videos (Chapter 4). This is possible thanks to neural implicit functions, a novel 3D representation based on neural networks.

Contribution

To summarize, our contributions are the following:

• We propose an algorithm to estimate 3D full body keypoints from a temporal sequence of 2D keypoint estimates. The algorithm is able to predict realistic poses even if some 2D keypoints are missing or erroneously detected. We demonstrate how the findings can enable free viewpoint rendering in an existing telepresence system.

• We introduce a novel approach for multi-view triangulation of 3D body keypoints. The model learns from data and produces smoother and more realistic poses than previous approaches.

• We devise a head reenactment pipeline in which a latent representation for head pose and expression is learned. We show that this representation is a better fit e.g. for telepresence with arbitrary driver than some other representations, i.e. provides a better tradeoff between disentanglement and precision.

• We suggest a method for estimation of full 3D head shape from just one or few photos of a head. We apply meta-learning to neural implicit functions, a recently emerged family of 3D representations. Our method produces reasonably good 3D portraits even from in-the-wild photos. Compared to previous methods, ours does not require 3D scanned datasets and uses less compute.

The rest of this thesis is organized as follows. Chapter 1 reviews prior work on the problems of human capture considered here. Chapter 2 describes a full-body telepresence system with 2D keypoint locations as pose representation, and explores ways to add free viewpoint capability to it, namely the temporal prediction of 3D keypoint positions by a shallow neural network (Section 2.2) and the more advanced learnable triangulation of 3D keypoints in the multi-camera setup (Section 2.3). Chapter 3, on the other hand, stresses the disadvantages of such hand-crafted keypoints on the additional example of head-only telepresence system (Section 3.1), and therefore departs from keypoints to develop a latent person-agnostic pose representation that is learned automatically from data (Section 3.2.1). Chapter 4 focuses entirely on identity capture and does not consider pose estimation. It describes an algorithm for single-view 3D head mesh reconstruction.

Похожие диссертационные работы по специальности «Другие cпециальности», 00.00.00 шифр ВАК

Заключение диссертации по теме «Другие cпециальности», Бурков Егор Андреевич

3.2.2 Результаты

Наша количественное исследование оценивает как относительную производительность дескрипторов позы на второстепенных задачах, так и непосредственное качество «межличностной» анимации головы. В качественных исследованиях мы показываем примеры анимации в сценариях и с тем же, и с другим ведущим, а также результаты интерполяции в выученном пространстве поз. Исследование на избыточность метода показывает влияние различных компонентов нашего метода на метрики при анимации голов.

Методы, участвующие в сравнении

Ниже мы сравним наши результаты с результатами следующих методов и систем. Мы рассматриваем следующие дескрипторы позы, основанные на различных степенях присутствия учителя во время обучения:

• Ours. 256-мерные латентные дескрипторы позы, выученные в нашей системе.

• X2Face. 128-мерные driving vectors, полученные в рамках системы анимации X2Face [144].

• FAb-Net. Мы также оцениваем 256-мерные дескрипторы FAb-Net [143] в качестве представления позы. Они похожи на наши тем, что, хотя и не являются независимыми от личности, но также обучаются без учителя на основе коллекции видео VoxCeleb2.

• 3DMM. Мы рассматриваем современную систему 3DMM [12]. Эта система раскладывает представление головы на ориентацию головы, выражение лица и дескриптор формы головы с помощью глубокой сети. Дескриптор позы получается путем конкатенации ориентации головы (представленной в виде кватерниона) и параметров выражения лица (29 коэффициентов).

Наш дескриптор обучается на основе набора данных VoxCeleb2. Дескриптор X2Face обучается на меньшем наборе данных VoxCeleb1 [96], а FAb-Net - на обоих. Дескрипторы 3DMM наиболее сильно контролируются во время обучения, поскольку 3DMM обучается на 3D-сканах и требует детектора ключевых точек (который, в свою очередь, тоже обучается с учителем).

Кроме того, мы рассматриваем следующие системы анимации головы, основанные на вышеупомянутых дескрипторах позы:

• Ours. Наша полная система, описанная в разделе 3.2.1.

• X2Face. Система X2Face [144], основанная на собственных дескрипторах и анимации на основе искажений (warping).

• X2Face+. В этом варианте мы используем замороженную предварительно обученную driving-сеть X2Face (вплоть до driving-вектора) вместо нашего кодировщика позы, а остальную архитектуру оставляем без изменений по сравнению с нашей. Мы обучаем кодировщик личности, генератор, обусловленный латентным вектором позы X2Face и нашим дескриптором личности, а также проекционный дискриминатор.

• FAb-Net+. То же самое, что и X2Face+, но с замороженной FAb-Net вместо нашего кодировщика позы.

• 3DMM+. То же самое, что и X2Face+, но с замороженной ExpNet [12] вместо нашего кодировщика позы и с отключенными искажениями позы. Дескриптор позы строится из выходов ExpNet, как описано выше. Мы дополнительно нормализуем эти 35-мерные дескрипторы на поэлементные средние значения и стандартные отклонения, вычисленные по обучающему набору VoxCeleb2.

• FSTH. Оригинальная система анимации голов [157], управляемая растеризованными ключевыми точками.

• FSTH + . Мы заново обучаем систему [157], внося в нее ряд изменений, которые делают ее более сопоставимой с нашей и другими системами. Необработанные координаты ключевых точек помещаются в генератор с помощью механизма AdaIN (как и в нашей системе). Генератор предсказывает сегментацию вместе с изображением. Мы также используем свой вариант обрезки обучающих изображений, который отличается от [157].

Исследование дескрипторов

Чтобы понять, насколько хорошо обученные дескрипторы позы сопоставляют разных людей в одной и той же позе, мы используем набор данных Multi-PIE [31], который не используется для обучения ни одного из дескрипторов, но содержит разметку шести классов эмоций для людей в разных позах. Мы ограничиваем набор данных ближней фронтальной и полупрофильной ориентациями камеры (а именно 08_0, 13_0, 14_0, 05_1, 05_0, 04_1, 19_0), оставляя 177 280 изображений. В каждой группе ориентации камеры мы случайным образом выбираем изображение для запроса и находим ближайшие N изображений из той же группы, используя косинусное расстояние между дескрипторами. Мы считаем совпадение корректным, если в ответ приходит изображение человека с той же самой меткой эмоций. Мы повторяем эту процедуру 100 раз для каждой группы. В таблице 3.1 мы показываем общее соотношение правильных совпадений в списках топ-10, топ-20, топ-50 и топ-100. Для дескриптора 3DMM мы учитываем только 29 коэффициентов выражения лица и игнорируем информацию о жесткой позе как нерелевантную для эмоций.

Обратите внимание, представленные выше научные тексты размещены для ознакомления и получены посредством распознавания оригинальных текстов диссертаций (OCR). В связи с чем, в них могут содержаться ошибки, связанные с несовершенством алгоритмов распознавания. В PDF файлах диссертаций и авторефератов, которые мы доставляем, подобных ошибок нет.