Computer vision is one of the most remarkable developments to emerge from the field of computer science. It’s among the most rapidly growing areas in the technology landscape and has the potential to significantly impact the way people live and work. Advances at the intersection of machine learning (ML) and computer vision have been accelerating in recent years, leading to significant progress in numerous fields, including healthcare, robotics, the automotive industry, and augmented reality (AR). Microsoft to be a prominent contributor to computer vision research.
Microsoft researchers have long been collaborating with academics and experts in the field on numerous computer vision projects with the goal of expanding what’s possible and helping people achieve more. One example is PeopleLens, a head-worn device that helps children who are blind or have low vision more easily interact in social situations by identifying people around them through spatialised audio. Another example is Swin Transformer. This computer vision architecture attains high accuracy in object detection and provides an opportunity to unify computer vision and natural language processing (NLP) architectures increasing the capacity and adaptability of computer vision models.
Microsoft Research to share some of its newest work in this space at the European Conference on Computer Vision (ECCV) 2022, with 45 accepted papers that will be presented through live presentations, tutorials, and poster sessions. This post highlights two of these papers, which showcase the latest research from Microsoft and its collaborators. One involves increasing the number of facial landmarks for more accurate 3D face reconstruction, achieving state-of-the-art results while decreasing the required compute power. The other introduces a dataset that takes advantage of the capabilities of AR devices for visual localisation and mapping driven by real-world AR scenarios.
3D face reconstruction with dense landmarks
Facial landmarks are points that correspond across all faces, and they often play a key role in face analysis. Researchers frequently rely on them when performing basic computer vision tasks, such as estimating head position and identifying gaze direction and more generally the position in space of all the details of the face. Facial landmarks include such areas as the tip of the nose, corners of the eyes, and points along the jawline. Typically, public datasets that practitioners use to train ML models contain annotations for 68 facial landmarks. However, numerous aspects of human faces are not precisely represented by 68 landmarks alone, and additional methods are often needed to supplement landmark detection, adding complexity to the training workflow and increasing the required compute power.
With the goal of achieving accurate 3D face reconstruction, we propose increasing the number of facial landmarks. In our paper “3D face reconstruction with dense landmarks,” we introduce a method to accurately predict 703 facial landmarks, more than 10 times as many as are commonly used, covering the entire face in great detail, including the eyes, ears, and teeth. We show that the increased number of landmarks are very precise when visible, and when they are occluded, for example, when someone lifts a coffee mug to their lips, we can estimate the location of these landmarks and what the part of the face looks like behind the object blocking it. We can use these landmarks to constrain a model-fitting problem to efficiently and precisely estimate all aspects of a face model. This includes the head pose, eye gaze, as well as the identity of the person whose face is being reconstructed, for example, the thickness of the lips and the shape of the nose.
This simple pipeline is comprised only of dense landmarks and continuous mathematical optimisation, allowing for extreme compute efficiency and enabling the entire system to run at over 150 frames per second on a single core of a laptop.
Increasing privacy, fairness, and efficiency with synthetic data
In computer vision, and particularly the area of face reconstruction, there are understandable concerns about anonymity when training ML models because training data often comes from real people. Our proposed method significantly reduces privacy concerns, as it uses only synthetic data to train ML models, compared with methods that use images of real people as part of their training datasets. That said, when we built the synthetic data pipeline, we needed to preserve the privacy of the people whose data we used, and we took care to acquire the consent of those several hundred subjects. This contrasts with the feasibility of acquiring consent from thousands (or even tens of thousands) of subjects, which would have been necessary if we were using real data.
It’s especially challenging, if not impossible, to preserve the privacy of people appearing in “found images” online, where the subject is often unknown. Using synthetic data helps us protect the privacy of data subjects and the rights of photographers and content creators. It’s another tool we can use in our mission to build technology in an ethical and responsible manner. Additionally, because people’s private information is not included in our dataset, if the ML model were to be attacked, only synthetic data would be subject to compromise.
Synthetic data also provides an opportunity to address inclusivity and fairness. Largely because the distribution of the data is fully controlled, ML practitioners can manage the fairness of representation by including diverse samples in their datasets, and all the data needed to do this would be perfectly labeled. For further details on how we build the synthetics model and training data and our approach to capturing the diversity of the human population, please see our face analysis paper.
There are other advantages to using synthetic data to train ML models, as well. For example, these models require a lot of data, giving rise to numerous difficulties that practitioners must navigate to obtain this data, such as the logistics of finding the number of people required, scheduling time in a lab, and situating multiple cameras to capture the various angles of a person’s face. These concerns are greatly reduced with synthetic data.
In addition, because data doesn’t need to be sourced from a real person, the iteration speed to improve the quality of the 3D face reconstruction is remarkably high, creating a robust workflow. And it isn’t necessary to apply quality assurance (QA) processes on each labeled image when using synthetic data – another cost – and time-saving benefit. Another advantage is the increase in accuracy, speed, and cost-effectiveness in labeling data. It would be nearly impossible to ask someone to consistently label 703 landmarks in a set of images.
Face analysis is a foundational piece for many ML systems, such as facial recognition and controlling avatars, and using a method that provides both accuracy and efficiency while also addressing privacy and fairness concerns pushes the boundaries of the state of the art. Up until now, there hasn’t been much work, if any, on methods that can yield this level of quality with only synthetic data. The ability to achieve 3D face reconstruction using dense landmarks and synthetic data has the potential to truly transform what’s possible with ML.
Follow us and Comment on Twitter @TheEE_io