Schlüsselwörter
Mimikerkennung - selbstnormalisierende neuronale Netze - Gesichtsgeometrie - Erkennung
von Krankheiten
Keywords
facial expression recognition - self-normalizing neural networks - facial geometry
- disease recognition
Introduction
Because facial expressions constitute a natural, powerful and universal form for humans
to communicate intentions, physical sensations, and emotional states, their importance
is signified
[1]
[2].
Significance of facial expressions in general and in diseases
Whereas the meaning of a certain facial expression is greatly influenced by the underlying
situation [3], facial expressions deliver a
vast amount of information about a person in general: Not only do a person’s current
emotions show in their facial expression [4] but
also physical sensations felt by the person such as pain and even abstract concepts
like intent [3]
[5]. Even if an attempt is made to suppress certain emotional states or rather their
expression, this usually does not succeed completely [6]
[7]
[8]
[9]. Furthermore, can facial expressions serve as an important aid in diagnosing a multitude
of different diseases: Mask-like facial
expressions can be seen in Parkinson’s disease [10], sad, expressionless, anxious faces in depression [11] and diminished facial expressions constitute a distinct domain of negative symptoms
in schizophrenia [12]
[13].
Currently, on the other hand, medical face masks limit the ability to express emotions
through facial expressions as a substantial part of the face is hidden underneath
them [14]. This can be especially limiting in the context of direct patient contact for both
physician and patient: Because physicians receive
limited information from their patients in regards to facial expressions, the assessment
of the patient’s current mental and emotional state may be limited [14]. Conversely, conveying emphatic expressions toward patients may be limited as well
[14]. Summarized, medical face masks complicate social interaction as they disturb emotional
reading from facial expressions [14]. Of course, we do not want to discourage the use of medical face masks in today’s
challenging circumstances, but rather stress the need for both safe and reliable
surrogates that allow people to express and read emotions.
Facial geometry
However not only facial expressions but also facial geometry can provide valuable
information about patients and their diagnosis: For example, hypertelorism and further
facial dysmorphisms
are noticeable symptoms in a variety of syndromes, including LEOPARD syndrome [15], cri du chat syndrome [16] and Gorlin-Goltz syndrome [17].
Automated facial expression
Currently, as the field of artificial intelligence and machine learning is evolving
rapidly, attempts are being made to classify images of persons and their facial expressions
into distinct
categories of the respective underlying emotions [18]
[19]
[20]
[21]
[22]
[23]
[24] or to objectively measure the severity of experienced pain [25]
[26]
[27]
[28] or affect [29]. Until now, most research has relied on simple two-dimensional
images as input data for training the artificial neural networks and consecutive evaluation
[22]. Advanced approaches using
two-dimensional videos promote advantages because videos additionally contain temporal
information as well as further information on the shape of the participants’ faces
because the angle of
view typically experiences subtle changes due to minimal movements of the head [22]. To train artificial neural networks and to
evaluate their performance after the training, databases have been established [30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44]. These databases contain two-dimensional images, image sequences, videos, or a combination
of the former
and show persons with different facial expressions [30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44]. All of the data are categorized based on the respective expression [30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44].
To further enhance automated facial expression recognition accuracy through advanced
data availability, databases have been created containing multiple images and/or videos
from different
viewpoints [37]
[44] or even true three-dimensional data [45]
[46]
[47]
[48]
[49] of human faces and their respective facial expressions.
Facial expressions in current learning-environments
Since the emergence of the COVID-19 causing virus SARS-Cov-2 in 2019 [50], many new challenges have arisen all over the world [51]
[52], including the field of medical research [53] and medical education [54]. Many educators have been confronted with new
challenging requirements and the introduction of virtual classes [55]. Ultrasound programs have discontinued offering hands-on
experiences to their students [56]
[57], medical schools limited or suspended
patient contact for students entirely [58] and some countries have even forbidden face-to-face university lectures completely
[59]
[60]. With the ongoing pandemic, new tools such as apps and services for virtual
online meetings have surged and become a new standard in current medical teaching
environments [51]
[61]
[62]
[63]
[64]
[65] and diagnostic relations [66]. These tools and solutions posed a viable surrogate for the first time in the pandemic
because they allowed a continuation of the teaching process and have proven to
be at least equivalent to traditional approaches in earlier studies [51]
[59]
[67].
Furthermore, it has been shown that the introduction of Zoom virtual tutorials resulted
in higher student satisfaction and a reduction in instructor workload of approximately
25%, whereas
engagement levels of students and the grade distribution stayed the same [68]. To take these new chances and opportunities, it becomes
clear that automated recognition of the learner’s status and the corresponding adaptation
of teaching and learning requirements must be taken into consideration. Since facial
expressions can
reflect both excessive and insufficient demands on learners as well as their comprehension,
automated facial expression recognition can also play a major role here, detecting
the respective
state of each student and consecutive adaption of the learning material through the
teaching software [69].
Taking into account the multitude of applications that automated facial expression
recognition allows, and the possibilities concerning the detection of pathologies
through automated facial
scans, the need for a technology that permits not only facial expression recognition
but also three-dimensional facial scanning becomes evident. As real-time three-dimensional
facial scans
can be utilized both as a means to achieve automated facial expression recognition
and aid in detecting pathologies, a highly portable, both flexible, and cost-efficient
technology to
swiftly perform three-dimensional facial scans would be desirable.
In this study, we want to present a highly portable and flexible solution to perform
automated facial expression recognition and acquisition of three-dimensional facial
geometry data using
only pre-existing smartphones or tablets to cut costs. Using our setup, the only requirement
of the devices is to be equipped with a TrueDepth camera.
In the study presented in this work, this technology was implemented – to our knowledge
– for the first time using only widely available technology.
Since the devices used could even be the patients’ own, large databases could be created
quickly, which would enable seamless data acquisition during treatment periods, for
example.
Material and Methods
A collective of 226 medical students from Ulm University in the 8th–9th study semesters that were present during the block internship in gynecology and
obstetrics between April and
December 2019 were examined. The study examination was integrated into the ultrasound
seminar with an average of 10 participants per appointment. All students who regularly
attended the
seminar or block internship were included, provided they had given their consent to
(voluntary) participation.
Before the study was carried out, the respective participants were informed about
the course of the study, its voluntary nature, and the absence of any (negative) consequences
in the event of
non-participation. According to a request to the Ethics Commission in Ulm, this study
fell under § 15 of the professional code for physicians in Baden-Württemberg and therefore
did not require
an ethics vote.
Setup
At the beginning of the study, the participants were seated at tables with one third-generation
Apple iPad PRO per student placed on the table’s surface. The iPads had a screen size
of
12.9″ and were running our newly developed app to perform the facial scans. Each of
the iPads was equipped with a SmartFolio cover, which was used to set it up on the
table. The angle
between the table’s surface and the back of the iPad was approximately 60° so that
the device’s front and the TrueDepth camera of the device pointed directly at the
participant’s face.
Description
In total, the list of the facial expressions to be shown by the participants consisted
of “disappointed”, “stressed”, “happy”, “sad”, and “surprised”. On the iPads, the
screen shown in
[Fig. 1] was displayed, containing the current expression to be shown by the participants
in its center.
Fig. 1
The screen that the participants were shown during the examination: It was equipped
with instructions on the facial expression to obtain (“Bitte schauen Sie nun
traurig” – “Please look sad now”), an “X” which indicated the preferred position on
the screen to look at as well as the instructions therefore (“Bitte die ganze Zeit
auf das X
schauen” – “Please look at the X all the time”).
In the background, the TrueDepth camera of the iPad was now activated, and the recording
process started.
The TrueDepth camera is a hybrid camera system that consists of both an ordinary camera
and a projector, which projects around 30.000 points of light onto the subject’s face
and
subsequently records their reflection and the time-of-flight difference using an infrared
camera [70]. The functionality is thus
similar to a lidar (light detection and ranging) [70]. This means that an exact three-dimensional scan of the human face can be
recorded in real-time by the system [70]. The raw data now recorded was automatically analyzed using the ARKit framework
and an
ARFaceAnchor object was subsequently created [71]
[72]. Utilizing the newly
created ARFaceAnchor object, extraction of the current facial geometry and current
facial expression from it was possible. The facial geometry was provided in form of
an ARFaceGeometry
object as “a coarse triangle mesh representing the topology of the detected face”
[73]. The facial geometry however was not processed
further in this study. The facial expression was represented in form of a dictionary
containing the type “blendShapeLocation” matched with a floating-point value [74]
[75]. In general, a dictionary is a programmatic type that contains an array
consisting of a certain key and a value matched with it (key-value pairs) [76]. In our example, the dictionary contained an array of
parameters relating to components of the facial expression and their relative strength.
There were 52 of these parameters (“blendShapeLocation”) contained for each facial
expression
recognized along with their respective strength. The relative strength was described
by floating-point values that could adopt values between 0 and 1 [75]. The values for the relative strength were subsequently saved, accompanied by the
type of facial expression to be shown as well as the current time.
In the following, the totality of all 52 parameters recorded at a specific moment
shall be referred to as a “snapshot”.
It was checked by our software once per second if the minimum number of 500 such individual
snapshots (each with 52 specific features of the face) had yet been reached for the
current
type of facial expression. If the minimum number had been reached, the next type of
facial expression was displayed on the iPad’s screen and the participants were asked
to adopt it. The
current progress was indicated by a bar in the middle of the iPad’s screen. During
the whole examination, a point marked with an “X” was displayed in the screen’s central
area. The
participants were asked to preferably look at this point. This was integrated to prevent
the students from looking away from the iPad, which could lead to a deterioration
of the
recognition quality or even complete transient loss of the system’s recognition. If
the detection through the camera failed (e.g. triggered by the student looking away
from the camera or
temporarily poor lighting conditions), the acquisition was automatically stopped.
This prevented invalid values from being recorded.
Programming
The snapshots obtained were consecutively saved in a file called “EmotionTrain.txt”,
combined with the respective facial expression to be displayed and the time of recording.
AI
The data obtained were used to train a new type of ANN architecture.
The study was designed in such a way that a minimum number of around 500 such snapshots
per type of facial expression had to be recorded to continue to the next type of expression
because
the resulting data set was supposed to be balanced in terms of the number of snapshots
per facial expression. As this was a minimum number, which was checked by a program
function at
regular intervals and, when exceeded, the next facial expression or finally to the
end of the study, the total number of registered snapshots during the training part
is not an integral
multiple of the product of study participants and the number of categories of facial
expressions. The normal distribution of the snapshots per category was checked statistically.
The recorded snapshots were separated into two sub-sets: A training-set, and a test-set.
The training-set was used to train the network and consisted of 85% of the data set’s
total amount
of snapshots, whereas the test-set consisted of the remaining 15%. The test-set was
presented to the network only after the training was fully completed. It was used
to assess the
network’s accuracy and specifics.
Allocation of the snapshots to the respective set took place utilizing a random generator
and after shuffling the entire data set [77].
An artificial neural network (ANN) was then created using the environment TensorFlow
[78]. The ANN contained a total of 22 different
layers, including an input layer, an output layer, and the 20 hidden layers in between
(see [Fig. 2]). These are made up of dense layers and
dropout layers. The dropout layers were used to minimize the overfitting of the network
[79].
Fig. 2
Schematic representation of the network architecture, based on the function plot_model
of the environment TensorFlow/Keras [78]
[80]. The input layer is shown in the first box. This is followed by various levels of
the “Dense” and
“AlphaDropout” layers. Finally, the output layer is shown. If a curly bracket was
shown next to a box, this means that the corresponding section is repeated by the
number given.
In addition, an L2 regularization of 0.25 was used for some of the dense layers to
further minimize overfitting [81]. This limits
the weights, determined for the corresponding layer depending on the sum of the square
of the weight. Hereby, extreme value formations of the factors are minimized [81]. Since the input values of the ANN were floating-point values in the ranges from
0 to 1 and these should be processed accordingly in
the network, scaled exponential linear units (SELU) were used as the activation function
of the hidden layers [82]. The resulting ANN
has subsequently been trained using Adam optimization. This was chosen because it
requires little memory, can be calculated efficiently, is well suited for problems
that have a large
amount of data and parameters, and remains unchanged when the gradient is rescaled
diagonally [83].
Special network architecture
Primarily, an ANN of the Self Normalizing Network (SNN) subtype was used. This type
of ANNs is characterized by the fact that no a-priori normalization, i.e. conversion
of the data, which
have different scales, to a common, defined scale from 0–1, needs to take place before
training [82]
[84]. For an ANN to achieve self-normalizing behavior, a SELU was used as an activation
function, which was first published in 2017 by
Klambauer et al. [82]. Since the data used were already normalized, no further normalization would have
been necessary. Concludingly,
a conventional, non-self-normalizing network could have been used. The raison d’être
of the related SNN is that SNNs have further advantages compared to conventional,
non-self-normalizing
network architectures: They allow shortened training times, faster convergence, and
a high degree of robustness in training [82].
When using SNNs, however, it is mandatory to use special initialization functions
[85]. In our case, we used the LeCun normal
function, which was published by LeCun et al. [86].
Furthermore, when using SNN it is mandatory to use a special subform of the dropout
layers: Alpha dropout layers [82]
[85]. With these, the activation is not simply deactivated (set to zero), but provided
with the negative saturation value of the SELU
activation function [82]
[87]. This means that the self-normalization property
of the SNN is retained even after the dropout. Klambauer et al. were able to show
in 2017 that this method is superior to the usual use of regular dropout layers when
using an SNN [82].
A relatively low dropout rate of 0.1 was chosen since tests showed that such lower
rates still reliably limit overfitting, but concurrently allow a high level of network
accuracy.
Inference test
After completion of the training, a test to determine inference time was conducted.
Inference denotes the application of an artificial neural network. In inference, no
training takes place.
Instead, data is supplied to the neural network to receive the wanted predictions.
For this, a standalone app was written in Swift, runnable on both iOS and macOS. The
fully trained network was converted into a TFLite network and implemented, as were
the test- and
training-sets. To measure the inference time, the current system time was captured
both before and after calling the invocation of the interpreter (interpreter.invoke()).
To capture the
system time, the command CACurrentMediaTime() was executed. Inference time was measured
for the data of the test-set, training-set, and a total of 500000 randomly generated
floating-point
precision (Float32) values between 0 and 1. These were generated by the command Float32.random(in:
0 … 1). The inference test app was executed on a 2021 MacBook PRO with an M1 Max processor,
an iPhone 12 PRO with an Apple A14 Bionic processor, and an iPad PRO 12.9″ 3rd generation
with an A12X processor. During the test, only the test app was actively running on
the devices. They
were connected to a permanent power connection, and the battery charge levels were
at least 50% in each case.
Statistics
The descriptive statistics were carried out by specifying absolute and relative frequencies
for categorical data. Specifics to describe the performance of the artificial neural
net, its
training and test included recall, precision, F1-Score, and specificity for each category.
The equal distribution of snapshots per category was checked using a χ2 test.
The statistics program IBM SPSS Statistics for Windows, Version 25 (Armonk, NY: IBM
Corp.), as well as Microsoft Excel was used for all statistical analyzes.
Results
Results from the data acquisition
In total, over half a million snapshots were recorded in the study (n total data set
= 563226).
These snapshots were separated into two sub-sets: a training-set and a test-set. The
training-set was used to train the network and consisted of 85% of the data set’s
total amount of
snapshots (n training-set = 478742), whereas the test-set consisted of the remaining
15% (n test-set = 84484).
The training-set consisted of snapshots with five different categories, which are
shown in [Table 1].
Table 1
Composition of the training-set from the various categories, as well as χ2 test to check the equal distribution.
–
|
Disappointed
|
Stressed
|
Happy
|
Sad
|
Surprised
|
Total number
|
Number of snapshots in training
|
95402
|
95585
|
95734
|
96270
|
95750
|
478741
|
Frequency
|
19.93%
|
19.97%
|
20.00%
|
20.11%
|
20.00%
|
–
|
χ2 test
|
0.357537338
|
It can be seen here that the individual categories are each composed of an average
of 95748.2 individual snapshots. The χ2 test carried out confirmed an equal distribution. The
quantitative differences between the individual categories per set have their origin
in the random programmatic selection when creating the training and test-sets. The
homogeneous
composition of a training-set is of fundamental importance for the type of ANN presented
here.
As already described above, the test-set consisted of 15% of the total data set.
In [Table 2] the distribution of the number of snapshots per category is shown. They are equally
distributed as well.
Table 2
Composition of the test set from the various categories, as well as χ2 test to check the equal distribution.
–
|
Disappointed
|
Stressed
|
Happy
|
Sad
|
Surprised
|
Total number
|
Number of snapshots in training
|
17086
|
16876
|
16781
|
16972
|
16768
|
84483
|
Frequency
|
20.22%
|
19.98%
|
19.86%
|
20.09%
|
19.85%
|
–
|
χ2 test
|
0.372682717
|
The test-set was used to assess the network’s accuracy and specifics.
Allocation of the snapshots to the respective set took place employing a random generator
and after shuffling the entire data set [77].
Results from the training of the ANN
The training of the network was accomplished within a day using a high-performance
system (Google Tensor Performance Unit 2). The overall achieved accuracy of the ANN’s
predictions was
43.58% at the start of the training. It rose to 80.54% over the course of the training.
The exact course of the accuracy and the loss function during the training can be
seen in [Fig. 3] and [Fig. 4]. After 400 epochs, the training was stopped. The batch size used was 128.
Fig. 3
Representation of the accuracy achieved during training depending on the number of
passes (epochs).
Fig. 4
Representation of the values of the loss function achieved during training depending
on the number of passes (epochs).
Results after the training
Results with the training-set
When the training-set was used once again in a subsequent evaluation run, an overall
accuracy of 82.095% could be determined for all snapshots of this set. The ANN was
not trained in this
evaluation run – the training-set was only used as the input data set. Precision,
recall, the F1 score, and the specificity of the individual categories, as well as
the overall accuracy
across all categories, are shown in [Table 3].
Table 3
Listing of the specifics of the individual categories during a run with all data of
the training set.
By category
|
Disappointed
|
Stressed
|
Happy
|
Sad
|
Surprised
|
Recall/Sensitivity
|
80.917%
|
76.911%
|
88.245%
|
83.006%
|
81.381%
|
Precision/Positive predictive value
|
81.774%
|
79.520%
|
83.582%
|
78.487%
|
87.549%
|
F1-Score
|
0.81342859
|
0.78193721
|
0.85850169
|
0.8068335
|
0.84352166
|
Specificity
|
95.512%
|
95.059%
|
95.667%
|
94.273%
|
97.106%
|
Negative cases
|
383339
|
383156
|
383007
|
382471
|
382991
|
Total
|
|
|
|
|
|
Recall/Sensitivity
|
82.095%
|
|
|
|
|
In this evaluation run, overall accuracy was achieved that exceeded the one determined
during the training although the utilized data were identical.
Results with the test-set
Using the test-set, an overall accuracy of 81.152% was determined for all snapshots
of this set. The performance data and sensitivities of the individual categories are
shown in [Table 4].
Table 4
Listing of the specifics of the individual categories during a run with all data of
the test set.
By category
|
Disappointed
|
Stressed
|
Happy
|
Sad
|
Surprised
|
Recall/Sensitivity
|
80.118%
|
74.793%
|
87.605%
|
82.842%
|
80.439%
|
Precision/Positive predictive value
|
80.986%
|
78.252%
|
82.697%
|
77.483%
|
86.868%
|
F1-Score
|
0.805495896
|
0.764830637
|
0.850801551
|
0.800728971
|
0.835299582
|
Specificity
|
95.231%
|
94.811%
|
95.457%
|
93.948%
|
96.989%
|
Negative cases
|
67397
|
67607
|
67702
|
67511
|
67715
|
Total
|
|
|
|
|
|
Recall/Sensitivity
|
81.152%
|
|
|
|
|
Inference results
Inference times resulting from the application of the network for a single inference
cycle are given below. They ranged between 9.71 microseconds (µs) on the fastest tested
device and
1338.67 µs on the slowest device. Each longest inference duration per category and
device differs from the average duration by a factor of 10 or higher. During the test,
inference of the
artificial neural network utilized one processor core out of six on the iPhone 12
PRO and out of ten on the MacBook PRO. The resulting reachable mean facial expression
recognition rates
ranged between 42283.29 per second on the iPhone 12 PRO and 90090.09 on the MacBook
PRO on a single core.
Table 5
Inference times needed for a single prediction cycle using the final trained artificial
neural network on different devices. Input data was supplied in
form of the test- and training-set as well as random numbers. The listed numbers denote
the inference times in microseconds (μs).
|
Number N of supplied snapshots
|
Range of values in µs
|
Minimum inference time in µs
|
Maximum inference time in µs
|
Mean inference time in µs
|
Std. Deviation in µs
|
MacBook PRO M1 Max random
|
500000
|
128.46
|
9.71
|
138.17
|
11.16
|
0.75
|
MacBook PRO M1 Max test-set
|
84483
|
106.42
|
10.08
|
116.50
|
11.11
|
0.71
|
MacBook PRO M1 Max training-set
|
478741
|
114.12
|
10.00
|
124.12
|
11.10
|
0.76
|
iPhone 12 PRO test-set
|
84483
|
618.62
|
11.50
|
630.12
|
20.20
|
12.25
|
iPhone 12 PRO training-set
|
478741
|
408.12
|
11.54
|
419.67
|
22.93
|
14.43
|
iPhone 12 PRO random
|
500000
|
744.54
|
13.67
|
758.21
|
23.65
|
15.36
|
iPad PRO 12.9″ 3rd gen test-set
|
84483
|
1323.63
|
15.04
|
1338.67
|
18.41
|
5.36
|
iPad PRO 12.9″ 3rd gen training-set
|
478741
|
1225.54
|
14.63
|
1240.17
|
18.42
|
3.34
|
iPad PRO 12.9″ 3rd gen random
|
500000
|
762.33
|
14.83
|
777.17
|
18.92
|
3.19
|
Discussion
Training and evaluation of the ANN
Looking at the previously presented results, both problematic and positive aspects
can be found. The various aspects are to be classified below.
Discussion of negative aspects
-
Inhomogeneity:
-
Since the study participants were asked to show the facial expression displayed on
the screen, but checking whether the study participants followed this command over
the entire
period (i.e. until the minimum threshold of 500 snapshots per facial expression was
reached) was impossible, the data may show a certain inhomogeneity.
-
[Fig. 5] shows an example of the recording of the facial expression “happy” of a participant
over the entire period. It can be seen
that the factors “mouthSmile_R” and “mouthSmile_L” reach the highest values. Such
a result is to be expected since these two factors represent the relative strength
of the mouth’s
left and right corners’ elevation, which is typical for smiling. However, it can also
be seen that the relative strength is subject to certain fluctuations and even temporarily
low
values near zero.
-
Especially at the start of the measurement, it should be noted that the rise of these
values requires a certain amount of time, namely that which is necessary for the participant
to react to the new command on the tablet display. Since the recording of the snapshots
took place at an approximate rate of 30 s−1, the time needed to react to the
modified command took about 1–2 seconds for this participant.
-
The values of these curves also decrease towards the end, which can be explained,
for example, by a lack of or declining motivation of the participants. It must also
be noted that
facial expressions with emotional connotations can be represented in a variety of
ways, of course, both intra- and interculturally as well as individually differing
[2].
-
These specifications make it difficult for an ANN to develop generally applicable
algorithms, especially for previously unknown participants. In addition, they reduce
the
predictive quality of the ANN for the data of the known participants, since the facial
expressions may not be recorded completely homogenously because of the one to two-second
overlap time between the individual categories.
-
Disparity between emotions and facial expression:
-
In addition, it must be noted that the emotions that a participant felt during the
study do not necessarily result in a change of the facial expression or a facial expression
per
se [4].
-
Vice versa, it must be noted that the participants were asked to adopt certain facial
expressions without necessarily being in the same emotional state that would cause
the same
expression independently from laboratory conditions. Therefore, it must be questioned
if the facial expressions collected under a certain category are really of this category
and
would look the same if a participant truly felt happy, sad, and so on or if the shown
facial expression only represented the idea of the participant how a facial expression
of the
respective category would show.
-
This leads to a reduced informative value of the ANN’s results.
-
To help solve this issue in future studies, media could be shown which have an increased
probability of triggering corresponding facial expressions due to their emotional
connotation. For validation purposes, each participant could subsequently be asked
which emotion (or whether at all) he or she felt during the playback of, for example,
a short
movie clip. Afterward, the ANN should be retrained using the resulting data.
-
Furthermore, this ANN should demonstrate a fundamental principle using the latest
and at the same time widely available technologies (AI and smartphones/tablets with
TrueDepth
camera).
-
Course over time:
-
Baseline:
-
No “baseline” facial expression (neutral facial expression) was recorded during the
study because during planning it was assumed that such neutral expressions would be
reflected
in homogeneously low values for the individual factors.
-
After careful consideration and under the assumption of the above hypothesis that
facial expressions can show a significant intra- and intercultural as well as individual
variability, it can be determined that the inclusion of a baseline category for future
questions should be contemplated to offer even better possibilities in assessing differences
between participants.
-
Snapshots
-
The used factors, automatically generated by ARKit and in their entirety forming a
snapshot, remain undefined both in terms of their generation, reproducibility, and
significance.
-
The documentation of the ARKit framework used merely notes that these reflect the
particular specifics of the facial expression currently shown.
-
How this dimensional reduction from the raw data, which is based on the values registered
by the TrueDepth camera, to just about 50 floating-point values takes place, remains
concealed.
-
How precise the recognition of the specifics of the facial expression is, based on
the raw data and how exactly – in reverse – a facial expression can be defined at
all based
on these 50 floating-point values, is not explained either.
-
Whether the same or similar facial expressions produce the same or similar snapshots
also remains undefined.
-
However, two factors must be considered: First, according to documentation, these
factors were designed to animate virtual characters based on faces recognized by the
camera. This
leads to the conclusion that the reproducibility must be given, otherwise, an animated
face would show e.g., clear twitching while a constant facial expression is being
shown.
Secondly, the high quality of the recognition and the sufficiently large information
content of the approximately 50 factors can be seen in the fact that overall accuracy
of the
network of approximately 80% could be achieved (in the test). If there had been poor
recognition of the factors, low reproducibility, or insufficient information content
of the
factors, no accuracy to this extent would have been achievable as the supplied data
would have been at least partly random.
-
Dimensionality reduction:
-
The goal of the AI used here is inevitably the dimensional reduction of the acquired
data: from many thousands of measured values of the TrueDepth camera or subsequently
the
snapshots with almost infinite value possibilities, statements with only one dimension
and few finite possibilities are to be distilled.
-
By reducing information from many thousands of light points projected onto the face
by the TrueDepth system to just about 50 factors, the network is relieved of a step
it would
have had to take inevitably (dimension reduction), but using raw data for training
might have yielded even more accurate results since the 50 factors ultimately follow
at least
partly arbitrary goals:
-
It is not proven that the factors used already describe each facial expression sufficiently
precisely that the category of the same can be extracted from them, or formulated
differently, that 100 factors would have been possible as well – or even necessary.
-
The network could have been trained differently, yielding different target factors.
-
It is therefore questionable whether these about 50 factors truly represent the 50
most optimal, efficient factors for our purpose.
-
Concludingly, approaches should be evaluated that take the entirety of facial geometry
as an input.
-
Selection of training data:
-
For the training, all snapshots of all participants were considered together and divided
into two sets (training-set and test-set).
-
Since the factors were recorded with a sufficiently high frequency and it is to be
expected that the measurement conditions (lighting conditions, position of the participant’s
head, and facial expression) sometimes did not change greatly between two snapshots.
Therefore, some of the snapshots used in the training may have similarities to those
used in
the test.
-
Thus, it is to be feared that the results reflect falsely high values that could not
have been achieved by using completely independent data, as would have been the case,
for
example, if the data had been assigned to the respective sets, based on participants.
-
If, however, the allocation of data into training and test-sets had not been based
on individual snapshots, but on participants (a participant or all of his snapshots
are
allocated to either one or the other set), the network would have had the same amount
of data available for training, but a significantly greater heterogeneity between
the data of
the test and training-sets would have to be expected. This is especially critical
given the small number of participants (226) since in essence only 192 (226 × 85%)
different
examples of human behavior would have been available to the network for training.
-
Since the entire spectrum of human facial expressions can hardly be shown based on
only 192 persons during a half-hour investigation, this procedure would certainly
not have
resulted in a robust network capable of making general statements.
-
On the contrary, it would have to be feared that some of the participants assigned
to the test-set could not have been categorized sufficiently correctly based on the
snapshots of
the 192 persons available for training.
-
It must also be considered that there are intercultural differences between the perception
of a facial expression and the composition of facial expression as a reaction to a
certain situation. Similarly, interindividual differences must be taken into account
when questioning whether all participants could have been analyzed accurately with
a
participant-based assignment of snapshots.
-
Considering the goal of the network in this pilot study to at least categorize the
facial expressions of the participants present in this study with sufficient accuracy,
a
single-snapshot-based mapping seems to be better, since thus the previous limitations
can be circumvented.
-
Conclusion:
Fig. 5
Example representation of the course of the raw data of a participant for the category
“Happy” during the recording of the data for AI training. The names of the
individual factors listed in the legend are in accordance with [74].
Advantages
The analysis shows that the training of the ANN, despite the use of data with difficulties
(see above), delivered results. After all, an overall accuracy of 80.54% was observed
at the end
of the training and even higher accuracy when using the test-set, which was previously
completely unknown to the ANN.
Of course, the question arises as to how it can be explained that the accuracy when
using the test-set exceeded that of the training-set. To do this, the three-stage
process must first be
re-iterated: Initially, the network was trained, at the end of which the overall accuracy
in the prediction was 80.45%. This was followed by an evaluation run in which a higher
accuracy of
82.095% was achieved using the same data that was used for the training (the training-set)
again. Finally, there was an evaluation run with data that was completely new for
the network: the
data from the test-set. Here again, a slightly lower accuracy of 81.152% was achieved.
Because the same data was used in steps one and two, it would be expectable that the
same results in terms of accuracy would ensue. There are several factors however that
explain why
different results arise from the same data: First of all, several alpha dropout Layers
were used, which lead to the deactivation of randomly selected neurons during training.
This is done by
setting the weight of the respective neuron to zero when using dropout layers and
by assigning a negative value for alpha dropout layers, a subtype of the dropout layer
[79]
[87]. This of course limits the capacity and complexity of the ANN. After the
training – as soon as the ANN is used to make predictions (as in the subsequent evaluation
run) – this behavior (zeroing or negative value assignment) is switched off, whereby
the ANN, in
abstract terms, receives previously unavailable calculatory capacities.
In summary, the alpha dropout layers are activated during the training and thus lead
to a reduction in the complexity and thus the calculation ability of the ANN. They
are deactivated in
the evaluation process, which means that the ANN has more calculatory units available,
and the complexity of the ANN increases. Due to the increased complexity of the ANN,
it was able to
make more precise statements in our case.
If one now considers that the overall accuracy of the training-set in the evaluation
cycle is superior to that achieved during training with the same data, the influence
of the dropout
layers can be determined. The minimally worse result when using the data from the
test-set in the third run compared to the result in the second run using the data
from the training-set is
around a 1% difference in absolute terms. The poorer performance when using new data
unknown to the network (in our case the data from the test-set) is called overfitting.
With only around
1% absolute difference, however, only minimal overfitting of the ANN has taken place.
Therefore, our strategy of using dropout was beneficial.
Another factor that contributed to the observed behavior is the L2 regularization
we used. This limits the choice of weights since extreme value formations are penalized
[79]
[88]. Since this is intended to minimize overfitting, it is also minimized that
the test accuracy decreases compared to the training accuracy.
To ultimately determine that the observed behavior did not result from an “advantageous”
distribution of the snapshots of the test-set, k-fold cross-validation should be carried
out in
future investigations [89]. In the current study, however, this was not implemented because the assignment
of snapshots to either set
had taken place randomized. Additionally, if the standard value of k = 10 had been
used for the k-fold cross-validation, a total computing time that would have surpassed
our available
capacities would have ensued on the utilized system.
Selectivity of the categories
Finally, when looking at the individual categories set out in [Table 4], it can be seen that the values achieved for the sensitivity differed
significantly in some cases. The difference between “stressed” and “happy” is almost
13% when using the test-set. In the training-set, too, these two categories were those
with the highest
and lowest values for sensitivity and therefore the furthest apart. Ultimately, the
different values can be explained, by the fact that the category “stressed” has a
significantly lower
degree of distinction or significantly fewer distinguishing features in relation to
the other categories, whereas the category “happy” does. This would mean that the
ANN would have a lower
probability of correctly classifying a “stressed” snapshot as such. This also explains
the lower positive predictive value compared to the other categories. If a snapshot
is assigned as
“stressed”, the probability is now lower, since other categories sometimes have similar
characteristics (there is a low degree of selectivity).
Real-world application – inference
As shown in [Table 5], the inference times ranged between 9.71 µs on the fastest tested device and 1338.67
µs on the slowest device. Sometimes,
great differences in duration time occurred between the inference cycles, resulting
in a difference from the average duration by a factor of 10 or higher. These can be
explained by a
power-saving state of the processor. When inspecting the data, it was noticeable that
the first processed snapshot in particular always took much longer than the average.
Before the
processor-intensive process of inference begins, it can be assumed that the processor
load is significantly lower. It is therefore likely that the processor was throttled
at this point to
save energy (e.g. by reducing the clock frequency). Furthermore, some components such
as variables are initialized in the first pass, which costs additional time and is
omitted in subsequent
passes.
Considered on its own, average processing rates of several tens of thousands of snapshots
per second were possible in the conducted tests – a multiple of what is needed. However,
it must be
considered that in real operation, other essential processes are operating concurrently:
For example, the three-dimensional image must be calculated and generated. Likewise,
the 52 factors
must be determined. Accordingly, fewer resources are available in real operation for
inference than were in the test carried out to determine the inference times. Nevertheless,
the low,
measured inference times show that only about 0.75 ms of processor time are demanded
for processing 30 snapshots, based on a mean inference cycle duration of 25 µs. This
corresponds to a
theoretical mean temporal processor occupancy of 0.075% which provides a reasonably
large buffer to allow a problem-free execution in real-time under real conditions.
Consequently, this new
technology allows even old devices to perform a real-time evaluation of 3d facial
expression recognition with ease. Additionally, it can be assumed that the inference
rates under real
conditions are even higher since the test app was executed and controlled via the
Xcode development environment, and debug information was collected in the process.
Conclusion
With the present work, it can be demonstrated that respectable results can be achieved
even when using data sets with some challenges. The ANN used turned out to be very
robust in the
evaluation, with special reference to the above-described challenges.
Furthermore, the use of the latest and at the same time widely available technologies
is to be classified as very interesting for future projects. All that is required
for the setup presented
is an iPhone or iPad with a built-in TrueDepth camera. These system requirements can
already be found in devices that are now more than 4 years old (e.g. iPhone X, presented
in September 2017)
[90].
Outlook
Use in other areas would therefore be classed as quite cost-effective, simple, and
flexible. A binary, ordinal, or even cardinally scaled classification for automated
pain recognition of
(hospitalized) patients based on their facial expressions, automatic adaptation of
the learning content on tablets, depending on the facial expressions shown by the
students, or observation
of the effects of interventions on patients during treatment periods would be possible.
Depending on the reaction to the content shown on the tablet or smartphone, an automatic
adjustment
could take place in real-time.
Implications for further examinations
As described, significantly more and more complex data should primarily be recorded
in a subsequent study: Not only should the 52 different factors relating to the facial
expression be
determined and recorded, but also the raw three-dimensional data of a facial scan
and their corresponding photographic data. Through these measures, it is to be expected
that the training
results can be significantly improved and made more precise. Furthermore, cluster
analysis and the introduction of more categories of facial expressions should be considered.
Currently, we are conducting the follow-up study with an optimized version of our
app that encompasses the suggested alterations and adaptions. As we aim to build a
large database of facial
scans not only for facial expression recognition but also to monitor diseases’ treatment
progress and to perform disease recognition, we encourage fellow investigators to
use our technology
and contribute to our common database.
Funding
This research did not receive any specific grant from funding agencies in the public,
commercial, or not-for-profit sectors.