Zhu 2021

Zhu 2021

DOI: 10.3389/fcomp.2021.624683

One-Liner

late fusion of multimodal signal on the CTP task using transformers, mobilnet, yamnet, and mockingjay

Novelty

Similar to Martinc 2021 and Shah 2021 but actually used the the current Neural-Network state of the art
Used late fusion again after the base model training
Proposed that inconsistency in the diagnoses of MMSE scores could be a great contributing factor to multi-task learning performance hindrance

Notable Methods

Proposed base model for transfer learning from text based on MobileNet (image), YAMNet (audio), Mockingjay (speech) and BERT (text)
Data all sourced from recording/transcribing/recognizing CTP task

Key Figs

Figure 3 and 4

This figure tells us the late fusion architecture used

Table 2

Pre-training with an existing dataset had (not statistically quantified) improvement against a randomly seeded model.

Table 3

Concat/Add fusion methods between audio and text provided even better results; confirms Martinc 2021 on newer data