DOI: 10.3389/fcomp.2021.624683
One-Liner
late fusion of multimodal signal on the CTP task using transformers, mobilnet, yamnet, and mockingjay
Novelty
- Similar to Martinc 2021 and Shah 2021 but actually used the the current Neural-Network state of the art
- Used late fusion again after the base model training
- Proposed that inconsistency in the diagnoses of MMSE scores could be a great contributing factor to multi-task learning performance hindrance
Notable Methods
- Proposed base model for transfer learning from text based on MobileNet (image), YAMNet (audio), Mockingjay (speech) and BERT (text)
- Data all sourced from recording/transcribing/recognizing CTP task
Key Figs
Figure 3 and 4
This figure tells us the late fusion architecture used
Table 2
Pre-training with an existing dataset had (not statistically quantified) improvement against a randomly seeded model.
Table 3
Concat/Add fusion methods between audio and text provided even better results; confirms Martinc 2021 on newer data