Jenkin, Michael R.Chandola, Deeksha2024-03-182024-03-182024-03-16https://hdl.handle.net/10315/41852Speech emotion recognition (SER) is the task of automatically recognizing emotions expressed in spoken language. Current approaches focus on analyzing isolated speech segments to identify a speaker’s emotional state. That being said, models based on text-based emotion recognition methods are considering conversational context and are moving towards emotion recognition in conversation (ERC). With the availability of multimodal datasets, ERC can be extended to non-text modalities as well. Building on these advances, in this thesis, we propose SERC-GCN, a method for speech emotion recognition in conversation (SERC) that predicts a speaker’s emotional state by incorporating conversational context, specifically speaker interactions, and temporal dependencies between utterances. SERC-GCN is a two-stage method. In the first stage, emotional features of utterance-level speech signals are extracted using a graph-based neural network. Here each individual speech utterance is transformed into a cyclic graph. These graphs are then processed by a two layered GCN architecture followed by a pooling layer to extract utterance-specific emotional features. In the second stage, these features are used to form conversation graphs that are used to train a graph convolutional network to perform SERC. We empirically evaluate the effectiveness of SERC-GCN on two benchmark dataset; IEMOCAP and MELD. Results show that SERC-GCN outperforms existing baseline approaches on these datasets.Author owns copyright, except where explicitly noted. Please contact the author directly with licensing requests.Computer scienceComputer engineeringPsychologySpeech Emotion Recognition in Conversations Using Graph Convolutional NetworksElectronic Thesis or Dissertation2024-03-16Speech emotion recognition in conversationHuman-computer interactionGraph convolutional networkEmotion recognition in conversation (ERC)Multimodal analysis