代码学习——基于音频、词汇和不流畅特征的门控多模态融合，用于从自发语音中识别阿尔茨海默病痴呆Multi-modal fusion with gating using audio, lexical an

文章目录

引言

文章全称：Multi-modal fusion with gating using audio, lexical and disfluency features for Alzheimer’s Dementia recognition from spontaneous speech
这篇文章是少有的公开代码的关于AD检测一些论文，这里需要好好学习。主要从以下几个方面进行学习，分别是
- 特征工程：提取音频特征和语义特征的方式
- 特征融合方式：本文是使用基于门控的双向LSTM进行特征处理和融合的
- 多模态单词向量标记：本文专门检测了文本描述中的repair对，需要结合代码查看如何实现的
整个代码并不能跑起来，作者仅仅给了部分代码，仅仅能够起到一个参考的作用，仅仅了解一下如何实现特征融合以及相关特征处理即可。

正文

特征工程

Audio Features音频特征

本文使用COVAREP音频分析框架，自动化从音频中提取79种声音特征。首先对声音进行采样，转变为100Hz，然后使用COVAREPS中高纬度特征，包括平均值、最大值、最小值、中位数、标准差和偏斜度和峰度等。除此之外，特征还包括韵律特征（基本的频率和声音）、声音质量特征（归一化振幅商，准开商，微分声门源谱的前两次谐波振幅差，最大色散商，抛物线谱参数，小波响应的光谱倾斜/斜率，以及声门脉冲动力学的Liljencrants-Fant模型的形状参数）还有频谱特征（归一化振幅商，准开商，微分声门源谱的前两次谐波振幅差，最大色散商，抛物线谱参数，小波响应的光谱倾斜/斜率，以及声门脉冲动力学的Liljencrants-Fant模型的形状参数）。没有声音的波段将会被设置为零。
然后对上述特征进行零均值和方差归一化。我们省略了所有与训练集结果没有统计学上显着的单变量相关性的特征。（这里是增加了单变量和结果的统计学的相关性实验）。
然后单独对声音这一个模态使用LSTM序列神经网络处理。在探索不同的超参数之后，使用音频数据的模型设置为时间步为20、步长为1，并有4个双向LSTM层，每层有256个隐藏节点。
下述是提供代码中专门针对音频特征的部分，仅仅有

	hyperparams = {'exp': 20, 'timesteps': 30, 'stride': 1, 'lr': 9.9999999999999995e-07, 'nlayers': 3, 'hsize': 128,
                   'batchsize': 128, 'epochs': 300, 'momentum': 0.80000000000000004, 'decay': 0.98999999999999999,
                   'dropout': 0.20000000000000001, 'dropout_rec': 0.20000000000000001, 'loss': 'binary_crossentropy',
                   'dim': 100, 'min_count': 3, 'window': 3, 'wepochs': 25, 'layertype': 'bi-lstm',
                   'merge_mode': 'mul', 'dirpath': 'data/LSTM_10-audio/', 'exppath': 'data/LSTM_10-audio/20/',
                   'text': 'data/Step10/alltext.txt', 'balClass': False}
    exppath = hyperparams['exppath']

    # load model
    with open(exppath + "/model.json", "r") as json_file:
        model_json = json_file.read()
    try:
        model = model_from_json(model_json)
    except:
        model = model_from_json(model_json, custom_objects={'myrelu':myrelu})

    lr = hyperparams['lr']
    loss = hyperparams['loss']
    momentum = hyperparams['momentum']
    nlayers = hyperparams['nlayers']
    # text = 'data/Step10/alltext.txt'

    # load best model and evaluate
    filepath_best = exppath + "/weights-best.hdf5"
    model.load_weights(filepath=filepath_best)
    print('--- load weights')

    sgd = optimizers.SGD(lr=lr, momentum=momentum, decay=0, nesterov=True)

    model.compile(loss=loss,
                  optimizer=sgd,
                  metrics=['accuracy'])
    print('--- compile model')

    # load data
    X_train, Y_train, X_dev, Y_dev, R_train, R_dev = loadAudio()
    print('--- load data')

    # getting activations from final layer
    layer = model.layers[nlayers-1]
    inputs = [K.learning_phase()] + model.inputs
    _layer2 = K.function(inputs, [layer.output])
    acts_train = np.squeeze(_layer2([0] + [X_train]))
    acts_dev = np.squeeze(_layer2([0] + [X_dev]))
    print('--- got activations')

这段代码只能说一点用都没有，除了告诉我专门针对音频文件处理的Bi-LSTM具体的网络训练参数，就再也没有别的了，没有告诉我如何提取特征，怎么保存，这些都没有。
看个寂寞。

Lexical Features from Text文本中的词汇特征

本文使用预训练的Glo Ve模型从图片描述转述文本中提取词汇特征，将话语序列转为单词序列。我们选择了优化训练集上模型输出的超参数值。最佳的嵌入维度被发现是100。
对于文本模态数据的LSTM，我们使用文本输入的模型则设置了时间步为10、步长为2，并有2个LSTM层，每层有16个隐藏节点。
这是相关的代码

    # PROCESSING DOCS
    # ===============================
    hyperparams = {'exp': 330, 'timesteps': 7, 'stride': 3, 'lr': 0.10000000000000001, 'nlayers': 2, 'hsize': 4,
                   'batchsize': 64, 'epochs': 300, 'momentum': 0.84999999999999998, 'decay': 1.0,
                   'dropout': 0.10000000000000001, 'dropout_rec': 0.80000000000000004,
                   'loss': 'binary_crossentropy', 'dim': 100, 'min_count': 3, 'window': 3, 'wepochs': 25,
                   'layertype': 'bi-lstm', 'merge_mode': 'concat', 'dirpath': 'data/LSTM_10/',
                   'exppath': 'data/LSTM_10/330/', 'text': 'data/Step10/alltext.txt', 'balClass': False}
    exppath = hyperparams['exppath']

    # load model
    with open(exppath + "/model.json", "r") as json_file:
        model_json = json_file.read()
    try:
        model = model_from_json(model_json)
    except:
        model = model_from_json(model_json, custom_objects={'myrelu':myrelu})

    lr = hyperparams['lr']
    loss = hyperparams['loss']
    momentum = hyperparams['momentum']
    nlayers = hyperparams['nlayers']

    # load best model and evaluate
    filepath_best = exppath + "/weights-best.hdf5"
    model.load_weights(filepath=filepath_best)
    print('--- load weights')

    sgd = optimizers.SGD(lr=lr, momentum=momentum, decay=0, nesterov=True)

    model.compile(loss=loss,
                  optimizer=sgd,
                  metrics=['accuracy'])
    print('--- compile model')

    # load data
    X_train_doc, Y_train, X_dev_doc, Y_dev, R_train_doc, R_dev_doc = loadDoc()
    print('--- load data')

    # getting activations from final layer
    layer = model.layers[nlayers - 1]
    inputs = [K.learning_phase()] + model.inputs
    _layer2 = K.function(inputs, [layer.output])
    acts_train_doc = np.squeeze(_layer2([0] + [X_train_doc]))
    acts_dev_doc   = np.squeeze(_layer2([0] + [X_dev_doc]))
    print('--- got activations')

和音频特征处理的一样，文本特征，也仅仅是直接调用预先训练好处理好的模型，然后保存为对应npy特征，并没有告知我文本特征的具体提取方法。

用于训练音频特征和语义特征的具体的LSTM网络模型

下述是音频模型和语义模型统一使用的LSTM模型，这里是使用Keras进行定义的，可以学习使用一下，如何使用。

def LSTM_train(X_train, Y_train, X_dev, Y_dev, R_train, R_dev, hyperparams):
    '''
    训练lstm的方法
    '''
    np.random.seed(1337)
    exp         = hyperparams['exp']
    batch_size  = hyperparams['batchsize']
    epochs      = hyperparams['epochs']
    lr          = hyperparams['lr']
    hsize       = hyperparams['hsize']
    nlayers     = hyperparams['nlayers']
    loss        = hyperparams['loss']
    dirpath     = hyperparams['dirpath']
    momentum    = hyperparams['momentum']
    decay       = hyperparams['decay']
    dropout     = hyperparams['dropout']
    dropout_rec = hyperparams['dropout_rec']
    merge_mode  = hyperparams['merge_mode']
    layertype   = hyperparams['layertype']
    balClass    = hyperparams['balClass']
    act_output  = hyperparams['act_output']


    dim = X_train.shape[2]
    timesteps = X_train.shape[1]

    if balClass:
        cweight = class_weight.compute_class_weight('balanced', np.unique(Y_train), Y_train)
    else:
        cweight = np.array([1, 1])

    model = Sequential()


    if layertype == 'lstm':
        if nlayers == 1:
            model.add(LSTM(hsize, return_sequences=False, input_shape=(timesteps, dim), recurrent_dropout=dropout_rec, dropout=dropout))

        if nlayers == 2:
            model.add(LSTM(hsize, return_sequences=True,   input_shape=(timesteps, dim), recurrent_dropout=dropout_rec, dropout=dropout))
            model.add(LSTM(hsize, return_sequences=False, recurrent_dropout=dropout_rec))

        if nlayers == 3:
            model.add(LSTM(hsize, return_sequences=True,  input_shape=(timesteps, dim), recurrent_dropout=dropout_rec, dropout=dropout))
            model.add(LSTM(hsize, return_sequences=True,  recurrent_dropout=dropout_rec))
            model.add(LSTM(hsize, return_sequences=False, recurrent_dropout=dropout_rec))

        if nlayers == 4:
            model.add(LSTM(hsize, return_sequences=True, input_shape=(timesteps, dim), recurrent_dropout=dropout_rec, dropout=dropout))
            model.add(LSTM(hsize, return_sequences=True, recurrent_dropout=dropout_rec,))
            model.add(LSTM(hsize, return_sequences=True, recurrent_dropout=dropout_rec))
            model.add(LSTM(hsize, return_sequences=False, recurrent_dropout=dropout_rec))
            # model.add(Dense(dsize, activation=act_output))

    elif layertype == 'bi-lstm':
        if nlayers == 1:
            model.add(Bidirectional(LSTM(hsize, return_sequences=False, recurrent_dropout=dropout_rec,
                           dropout=dropout), input_shape=(timesteps, dim), merge_mode=merge_mode))

        if nlayers == 2:
            model.add(Bidirectional(LSTM(hsize, return_sequences=True, recurrent_dropout=dropout_rec,
                           dropout=dropout),input_shape=(timesteps, dim), merge_mode=merge_mode))
            model.add(Bidirectional(LSTM(hsize, return_sequences=False, recurrent_dropout=dropout_rec), merge_mode=merge_mode))

        if nlayers == 3:
            model.add(Bidirectional(LSTM(hsize, return_sequences=True, recurrent_dropout=dropout_rec,
                           dropout=dropout),input_shape=(timesteps, dim),merge_mode=merge_mode))
            model.add(Bidirectional(LSTM(hsize, return_sequences=True, recurrent_dropout=dropout_rec),merge_mode=merge_mode))
            model.add(Bidirectional(LSTM(hsize, return_sequences=False,recurrent_dropout=dropout_rec),merge_mode=merge_mode))

        if nlayers == 4:
            model.add(Bidirectional(LSTM(hsize, return_sequences=True, recurrent_dropout=dropout_rec,
                           dropout=dropout),input_shape=(timesteps, dim), merge_mode=merge_mode))
            model.add(Bidirectional(LSTM(hsize, return_sequences=True,  recurrent_dropout=dropout_rec),merge_mode=merge_mode))
            model.add(Bidirectional(LSTM(hsize, return_sequences=True,  recurrent_dropout=dropout_rec),merge_mode=merge_mode))
            model.add(Bidirectional(LSTM(hsize, return_sequences=False, recurrent_dropout=dropout_rec),merge_mode=merge_mode))

    if act_output == 'sigmoid':
        dsize = 1
        model.add(Dense(dsize, activation=act_output))

    elif act_output == 'softmax':
        dsize = 27
        model.add(Dense(dsize, activation=act_output))
        Y_train = to_categorical(R_train, num_classes=27)
        Y_dev = to_categorical(R_dev, num_classes=27)

    elif act_output == 'relu':
        dsize = 1
        def myrelu(x):
            return (K.relu(x, alpha=0.0, max_value=27))
        model.add(Dense(dsize, activation=myrelu))
        Y_train = R_train
        Y_dev = R_dev


    print(model.summary())
    print('--- network has layers:', nlayers, ' hsize:',hsize, ' bsize:', batch_size,
          ' lr:', lr, ' epochs:', epochs, ' loss:', loss, ' act_o:', act_output)

    sgd = optimizers.SGD(lr=lr, momentum=momentum, decay=0, nesterov=True)

    model.compile(loss=loss,
                  optimizer=sgd,
                  metrics=['accuracy','mae','mse'])

    dirpath = dirpath + str(exp)
    os.system('mkdir ' + dirpath)

    model_json = model.to_json()
    with open(dirpath + "/model.json", "w") as json_file:
        json_file.write(model_json)

    filepath_best       = dirpath + "/weights-best.hdf5"
    filepath_epochs     = dirpath + "/weights-{epoch:02d}-{loss:.2f}.hdf5"

    checkpoint_best     = ModelCheckpoint(filepath_best,   monitor='loss', verbose=0, save_best_only=True, mode='auto')

    checkpoint_epochs   = ModelCheckpoint(filepath_epochs, monitor='loss', verbose=0, save_best_only=True, mode='auto')
    
    csv_logger          = CSVLogger(dirpath + '/training.log')
    
    lr_decay            = lr_decay_callback(lr, decay)

    early_stop          = EarlyStopping(monitor='loss', min_delta=1e-04, patience=25, verbose=0, mode='auto')

    tensorboard         = TensorBoard(log_dir=dirpath + '/logs', histogram_freq=0, write_graph=True, write_images=False)
    
    perf                = Metrics()

    callbacks_list      = [checkpoint_best, checkpoint_epochs, early_stop, lr_decay, tensorboard, csv_logger]

    model.fit(X_train, Y_train,
              batch_size=batch_size,
              epochs=epochs,
              validation_data=(X_dev, Y_dev),
              class_weight=cweight,
              callbacks=callbacks_list)

    model.load_weights(filepath=filepath_best)
    
    model.compile(loss=loss,
                  optimizer=sgd,
                  metrics=['accuracy'])

    pred        = model.predict(X_dev,   batch_size=None, verbose=0, steps=None)
    pred_train  = model.predict(X_train, batch_size=None, verbose=0, steps=None)

    return pred, pred_train

特征融合

这里论文中强调使用的是带有门控神经网络的模型，但是实际使用的就是简单的全连接层和dropout层的交叉堆叠，仅此而已。

def train_all(X_train_fuse, Y_train, X_dev_fuse, Y_dev, R_train, R_dev, hyperparams):
    np.random.seed(1337)
	# init random seed

    dim = X_train_fuse.shape[1]

    # hyperparameters
    loss = hyperparams['loss']
    lr = hyperparams['lr']
    momentum = hyperparams['momentum']
    batch_size = hyperparams['batchsize']
    dsize = hyperparams['dsize']
    epochs = hyperparams['epochs']
    decay = hyperparams['decay']
    act = hyperparams['act']
    nlayers = hyperparams['nlayers']
    dropout = hyperparams['dropout']
    exppath = hyperparams['exppath']
    act_output = hyperparams['act_output']

    # define input
    input = Input(shape=(dim,))

    # define number of DNN layers
    if nlayers == 1:
        final = Dense(dsize, activation=act)(input)
        final = Dropout(dropout)(final)

    if nlayers == 2:
        final = Dense(dsize, activation=act)(input)
        final = Dropout(dropout)(final)
        final = Dense(dsize, activation=act)(final)
        final = Dropout(dropout)(final)


    if nlayers == 3:
        final = Dense(dsize, activation=act)(input)
        final = Dropout(dropout)(final)
        final = Dense(dsize, activation=act)(final)
        final = Dropout(dropout)(final)
        final = Dense(dsize, activation=act)(final)
        final = Dropout(dropout)(final)

    if nlayers == 4:
        final = Dense(dsize, activation=act)(input)
        final = Dropout(dropout)(final)
        final = Dense(dsize, activation=act)(final)
        final = Dropout(dropout)(final)
        final = Dense(dsize, activation=act)(final)
        final = Dropout(dropout)(final)

    # add final output node
	final = Dense(1, activation='sigmoid')(final)

	# define model
    model = Model(inputs=input, outputs=final)

    # print summary
    print(model.summary())
    print('--- network has layers:', nlayers, 'dsize:', dsize, 'bsize:', batch_size, 'lr:', lr, 'epochs:',
          epochs)


    # defining files to save
    # dirpath = dirpath + str(exp)
    os.system('mkdir ' + exppath)

    # serialize model to JSON
    model_json = model.to_json()
    with open(exppath + "/model.json", "w") as json_file:
        json_file.write(model_json)

    # define optimizer
    sgd = optimizers.SGD(lr=lr, momentum=momentum, decay=0, nesterov=True)

    # compile model
    model.compile(loss=loss,
                  optimizer=sgd,
                  metrics=['accuracy'])

    # filepaths to checkpoints
    filepath_best = exppath + "/weights-best.hdf5"
    filepath_epochs = exppath + "/weights-{epoch:02d}-{loss:.2f}.hdf5"

    # save best model
    checkpoint_best = ModelCheckpoint(filepath_best, monitor='loss', verbose=0, save_best_only=True, mode='auto')
    
    # save improved model
    checkpoint_epochs = ModelCheckpoint(filepath_epochs, monitor='loss', verbose=0, save_best_only=True, mode='auto')
    
    # log performance to csv file
    csv_logger = CSVLogger(exppath + '/training.log')
    # loss_history        = LossHistory()
    # lrate               = LearningRateScheduler()

    # update decay as function of epoch and lr
    lr_decay = lr_decay_callback(lr, decay)

    # define early stopping criterion
    early_stop = EarlyStopping(monitor='loss', min_delta=1e-04, patience=25, verbose=0, mode='auto')
    # reduce_lr         = ReduceLROnPlateau(monitor='acc', factor=0.2, patience=5, min_lr=0.0001)
    
    # log data to view via tensorboard
    tensorboard = TensorBoard(log_dir=exppath + '/logs', histogram_freq=0, write_graph=True, write_images=False)
    
    # define metrics
    perf = Metrics()

    # callbacks we are interested in
    callbacks_list = [checkpoint_best, checkpoint_epochs, early_stop, lr_decay, perf, tensorboard, csv_logger]

    # train model
    model.fit(X_train_fuse, Y_train,
              batch_size=batch_size,
              epochs=epochs,
              validation_data=(X_dev_fuse, Y_dev),
              callbacks=callbacks_list)

    # load best model and evaluate
    model.load_weights(filepath=filepath_best)
    model.compile(loss=loss,
                  optimizer=sgd,
                  metrics=['accuracy'])

    # return predictions of best model
    pred_train = model.predict(X_train_fuse, batch_size=None, verbose=0, steps=None)
    pred = model.predict(X_dev_fuse, batch_size=None, verbose=0, steps=None)

    return pred, pred_train

总结

对这篇论文提出来的方法存疑，因为给出的代码和论文中描述的方法实现的并不一样，并没有使用所谓的门控机制去修改LSTM。
然后也没有对文本描述中的repair组合进行专门的修改。不过可以尝试一下他提到的那些音频特征。

客院载论

代码学习——基于音频、词汇和不流畅特征的门控多模态融合，用于从自发语音中识别阿尔茨海默病痴呆Multi-modal fusion with gating using audio, lexical an

文章目录

引言

正文

特征工程

Audio Features音频特征

Lexical Features from Text文本中的词汇特征

用于训练音频特征和语义特征的具体的LSTM网络模型

特征融合

总结