问题描述
我正在使用HTK
工具包执行单词发现任务,并且具有经典的训练和测试数据不匹配的问题.训练数据仅包含干净"(通过麦克风记录)数据.数据被转换为MFCC_E_D_A
参数,然后由HMM(电话级别)进行建模.我的测试数据已通过座机和手机频道记录下来(引起失真等).将MFCC_E_D_A
参数与HVite
一起使用会导致错误的输出.我想使用带有MFCC_E_D_A_Z
参数的cepstral mean normalization
,但是它没有太大用处,因为HMM没有使用此数据建模.我的问题如下:
I am working with the HTK
toolkit on a word spotting task and have a classic training and testing data mismatch. The training data consisted of only "clean" (recorded over a mic) data. The data was converted to MFCC_E_D_A
parameters which were then modelled by HMMs (phone-level). My test data has been recorded over landline and mobile phone channels (inviting distortions and the like). Using the MFCC_E_D_A
parameters with HVite
results in incorrect output. I want to make use of cepstral mean normalization
with MFCC_E_D_A_Z
parameters but it would not be of much use since the HMMs are not modelled with this data. My questions are as follows:
- 是否可以通过任何方式 转换
MFCC_E_D_A_Z
到MFCC_E_D_A
?这样,我将按照以下方式进行操作:input -> MFCC_E_D_A_Z -> MFCC_E_D_A -> HMM log likelihood computation
. - 是否可以将将
MFCC_E_D_A
参数建模的现有HMM转换为MFCC_E_D_A_Z
?
- Is there any way by which I can convert
MFCC_E_D_A_Z
intoMFCC_E_D_A
? That way I follow this way:input -> MFCC_E_D_A_Z -> MFCC_E_D_A -> HMM log likelihood computation
. - Is there any way to convert the existing HMMs which model
MFCC_E_D_A
parameters intoMFCC_E_D_A_Z
?
如果有一种方法可以从上面进行(1),则HCopy
的配置文件将是什么样?我编写了以下HCopy
配置文件进行转换:SOURCEFORMAT = MFCC_E_D_A_Z
TARGETKIND = MFCC_E_D_A
TARGETRATE = 100000.0
SAVECOMPRESSED = T
SAVEWITHCRC = T
WINDOWSIZE = 250000.0
USEHAMMING = T
PREEMCOEF = 0.97
NUMCHANS = 26
CEPLIFTER = 22
NUMCEPS = 12
ENORMALISE = T
If there is a way to do (1) from above, what would the config file for HCopy
look like? I wrote the following HCopy
config file for conversion:SOURCEFORMAT = MFCC_E_D_A_Z
TARGETKIND = MFCC_E_D_A
TARGETRATE = 100000.0
SAVECOMPRESSED = T
SAVEWITHCRC = T
WINDOWSIZE = 250000.0
USEHAMMING = T
PREEMCOEF = 0.97
NUMCHANS = 26
CEPLIFTER = 22
NUMCEPS = 12
ENORMALISE = T
这不起作用.我该如何改善呢?
This does not work. How can I improve this?
推荐答案
您需要了解电话录音的频率范围是另外一个范围,因为它们被限制在通道中.通常存在200至3500 Hz的频率范围.宽带声学模型在100到6800的范围内训练.它将无法可靠地解码电话语音,因为电话语音会错过3500到6800所需的频率.它与功能类型或均值归一化或失真无关,您只是不能那
You need to understand that telephone recordings have another range of frequencies because they are clipped in the channels. Usually range of frequencies from 200 to 3500 Hz is present. Wideband acoustic model is trained on the range from 100 to 6800. It will not decode telephone speech reliably because telephone speech misses the required frequencies from 3500 to 6800. It's not related to feature type or mean normalization or distortion, you just can't do that
您需要在转换为8khz的音频上训练原始模型,或者至少修改滤波器组参数以匹配电话频率范围.
You need to train your original model on audio converted to 8khz or at least to modify the filterbank parameters to match telephone range of frequencies.
这篇关于从一种MFCC类型转换为另一种-HTK的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!