问题描述
我的主要目标是将mfcc功能提供给ANN.
My main goal is in feeding mfcc features to an ANN.
但是,我陷入了数据预处理步骤,而我的问题分为两个部分.
背景:
我有声音我有一个带有注释和时间戳的txt文件,如下所示:
However I am stuck at the data pre processing step and my question has two parts.
BACKGROUND :
I have an audio.I have a txt file that has the annotation and time stamp like this:
0.0 2.5 Music
2.5 6.05 silence
6.05 8.34 notmusic
8.34 12.0 silence
12.0 15.5 music
我知道对于一个音频文件,我可以使用librosa这样计算mfcc:
I know for a single audio file, I can calculate the mfcc using librosa like this :
import librosa
y, sr = librosa.load('abcd.wav')
mfcc=librosa.feature.mfcc(y=y, sr=sr)
第1部分:我无法围绕两件事:
如何根据注释中的片段计算mfcc.
Part 1: I'm unable to wrap my head around two things :
how to calculate mfcc based on the segments from the annotations.
第2部分:如何最好地存储这些mfcc,以将其传递给keras DNN.也就是说,应将每个音频段计算出的所有mfcc都保存到单个列表/词典中.还是最好将它们保存到不同的词典中,以便将属于一个标签的所有mfcc放在一个位置.
Part2: How to best store these mfcc's for passing them to keras DNN. i.e should all mfcc's calculated per audio segment be saved to a single list/dictionary. or is it better to save them to different dictionaries so that all mfcc's belonging to one label are at one place.
我是音频处理和python的新手,所以我愿意接受有关最佳做法的建议.
I'm new to audio processing and python so, i'm open to recommendations regarding best practices.
很乐意提供其他详细信息.谢谢.
More than happy to provide additional details.Thanks.
推荐答案
第1部分:MFCC到标签的转换
从librosa文档中看不出来,但是我相信mfcc的计算大约是23mS的帧速率.如果您的代码在mfcc.shape
以上,则将返回(20, x)
,其中20是要素数量,而x对应于x帧数. mfcc的默认hop_rate
是512个样本,这意味着每个mfcc样本的跨度约为23mS(512/sr).
It's not obvious from the librosa documentation but I believe the mfcc's are being calculated at about a 23mS frame rate. With your code above mfcc.shape
will return (20, x)
where 20 is the number of features and the x corresponds to x number of frames. The default hop_rate
for mfcc is 512 samples which means each mfcc sample spans about 23mS (512/sr).
使用此选项,您可以计算文本文件中哪个帧与哪个标签一起使用.例如,标签Music
从0.0到2.5秒变为mfcc帧0到2.5 * sr/512〜=108.它们的输出不完全相等,因此需要四舍五入.
Using this you can compute which frame goes with which tag in your text file. For example, the tag Music
goes from 0.0 to 2.5 seconds so that will be mfcc frame 0 to 2.5*sr/512 ~= 108. They will not come out exactly equal so you need to round the values.
第2A部分:DNN数据格式
对于输入(mfcc数据),您需要弄清楚输入的外观.您将具有20个功能,但是要向网络输入一个帧还是要提交时间序列.您是mfcc数据已经是一个numpy数组,但是其格式设置为(功能,示例).您可能希望将其反转以输入到Keras.您可以使用numpy.reshape
来做到这一点.
For the input (mfcc data) you'll need to figure out what the input looks like. You'll have 20 features but do you want to input a single frame to your net or are you going to submit a time series. You're mfcc data is already a numpy array, however it's formatted as (feature, sample). You probably want to reverse that for input to Keras. You can use numpy.reshape
to do that.
对于输出,您需要为文本文件中的每个标签分配一个数值.通常,您将tag to integer
存储在字典中.然后,这将用于创建网络的训练输出.每个输入样本应有一个输出整数.
For the output, you need assign a numeric value to each tag in your text file. Typically you would store the the tag to integer
in a dictionary. This will then be used to create your training output for the network. There should be one output integer for each input sample.
第2B部分:保存数据
最简单的方法是使用pickle
保存并稍后重新加载.我喜欢使用一个类来封装输入,输出和字典数据,但是您可以选择任何适合自己的方法.
The simplest way to do this is to use pickle
to save and the reload it later. I like to use a class to encapsulate the input, output and dictionary data but you can choose whatever works for you.
这篇关于根据带注释的文件为音频片段生成mfcc的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!