本文介绍了倒谱分析,用于音高检测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从声音信号中提取音高.

I'm looking to extract pitches from a sound signal.

IRC上的某人刚刚向我解释了采用双FFT如何实现这一目标.具体来说:

Someone on IRC just explained to me how taking a double FFT achieves this. Specifically:

  1. 进行FFT
  2. 取绝对值平方的对数(可以通过查找表完成)
  3. 进行另一次FFT
  4. 取绝对值

我正在尝试使用vDSP

I am attempting this using vDSP

我不明白我以前是怎么遇到这种技术的.我做了很多狩猎和提问.几个星期的价值.更重要的是,我不明白为什么我没有想到它.

I can't understand how I didn't come across this technique earlier. I did a lot of hunting and asking questions; several weeks worth. More to the point, I can't understand why I didn't think of it.

我正在尝试通过vDSP库实现这一目标.似乎它具有处理所有这些任务的功能.

I am attempting to achieve this with vDSP library. It looks as though it has functions to handle all of these tasks.

但是,我想知道最终结果的准确性.

However, I'm wondering about the accuracy of the final result.

我以前曾使用过一种技术,它会搜索单个FFT的频点以获取局部最大值.当遇到一个峰值时,它会使用一种狡猾的技术(自上次FFT以来的相位变化)将实际峰值更准确地放置在bin中.

I have previously used a technique which scours the frequency bins of a single FFT for local maxima. When it encounters one, it uses a cunning technique (the change in phase since the last FFT) to more accurately place the actual peak within the bin.

我担心我在这里介绍的这种技术会失去这种精度.

I am worried that this precision will be lost with this technique I'm presenting here.

我猜想该技术可以在第二次FFT之后使用,以准确地获得基本值.但是看起来好像信息在步骤2中丢失了.

I guess the technique could be used after the second FFT to get the fundamental accurately. But it kind of looks like the information is lost in step 2.

由于这可能是一个棘手的过程,所以有经验的人可以看看我在做什么,并检查它的理智吗?

As this is a potentially tricky process, could someone with some experience just look over what I'm doing and check it for sanity?

此外,我听说还有另一种技术,涉及在邻近的垃圾箱上拟合二次方.这是否具有可比的准确性?如果是这样,我会喜欢它,因为它不涉及记忆bin阶段.

Also, I've heard there is an alternative technique involving fitting a quadratic over neighbouring bins. Is this of comparable accuracy? If so, I would favour it, as it doesn't involve remembering bin phases.

所以,问题:

  • 这种方法有意义吗?可以改善吗?
  • 我有点担心"log square"部分.似乎有一个vDSP函数可以完全做到这一点:vDSP_vdbcon.但是,没有迹象表明它会预先计算对数表-我认为不会,因为FFT函数需要调用显式的预计算函数并将其传递给它.而且此功能没有.
  • 是否有拾取谐波的危险?
  • 有什么狡猾的方法可以使vDSP达到最大,最大的优先级吗?
  • 有人可以指出我对这种技术的一些研究或文献吗?

  • does this approach makes sense? Can it be improved?
  • I'm a bit worried about the "log square" component; there seems to be a vDSP function to do exactly that: vDSP_vdbcon. However, there is no indication it precalculates a log-table -- I assume it doesn't, as the FFT function requires an explicit pre-calculation function to be called and passed into it. And this function doesn't.
  • Is there some danger of harmonics being picked up?
  • is there any cunning way of making vDSP pull out the maxima, biggest first?
  • Can anyone point me towards some research or literature on this technique?

主要问题:是否足够准确?精度可以提高吗?专家告诉我,准确度是不够的.这行的结尾吗?

the main question: Is it accurate enough? Can the accuracy be improved? I have just been told by an expert that the accuracy IS INDEED not sufficient. Is this the end of the line?

Pi

PS,当我想创建标签时,我非常恼火,但是不能. :|我已经向维护人员建议,这样可以跟踪尝试的标签,但是我确信我被忽略了.我们需要用于vDSP的标签,加速框架,倒谱分析

PS I get SO annoyed when I want to create tags, but cannot. :| I have suggested to the maintainers that SO keep track of attempted tags, but I'm sure I was ignored. We need tags for vDSP, accelerate framework, cepstral analysis

推荐答案

好吧,让我们一一讲解:

Okay, let's go through one by one:

尽管我不是专家,并且接受过最少的正规培训,但我认为我知道此问题的最佳答案.在过去的几年中,我已经做了很多搜索,阅读和实验工作.我的共识是,就精度,复杂度,噪声鲁棒性和速度之间的权衡而言,自相关方法是迄今为止最好的音高检测器.除非您有一些非常特殊的情况,否则我几乎总是建议您使用自相关.稍后再详细介绍,让我回答您的其他问题.

Although I am not an expert and have had minimal formal training, I think I know the best answer to this problem. I've done a lot of searching, reading, and experimenting over the past few years. My consensus is that the autocorrelation method is by far the best pitch detector in terms of the tradeoff between accuracy, complexity, noise robustness, and speed. Unless you have some very specific circumstances, I would almost always recommend using autocorrelation. More on this later, let me answer your other questions.

您描述的是倒谱分析",它是一种主要用于从语音中提取音高的方法.倒频谱分析完全取决于信号泛音的丰富度强度.例如,如果您要通过倒频谱分析传递纯正弦波,您将得到可怕的结果.但是,对于语音(这是一个复杂的信号),存在大量泛音. (顺便说一句,泛音是以基本频率的倍数振荡的信号元素,即我们所感知的音高).倒频谱分析在检测缺少基频的语音时可能很健壮.也就是说,假设您绘制了函数 sin(4x)+ sin(6x)+ sin(8x)+ sin(10x).如果您看一下,很显然它具有与函数sin(2x)相同的频率.但是,如果对该函数应用傅立叶分析,则对应于sin(2x)的bin的幅值为.因此,该信号被认为具有缺失的基本频率",因为它不包含我们认为是的频率的正弦波.因此,仅在傅立叶变换中选择最大峰值将无法处理此信号.

What you describe is "cepstral analysis" which is a method mainly used for the extraction of pitch from speech. Cepstral analysis relies entirely on the plentifulness and strength of the overtones of your signal. If for example, you were to pass a pure sine wave through cepstral analysis, you would get terrible results. However, for speech, which is a complex signal, there is a large number of overtones. (overtones, by the way, are elements of the signal which are oscillating at multiples of the fundamental frequency i.e. the pitch we perceive). Cepstral analysis can be robust in detecting speech with a missing fundamental frequency. That is, suppose you plotted the function sin(4x)+sin(6x)+sin(8x)+sin(10x). If you look at that, it is clear that it has the same frequency as the function sin(2x). However, if you apply fourier analysis to this function, the bin corresponding to sin(2x) will have zero magnitude. Thus this signal is consider to have a "missing fundamental frequency", because it does not contain the sinusoid of the frequency which we consider it to be. Thus simply picking the biggest peak on the fourier transform will not work on this signal.

您要描述的是相位声码器技术,可以更准确地测量给定 partial 的频率.但是,如果您使用的信号的基频成分缺失或较弱,则挑选出最大信号仓的基本技术会给您带来麻烦.

What you are describing is the phase vocoder technique to more accurately measure the frequency of a given partial. However, the basic technique of picking out the biggest bin is going to cause you problems if you use a signal with a missing or weak fundamental frequency component.

首先,请记住,相位声码器技术只能更准确地测量单个声部的频率.它忽略了包含在基本频率较高部分中的信息.其次,在给定适当的FFT大小的情况下,您可以使用峰值插值获得非常好的精度.这里的其他人已将您引向抛物线插值.我也建议这样做.

First of all, remember that the phase vocoder technique only more accurately measures the frequency of a single partial. It ignores the information contained in the higher partials about the fundamental frequency. Second of all, given a decent FFT size, you can get very good accuracy using peak interpolation. Someone else here has pointed you towards parabolic interpolation. I also would suggest this.

如果以40100 Hz的音高抛物线内插40100 Hz的4098个样本数据块的FFT,这将意味着它位于第40个(430.66 Hz)和第41个(441.430664064)bin之间.假设本文在一般情况下大致正确,它表示抛物线插值将分辨率提高一个数量级以上.这使分辨率至少达到1 Hz,这是人类听力的阈值.实际上,如果您使用理想的高斯窗口,则抛物线插值在峰值处精确无误(是的,准确.但是请记住,您永远不能使用真正的高斯窗口,因为它会在双向).如果您仍然担心获得更高的精度,可以随时填充FFT.这意味着在变换之前在FFT的末尾添加零.结果表明,这等效于"sinc插值",它是频率受限信号的理想插值函数.

If you parabolically interpolate the FFT of a 4098 sample block of data at 44100 Hz, with a pitch about 440 hz, that will mean it will be between the 40th (430.66 Hz) and 41st (441.430664064) bin. Assuming this paper is approximately correct in the general case, it says parabolic interpolation increases resolution by more than one order of magnitude. This leaves the resolution at at least 1 Hz, which is the threshold of human hearing. In fact, if you use an ideal Gaussian window, parabolic interpolation is exact at the peaks (That's right, exact. remember, however, that you can never use a true Gaussian window, because it extends forever in both directions.) If you are still worried about getting higher accuracy, you can always pad the FFT. This means adding zeros to the end of the FFT before transforming. It works out that this is equivalent to "sinc interpolation" which is the ideal interpolation function for frequency limited signals.

那是正确的.相位声码器技术依赖于以下事实:顺序帧已连接并且具有特定的相位关系.但是,连续帧FFT的对数幅度在相位方面并没有显示相同的关系,因此将这种变换用于第二个FFT将毫无用处.

That is correct. The phase vocoder technique relies on the fact that sequential frames are connected and have a specific phase relationship. However, the log magnitude of the FFT of sequential frames does not show the same relationship in terms of phase, thus it would be useless to use this transform for the second FFT.

是和是,我将在最后详细说明我在自相关方面的改进.

Yes and yes, I will elaborate on the improvement in my bit on autocorrelation at the end.

抱歉,我不知道vDSP库的详细信息.

I don't know the specifics of the vDSP library, sorry.

您最初使用的相位声码器峰值选​​择技术是什么?是的.用倒谱法?不,不是真的,要点是,它考虑了所有谐波以获得其频率估算值.例如,假设我们的频率为1.我们的泛音为2、3、4、5、6、7、8、9等.我们必须去除所有的奇次谐波,即保留2、4、6, 8等),会先消除基频,然后再将基频与它的泛音之一混淆.

In your original phase-vocoder peak picking technique? yes. With the cepstral method? no, not really, the whole point is that it considers all the harmonics to get its frequency estimate. For exmaple, let's say our freqency is 1. Our overtones are 2,3,4,5,6,7,8,9,etc We would have to take out all of the odd harmonics, i.e. leave 2,4,6,8, etc, and remove the fundamental frequency before it would start to be confused with one of its overtones.

不了解vDSP,但是在一般情况下,您通常只需遍历所有DSP并跟踪最大的DSP.

Don't know vDSP, but in the general case, you usually just iterate over all of them and keep track of the biggest.

我在评论中给您的链接P.

The link P. i gave you in a comment seemed like a good one.

此外,网站提供了令人难以置信的深度和广度从理论上和实践上解释DSP主题,包括各种音高提取,操作等. (是指向网站索引的更一般的链接).我总是发现自己回到了它.有时候,如果您跳到其中途,可能会有些不知所措,但是您始终可以将所有解释都归结为基本构成部分.

Also, this website offers an incredibly in-depth and wonderfully broad explanation of DSP topics, including all sorts of pitch extraction, manipulation, etc, in both a theoretical and practical way. (this is a more general link to an index on the site). I always find myself coming back to it. Sometimes it can be a bit overwhelming if you jump into the middle of it, but you can always follow every explanation back to the basic building blocks.

现在进行自相关.基本上,技术是这样的:您获取(窗口式)信号并对其延迟不同的量.找到与您的原始信号最匹配的量.那是基本时期.这在理论上有很多意义.您正在寻找信号的重复部分.

Now for autocorrelation. Basically the technique is this: You take your (windowed) signal and time delay it different amounts. Find the amount which matches up best with your original signal. That is the fundamental period. It makes a lot of theoretical sense. You are hunting for the repetitive parts of your signal.

实际上,与所有这些延时复制的信号进行相关比较慢.通常,它是通过这种方式代替的(在数学上是等效的):

In practice, taking the correlation with all these time delayed copies of the signal is slow. It is usually implemented in this way instead (which is mathematically equivalent):

将其零填充以使其原始长度加倍.进行FFT.然后将所有系数替换为其平方大小,但第一个系数除外,将其设置为0.现在进行IFFT.将每个元素除以第一个元素.这为您提供了自相关.数学上,您正在使用圆卷积定理(查找它),并使用零填充将线性卷积问题转换为圆卷积问题,可以有效地解决该问题.

Zero-Pad it to double its original length.Take the FFT. Then replace all the coefficients with their square magnitude, except for the first, which you set to 0. Now take the IFFT. Divide every element by the first one. This gives you the autocorrelation. Mathematically, you are using the circular convolution theorem (look it up), and using zero-padding to convert a linear convolution problem into a circular convolution one, which can be efficiently solved.

但是,在选择峰时要小心.对于很小的延迟,信号将非常好地与自身匹配,这仅仅是因为它是连续的. (我的意思是,如果将其延迟为零,则它与自身完美相关),而是在第一个零交叉处的之后中选择最大的峰.您可以将自相关函数抛物线内插,也可以使用其他技术来内插自相关函数,以获取更准确的值.

However, be careful about picking the peak. For very small delays, the signal will match up with itself very well, simply because it is continuous. (I mean, if you delay it zero, it correlates perfectly with itself) Instead, pick the largest peak after the first zero-crossing. You can parabolically interpolate the autocorrelation function as well just as with other techniques to get much more accurate values.

就所有标准而言,这本身就能为您提供非常好的音高检测.但是,有时您可能会遇到音高减半和音高加倍的问题.基本上,问题在于,如果信号每1秒钟重复一次,那么它每 2 秒也重复一次.同样,如果它具有很强的泛音,则可能会音高 halving .因此,最大的高峰可能并不总是您想要的.解决此问题的方法是Phillip McLeod的MPM算法.想法是这样的:

This by itself will give you very good pitch detection by all criteria However, you might sometimes encounter a problem with pitch halving and pitch doubling. Basically the problem is that if a signal is repetitive every 1 second, it is also repetitive every two seconds. Similarly, if it has a very strong overtone, you might get pitch halving. So the biggest peak might not always be the one you want. A solution to this problem is the MPM algorithm by Phillip McLeod. The idea is this:

您不想选择最大的峰,而是要选择足以考虑的第一个峰.您如何确定峰是否足够大以至于无法考虑?如果它至少与A *最大峰一样高,则A为某个常数.菲利普(Phillip)建议我认为A的值约为0.9.实际上,他编写的程序Tartini可让您实时比较几种不同的音高检测算法.我强烈建议下载并尝试一下(它实现倒谱,直接自相关和MPM ):(如果您在构建时遇到麻烦,请尝试在此处 a>.

Instead of picking the biggest peak, you want to pick the first peak that is large enough to be considered. How do you determine if a peak is large enough to be considered? If it is at least as high as A*the largest peak, where A is some constant. Phillip suggests a value of A around 0.9 I think. Actually the program he wrote, Tartini, allows you to compare several different pitch detection algorithms in real time. I would strongly suggest downloading it and trying it out (it implements Cepstrum, straight autocorrelation, and MPM): (if you have trouble building, try the instructions here.

我要注意的最后一件事是关于窗口化.通常,任何平滑的窗口都可以.汉宁窗,汉明窗等.希望您应该知道如何开窗.如果您想要更准确的时间测量,我也建议您做重叠的窗口.

One last thing I should note is about windowing. In general, any smooth window will do. Hanning window, Hamming window, etc. Hopefully you should know how to window. I would also suggest doing overlapped windows if you want more accurate temporal measurements.

顺便说一句,自相关的一个很酷的特性是,如果频率在您正在测量的窗口部分中呈线性变化,它将在窗口的中心处为您提供正确的频率.

By the way, a cool property of the autocorrelation is that if the frequency is changing linearly through the windowed section you are measuring, it will give you the correct frequency at the center of the window.

另一件事:我所描述的被称为 biased 自相关函数.这是因为对于更高的时间滞后,原始信号和时间滞后版本之间的重叠变得越来越少.例如,如果您查看一个大小为N的窗口,该窗口已延迟了N-1个样本,则会看到只有一个样本重叠.因此,此延迟的相关性显然将非常接近零.您可以通过将自相关函数的每个值除以重叠的样本数量来获得此值,从而对此进行补偿.这称为无偏自相关.但是,总的来说,这样做会导致更糟的结果,因为较高的自相关延迟值非常嘈杂,因为它们仅基于少数几个样本,因此可以合理地对其进行加权.

One more thing: What I described is called the biased autocorrelation function. This is because for higher time lags, the overlap between the original signal and the time lagged version becomes less and less. For example, if you look at a window of size N which has been delayed N-1 samples, you see that only one sample overlaps. So the correlation at this delay is clearly going to be very close to zero. You can compensate for this, by diving each value of the autocorrelation function by the number of samples overlap to get it. This is called the unbiased autocorrelation. However, in general, you will get worse results with this, as the higher delay values of the autocorrelation are very noisy, as they are based on only a few samples, so it makes sense to weigh them less.

与往常一样,如果您要查找更多信息,则google是您的朋友.良好的搜索条件:自相关,基音检测,基音跟踪,基音提取,基音估计,倒谱等.

If you're looking for more information, as always, google is your friend. Good search terms: autocorrelation, pitch detection, pitch tracking, pitch extraction, pitch estimation, cepstrum, etc.

这篇关于倒谱分析,用于音高检测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!