问题描述
在 Coursera 的斯坦福大学的 Andrew Ng 的机器学习介绍性讲座中的幻灯片中,鉴于音频源是由两个空间分离的麦克风录制的,他针对鸡尾酒会问题给出了以下一行 Octave 解决方案:
[W,s,v]=svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');
幻灯片底部是来源:Sam Roweis、Yair Weiss、Eero Simoncelli",而之前一张幻灯片的底部是音频剪辑由 Te-Won Lee 提供".在视频中,吴教授说,
因此,您可能会像这样看待无监督学习并问,'实现它有多复杂?'似乎为了构建这个应用程序,似乎要做这个音频处理,你会写大量的代码,或者链接到一堆处理音频的 C++ 或 Java 库.复杂的程序来处理这个音频:分离音频等等.结果证明算法可以做你刚刚听到的,这可以用一行代码来完成......显示在这里.研究人员花了很长时间想出这行代码.所以我并不是说这是一个简单的问题.但事实证明,当您使用正确的编程环境时,许多学习算法将是非常短的程序."
视频讲座中播放的分离音频效果并不完美,但在我看来,非常棒.有没有人对这一行代码的表现如此出色有任何见解?特别是,有没有人知道一个参考文献可以解释 Te-Won Lee、Sam Roweis、Yair Weiss 和 Eero Simoncelli 就这一行代码所做的工作?
更新
为了演示算法对麦克风间隔距离的敏感性,以下模拟(以 Octave 形式)将音调从两个空间分离的音调发生器中分离出来.
% 定义模型f1 = 1100;音调发生器 1 的 % 频率;单位:赫兹f2 = 2900;音调发生器 2 的 % 频率;单位:赫兹Ts = 1/(40*max(f1,f2));% 采样周期;单位麦克风 = 1;以原点为中心的麦克风之间的百分比距离;单位:米dSrc = 10;以原点为中心的音调发生器之间的百分比距离;单位:米c = 340.29;% 声音的速度;单位:米/秒% 产生音调图1);t = [0:Ts:0.025];音调1 =罪(2*pi*f1*t);音调2 =罪(2*pi*f2*t);情节(t,tone1);坚持,稍等;情节(t,tone2,'r');xlabel('时间');ylabel('振幅');轴([0 0.005 -1 1]);Legend('音调1','音调2');暂缓;% 麦克风混音% 假设声音强度的平方反比衰减(即声音幅度的反线性衰减)图(2);dNear = (dSrc - dMic)/2;dFar = (dSrc + dMic)/2;mic1 = 1/dNear*sin(2*pi*f1*(t-dNear/c)) + 1/dFar*sin(2*pi*f2*(t-dFar/c));mic2 = 1/dNear*sin(2*pi*f2*(t-dNear/c)) + 1/dFar*sin(2*pi*f1*(t-dFar/c));情节(t,mic1);坚持,稍等;情节(t,mic2,'r');xlabel('时间');ylabel('振幅');轴([0 0.005 -1 1]);Legend('麦克风 1', '麦克风 2');暂缓;% 使用 svd 来隔离声源图(3);x = [mic1' mic2'];[W,s,v]=svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');情节(t,v(:,1));坚持,稍等;maxAmp = max(v(:,1));plot(t,v(:,2),'r');xlabel('时间');ylabel('振幅');轴([0 0.005 -maxAmp maxAmp]);Legend('隔离音1','隔离音2');暂缓;
在我的笔记本电脑上执行大约 10 分钟后,模拟生成以下三幅图,说明两个独立的音调具有正确的频率.
但是,将麦克风间隔距离设置为零(即 dMic = 0)会导致模拟生成以下三个数字,说明模拟无法隔离第二个音调(通过 svd s 中返回的单个有效对角线项确认矩阵).
我希望智能手机上的麦克风间隔距离足够大以产生良好的效果,但将麦克风间隔距离设置为 5.25 英寸(即 dMic = 0.1333 米)会导致模拟生成以下结果,不太令人鼓舞,图示说明第一个孤立音调中的高频分量.
两年后我也试图解决这个问题.但我得到了答案;希望它会帮助某人.
您需要 2 个录音.您可以从
In a slide within the introductory lecture on machine learning by Stanford's Andrew Ng at Coursera, he gives the following one line Octave solution to the cocktail party problem given the audio sources are recorded by two spatially separated microphones:
[W,s,v]=svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');
At the bottom of the slide is "source: Sam Roweis, Yair Weiss, Eero Simoncelli" and at the bottom of an earlier slide is "Audio clips courtesy of Te-Won Lee". In the video, Professor Ng says,
The separated audio results played in the video lecture are not perfect but, in my opinion, amazing. Does anyone have any insight on how that one line of code performs so well? In particular, does anyone know of a reference that explains the work of Te-Won Lee, Sam Roweis, Yair Weiss, and Eero Simoncelli with respect to that one line of code?
UPDATE
To demonstrate the algorithm's sensitivity to microphone separation distance, the following simulation (in Octave) separates the tones from two spatially separated tone generators.
% define model
f1 = 1100; % frequency of tone generator 1; unit: Hz
f2 = 2900; % frequency of tone generator 2; unit: Hz
Ts = 1/(40*max(f1,f2)); % sampling period; unit: s
dMic = 1; % distance between microphones centered about origin; unit: m
dSrc = 10; % distance between tone generators centered about origin; unit: m
c = 340.29; % speed of sound; unit: m / s
% generate tones
figure(1);
t = [0:Ts:0.025];
tone1 = sin(2*pi*f1*t);
tone2 = sin(2*pi*f2*t);
plot(t,tone1);
hold on;
plot(t,tone2,'r'); xlabel('time'); ylabel('amplitude'); axis([0 0.005 -1 1]); legend('tone 1', 'tone 2');
hold off;
% mix tones at microphones
% assume inverse square attenuation of sound intensity (i.e., inverse linear attenuation of sound amplitude)
figure(2);
dNear = (dSrc - dMic)/2;
dFar = (dSrc + dMic)/2;
mic1 = 1/dNear*sin(2*pi*f1*(t-dNear/c)) +
1/dFar*sin(2*pi*f2*(t-dFar/c));
mic2 = 1/dNear*sin(2*pi*f2*(t-dNear/c)) +
1/dFar*sin(2*pi*f1*(t-dFar/c));
plot(t,mic1);
hold on;
plot(t,mic2,'r'); xlabel('time'); ylabel('amplitude'); axis([0 0.005 -1 1]); legend('mic 1', 'mic 2');
hold off;
% use svd to isolate sound sources
figure(3);
x = [mic1' mic2'];
[W,s,v]=svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');
plot(t,v(:,1));
hold on;
maxAmp = max(v(:,1));
plot(t,v(:,2),'r'); xlabel('time'); ylabel('amplitude'); axis([0 0.005 -maxAmp maxAmp]); legend('isolated tone 1', 'isolated tone 2');
hold off;
After about 10 minutes of execution on my laptop computer, the simulation generates the following three figures illustrating the two isolated tones have the correct frequencies.
However, setting the microphone separation distance to zero (i.e., dMic = 0) causes the simulation to instead generate the following three figures illustrating the simulation could not isolate a second tone (confirmed by the single significant diagonal term returned in svd's s matrix).
I was hoping the microphone separation distance on a smartphone would be large enough to produce good results but setting the microphone separation distance to 5.25 inches (i.e., dMic = 0.1333 meters) causes the simulation to generate the following, less than encouraging, figures illustrating higher frequency components in the first isolated tone.
I was trying to figure this out as well, 2 years later. But I got my answers; hopefully it'll help someone.
You need 2 audio recordings. You can get audio examples from http://research.ics.aalto.fi/ica/cocktail/cocktail_en.cgi.
reference for implementation is http://www.cs.nyu.edu/~roweis/kica.html
ok, here's code -
[x1, Fs1] = audioread('mix1.wav');
[x2, Fs2] = audioread('mix2.wav');
xx = [x1, x2]';
yy = sqrtm(inv(cov(xx')))*(xx-repmat(mean(xx,2),1,size(xx,2)));
[W,s,v] = svd((repmat(sum(yy.*yy,1),size(yy,1),1).*yy)*yy');
a = W*xx; %W is unmixing matrix
subplot(2,2,1); plot(x1); title('mixed audio - mic 1');
subplot(2,2,2); plot(x2); title('mixed audio - mic 2');
subplot(2,2,3); plot(a(1,:), 'g'); title('unmixed wave 1');
subplot(2,2,4); plot(a(2,:),'r'); title('unmixed wave 2');
audiowrite('unmixed1.wav', a(1,:), Fs1);
audiowrite('unmixed2.wav', a(2,:), Fs1);
这篇关于鸡尾酒会算法 SVD 实现......在一行代码中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!