ã¡ã«å¨æ³¢æ°ã±ãã¹ãã©ã ä¿æ°ï¼MFCCï¼
Pythonã§é³å£°ä¿¡å·å¦çï¼2011/05/14ï¼ã®ç¬¬19åç®ã
ä»åã¯ãé³å£°èªèã®ç¹å¾´éã¨ãã¦ããè¦ãããã¡ã«å¨æ³¢æ°ã±ãã¹ãã©ã ä¿æ°ï¼Mel-Frequency Cepstrum Coefficientsï¼ãæ±ãã¦ã¿ã¾ãããããããMFCCã§ãã
MFCCã¯ã±ãã¹ãã©ã ï¼2012/2/11ï¼ã¨åãã声éç¹æ§ã表ãç¹å¾´éã§ããã±ãã¹ãã©ã ã¨MFCCã®éãã¯MFCCã人éã®é³å£°ç¥è¦ã®ç¹å¾´ãèæ ®ãã¦ãããã¨ã§ããã¡ã«ã¨ããè¨èãããã表ãã¦ãã¾ãã
MFCCã®æ½åºæé ãã¾ã¨ããã¨
- ããªã¨ã³ãã¡ã·ã¹ãã£ã«ã¿ã§æ³¢å½¢ã®é«åæåã強調ãã
- çªé¢æ°ããããå¾ã«FFTãã¦æ¯å¹ ã¹ãã¯ãã«ãæ±ãã
- æ¯å¹ ã¹ãã¯ãã«ã«ã¡ã«ãã£ã«ã¿ãã³ã¯ãããã¦å§ç¸®ãã
- ä¸è¨ã®å§ç¸®ããæ°å¤åãä¿¡å·ã¨ã¿ãªãã¦é¢æ£ã³ãµã¤ã³å¤æãã
- å¾ãããã±ãã¹ãã©ã ã®ä½æ¬¡æåãMFCC
ã¨ãªãã¾ããç§ãåèã«ããã³ã¼ãã¯æ¯å¹ ã¹ãã¯ãã«ã使ã£ã¦ããã©ãWikipediaã®èª¬æãè¦ãã¨ãã¯ã¼ã¨æ¸ãã¦ããããã¯ã¼ã¹ãã¯ãã«ã®æ¹ãããã®ããªï¼
ã±ãã¹ãã©ã åæã¨éãã®ã¯
- æ¯å¹ ã¹ãã¯ãã«ã«ã¡ã«ãã£ã«ã¿ãã³ã¯ãããã¦å§ç¸®ãã
- ã±ãã¹ãã©ã é åã«ç§»ãã®ã«ãã¼ãªã¨å¤æã§ã¯ãªããé¢æ£ã³ãµã¤ã³å¤æã使ã
ã®2ç¹ã§ãã¨ã¯ã ãããä¸ç·ã®ããã§ããã¨ããããã§ãã®æé ã«ãã£ã¦Pythonã§å®è£ ãã¦ããããã¨æãã¾ãã
é³å£°ä¿¡å·
ã¾ããé³å£°ä¿¡å·ãç¨æãã¾ããã±ãã¹ãã©ã åæã§ä½¿ã£ããããã®é³ã®ä¸å¿ã®å®å¸¸é¨åããã¼ããã¾ãã
#coding:utf-8 import wave import numpy as np from pylab import * def wavread(filename): wf = wave.open(filename, "r") fs = wf.getframerate() x = wf.readframes(wf.getnframes()) x = np.frombuffer(x, dtype="int16") / 32768.0 # (-1, 1)ã«æ£è¦å wf.close() return x, float(fs) if __name__ == "__main__": # é³å£°ããã¼ã wav, fs = wavread("a.wav") t = np.arange(0.0, len(wav) / fs, 1/fs) # é³å£°æ³¢å½¢ã®ä¸å¿é¨åãåãåºã center = len(wav) / 2 # ä¸å¿ã®ãµã³ãã«çªå· cuttime = 0.04 # åãåºãé·ã [s] wavdata = wav[center - cuttime/2*fs : center + cuttime/2*fs] time = t[center - cuttime/2*fs : center + cuttime/2*fs] # 波形ããããã plot(time * 1000, wavdata) xlabel("time [ms]") ylabel("amplitude") savefig("waveform.png") show()
å®è¡çµæã¯ã
ããªã¨ã³ãã¡ã·ã¹ãã£ã«ã¿
次ã«ã¹ãã¯ãã«ã®é«åæåã強調ããããªã¨ã³ãã¡ã·ã¹ãã£ã«ã¿ï¼pre-emphasis filterï¼ãããã¾ããé«åæåã強調ãããã¨ã§å£°éç¹å¾´ãã¯ã£ããåºããã使ã£ãæ¹ãããã¨ã®ãã¨ãããªã¨ã³ãã¡ã·ã¹ãã£ã«ã¿ã®å®ç¾©ã¯ã
ããã§ãx(n)ã¯é³å£°æ³¢å½¢ãã¼ã¿ãpã¯ããªã¨ã³ãã¡ã·ã¹ä¿æ°ã0.97ã使ããã¨ãå¤ãã¨ã®ãã¨ã
ä¸è¨ã®å¼ã¯ããã£ã«ã¿ä¿æ°ã [1.0, -p] ã®FIRãã£ã«ã¿ï¼2011/10/23ï¼ã¨åãã§ããFIRãã£ã«ã¿ã®å®ç¾©å¼ã«ä¸ã®ãã£ã«ã¿ä¿æ°ãä»£å ¥ããã¨ããªã¨ã³ãã¡ã·ã¹ãã£ã«ã¿ã®å¼ã¨åãã«ãªããã¨ã確èªã§ãã¾ããPythonã ã¨lfilter()ã使ã£ã¦ä¸ã®ããã«å®è£ ã§ãã¾ãã
import scipy.signal def preEmphasis(signal, p): """ããªã¨ã³ãã¡ã·ã¹ãã£ã«ã¿""" # ä¿æ° (1.0, -p) ã®FIRãã£ã«ã¿ãä½æ return scipy.signal.lfilter([1.0, -p], 1, signal) # ããªã¨ã³ãã¡ã·ã¹ãã£ã«ã¿ãããã p = 0.97 # ããªã¨ã³ãã¡ã·ã¹ä¿æ° signal = preEmphasis(signal, p)
å®éã«ããªã¨ã³ãã¡ã·ã¹ãã£ã«ã¿ãããã波形ã¨ãã®ã¹ãã¯ãã«ã表示ãã¦ã¿ã¾ãããå·¦ä¸ããã¨ã®æ³¢å½¢ãå³ä¸ããã£ã«ã¿ãããã波形ãå·¦ä¸ããã¨ã®æ³¢å½¢ã®ã¹ãã¯ãã«ãå³ä¸ããã£ã«ã¿ãããã波形ã®ã¹ãã¯ãã«ã§ããããªã¨ã³ãã¡ã·ã¹ãã£ã«ã¿ãããããã¨ã§é«åæåã強調ããã¦ããã®ããããããã¾ãã
æ¯å¹ ã¹ãã¯ãã«
次ã«ããªã¨ã³ãã¡ã·ã¹ãã£ã«ã¿ãããã波形ã«ããã³ã°çªãããã¦ããFFTãã¦æ¯å¹ ã¹ãã¯ãã«ãæ±ãã¾ãããã£ãã®ã¯ããã³ã°çªããã¦ã¾ããã§ããã
# ããã³ã°çªãããã hammingWindow = np.hamming(len(signal)) signal = signal * hammingWindow # æ¯å¹ ã¹ãã¯ãã«ãæ±ãã nfft = 2048 # FFTã®ãµã³ãã«æ° spec = np.abs(np.fft.fft(signal, nfft))[:nfft/2] fscale = np.fft.fftfreq(nfft, d = 1.0 / fs)[:nfft/2] # ãããã plot(fscale, spec) xlabel("frequency [Hz]") ylabel("amplitude spectrum") savefig("spectrum.png") show()
ã¡ã«ãã£ã«ã¿ãã³ã¯ã®ä½æ
次ãã¡ã«ãã£ã«ã¿ãã³ã¯ãä½æãã¾ãããã£ã«ã¿ãã³ã¯ã¨ããã®ã¯ãã³ããã¹ãã£ã«ã¿ï¼2011/10/30ï¼ãè¤æ°ä¸¦ã¹ããã®ã§ããããã§ã¯ãä¸è§å½¢ã®ãã³ããã¹ãã£ã«ã¿ããªã¼ãã¼ã©ãããããªãã並ã¹ã¾ããä¸è§å½¢ã®ãã³ããã¹ãã£ã«ã¿ã®æ°ããã£ãã«æ°ã¨å¼ã³ã¾ãã
ããã§ãã ã®ãã£ã«ã¿ãã³ã¯ã§ã¯ãªããã¡ã«ãã¤ãã®ãç¹å¾´ã§ããã¡ã«å°ºåº¦ã¯ã人éã®é³å£°ç¥è¦ãåæ ããå¨æ³¢æ°è»¸ã§åä½ã¯melã§ããä½å¨æ³¢ã»ã©ééãçããé«å¨æ³¢ã»ã©ééãåºããªã£ã¦ãã¾ããã©ããã人éã¯ä½å¨æ³¢ãªãç´°ããé³ã®é«ãã®éãããããããã©ãé«å¨æ³¢ã«ãªãã»ã©é³ã®é«ãã®éããããããªããªãããã§ãã詳ãã説æã¯åèæç®ã«ãããã¨ãã¦Hzã¨melãç¸äºå¤æããé¢æ°ãå®è£ ãã¾ãã
def hz2mel(f): """Hzãmelã«å¤æ""" return 1127.01048 * np.log(f / 700.0 + 1.0) def mel2hz(m): """melãhzã«å¤æ""" return 700.0 * (np.exp(m / 1127.01048) - 1.0)
ã¡ã«ãã£ã«ã¿ãã³ã¯ã¯ããã³ããã¹ãã£ã«ã¿ã®ä¸è§çªãã¡ã«å°ºåº¦ä¸ã§çééã«ãªãããã«é ç½®ããã¾ããã¡ã«å°ºåº¦ä¸ã§çééã«ãªãã¹ããã£ã«ã¿ãHz尺度ã«æ»ãã¨é«å¨æ³¢ã«ãªãã»ã©å¹ ãåºãä¸è§å½¢ã«ãªãã¾ããä¸ã®å³ã®ãããªã¤ã¡ã¼ã¸ã§ãã
ã¡ã«ãã£ã«ã¿ãã³ã¯ãä½ãé¢æ°ã§ãã
def melFilterBank(fs, nfft, numChannels): """ã¡ã«ãã£ã«ã¿ãã³ã¯ãä½æ""" # ãã¤ãã¹ãå¨æ³¢æ°ï¼Hzï¼ fmax = fs / 2 # ãã¤ãã¹ãå¨æ³¢æ°ï¼melï¼ melmax = hz2mel(fmax) # å¨æ³¢æ°ã¤ã³ããã¯ã¹ã®æå¤§æ° nmax = nfft / 2 # å¨æ³¢æ°è§£å度ï¼å¨æ³¢æ°ã¤ã³ããã¯ã¹1ãããã®Hzå¹ ï¼ df = fs / nfft # ã¡ã«å°ºåº¦ã«ãããåãã£ã«ã¿ã®ä¸å¿å¨æ³¢æ°ãæ±ãã dmel = melmax / (numChannels + 1) melcenters = np.arange(1, numChannels + 1) * dmel # åãã£ã«ã¿ã®ä¸å¿å¨æ³¢æ°ãHzã«å¤æ fcenters = mel2hz(melcenters) # åãã£ã«ã¿ã®ä¸å¿å¨æ³¢æ°ãå¨æ³¢æ°ã¤ã³ããã¯ã¹ã«å¤æ indexcenter = np.round(fcenters / df) # åãã£ã«ã¿ã®éå§ä½ç½®ã®ã¤ã³ããã¯ã¹ indexstart = np.hstack(([0], indexcenter[0:numChannels - 1])) # åãã£ã«ã¿ã®çµäºä½ç½®ã®ã¤ã³ããã¯ã¹ indexstop = np.hstack((indexcenter[1:numChannels], [nmax])) filterbank = np.zeros((numChannels, nmax)) for c in np.arange(0, numChannels): # ä¸è§ãã£ã«ã¿ã®å·¦ã®ç´ç·ã®å¾ãããç¹ãæ±ãã increment= 1.0 / (indexcenter[c] - indexstart[c]) for i in np.arange(indexstart[c], indexcenter[c]): filterbank[c, i] = (i - indexstart[c]) * increment # ä¸è§ãã£ã«ã¿ã®å³ã®ç´ç·ã®å¾ãããç¹ãæ±ãã decrement = 1.0 / (indexstop[c] - indexcenter[c]) for i in np.arange(indexcenter[c], indexstop[c]): filterbank[c, i] = 1.0 - ((i - indexcenter[c]) * decrement) return filterbank, fcenters # ã¡ã«ãã£ã«ã¿ãã³ã¯ãä½æ numChannels = 20 # ã¡ã«ãã£ã«ã¿ãã³ã¯ã®ãã£ãã«æ° df = fs / nfft # å¨æ³¢æ°è§£å度ï¼å¨æ³¢æ°ã¤ã³ããã¯ã¹1ãããã®Hzå¹ ï¼ filterbank, fcenters = melFilterBank(fs, nfft, numChannels) # ã¡ã«ãã£ã«ã¿ãã³ã¯ã®ãããã for c in np.arange(0, numChannels): plot(np.arange(0, nfft / 2) * df, filterbank[c]) savefig("melfilterbank.png") show()
ããã§æ»ãå¤ã®filterbankã¯è¡åã§ããåè¡ã1ã¤ã®ãã³ããã¹ãã£ã«ã¿ï¼ä¸è§å½¢ï¼ã«ãããã¾ããé³å£°èªèã§ã¯ã20åã®ãã³ããã¹ãã£ã«ã¿ã使ããã¨ãå¤ãããã§ããã¤ã¾ãã20ãã£ãã«ã®ãã£ã«ã¿ãã³ã¯ã§ããã¾ããåæ°ã¯ãFFTã®ãµã³ãã«æ°ï¼nfftï¼ã2048ã«ãã¦ããã®ã§ããã¤ãã¹ãå¨æ³¢æ°ã¾ã§ã®ååãã¨ãã¨1024ã«ãªãã¾ãã
print filterbank.shape
ã¨ããã¨(20, 1024)ãè¿ã£ã¦ãã¾ãã
ã¡ã«ãã£ã«ã¿ãã³ã¯ãããã
次ã«ãå ã®æ¯å¹ ã¹ãã¯ãã«ã«ã¡ã«ãã£ã«ã¿ãã³ã¯ãããã¾ããæ¯å¹ ã¹ãã¯ãã«ã«å¯¾ãã¦ã¡ã«ãã£ã«ã¿ãã³ã¯ã®åãã£ã«ã¿ãããããã£ã«ã¿å¾ã®æ¯å¹ ã足ãåããã¦å¯¾æ°ãã¨ãã¾ããè¨èã§æ¸ãã¨ããããããã ããã©ã³ã¼ãã ã¨ã¯ã£ãããã¾ãã
# æ¯å¹ ã¹ãã¯ãã«ã«å¯¾ãã¦ãã£ã«ã¿ãã³ã¯ã®åãã£ã«ã¿ãããã # æ¯å¹ ã®åã®å¯¾æ°ãã¨ã mspec = [] for c in np.arange(0, numChannels): mspec.append(np.log10(sum(spec * filterbank[c]))) mspec = np.array(mspec)
å®ã¯ãforã«ã¼ãã§æ¸ããªãã¦ãä¸ã®ããã«è¡åã®ããç®ã使ãã¨ãã£ã¨ç°¡æ½ã«ããã¾ãããã£ã«ã¿ãããã¦æ¯å¹ ã足ãåãããã¨ããã®ã¯å ç©ã§è¡¨ããããã§ãã
# æ¯å¹
ã¹ãã¯ãã«ã«ã¡ã«ãã£ã«ã¿ãã³ã¯ãé©ç¨
mspec = np.log10(np.dot(spec, filterbank.T))
ã©ã£ã¡ã®æ¹æ³ã§ãåãçµæã«ãªãã¾ãããããããã¨æ¯å¹ ã¹ãã¯ãã«ãã¡ã«ãã£ã«ã¿ãã³ã¯ã®ãã£ãã«æ°ã¨åã次å ã«å§ç¸®ããã¾ããä»åã¯ããã£ãã«æ°20ãªã®ã§20次å ã®ãã¼ã¿ã«ãªãã¾ããå ã®å¯¾æ°æ¯å¹ ã¹ãã¯ãã«ã¨ã¡ã«ãã£ã«ã¿ãã³ã¯ã§20次å ãã¼ã¿ããããããã¦ã¿ãã¨ä¸ã®ããã«ãªãã¾ãã
# å ã®æ¯å¹ ã¹ãã¯ãã«ã¨ãã£ã«ã¿ãã³ã¯ãããã¦å§ç¸®ããã¹ãã¯ãã«ã表示 subplot(211) plot(fscale, np.log10(spec)) xlabel("frequency") xlim(0, 25000) subplot(212) plot(fcenters, mspec, "o-") xlabel("frequency") xlim(0, 25000) show()
é¢æ£ã³ãµã¤ã³å¤æ
æå¾ã«ãã®20次å ã®ãã¼ã¿ãé¢æ£ã³ãµã¤ã³å¤æãã¦ã±ãã¹ãã©ã é åã«ç§»ãã¾ããã±ãã¹ãã©ã åæã ã¨ãã¼ãªã¨å¤æã§æ»ãã¦ã¾ããããã©ãMFCCã®å ´åã¯é¢æ£ã³ãµã¤ã³å¤æã使ãã¨ã®ãã¨ãããã辺ããªããªã®ãç解ã§ãã¦ã¾ãããã¨ãããé¢æ£ã³ãµã¤ã³å¤æãä½ãªã®ããåå¼·ä¸ã§ãã»ã»ã»
# é¢æ£ã³ãµã¤ã³å¤æ ceps = scipy.fftpack.realtransforms.dct(mspec, type=2, norm="ortho", axis=-1) # ä½æ¬¡æåããncepsåã®ä¿æ°ãè¿ã return ceps[:nceps]
é¢æ£ã³ãµã¤ã³å¤æã®çµæããä½æ¬¡ã®æåãåãåºãããã®ãMFCCã§ãã大ä½ã12次ã¾ã§ã¨ããã¨ãå¤ãã¨ã®ãã¨ããªã®ã§ãnceps=12ã§ããä¸ã®é³å£°ãã¼ã¿ã ã¨
[ 2.51895741 -0.39441998 0.16150014 0.17564364 -0.72552876 -0.73787793 -0.16415795 0.07149698 0.24680304 0.02212086 -0.34275272 -0.29347927]
ã¨ããçµæã«ãªãã¾ããããã¼ããä¸å¿ããã£ã½ãæ°å¤ãåºã¦ãããã©ããã§ãã£ã¦ããã®ããªï¼
ä»åã¯ãåå¼·ã®ããã«èªåã§å®è£ ãã¦ã¿ã¾ããããå®éã¯ãHTKãSPTKï¼2012/8/5ï¼ã®ãããªæ¢åãã¼ã«ã使ãã°ããããç°¡åã«æ½åºã§ãã¾ãããã¶ããã®ã³ã¼ãã¯å®é¨ã§ã¯ä½¿ãã¾ããï¼ç¬ï¼
ããééããããã¾ããããã³ã¡ã³ãæ¬ã§ãã²æãã¦ãã ããï¼ãããããé¡ããã¾ãã
åèè³æ
- 音声認識を紹介するページ - é³å£°èªèã®åçãç´æçã«ãããããã解説ã
- Mel scale - Wikipedia
- Mel-frequency cepstrum - Wikipedia
- Miyazawa's Pukiwiki 公開版 - ã±ãã¹ãã©ã åæã§ãåç §ãã¾ããããã®ç¶ãã§ãã
- Talkbox - Pythonã§å®è£ ããMFCCã®ã³ã¼ããä¸é¨ã ãåèã
- Auditory Toolbox - Matlabã§å®è£ ããMFCCã®ã³ã¼ã
- Matlab Central - ã¡ã«ãã£ã«ã¿ãã³ã¯ã®ä½ãæ¹ã¯ããã®ã³ã¼ããåç §