åãã«
ä»åã¯å®çªã®pyanonoteã¨whisperã§è©±è ãã¤ã¢ã©ã¤ã¼ã¼ã·ã§ã³ãè¡ã£ã¦ã¿ã¾ã
以ä¸ã§è¨äºã®ãµã³ãã«ãªãã¸ããªãå ¬éãã¦ãã¾ã
éå»ã«ã¯ã»ãã®ã©ã¤ãã©ãªã§ã試ãã¦ããã®ã§ãã»ãã«ã©ã®ãããªã©ã¤ãã©ãªãããã®ãæ°ã«ãªãå ´åã¯ã覧ãã ãã
éçºç°å¢
- Windows11
- python 3.9
- uv
ã»ããã¢ãã
uv ã§python 3.9ã®ç°å¢ãä½ãã¾ã. pyanonoteãä¾åãã¦ãã numbaã3.10以ä¸ã¯å¯¾å¿ãã¦ãã¾ããã§ãã
uv venv -p 3.9 source venv/bin/activate
å¿ è¦ãªã©ã¤ãã©ãªãå ¥ãã¦ããã¾ã
uv pip install pyannote.audio
torchãgpuçãå ¥ãã¾ã
uv pip install torch==2.5.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121 --force-reinstall
mp3ã®ãã¡ã¤ã«ãæ±ãããã㫠追å ã®ã©ã¤ãã©ãªãå ¥ãã¾ã
uv pip install pydub
æåãè¶ãããã«whisperãå ¥ãã¾ã
uv pip install - U openai-whisper
å®è¡
以ä¸ã®ã¹ã¯ãªãããå®è¡ãããã¨ã§è©±è ãã¤ã¢ã©ã¤ã¼ã¼ã·ã§ã³ããã³æåèµ·ãããè¡ããã¨ãã§ãã¾ã
# å¿ è¦ãªã©ã¤ãã©ãªã®ã¤ã³ãã¼ã from pyannote.audio import Pipeline import whisper import numpy as np from pydub import AudioSegment # 話è åé¢ã¢ãã«ã®åæå pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1") # Whisperã¢ãã«ã®ãã¼ã model = whisper.load_model("large-v3") # é³å£°ãã¡ã¤ã«ãæå® audio_file = "JA_B00000_S00529_W000007.mp3" # MP3ãã¡ã¤ã«ãæå®ãã¾ã # 話è åé¢ã®å®è¡ diarization = pipeline(audio_file) # MP3ãã¡ã¤ã«ãAudioSegmentã§èªã¿è¾¼ã audio_segment = AudioSegment.from_file(audio_file, format="mp3") # é³å£°ãã¡ã¤ã«ã16kHzãã¢ãã©ã«ã«å¤æ audio_segment = audio_segment.set_frame_rate(16000).set_channels(1) # 話è åé¢ã®çµæãã«ã¼ãå¦ç for segment, _, speaker in diarization.itertracks(yield_label=True): # 話è ãã¨ã®çºè©±åºéã®é³å£°ãåãåºãï¼ããªç§åä½ï¼ start_ms = int(segment.start * 1000) end_ms = int(segment.end * 1000) segment_audio = audio_segment[start_ms:end_ms] # é³å£°ãã¼ã¿ãnumpyé åã«å¤æ waveform = np.array(segment_audio.get_array_of_samples()).astype(np.float32) # é³å£°ãã¼ã¿ã[-1.0, 1.0]ã®ç¯å²ã«æ£è¦å waveform = waveform / np.iinfo(segment_audio.array_type).max # Whisperã«ããæåèµ·ãã # é³å£°ãã¼ã¿ããµã³ããªã³ã°ã¬ã¼ã16kHzã«åããã¦ããã³ã½ã«ã«å¤æ result = model.transcribe(waveform,fp16=False) # 話è ã©ãã«ä»ãã§çµæããã©ã¼ããããã¦åºå for data in result["segments"]: start_time = segment.start + data["start"] end_time = segment.start + data["end"] print(f"{start_time:.2f},{end_time:.2f},{speaker},{data['text']}")
çµæã¯ä»¥ä¸ã«ãªãã¾ã
0.03,4.15,SPEAKER_00,ç©äºã«å¯¾ãã¦ããçã£ç´ãã«åãçµãããããªå§¿å¢ã¨ãã