åãã«
TTSã®å¦ç¿ã®ã²ã¨ã¤ã§ãæååããã¹ã¦ã²ãããªã«ãããå ´åãããã¾ãããã®éã«ç°¡åã«ä½¿ãã sudachiã使ã£ã¦å¦çããã¦ã¿ã¾ã
以ä¸ã«ãµã³ãã«ãªãã¸ããªãå
¬éãã¦ãã¾ã
github.com
éçºç°å¢
ã»ããã¢ãã
uv pip install sudachipy sudachidict-core
ã²ããªãã«å¤æ
ä»åã¯ãã¤ããã¿ã¡ããã³ã¼ãã¹â声åªçµ±è¨ã³ã¼ãã¹ï¼JVSã³ã¼ãã¹æºæ ï¼ã®å
容ã§å®è¡ããã¦ã¿ã¾ã
sudachiã使ã£ããã¡ã¤ã«èªã¿è¾¼ã¿ããã³å¤æã³ã¼ãã¯ä»¥ä¸ã§ã
from sudachipy import dictionary, tokenizer
def katakana_to_hiragana(text):
"""ã«ã¿ã«ããã²ãããªã«å¤æããé¢æ°"""
hiragana = ''
for char in text:
code = ord(char)
if 0x30A1 <= code <= 0x30FA:
hiragana += chr(code - 0x60)
else:
hiragana += char
return hiragana
tokenizer_obj = dictionary.Dictionary().create()
mode = tokenizer.Tokenizer.SplitMode.C
with open('sudachi-result.txt', 'w', encoding='utf-8') as f_output:
with open('sample.txt', 'r', encoding='utf-8') as f_input:
for line in f_input:
line = line.strip()
if not line:
continue
if ':' in line:
identifier, script = line.split(':', 1)
readings = []
for token in tokenizer_obj.tokenize(script, mode):
yomi = token.reading_form()
readings.append(yomi)
katakana_text = ''.join(readings)
hiragana_text = katakana_to_hiragana(katakana_text)
output_line = f"{identifier}:{hiragana_text}\n"
f_output.write(output_line)
else:
f_output.write(line + '\n')
以ä¸ã®ãããªãã©ã¼ãããã® sample.txt
ãã«ã¼ããã¹ã«ãããã¨ãæ³å®ãã¦ãã¾ã
VOICEACTRESS100_001:ã¾ããæ±å¯ºã®ããã«ãäºå¤§æçã¨å¼ã°ããã主è¦ãªæçã®ä¸å¤®ã«é
ããããã¨ãå¤ãã
VOICEACTRESS100_002:ãã¥ã¼ã¤ã³ã°ã©ã³ã風ã¯ãçä¹³ããã¼ã¹ã¨ãããç½ãã¯ãªã¼ã ã¹ã¼ãã§ããããã¹ãã³ã¯ã©ã ãã£ã¦ãã¼ã¨ãå¼ã°ããã
VOICEACTRESS100_003:ã³ã³ãã¥ã¼ã¿ã²ã¼ã ã®ã¡ã¼ã«ã¼ããæ¥çå£ä½ãªã©ã«é¢é£ãã人ç©ã®ã«ãã´ãªã
ãã¡ããå¦çããããã¨ã§ä»¥ä¸ã®ãããªçµæã«ãªãã¾ã
VOICEACTRESS100_001:ã¾ããã¨ããã®ããã«ããã ãã¿ããããã¨ãã°ããããã
ãããªã¿ããããã®ã¡ã
ãããã«ã¯ãããããã¨ããããã
VOICEACTRESS100_002:ã«ã
ã¼ãããããã©ãµãã¯ããã
ãã«ã
ããã¹ã¼ãã¨ããããããããã¼ããã¼ã·ã§ãããã¼ãã¨ããããã¡ããã ã¼ã¨ããã°ããã
VOICEACTRESS100_003:ããã´ã
ã¼ããã¼ãã®ãã¼ãã¼ãããããããã ããããªã©ã«ããããããããã¶ã¤ã®ãã¦ããã
è¾æ¸ã®å¤æ´
sudachiã®è¾æ¸ã¯3ã¤ããã¾ãã
ä»å㯠ä¸çªèªå½ãå¤ã full
ã使ã£ã¦ã¿ã¾ã
ã¾ãã¯ã¤ã³ã¹ãã¼ã«ããã¾ã
uv pip install sudachidict_full
ã¢ã¼ããæå®ããéã«è¾æ¸ãæå®ãã¾ã
tokenizer_obj = dictionary.Dictionary(dict="full").create()
mode = tokenizer.Tokenizer.SplitMode.C
github.com
ãã㧠core
㨠full
ã§ã®ç²¾åº¦ã®æ¯è¼ãè¡ãã¾ããä»åã¯ã以ä¸ã®ãããªdiffã³ã¼ããä½ã£ã¦å®è¡ãã¾ããã
import difflib
def compare_results(core_file, full_file, diff_output_file):
with open(core_file, 'r', encoding='utf-8') as f_core:
core_lines = f_core.readlines()
with open(full_file, 'r', encoding='utf-8') as f_full:
full_lines = f_full.readlines()
with open(diff_output_file, 'w', encoding='utf-8') as f_output:
max_lines = max(len(core_lines), len(full_lines))
for i in range(max_lines):
core_line = core_lines[i].strip() if i < len(core_lines) else ''
full_line = full_lines[i].strip() if i < len(full_lines) else ''
core_identifier, core_script = (core_line.split(':', 1) + [''])[:2]
full_identifier, full_script = (full_line.split(':', 1) + [''])[:2]
if core_script != full_script:
f_output.write(f"--- å·®åãè¦ã¤ããã¾ããï¼è¡ {i+1}ï¼ ---\n")
f_output.write(f"èå¥å: {core_identifier}\n")
f_output.write(f"ãCoreè¾æ¸ã®çµæã\n")
f_output.write(core_script + '\n')
f_output.write(f"ãFullè¾æ¸ã®çµæã\n")
f_output.write(full_script + '\n')
f_output.write(f"ãå·®åã\n")
diff = difflib.ndiff(core_script, full_script)
f_output.write(''.join(diff) + '\n\n')
else:
pass
core_file = 'sudachi-result.txt'
full_file = 'sudachi-full-result.txt'
diff_output_file = 'diff_results.txt'
compare_results(core_file, full_file, diff_output_file)
å·®åã¯ä»¥ä¸ã§ã
--- å·®åãè¦ã¤ããã¾ããï¼è¡ 9ï¼ ---
èå¥å: VOICEACTRESS100_009
ãCoreè¾æ¸ã®çµæã
ã¾ãããããã¯ããªããã¾ããã®ããããããã
ããã
ãããããã¨ã®ããããã«ããããããã
ãFullè¾æ¸ã®çµæã
ã¾ãããããã¯ãã¡ã
ãããããã®ããããããã
ããã
ãããããã¨ã®ããããã«ããããããã
ãå·®åã
ã¾ ã ã ã ã ã 㯠ã- ãª- ã- ã- ã¾+ ã¡+ ã
+ ã+ ã+ ã ã ã ã® ã ã ã ã ã ã ã
ã ã ã
ã ã ã ã ã 㨠㮠ã ã ã ã ã« ã ã ã ã ã ã ã
--- å·®åãè¦ã¤ããã¾ããï¼è¡ 27ï¼ ---
èå¥å: VOICEACTRESS100_027
ãCoreè¾æ¸ã®çµæã
ã¡ããããã«ãã£ããã¿ããã¾ã¯ãã¯ããªãããã¯ãã«ãã¹ãã²ããã²ãããããã£ããã¨ã§ããããã
ãFullè¾æ¸ã®çµæã
ã¡ããããã«ãã£ããã¿ããã¾ã¯ãã¯ããªãããã¯ãã«ãããã²ãã£ã´ãããããã£ããã¨ã§ããããã
ãå·®åã
ã¡ ã ã ã ã ã« ã 㣠ã ã ã¿ ã ã 㾠㯠ã 㯠ã 㪠ã ã ã 㯠ã ã« ã- ã¹- ã+ ã+ ã ã² ã- ã- ã²+ ã£+ ã´ ã ã ã ã ã 㣠ã ã 㨠㧠ã ã ã ã ã