Speech のバックアップ(No.4) - PukiWiki

[ トップ ] [ 一覧 | 検索 | 履歴 | ログイン ]

バックアップ一覧
差分を表示
現在との差分を表示
ソースを表示
Speech へ行く。
- 1 (2024-11-12 (火) 13:59:21)
- 2 (2024-11-12 (火) 15:08:00)
- 3 (2024-11-12 (火) 20:37:05)
- 4 (2024-11-13 (水) 01:22:26)
- 5 (2024-11-13 (水) 16:09:39)
- 6 (2024-11-18 (月) 00:44:42)
- 7 (2024-11-18 (月) 14:42:35)

私的AI研究会 > Speech

音声認識 / 音声合成 == 編集中 == †

　日本語の音声認識 / 音声合成を検証する

音声認識 / 音声合成 == 編集中 ==
参考資料

※ 最終更新:2024/11/12　

音声認識 †

音声認識『SpeechRecognition』 †

パッケージのインストール

(py38_learn_test2) pip install SpeechRecognition

コマンド実行例

(py38_learn_test2) cd /anaconda_win/workspace_2/speech_recognition
(py38_learn_test2) python speech2text.py ./sample/output.wav
準備ができたので実際に 合成音声ファイルを作成してみます

(py38_learn_test2) python speech2text.py ./sample/yoshimura3.wav
それでは ただいまより 知事記者会見を始めさせていただきます はじめに知事より ご説明がございます え 知事 よろしくお願いい たします はい 私からは3点です まず1点目についてです 先島庁舎におけるサウンディング型の市場調査を行いますえこれ あの先島 調査に入ってるですねえ ホテル 作島コスモスタワーホテル の明け渡し後におけるテナントの公募 条件を検討するにあたって サウ ンディングをします 市場動向 事業者の活用意向と事前に把握するための調査をいたします あの先島はコスモタワーホテルとの訴訟 等々につきましても すでに報告してる通りです あの予定通りですね 10月31日までに 明け渡しを受ける予定にしています そして このマーケットスタンディングですけども その後の活用ということでまず 対処 フロアですけれども 低層会 7回から17回までの計11フロアについて マーケットスタンディングをいたします 原則として 7回から17回のあの全てのフロアを一括して借り受ける活用案の提案を募集をいたします この中でですね7階から9回についてですけれども これは ホテル仕様の回収工事がまだ未完了になっています で 今 10階から17回までについては あの ホテル仕様は完了してそして実際に 10階から17階までが先島 コスモタワーホテル として 使う あの これまで使われてきました そして7階から9回もあの貸していたところではあるんですけれども ここについては まだ ホテル仕様の回収工事が途中の段階で止まってるというものですま 今回のマクド 3人については 7階から17回 えすべてを対象とすることを原則としてますべてを買い受け 場合にどんな活用案がありますか ということの募集をいたします もちろんですね あの現況のまま使用する場合 これ一番多いと思います 今もうすでに ホテル なわけですから え もちろん そのホテルはあると思いますが それ以外にもあの募集をいたします ホテル以外で 用途変更して こういった 活用があるんじゃないかという提案があれば それも含めて 募集 致しますので あの必ずしもホテルに限ったものではないというものになりますま 現状 ホテルで今使われてるものであります そし てスケジュールですけれども 本日から調査を開始をいたしましてえ 現地見学会を 9月中旬に行います そして 9月下旬には質問の受 付をいたします ですので あの 興味のある事業者の皆さんには です ね 実際 このホテル仕様の じゃあ あ の改修工事 が未完了の 音が どうなってるのかとかそういうことも含めて えこの現地の見学会等をさせていただくということになります そして

サンプル音声ファイル
　　「output.wav」　「yoshimura3.wav」

ソースコード

▼「speech2text.py」ソースコード

# -*- coding: utf-8 -*-
##------------------------------------------
##  Speech to text           Ver 0.01
##       with speech_recognition
##
##               2024.11.07 Masahiro Izutsu
##------------------------------------------
## speech2text.py
import speech_recognition as sr
import sys

args = sys.argv
filepath = args[1]
r = sr.Recognizer()
with sr.AudioFile(filepath) as source:
    audio = r.record(source)

text = r.recognize_google(audio, language='ja-JP')

print(text)

音声サンプル作成手順 †

YouTube から適当なサンプルをダウンロード（例：吉村大阪府知事　定例記者会見（令和6年9月4日)）
```
(py38_learn_test2) cd /anaconda_win/workspace_2/mylib2
(py38_learn_test2) python ytb_down.py 'https://youtu.be/UGoYMe7qcBY'
```

サンプルをダウンロードファイルを切り取る（4秒/3秒）

(py38_learn_test2) ffmpeg -i '吉村大阪府知事　定例記者会見（令和6年9月4日) [UGoYMe7qcBY].webm' -ss 00:00:00 -t 00:04:00 -async 1 yoshimura.mp4
(py38_learn_test2) ffmpeg -i '吉村大阪府知事　定例記者会見（令和6年9月4日) [UGoYMe7qcBY].webm' -ss 00:00:00 -t 00:03:00 -async 1 yoshimura.mp4

「.mp4」から音声ファイル「.wav」を作成

(py38_learn_test2) python my_videotool.py 60 yoshimura3.mp4 yoshimura3.wav

▼　実行ログ

(py38_learn_test2) python ytb_down.py 'https://youtu.be/UGoYMe7qcBY'
0 ytb_down.py
1 https://youtu.be/UGoYMe7qcBY
['https://youtu.be/UGoYMe7qcBY']
Deprecated Feature: Support for Python version 3.8 has been deprecated. Please update to Python 3.9 or above
[youtube] Extracting URL: https://youtu.be/UGoYMe7qcBY
[youtube] UGoYMe7qcBY: Downloading webpage
[youtube] UGoYMe7qcBY: Downloading ios player API JSON
[youtube] UGoYMe7qcBY: Downloading mweb player API JSON
[youtube] UGoYMe7qcBY: Downloading player dad5a960
[youtube] UGoYMe7qcBY: Downloading m3u8 information
[info] UGoYMe7qcBY: Downloading 1 format(s): 244+251
[download] Destination: 吉村大阪府知事　定例記者会見（令和6年9月4日) [UGoYMe7qcBY].f244.webm
[download] 100% of  130.29MiB in 00:00:03 at 36.16MiB/s
[download] Destination: 吉村大阪府知事　定例記者会見（令和6年9月4日) [UGoYMe7qcBY].f251.webm
[download] 100% of   53.41MiB in 00:00:13 at 4.08MiB/s
[Merger] Merging formats into "吉村大阪府知事　定例記者会見（令和6年9月4日) [UGoYMe7qcBY].webm"
Deleting original file 吉村大阪府知事　定例記者会見（令和6年9月4日) [UGoYMe7qcBY].f251.webm (pass -k to keep)
Deleting original file 吉村大阪府知事　定例記者会見（令和6年9月4日) [UGoYMe7qcBY].f244.webm (pass -k to keep)

(py38_learn_test2) ffmpeg -i '吉村大阪府知事　定例記者会見（令和6年9月4日) [UGoYMe7qcBY].webm' -ss 00:00:00 -t 00:04:00 -async 1 yoshimura.mp4
ffmpeg version 7.0.2 Copyright (c) 2000-2024 the FFmpeg developers
  built with clang version 18.1.8
  configuration: --prefix=/d/bld/ffmpeg_1726581472730/_h_env/Library --cc=clang.exe --cxx=clang++.exe --nm=llvm-nm --ar=llvm-ar --disable-doc --enable-openssl --enable-demuxer=dash --enable-hardcoded-tables --enable-libfreetype --enable-libharfbuzz --enable-libfontconfig --enable-libopenh264 --enable-libdav1d --ld=lld-link --target-os=win64 --enable-cross-compile --toolchain=msvc --host-cc=clang.exe --extra-libs=ucrt.lib --extra-libs=vcruntime.lib --extra-libs=oldnames.lib --strip=llvm-strip --disable-stripping --host-extralibs= --disable-libopenvino --enable-gpl --enable-libx264 --enable-libx265 --enable-libaom --enable-libsvtav1 --enable-libxml2 --enable-pic --enable-shared --disable-static --enable-version3 --enable-zlib --enable-libopus --pkg-config=/d/bld/ffmpeg_1726581472730/_build_env/Library/bin/pkg-config
  libavutil      59.  8.100 / 59.  8.100
  libavcodec     61.  3.100 / 61.  3.100
  libavformat    61.  1.100 / 61.  1.100
  libavdevice    61.  1.100 / 61.  1.100
  libavfilter    10.  1.100 / 10.  1.100
  libswscale      8.  1.100 /  8.  1.100
  libswresample   5.  1.100 /  5.  1.100
  libpostproc    58.  1.100 / 58.  1.100
Input #0, matroska,webm, from '吉村大阪府知事　定例記者会見（令和6年9月4日) [UGoYMe7qcBY].webm':
  Metadata:
    ENCODER         : Lavf61.1.100
  Duration: 01:00:46.45, start: 0.000000, bitrate: 422 kb/s
  Stream #0:0(eng): Video: vp9 (Profile 0), yuv420p(tv, bt709), 854x480, SAR 1:1 DAR 427:240, 29.97 fps, 29.97 tbr, 1k tbn (default)
      Metadata:
        DURATION        : 01:00:46.409000000
  Stream #0:1(eng): Audio: opus, 48000 Hz, stereo, fltp (default)
      Metadata:
        DURATION        : 01:00:46.448000000
Stream mapping:
  Stream #0:0 -> #0:0 (vp9 (native) -> h264 (libx264))
  Stream #0:1 -> #0:1 (opus (native) -> aac (native))
Press [q] to stop, [?] for help
[libx264 @ 000002CB27296FC0] using SAR=1/1
[libx264 @ 000002CB27296FC0] using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2
[libx264 @ 000002CB27296FC0] profile High, level 3.1, 4:2:0, 8-bit
[libx264 @ 000002CB27296FC0] 264 - core 164 r3095 baee400 - H.264/MPEG-4 AVC codec - Copyleft 2003-2022 - http://www.videolan.org/x264.html - options: cabac=1 ref=3 deblock=1:0:0 analyse=0x3:0x113 me=hex subme=7 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=16 chroma_me=1 trellis=1 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-2 threads=15 lookahead_threads=2 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=3 b_pyramid=2 b_adapt=1 b_bias=0 direct=1 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=25 scenecut=40 intra_refresh=0 rc_lookahead=40 rc=crf mbtree=1 crf=23.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00
Output #0, mp4, to 'yoshimura.mp4':
  Metadata:
    encoder         : Lavf61.1.100
  Stream #0:0(eng): Video: h264 (avc1 / 0x31637661), yuv420p(tv, bt709, progressive), 854x480 [SAR 1:1 DAR 427:240], q=2-31, 29.97 fps, 30k tbn (default)
      Metadata:
        DURATION        : 01:00:46.409000000
        encoder         : Lavc61.3.100 libx264
      Side data:
        cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: N/A
  Stream #0:1(eng): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 128 kb/s (default)
      Metadata:
        DURATION        : 01:00:46.448000000
        encoder         : Lavc61.3.100 aac
[out#0/mp4 @ 000002CB271E4C00] video:10854KiB audio:3773KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: 1.771115%
frame= 7193 fps=900 q=-1.0 Lsize=   14886KiB time=00:03:59.93 bitrate= 508.2kbits/s speed=  30x
[libx264 @ 000002CB27296FC0] frame I:29    Avg QP:17.73  size: 54772
[libx264 @ 000002CB27296FC0] frame P:1859  Avg QP:20.73  size:  3739
[libx264 @ 000002CB27296FC0] frame B:5305  Avg QP:25.89  size:   485
[libx264 @ 000002CB27296FC0] consecutive B-frames:  1.0%  1.5%  1.4% 96.1%
[libx264 @ 000002CB27296FC0] mb I  I16..4: 15.2% 36.6% 48.2%
[libx264 @ 000002CB27296FC0] mb P  I16..4:  1.2%  1.5%  0.3%  P16..4: 26.0%  6.9%  4.3%  0.0%  0.0%    skip:59.7%
[libx264 @ 000002CB27296FC0] mb B  I16..4:  0.0%  0.0%  0.0%  B16..8: 13.9%  1.0%  0.3%  direct: 0.3%  skip:84.4%  L0:41.6% L1:54.5% BI: 3.9%
[libx264 @ 000002CB27296FC0] 8x8 transform intra:45.3% inter:67.7%
[libx264 @ 000002CB27296FC0] coded y,uvDC,uvAC intra: 44.4% 41.1% 20.2% inter: 3.5% 3.3% 0.3%
[libx264 @ 000002CB27296FC0] i16 v,h,dc,p: 35% 22%  7% 37%
[libx264 @ 000002CB27296FC0] i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 33% 20% 26%  3%  3%  4%  4%  4%  4%
[libx264 @ 000002CB27296FC0] i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 33% 26% 12%  4%  5%  5%  5%  5%  5%
[libx264 @ 000002CB27296FC0] i8c dc,h,v,p: 54% 20% 23%  3%
[libx264 @ 000002CB27296FC0] Weighted P-Frames: Y:11.3% UV:1.0%
[libx264 @ 000002CB27296FC0] ref P L0: 58.3% 21.8% 14.8%  4.5%  0.5%
[libx264 @ 000002CB27296FC0] ref B L0: 89.1%  9.1%  1.9%
[libx264 @ 000002CB27296FC0] ref B L1: 96.2%  3.8%
[libx264 @ 000002CB27296FC0] kb/s:370.43
[aac @ 000002CB2727BA40] Qavg: 1211.732

(py38_learn_test2) python my_videotool.py 60 yoshimura3.mp4 yoshimura3.wav
 Source file → 'yoshimura3.mp4'
 Saving file → 'yoshimura3.wav'

音声認識アプリケーション †

コマンド実行

(py38_learn_test2) python speech.py
なにか話してください ...
認識中...
テストで
なにか話してください ...
認識中...
認識できなかった。。。。。
なにか話してください ...
認識中...
ストップ
プログラムを終了します

ソースコード

▼「speech.py」ソースコード

# -*- coding: utf-8 -*-
##------------------------------------------
##  Speech to text           Ver 0.01
##       with speech_recognition
##
##               2024.11.07 Masahiro Izutsu
##------------------------------------------
## speech.py    https://knt60345blog.com/speechrecognition1/

# Color Escape Code ---------------------------
GREEN = '\033[1;32m'
RED = '\033[1;31m'
NOCOLOR = '\033[0m'
YELLOW = '\033[1;33m'
CYAN = '\033[1;36m'
BLUE = '\033[1;34m'

import speech_recognition as sr

if __name__ == "__main__":

    r = sr.Recognizer()
    mic = sr.Microphone()

    while True:
        print(CYAN + "なにか話してください ..." + NOCOLOR)

        with mic as source:
            r.adjust_for_ambient_noise(source) #ノイズ除去
            audio = r.listen(source)

        print (GREEN + "認識中..." + NOCOLOR)

        try:
            print(r.recognize_google(audio, language='ja-JP'))

            # "ストップ"と発話することで、プログラムを止める
            if r.recognize_google(audio, language='ja-JP') == "ストップ" :
                print(CYAN + "プログラムを終了します" + NOCOLOR)
                break

        except sr.UnknownValueError:
            print(YELLOW + "認識できなかった。。。。。" + NOCOLOR)
        except sr.RequestError as e:
            print("Could not request results from Google Speech Recognition service; {0}".format(e))

音声合成 †

音声合成『pyttsx3』 †

パッケージのインストール

(py38_learn_test2) pip install pip install pyttsx3

コマンド実行例

(py38_learn_test2) python text2speak1.py

・速さを調整

(py38_learn_test2) python text2speak2.py

・速さを調整/英語

(py38_learn_test2) python text2speak3.py

・テキストファイルを読む

(py38_learn_test2) python text2speak.py ./sample/soseki.txt
吾輩は猫である。名前はまだ無い。
　どこで生れたかとんと見当がつかぬ。何でも薄暗いじめじめした所でニャーニャー泣いていた事だけは記憶している。吾輩はここで始めて人間というものを見た。しかもあとで聞くとそれは書生という人間中で一番獰悪な種族であったそうだ。この書生というのは時々我々を捕えて煮て食うという話である。しかしその当時は何という考もなかったから別段恐しいとも思わなかった。ただ彼の掌に載せられてスーと持ち上げられた時何だかフワフワした感じがあったばかりである。掌の上で少し落ちついて書生の顔を見たのがいわゆる人間というものの見始であろう。

(py38_learn_test2) python text2speak.py ./sample/jyugemu.txt
和尚の書いてくれた紙を熊さんは読み上げる。「ええー、寿限無寿限無、五劫のすり切れ、海砂利水魚の水行末、雲来末、風来末、食う寝るところに住むところ、やぶらこうじのぶらこうじ、パイポパイポ、パイポのシューリンガン、シューリンガンのグーリンダイ、グーリンダイのポンポコピーのポンポコナの長久命の長助、うーん、こう並べてみるとみんなつけてえ名前ばかりですねえ」

　　「soseki.wav」

　「jyugemu.wav」

ソースコード

▼「text2speak.py」ソースコード

# -*- coding: utf-8 -*-
##------------------------------------------
##  Text to Speak with pyttsx3   Ver 0.01
##
##               2024.11.07 Masahiro Izutsu
##------------------------------------------
## text2speak.py

import pyttsx3

def read_text(text):
    # エンジンの初期化
    engine = pyttsx3.init()

    # 読み上げ速度を設定
    engine.setProperty('rate', 150)

    # 音量を設定
    engine.setProperty('volume', 0.9)

    # 声を変更
    voices = engine.getProperty('voices')
    engine.setProperty('voice', voices[0].id)

    # テキストを読み上げる
    engine.say(text)
#    engine.save_to_file(text, 'voice.wav')

    # 読み上げを実行
    engine.runAndWait()

if __name__ == "__main__":
    import sys
    args = sys.argv

    filepath = args[1]
    f = open(filepath, 'r', encoding='UTF-8')
    text = f.read()
    print(text)
    f.close()

    read_text(text)

対処した問題点とエラー詳細 †

更新履歴 †

2024/11/12 初版

参考資料 †

Speech Recognition　音声認識

Speech Synthesis　音声合成