TalkFace2 のバックアップ(No.6)

バックアップ一覧
差分を表示
現在との差分を表示
ソースを表示
TalkFace2 へ行く。
- 1 (2024-10-15 (火) 00:45:47)
- 2 (2024-11-05 (火) 17:55:37)
- 3 (2024-11-05 (火) 21:41:41)
- 4 (2024-11-06 (水) 20:09:58)
- 5 (2024-11-07 (木) 00:56:36)
- 6 (2024-11-07 (木) 01:43:55)
- 7 (2024-11-11 (月) 02:02:54)
- 8 (2024-11-17 (日) 19:54:41)
- 9 (2024-11-17 (日) 20:17:14)
- 10 (2024-11-19 (火) 01:05:26)

私的AI研究会 > TalkFace2

音声で顔画像を動かす：One Shot Talking Face（その２）== 編集中 == †

　音声と１枚の顔画像を使って、顔画像がまるで話しているような動画を作成する技術「One Shot Talking Face」をローカルマシンで動かす

音声で顔画像を動かす：One Shot Talking Face（その２）== 編集中 ==
参考資料

※ 最終更新:2024/11/06　

↑

『One Shot Talking Face』のデモプログラム †

　Windows マシン上での生成環境構築に難があるので、Linux 上で生成した結果を用いたデモプログラムを紹介する
　本プログラムは Linux生成環境では未生成の画像生成を実行できる

↑

概要 †

音声と顔のキーポイントの動きとの相関性をAudio-visual Correlation Transformer に学習させる
学習後、One-shot Generation（音声と一枚の顔画像から動画作成）を行なう
音声と１枚の顔画像をAudio-visual Correlation Transformerに入力し顔のキーポイントの動きを予測

One Shot Talking Face 概念図（下記論文所収）
論文「Alias-Free Generative Adversarial Networks (StyleGAN3)」
<paper>
・One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning
<framework>
・https://github.com/camenduru/one-shot-talking-face-colab

前回の検証
・音声で顔画像を動かす：One Shot Talking Face

顔画像から音声で動く動画を生成する手順
Step 1: 音声認識エンジン『pocketsphinx』で音声のテキストファイルを作成
Step 2: 音声テキストファイルをJSON プロセッサーツール『jq』で .json ファイルに成形
Step 3: .json ファイルから音声と顔のキーポイントに合わせた動画を生成する

↑

デモ実行環境の構築 †

仮想環境「py38_learn」で実行する
未作成の場合は → 『仮想環境 (py38_learn)』の手順で仮想環境を作成
プロジェクト・パッケージ project_talking-face をダウンロード
・解凍してできるフォルダ
```
project_talking-face
└─workspace_2
    └─one-shot-talking-face
        ├─results
        │  ├─phone
        │  └─text
        ├─select
        │  ├─audios
        │  └─images
        └─train
```
・解凍してできる「project_talking-face/」フォルダ内を次のフォルダの下に上書きコピーする
　Windows の場合 →「anaconda_win/」　Linux の場合 → 「~/」

↑

One Shot Talking Face (GUI)「talk_face.py」 †

主な機能
・音声ファイルは左側のファイルリストから選択する
・右側の画像一覧から対象の静止画ファイルを決める
・すでに動画生成済みの場合は「Talking Video」欄にファイル名を表示する
・「Talking Video」が空白の静止画ファイルを選択して「Video」ボタンを押すと動画を生成する ※
　※ GPU 搭載の Linux 環境でのみ動作する
・生成機能が動作しない環境の場合は「ビューモード」となり、GPU による処理結果を再生することができる

コマンドオプション一覧

コマンドオプション	引数	初期値	意味
--audio_file	str	'./select/audios/obama2.wav'	音声ファイルパス
--source_dir	str	'./select/images/d5.jpg'	静止画ファイルパス
--result_path	str	'./result'	出力保存ディレクトリ
--log	int	3	Log level(-1/0/1/2/3/4/5)

操作方法

① 音声ファイルの選択（クリック後 5秒間音声を再生する）
② 静止画選択エリア（選択した静止画像ファイル名を「imege file」に表示）
③ 指定した音声ファイルと静止画ファイルの生成動画が作成済みの場合ファイル名を表示
④ マウスカーソルが静止画像上にあるときそのファイル名を表示
⑤ マウスカーソルが静止画像上にあると今の音声ファイルとの生成ファイルを表示

⑥ 現在の設定での Talking file の生成、または生成済みの動画の再生
⑦ アプリケーションを終了する

コマンド実行例

(py38_learn) python talk_face.py

One Shot Talking Face (GUI) Ver. 0.01: Starting application...
   - audio_file              :  ./select/audios/obama2.wav
   - source_dir              :  ./select/images/d5.jpg
   - result_path             :  ./results
   - cpu                     :  False
   - log                     :  3


Finished.

終了前に生成されたファイル一覧（./results フォルダ）を表示する

モジュール・ソースコード

▼「talk_face.py」

# -*- coding: utf-8 -*-
##------------------------------------------
##  One Shot Talking Face (GUI)   Ver 0.01
##
##               2024.10.16 Masahiro Izutsu
##------------------------------------------
## talk_face.py

import warnings
warnings.simplefilter('ignore')

# Color Escape Code ---------------------------
GREEN = '\033[1;32m'
RED = '\033[1;31m'
NOCOLOR = '\033[0m'
YELLOW = '\033[1;33m'
CYAN = '\033[1;36m'
BLUE = '\033[1;34m'

# インポート＆初期設定
import os
import argparse
import subprocess
import platform


import cv2
import PySimpleGUI as sg
import tkinter as tk
from PIL import Image, ImageTk

import my_logging
import my_movieplay
import my_thumbnail

from torch.cuda import is_available
gpu_d = is_available()                                          # GPU 確認

# 定数定義
DEF_AUDIO = './select/audios/obama2.wav'
DEF_IMAGE = './select/images/d5.jpg'
RESULT_PATH = './results'
RESULT_PHONE = './results/phone'
DEF_THEME = 'BlueMono'

KEY_IMGFILE = '-ImgFile-'
KEY_WAVFILE = '-WavFile-'
KEY_TALKFILE = '-TalkFile-'
KEY_TXTIMG = '-Video-'
KEY_TALKFILE2 = '-TalkFile2-'
KEY_VIDEO = '-Image-'
KEY_EXIT = '-Exit-'
KEY_PAGE = '-Page-'
KEY_PAGEUP = '-PageUp-'
KEY_PAGEDOWN = '-PageDown-'

THUMB_SIZE = 64
GAP_SIZE = 4
CANVAS_XN = 8
CANVAS_YN = 5

# タイトル
title = 'One Shot Talking Face (GUI) Ver. 0.01'
sub_title = ''

# Parses arguments for the application
def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--audio_file", default=DEF_AUDIO, help="path to audio file")
    parser.add_argument("--source_image", default=DEF_IMAGE, help="path to source image")
    parser.add_argument("--result_path", default=RESULT_PATH, help="path to output")
    parser.add_argument("--cpu", dest="cpu", action="store_true", help="cpu mode.")
    parser.add_argument('--log', metavar = 'LOG', default = '3', help = 'Log level(-1/0/1/2/3/4/5) Default value is \'3\'')

    return parser

# 基本情報の表示
def display_info(args, title):
    print('\n' + GREEN + title + ': Starting application...' + NOCOLOR)
    print('   - ' + YELLOW + 'audio_file              : ' + NOCOLOR, args.audio_file)
    print('   - ' + YELLOW + 'source_dir              : ' + NOCOLOR, args.source_image)
    print('   - ' + YELLOW + 'result_path             : ' + NOCOLOR, args.result_path)
    print('   - ' + YELLOW + 'cpu                     : ' + NOCOLOR, args.cpu)
    print('   - ' + YELLOW + 'log                     : ' + NOCOLOR, args.log)
    print(' ')

# talking video ファイル名を得る
def get_talk_file(out_path, image_file, wave_file):
    base_dir_pair = os.path.split(image_file)
    i_name, _ = os.path.splitext(base_dir_pair[1])
    base_dir_pair = os.path.split(wave_file)
    a_name, _ = os.path.splitext(base_dir_pair[1])
    path = out_path + '/' + i_name + '_' + a_name + '.mp4'
    flg = os.path.exists(path)
    return path, flg

# ** wavファイルから jsonファイルを作成 **
def wav2json(wave_file, logger):
    base_dir_pair = os.path.split(wave_file)
    s_name, _ = os.path.splitext(base_dir_pair[1])
    path = RESULT_PHONE + '/' + s_name + '.json'
    if not os.path.exists(path):
        jq_param = '[.w[]|{word: (.t | ascii_upcase | sub("<S>"; "sil") | sub("<SIL>"; "sil") | sub("\\\(2\\\)"; "") | sub("\\\(3\\\)"; "") | sub("\\\(4\\\)"; "") | sub("\\\[SPEECH\\\]"; "SIL") | sub("\\\[NOISE\\\]"; "SIL")), phones: [.w[]|{ph: .t | sub("\\\+SPN\\\+"; "SIL") | sub("\\\+NSN\\\+"; "SIL"), bg: (.b*100)|floor, ed: (.b*100+.d*100)|floor}]}]'

        command = f"pocketsphinx -phone_align yes single {wave_file} $text | jq '{jq_param}' > {path}"
        logger.debug(command)
        os.system(command)
        logger.info(f'save jason to:  {path}')

    return path

# ** 静止画像から talking 画像を作成 **
def image2talk(image_file, wave_file, phone_path, out_path, logger):
    import config
    from test_script2 import test_with_input_audio_and_image2

    base_dir_pair = os.path.split(image_file)
    i_name, _ = os.path.splitext(base_dir_pair[1])
    base_dir_pair = os.path.split(wave_file)
    a_name, _ = os.path.splitext(base_dir_pair[1])
    path = out_path + '/' + i_name + '_' + a_name + '.mp4'
    if not os.path.exists(path):
        test_with_input_audio_and_image2(image_file, wave_file, phone_path, config.GENERATOR_CKPT, config.AUDIO2POSE_CKPT, out_path, False)

    return path

# ファイルパスからファイル名を得る
def path2filename(path):
    base_dir_pair = os.path.split(path)
    filename = base_dir_pair[1]
    return filename

# main プロセス
def main_process(opt, logger):
    title_a = f'One Shot Talking Face {sub_title}'

    # Thumbnail オブジェクト作成
    Thumb = my_thumbnail.Thumbnail(CANVAS_XN, CANVAS_YN, THUMB_SIZE, GAP_SIZE, ['.jpg', '.png', '.bmp'])
    image_file = opt.source_image
    frame = Thumb.initialize(image_file)
    image_file = Thumb.get_sel_file()

    result_path = opt.result_path
    audio_file = opt.audio_file
    audio_files, audio_dir, sel_audio = Thumb.get_file_list(audio_file, ['.wav'])
    logger.debug(f'audo_files = {audio_files}')

    # ウィンドウのテーマ
    sg.theme(DEF_THEME)

    canvas = sg.Image(size = Thumb.get_canvas_size(), key='CANVAS')

    # ウィンドウのレイアウト
    col_left = [
            [sg.Text("Audio File select:", size=(20, 1))],
            [sg.Listbox(audio_files, key='-AudioList-', size = (20, 10), default_values = sel_audio, enable_events=True)],
            [sg.Text("audio file:", size=(20, 1))],
            [sg.Text(path2filename(audio_file), background_color='White', size=(20, 1), key = KEY_WAVFILE)],
            [sg.Text("image file:", size=(20, 1))],
            [sg.Text(path2filename(image_file), background_color='White', size=(20, 1), key = KEY_IMGFILE)],
            [sg.Text("Talking Video:", size=(20, 1))],
            [sg.Text("", background_color='White', size=(20, 1), key = KEY_TALKFILE)],
    ]
    
    col_right = [
            [sg.Text("Image file select:", size=(12, 1)), sg.Text(image_file, background_color='LightSteelBlue1', size=(40, 1), key = KEY_TXTIMG)],
            [canvas],
    ]

    col_btn = [
            [
             sg.Text('', background_color='LightSteelBlue1', size=(28, 1), key = KEY_TALKFILE2),
             sg.Text("", size=(4, 1)),
             sg.Text("Page: 1/1", size=(12, 1), key=KEY_PAGE),
             sg.Button('▼', size=(2, 1), key=KEY_PAGEUP),
             sg.Button('▲', size=(2, 1), key=KEY_PAGEDOWN),
             sg.Text("", size=(4, 1)),
             sg.Button('Video', size=(8, 1), key=KEY_VIDEO),
             sg.Button('Exit', size=(8, 1), key=KEY_EXIT),
             sg.Text("", size=(1, 1))
            ]
    ]

    layout = [[sg.Text("", size=(1, 1)), sg.Text(title_a, size=(34, 1), justification='left', font='Helvetica 16')],
            [sg.Column(col_left, vertical_alignment='top'), sg.Column(col_right, vertical_alignment='top')],
            [sg.Column(col_btn, justification='r') ],
    ]

    # ウィンドウオブジェクトの作成
    window = sg.Window(title, layout, finalize=True, return_keyboard_events=True, use_default_focus=False)

    img = cv2.imencode('.png', frame)[1].tobytes()
    window['CANVAS'].update(img)

    # ユーザーイベントの定義
    canvas.bind('<Motion>', '_motion')
    canvas.bind('<ButtonPress>', '_click_on')
    canvas.bind('<ButtonRelease>', '_click_off')
    canvas.bind('<Double-Button>', '_double_click')

    page_offset, page_max = Thumb.get_page_max()
    window[KEY_PAGE].update(f'Page: {page_offset + 1}/{page_max}')
    window[KEY_PAGEUP].update(disabled = not Thumb.check_page_up())
    window[KEY_PAGEDOWN].update(disabled = not Thumb.check_page_down())

    new_make_f = False
    video_play_f = False
    sel_video = ''

    # イベントのループ
    while True:
        event, values = window.read(timeout=30)

        if new_make_f:
            logger.info('New Talking Video making...')
            phone_file = wav2json(audio_file, logger)
            logger.debug(f'phone_file → {phone_file}')
            video_file = image2talk(image_file, audio_file, phone_file, result_path, logger)
            logger.info(f'New Talking Video → {sel_video}')
            video_play_f = True

        if video_play_f:
            my_movieplay.movie_play(sel_video, sel_video)
            window[KEY_EXIT].update(disabled = False)
            video_play_f = False
            new_make_f = False

        if event == KEY_EXIT or event == sg.WIN_CLOSED:
            break

        if event == KEY_PAGEUP:
            logger.debug(f'{event}')
            frame = Thumb.page_up()
            if frame is not None:
                window[KEY_TXTIMG].update('')
                img = cv2.imencode('.png', frame)[1].tobytes()
                window['CANVAS'].update(img)

                page_offset, page_max = Thumb.get_page_max()
                window[KEY_PAGE].update(f'Page: {page_offset + 1}/{page_max}')
                window[KEY_PAGEUP].update(disabled = not Thumb.check_page_up())
                window[KEY_PAGEDOWN].update(disabled = not Thumb.check_page_down())

        if event == KEY_PAGEDOWN:
            logger.debug(f'{event}')
            frame = Thumb.page_down()
            if frame is not None:
                window[KEY_TXTIMG].update('')
                img = cv2.imencode('.png', frame)[1].tobytes()
                window['CANVAS'].update(img)

                page_offset, page_max = Thumb.get_page_max()
                window[KEY_PAGE].update(f'Page: {page_offset + 1}/{page_max}')
                window[KEY_PAGEUP].update(disabled = not Thumb.check_page_up())
                window[KEY_PAGEDOWN].update(disabled = not Thumb.check_page_down())

        if event == 'CANVAS_motion':
            x = canvas.user_bind_event.x
            y = canvas.user_bind_event.y
            filename, frame = Thumb.pixel2file(x, y)
            image_file0 = Thumb.filename2path(filename)
            window[KEY_TXTIMG].update(filename)
            img = cv2.imencode('.png', frame)[1].tobytes()
            window['CANVAS'].update(img)

            video_file, video_f = get_talk_file(result_path, image_file0, audio_file)
            window[KEY_VIDEO].update(disabled = sel_video == '' and not gpu_d)

            if video_f:
                base_dir_pair = os.path.split(video_file)
                name = base_dir_pair[1]
            else:
                name = ''
                video_file = ''

            window[KEY_TALKFILE2].update(name)

        if event == 'CANVAS_click_on':
            image_file, frame = Thumb.select_file()
            window[KEY_IMGFILE].update(path2filename(image_file))
            img = cv2.imencode('.png', frame)[1].tobytes()
            window['CANVAS'].update(img)
            logger.debug(f'{event}  {image_file}')


            if video_f:
                base_dir_pair = os.path.split(video_file)
                name = base_dir_pair[1]
            else:
                name = ''
                video_file = ''

            sel_video = video_file
            window[KEY_TALKFILE].update(name)
            window[KEY_VIDEO].update(disabled = sel_video == '' and not gpu_d)

        if event == KEY_VIDEO:
            if os.path.exists(sel_video):
                video_play_f = True
                logger.debug(video_file)
                window[KEY_VIDEO].update(disabled = True)
                window[KEY_EXIT].update(disabled = True)

            elif gpu_d:
                window[KEY_EXIT].update(disabled = True)
                window[KEY_TALKFILE].update('Talk Video making...')
                new_make_f = True

        if event == 'CANVAS_double_click':
            if os.path.exists(sel_video):
                video_play_f = True
                logger.debug(video_file)
                window[KEY_VIDEO].update(disabled = True)
                window[KEY_EXIT].update(disabled = True)

            elif gpu_d:
                window[KEY_EXIT].update(disabled = True)
                window[KEY_TALKFILE].update('Talk Video making...')
                new_make_f = True

        if event == 'CANVAS_click_off':
            pass

        if event == '-AudioList-':
            name = values['-AudioList-'][0]
            window[KEY_WAVFILE].update(name)
            audio_file = os.path.join(audio_dir, name)

            video_file, video_f = get_talk_file(result_path, image_file, audio_file)
            if video_f:
                base_dir_pair = os.path.split(video_file)
                name = base_dir_pair[1]
            else:
                name = ''
                video_file = ''

            sel_video = video_file
            window[KEY_TALKFILE].update(name)
            window[KEY_TALKFILE2].update('')
            window[KEY_VIDEO].update(disabled = True)
            window[KEY_EXIT].update(disabled = True)

            if platform.system()=='Windows':
                cmd = "python wav_play.py " + audio_file
                pro = subprocess.Popen(cmd)
                sg.popup_no_buttons(audio_file, background_color='#ffffff', auto_close=True, auto_close_duration=5, no_titlebar=True)
                pro.terminate()
            else:
                cmd = "python wav_play.py " + audio_file
                pro = subprocess.Popen('exec ' + cmd, shell = True)
                sg.popup_no_buttons(audio_file, background_color='#ffffff', auto_close=True, auto_close_duration=5, no_titlebar=True)
                pro.kill()

            window[KEY_VIDEO].update(disabled = not video_f and not gpu_d)
            window[KEY_EXIT].update(disabled = False)

            logger.debug(audio_file)

    # ウィンドウ終了処理
    window.close()


# main関数エントリーポイント(実行開始)
if __name__ == "__main__":
    import platform

    parser = parse_args()
    opt = parser.parse_args()

    # アプリケーション・ログ設定
    module = os.path.basename(__file__)
    module_name = os.path.splitext(module)[0]
    logger = my_logging.get_module_logger_sel(module_name, int(opt.log))

    if opt.cpu or platform.system()=='Windows':
        gpu_d = False

    sub_title = '' if gpu_d else '<view mode>'
    display_info(opt, title)

    main_process(opt, logger)

    msg = '処理結果一覧:　' + os.getcwd() + opt.result_path[1:]
    my_thumbnail.file_dialog(file_path=opt.result_path, title=msg, theme=DEF_THEME, xn=10, yn=4, thumb_size=128, gap=4, ret='Exit', audio_f=True, logger=logger)

    logger.info('\nFinished.\n')

▼「wav_play.py」

# -*- coding: utf-8 -*-
##------------------------------------------
##  Wave File Play audio   Ver 0.01
##
##               2024.10.22 Masahiro Izutsu
##------------------------------------------
## wav_play.py

import sys
import wave
import pyaudio

args = sys.argv

wf = wave.open(str(args[1]), "r")
p = pyaudio.PyAudio()
stream = p.open(format=p.get_format_from_width(wf.getsampwidth()),
                channels=wf.getnchannels(),
                rate=wf.getframerate(),
                output=True)

# チャンク単位でストリームに出力し音声を再生
chunk = 1024
data = wf.readframes(chunk)
while data != b'':
    stream.write(data)
    data = wf.readframes(chunk)
stream.close()
p.terminate()

↑

音声認識エンジン pocketsphinx のテスト「talk_text.py」 †

必要なパッケージをインストール
```
(py38_learn) pip install pocketsphinx
```

コマンドオプション一覧

コマンドオプション	引数	初期値	意味
--audio_file	str	'./select/audios/obama2.wav'	音声ファイルパス
--result_path	str	'./result'	出力保存ディレクトリ
--log	int	3	Log level(-1/0/1/2/3/4/5)

操作方法

① 音声ファイルの選択
② 認識したテキストの表示エリア
③ 指定した音声ファイルを再生(再生中の場合は停止)する
④ 指定した音声ファイルをテキストに変換する
⑤ メッセージ表示エリア
⑥ アプリケーションを終了する

コマンド実行例

(py38_learn) python talk_text.py

Talk to text (GUI) Ver. 0.01: Starting application...
   - audio_file              :  ./select/audios/obama2.wav
   - result_path             :  ./results
   - log                     :  3


Finished.

CUI アプリケーション「speak2text.py」実行例

(py38_learn) python speak2text.py ./select/audios/obama2.wav

hi everybody but i am thank you for that you've won too much like could not be prouder of everything you got your time with the obama foundation
and of course i couldn't be prouder of all of you in the graduating class of twenty twenty
four teachers and coaches
most of all parents and family who guided you won't work
op graduating is a big achievement on or any circumstances
some of you had overcome serious obstacles long way
were there was no lose work or losing a job
living in a neighborhood where people to watch

モジュール・ソースコード

▼「talk_text.py」

# -*- coding: utf-8 -*-
##------------------------------------------
##  Talk to text (GUI)      Ver 0.01
##       (PocketSphinx test)
##
##               2024.11.05 Masahiro Izutsu
##------------------------------------------
## talk_text.py

import warnings
warnings.simplefilter('ignore')

# Color Escape Code ---------------------------
GREEN = '\033[1;32m'
RED = '\033[1;31m'
NOCOLOR = '\033[0m'
YELLOW = '\033[1;33m'
CYAN = '\033[1;36m'
BLUE = '\033[1;34m'

# インポート＆初期設定
import os
import argparse
import subprocess
import platform

import cv2
import PySimpleGUI as sg
import tkinter as tk
from PIL import Image, ImageTk

import my_logging
import my_thumbnail

# 定数定義
DEF_AUDIO = './select/audios/obama2.wav'
RESULT_PATH = './results'
TEXT_PATH = './results/text'
DEF_THEME = 'BlueMono'

KEY_TALK = '-Talk-'
KEY_OK = '-Ok-'
KEY_EXIT = '-Exit-'
KEY_TEXT = '-Text-'
KEY_MSGTXT = '-Messege-'

# タイトル
title = 'Talk to text (GUI) Ver. 0.01'
sub_title = ''

# Parses arguments for the application
def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--audio_file", default=DEF_AUDIO, help="path to audio file")
    parser.add_argument("--result_path", default=RESULT_PATH, help="path to output")
    parser.add_argument('--log', metavar = 'LOG', default = '3', help = 'Log level(-1/0/1/2/3/4/5) Default value is \'3\'')

    return parser

# 基本情報の表示
def display_info(args, title):
    print('\n' + GREEN + title + ': Starting application...' + NOCOLOR)
    print('   - ' + YELLOW + 'audio_file              : ' + NOCOLOR, args.audio_file)
    print('   - ' + YELLOW + 'result_path             : ' + NOCOLOR, args.result_path)
    print('   - ' + YELLOW + 'log                     : ' + NOCOLOR, args.log)
    print(' ')

# ファイルパスからファイル名を得る
def path2filename(path):
    base_dir_pair = os.path.split(path)
    filename = base_dir_pair[1]
    return filename

# main プロセス
def main_process(opt, logger):
    title_a = title

    # Thumbnail オブジェクト作成
    Thumb = my_thumbnail.Thumbnail(0, 0, 0, 0, [])

    result_path = opt.result_path
    audio_file = opt.audio_file
    audio_file = opt.audio_file
    audio_files, audio_dir, sel_audio = Thumb.get_file_list(audio_file, ['.wav'])
    logger.debug(f'audo_files = {audio_files}')

    # ウィンドウのテーマ
    sg.theme(DEF_THEME)

    # ウィンドウのレイアウト
    col_left = [
            [sg.Text("Audio File select:", size=(20, 1))],
            [sg.Listbox(audio_files, key='-AudioList-', size = (20, 15), default_values = sel_audio, enable_events=True)],
    ]
    
    col_right = [
            [sg.Multiline(size=(40,18), key=KEY_TEXT)]
    ]

    col_btn = [
            [
             sg.Text('', size=(18, 1), text_color='#008800', background_color='LightSteelBlue1', key=KEY_MSGTXT),
             sg.Text("", size=(2, 1)),
             sg.Button('Talk', size=(8, 1), key=KEY_TALK),
             sg.Button('Text', size=(8, 1), key=KEY_OK),
             sg.Text("", size=(1, 1)),
             sg.Button('Exit', size=(8, 1), key=KEY_EXIT),
             sg.Text("", size=(1, 1))
            ]
    ]

    layout = [[sg.Text("", size=(1, 1)), sg.Text(title_a, size=(34, 1), justification='left', font='Helvetica 16')],
            [sg.Column(col_left, vertical_alignment='top'), sg.Column(col_right, vertical_alignment='top')],
            [sg.Column(col_btn, justification='r') ],
    ]

    # Wave Open
    def open_wav(filepath):
        cmd = "python wav_play.py " + filepath
        logger.debug(cmd)
        if platform.system()=='Windows':
            pros = subprocess.Popen(cmd)
        else:
            pros = subprocess.Popen('exec ' + cmd, shell = True)
        return pros

    # Wave Close
    def close_wav(pros):
        if process is not None:
            if platform.system()=='Windows':
                process.terminate()
            else:
                process.kill()
            pros = None
        return pros


    # ウィンドウオブジェクトの作成
    window = sg.Window(title, layout, finalize=True, return_keyboard_events=True, use_default_focus=False)

    new_make_f = False
    process = None

    # イベントのループ
    while True:
        event, values = window.read(timeout=30)

        if new_make_f:
            base_dir_pair = os.path.split(audio_file)
            txt_path = TEXT_PATH + '/' + base_dir_pair[1][:-4] + '.txt'
            cmd = f'python speak2text.py {audio_file} > {txt_path}'
            logger.debug(cmd)
            os.system(cmd)
            window[KEY_MSGTXT].update('')

            with open(txt_path) as f:
                s = f.read()
            window[KEY_TEXT].update(s)
            close_wav(process)
            process = open_wav(audio_file)

            window[KEY_OK].update(disabled = False)
            window[KEY_EXIT].update(disabled = False)
            new_make_f = False

        if event == KEY_EXIT or event == sg.WIN_CLOSED:
            close_wav(process)
            break

        if event == KEY_OK:
            logger.debug(event)
            process = close_wav(process)
            window[KEY_TEXT].update('')
            window[KEY_OK].update(disabled = True)
            window[KEY_EXIT].update(disabled = True)
            window[KEY_MSGTXT].update('Wav file to text ...')
            new_make_f = True

        if event == KEY_TALK:
            if process is None:
                process = open_wav(audio_file)
            else:
                process = close_wav(process)

        if event == '-AudioList-':
            process = close_wav(process)
            name = values['-AudioList-'][0]
            audio_file = os.path.join(audio_dir, name)
            logger.debug(audio_file)

    # ウィンドウ終了処理
    window.close()


# main関数エントリーポイント(実行開始)
if __name__ == "__main__":
    import platform

    parser = parse_args()
    opt = parser.parse_args()

    # アプリケーション・ログ設定
    module = os.path.basename(__file__)
    module_name = os.path.splitext(module)[0]
    logger = my_logging.get_module_logger_sel(module_name, int(opt.log))

    display_info(opt, title)

    main_process(opt, logger)

    logger.info('\nFinished.\n')

▼「speak2text.py」

# -*- coding: utf-8 -*-
##------------------------------------------
##  Speak to text           Ver 0.01
##       (PocketSphinx test)
##
##               2024.09.05 Masahiro Izutsu
##------------------------------------------
## speak2text.py

# Color Escape Code
RED = '\033[1;31m'
NOCOLOR = '\033[0m'

import os
import sys

args = sys.argv
filepath = args[1]
if not os.path.exists(filepath):
    print(f'{RED}Error: file not found !!{NOCOLOR}')
    exit(0)

from pocketsphinx import AudioFile
for phrase in AudioFile(filepath):
    print(phrase)

↑

JSON プロセッサーツール『jq』 †

インストール　Windows の場合　
1. 標準コマンドとして組み込まれている winget を利用する

(py38_learn) winget search jq
名前           ID           バージョン ソース
-----------------------------------------------
JQuery参考手册 9NBLGGH4P48H Unknown    msstore
jq             jqlang.jq    1.7.1      winget

(py38_learn) winget install jqlang.jq
見つかりました jq [jqlang.jq] バージョン 1.7.1
このアプリケーションは所有者からライセンス供与されます。
Microsoft はサードパーティのパッケージに対して責任を負わず、ライセンスも付与しません。
ダウンロード中 https://github.com/jqlang/jq/releases/download/jq-1.7.1/jq-windows-amd64.exe
  ██████████████████████████████   962 KB /  962 KB
インストーラーハッシュが正常に検証されました
パッケージのインストールを開始しています...
コマンド ライン エイリアスが追加されました: "jq"
パス環境変数が変更されました; 新しい値を使用するにはシェルを再起動してください。
インストールが完了しました

(py38_learn) jq --help
jq - commandline JSON processor [version 1.7.1]
    :

（参考）アンインストール

(py38_learn) winget uninstall jqlang.jq
見つかりました jq [jqlang.jq]
パッケージのアンインストールを開始しています...
正常にアンインストールされました

2. オフィシャルサイトから Win 版をダウンロードして使用する
名前を変更 jq-windows-amd64.exe → jq-win.exe してパスの通ったディレクトリに配置する

(py38_learn) jq-win -V
jq-1.7.1

インストール　Linux の場合　

(py38_learn) sudo apt install jq
[sudo] XXXX のパスワード: 
パッケージリストを読み込んでいます... 完了
    :
(py38_learn) jq -V
jq-1.6

「One Shot Talking Face」テストデータの実行

(py38_learn) cat ./train/audio.txt | jq-win '[.w[]|{word: (.t | ascii_upcase | sub(\"<S>\"; \"sil\") | sub(\"<SIL>\"; \"sil\") | sub(\"\\(2\\)\"; \"\") | sub(\"\\(3\\)\"; \"\") | sub(\"\\(4\\)\"; \"\") | sub(\"\\[SPEECH\\]\"; \"SIL\") | sub(\"\\[NOISE\\]\"; \"SIL\")), phones: [.w[]|{ph: .t | sub(\"\\+SPN\\+\"; \"SIL\") | sub(\"\\+NSN\\+\"; \"SIL\"), bg: (.b*100)|floor, ed: (.b*100+.d*100)|floor}]}]'
[
  {
    "word": "sil",
    "phones": [
      {
        "ph": "SIL",
        "bg": 0,
        "ed": 13
      }
    :

(py38_learn_test2) cat ./train/audio.txt | jq-win '[.w[]|{word: (.t | ascii_upcase | sub(\"<S>\"; \"sil\") | sub(\"<SIL>\"; \"sil\") | sub(\"\\(2\\)\"; \"\") | sub(\"\\(3\\)\"; \"\") | sub(\"\\(4\\)\"; \"\") | sub(\"\\[SPEECH\\]\"; \"SIL\") | sub(\"\\[NOISE\\]\"; \"SIL\")), phones: [.w[]|{ph: .t | sub(\"\\+SPN\\+\"; \"SIL\") | sub(\"\\+NSN\\+\"; \"SIL\"), bg: (.b*100)|floor, ed: (.b*100+.d*100)|floor}]}]' > ./train/audio.json

↑

『One Shot Talking Face』ローカルマシンに環境構築 †

　『One Shot Talking Face を使って音声で顔画像を動かす』をローカルマシンに移植する

↑

実行環境の構築　Linux の場合　 †

仮想環境「py38_learn」で実行する
未作成の場合は → 『仮想環境 (py38_learn)』の手順で仮想環境を作成
大きなファイルのクローンのためのパッケージ導入
```
(py38_learn) sudo apt install git-lfs
```

pocketsphinx パッケージの導入

(py38_learn) git clone https://github.com/cmusphinx/pocketsphinx.git
(py38_learn) cd pocketsphinx/
(py38_learn) cmake -S . -B build
(py38_learn) cmake --build build
(py38_learn) sudo cmake --build build --target install

(py38_learn) pocketsphinx
Usage: pocketsphinx [PARAMS] [soxflags | config | help | help-config | live | single | align] INPUTS...
Examples:
    sox input.mp3 $(pocketsphinx soxflags) | pocketsphinx single -
    sox -qd $(pocketsphinx soxflags) | pocketsphinx live -
    pocketsphinx single INPUT
    pocketsphinx align INPUT WORDS...

For detailed PARAMS values, run pocketsphinx help-config

GitHub サイトからプロジェクトをダウンロード

(py38_learn) git clone https://huggingface.co/camenduru/pocketsphinx-20.04-t4 pocketsphinx

プロジェクト・パッケージ project_talking-face をダウンロード
・解凍してできるフォルダ
```
project_talking-face
└─workspace_2
    └─one-shot-talking-face
        ├─results
        │  ├─phone
        │  └─text
        ├─select
        │  ├─audios
        │  └─images
        └─train
```
・解凍してできる「project_talking-face/」フォルダ内を「~/」フォルダの下に上書きコピーする

不足しているパッケージの導入

(py38_learn) sudo apt install jq
(py38_learn) pip install python_speech_features
(py38_learn) pip install pyworld

↑

画像生成までの確認 †

Step 1: 音声認識エンジン『pocketsphinx』で音声のテキストファイルを作成

(py38_learn) pocketsphinx -phone_align yes single ./train/audio.wav
{"b":0.000,"d":40.000,"p":0.000,"t":"hi everybody and i thank you for that you've won too much like could not be prouder of everything you got your time with the obama foundation then of course i couldn't be prouder of all of you in the graduating class of twenty twenty those walls the teachers and coaches the most of all parents and family who guided you won't why not graduating is a big achievement under any circumstances someone to get over com serious obstacles long wet weather was no loose worker whose good job now we're living in a neighborhood where people to walk","w":[{"b":0.000,"d":0.130,"p":0.981,"t":"<s>","w":[{"b":0.000,"d":0.130,"p":0.981,"t":"SIL"}]},{"b":0.130,"d":0.280,"p":0.985,"t":"<sil>","w":[{"b":0.130,"d":0.280,"p":0.985,"t":"SIL"}]},{"b":0.410,"d":0.170,"p":0.954,"t":"hi","w":[{"b":0.410,"d":0.070,"p":0.981,"t":"HH"},{"b":0.480,"d":0.100,"p":0.972,"t":"AY"}]},{"b":0.580,"d":0.470,"p":0.876,"t":"everybody","w":[{"b":0.580,"d":0.050,"p":0.989,"t":"EH"},{"b":0.630,"d":0.080,"p":0.971,"t":"V"},{"b":0.710,"d":0.050,"p":0.990,"t":"R"},{"b":0.760,"d":0.030,"p":0.990,"t":"IY"},{"b":0.790,"d":0.060,"p":0.995,"t":"B"},{"b":0.850,"d":0.060,"p":0.991,"t":"AA"},{"b":0.910,"d":
    :

Step 2: Step 1 + 音声テキストファイルをJSON プロセッサーツール『jq』で .json ファイルに成形

(py38_learn) mizutu@ubuntu-lat:~/workspace_2/one-shot-talking-face$ pocketsphinx -phone_align yes single ./train/audio.wav $text | jq '[.w[]|{word: (.t | ascii_upcase | sub("<S>"; "sil") | sub("<SIL>"; "sil") | sub("\\(2\\)"; "") | sub("\\(3\\)"; "") | sub("\\(4\\)"; "") | sub("\\[SPEECH\\]"; "SIL") | sub("\\[NOISE\\]"; "SIL")), phones: [.w[]|{ph: .t | sub("\\+SPN\\+"; "SIL") | sub("\\+NSN\\+"; "SIL"), bg: (.b*100)|floor, ed: (.b*100+.d*100)|floor}]}]' > test.json

Step 3: .json ファイルから音声と顔のキーポイントに合わせた動画を生成する

(py38_learn) python -B test_script.py --img_path ./train/image.png --audio_path ./train/audio.wav --phoneme_path ./test.json --save_dir ./train

↑

対処した問題点とエラー詳細 †

↑

ValueError: numpy.ndarray size changed, †

エラー詳細

(py38_learn) python -B test_script.py --img_path ./train/image.png --audio_path ./train/audio.wav --phoneme_path ./test.json --save_dir ./train
Traceback (most recent call last):
  File "test_script.py", line 12, in <module>
    from tools.interface import read_img,get_img_pose,get_pose_from_audio,get_audio_feature_from_audio,\
  File "/home/mizutu/workspace_2/one-shot-talking-face/tools/interface.py", line 12, in <module>
    import pyworld
  File "/home/mizutu/anaconda3/envs/py38_learn/lib/python3.8/site-packages/pyworld/__init__.py", line 17, in <module>
    from .pyworld import *
  File "pyworld/pyworld.pyx", line 1, in init pyworld.pyworld
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 80 from PyObject

原因と対応策
・「numpy」の　API変更のようなので「1.23.0」にアップグレードする
・パッケージの不整合が出るが今回は無視する
・ちなみに GoogleColab 上でも現在は同様の処理をしないと動作しない

(py38_learn_test) mizutu@ubuntu-HP-ENVY:~/workspace_2$ pip install numpy==1.23.0
Collecting numpy==1.23.0
  Using cached numpy-1.23.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.2 kB)
Using cached numpy-1.23.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.19.5
    Uninstalling numpy-1.19.5:
      Successfully uninstalled numpy-1.19.5
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
openvino 2022.1.0 requires numpy<1.20,>=1.16.6, but you have numpy 1.23.0 which is incompatible.
Successfully installed numpy-1.23.0

↑

OSError: [Errno 8] Exec format error: †

エラー詳細

(py38_learn_test) python -B test_script.py --img_path ./train/image.png --audio_path ./train/audio.wav --phoneme_path ./train/test.json --save_dir ./train
Traceback (most recent call last):
  File "test_script.py", line 180, in <module>
    test_with_input_audio_and_image(args.img_path,args.audio_path,phoneme,config.GENERATOR_CKPT,config.AUDIO2POSE_CKPT,args.save_dir)
        :
  File "/home/mizutu/anaconda3/envs/py38_learn_test/lib/python3.8/subprocess.py", line 1720, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 8] Exec format error: './OpenFace/FeatureExtraction'

原因と対応策
・「git lfs」未導入のため大きなファイルのクローンができていない
・「git lfs」インストール後、再度プロジェクトをクローンし直す

(py38_learn_test) sudo apt update
    :
(py38_learn_test) sudo apt-get install git-lfs
    :
以下のパッケージが新たにインストールされます:
  git-lfs
    :
(py38_learn_test) git lfs install
Updated git hooks.
Git LFS initialized.

(py38_learn_test) git clone https://huggingface.co/camenduru/one-shot-talking-face-20.04-t4 one-shot-talking-face
    :

↑