ComfyUI9b の履歴(No.61)

私的AI研究会 > ComfyUI9b

画像生成AI「ComfyUI」９（動画編４）　== 編集中 ==†

　「ComfyUI」を使ってローカル環境でのAI画像生成を検証する

▲　目　次

画像生成AI「ComfyUI」９（動画編４）　== 編集中 ==
参考資料

※ 最終更新:2026/05/15　

↑

LTX-2.3 による音声付き動画生成†

　2026年3月発表された音声対応の動画生成モデル。
　1月発表の「LTX-2」と比較して大幅な性能向上とのこと、ComfyUIでネイティブサポートされているので検証してみる

↑

概要†

「LTX-2.3」とは
- イスラエルの Lightricks（ライトリックス）社が開発 2026年 3月に公開した高性能オープンソース動画生成AIモデル
- 従来のモデル（LTX-2）と比較して動画の品質やプロンプト理解力が大幅に向上

主な特徴
- 高速・高品質な動画生成: 動画と音声をセットで生成可能で、ローカル環境でも高速に動作する設計
- 高解像度と長尺対応: 4K画質や、長時間の動画生成に対応している
- 音声との連携: 画像と音声を同時に入力することで、リップシンク（口の動きを合わせる）や歌に合わせた動きが可能
- プロンプト理解力の向上: 前世代の LTX-2 と比べて、プロンプトに忠実な映像を生成する能力が向上
- ローカル運用向け: ComfyUIでネイティブサポートされており、個人の PC環境（GPU）で動作させることが可能

前世代との違いと評価
- WAN 2.2との比較: ローカル生成AIのライバルである「WAN 2.2」がシネマティックな動きや画質に強みを持つ一方、LTX-2.3 は生成速度と出力の安定性に強みがある
- 用途: ストーリーボードの作成や、プロンプトを素早く試す（イテレーション）作業に向いている

利用方法
- ComfyUI というツールを使用して、ローカルのPC環境で動かすのが一般的
- モデルや追加のウェイト（ID-LoRAなど）をダウンロードし、動画と音声を生成するワークフローを設定して使用する

動作前提要件（公式ドキュメントより）
- ComfyUI installed
- CUDA-compatible GPU with 32GB+ VRAM
- 100GB+ free disk space for models and cache

オフィシャルサイト

（参考）低 VRAM 環境下の動作のための蒸留版
- LTX-2.3 には 8 ステップで動作する蒸留版（distilled version）も含まれている
- Classifier-Free Guidance（CFG）値 1で実行でき、フルモデルと比べて大幅な高速化が可能

↑

プロジェクトで作成するワークフロー†

このプロジェクトで作成するワークフローと関連データは下記にアップロードしている（更新されている場合は再度ダウンロードのこと）

ComfyUI_ex_proj.zip をダウンロード（随時更新中）※2026/05/08更新
・解凍してできるフォルダ

📂ComfyUI
  ├─📂input　　　　　　　　　　　　　　← ワークフローに含まれる入力画像
  └─📂user
        └─📂default
              └─📂workflows　　　　　　　　← ワークフローの保存場所
                    :
                    ├─📂_video
                    ├─📂_video2
                    ├─📂LTX 　　　　　　　 ← この章で作成するワークフロー
                    :

・解凍してできる「ComfyUI/」フォルダを「StabilityMatrix/Data/Packages/ComfyUI」へ上書きコピーする

ワークフローの分類

　NVIDIA GPU VRAM 8GB 以下の環境で、およそ 10分以内の生成時間で完了するものを基本として検証する

分類	内容	説明
A	Text to Video	テキスト（文章・プロンプト）から動画を生成 (プロンプト・エンハンサー付き)
B	Image to Video	静止画像（写真やイラスト）から動画を生成
C	Text to Video (展開版)	A の subgraph を展開
D	Image to Video (展開版)	B の subgraph を展開
E	Text/Image to Video Single stage	1 ステップでテキスト・静止画像から動画を生成
E1	Text/Image to Video Two stage	2 ステップでテキスト・静止画像から動画を生成 ※
E2	Text to Video Three stage	3 ステップでテキストから動画を生成 ※
E3	Image to Video Three stage	3 ステップで静止画像から動画を生成 ※
E4	Image/Audio to Video Two stage	3 ステップで音声・静止画像から動画を生成 ※
F	Image Audio to Video	静止画像（写真やイラスト）と音声ファイルから動画を生成
G	Image Audio to Video (展開版)	F の subgraph を展開
H	FLF2V	最初と最後のフレーム 2枚の画像から動画生成
I	FLF2V (展開版)	H の subgraph を展開
J	Style Transition	2枚の画像からシーン間の切り替え動画生成
K	Style Transition (展開版)	J の subgraph を展開
L	ID LoRA	1枚の画像と短い音声クリップから動画を生成
M	ID LoRA (展開版)	L の subgraph を展開
Z	Text/Image to Video Auto	テキスト・静止画像から自動でプロンプト入力して動画生成 ※

　　※ 環境不適合（メモリー不足）のため成果出ず（蒸留版・GGUF版で辛うじて使用可能）

・使用するモデルの種別
　　→ モデル「dev」「distilled」の主な違い
　　→ モデル「fp8」「GUFF」の主な違い

モデル	内容	説明
dev	通常版 (標準)	オープンウェイトの動画・音声生成のためのベースモデル。22B（220億）パラメータを採用
distilled	蒸留版	低 VRAM 環境下の動作のための蒸留版 8 ステップで動作し大幅な高速化が可能
fp8	8ビット量子化モデル	推論の高速化のためサイズをに縮小したモデル → 量子化モデルとは
GGUF	GGUF 量子化モデル	低 VRAM環境に合わせて混合精度量子化技術により、サイズを削減したモデル → GGUF モデルについて

ワークフローと動作環境による生成時間（分：秒）　　軽量版推奨ワークフロー　　　軽量 GGUF版推奨ワークフロー　

機能分類	ワークフロー	機　能	モデル	CPU					CPU
機能分類	ワークフロー	機　能	モデル	RTX 4070	RTX 4060	RTX 4060L	RTX 3050	GTX 1050	i7-1260P	i7-1185G7
A	5300_LTX-2.3_t2v_dev	Text to Video 基本ワークフロー	fp8 dev	06:07.31	22:16.79	24:22.67	72:49.88	非対応
B	5301_LTX-2.3_i2v_dev	Image to Video 基本ワークフロー	fp8 dev	04:11.95	22:10.73	22:28.35	63:04.92
C	5302_LTX-2.3_t2v_dev_simple	Text to Video 基本(simple)	fp8 dev	04:28.46	28:22.08	27:20.67	70:07.86
D	5303_LTX-2.3_i2v_dev_simple	Image to Video 基本(simple)	fp8 dev	04:15.03	22:55.24	20:50.82	70:09.77
E	5304_LTX-2.3_T2V_I2V_1st_dev	Text/Image to Video (dev)	fp8 dev	30:18.01	222:42.62	208:45.76	535:52.72
F	5310_LTX-2.3_ia2v_dev	Image Audio to Video 基本ワークフロー	fp8 dev	05:38.84	18:18.31	20:56.54	31:51.12
G	5311_LTX-2.3_ia2v_dev_simple	Image Audio to Video 基本(simple)	fp8 dev	03:15.90	28:57.83	22:01.73	31:00.36
A	5340_LTX-2.3_t2v_dev_GGUF	Text to Video (GGUF)	GGUF dev	04:32.22	14:21.13	09:56.69	17:0079
B	5341_LTX-2.3_i2v_dev_GGUF	Image to Video (GGUF)	GGUF dev	02:49.95	06:14.43	08:17.38	11:40.02
C	5342_LTX-2.3_T2V_dev_GGUF	Text to Video (GGUF 展開版)	GGUF dev	02:56.61	05:57.39	08:12.25	10:34.54
D	5343_LTX-2.3_I2V_dev_GGUF	Image to Video (GGUF 展開版)	GGUF dev	03:23.88	05:35.69	08:32.48	10:54.41
E	5344_LTX-2.3_T2V_I2V_1st_dev_GF	Text/Image to Video (GGUF dev)	GGUF dev	36:53.21	65:50.52	108:54.75	123:03.22
F	5350_ltx-2.3_ia2v_dev_GGUF	Image Audio to Video (GGUF)	GGUF dev	07:04.70	11:52.07	15:16.69	18:40.09
G	5351_ltx-2.3_ia2v_dev_GGUF	Image Audio to Video (GGUF 展開版)	GGUF dev	05:25.49	10:16.69	13:29.41	16:41.47
A	5400_LTX-2.3_t2v_distilled	Text to Video 基本ワークフロー	fp8 distilled	01:25.56	06:05.99	06:43.14	19:13.23
B	5401_LTX-2.3_i2v_distilled	Image to Video 基本ワークフロー	fp8 distilled	01:58.72	06:16.17	06:56.02	19:07.36
C	5402_LTX-2.3_t2v_distil_simple	Text to Video 基本(simple)	fp8 distilled	01:43.82	06:20.09	06:36.58	17:33.27
C	5402v2_LTX-2.3_t2v_distil_simple	Text to Video 基本(simple)	fp8 distilled	01:43.82	06:20.09	06:36.58	17:33.27
D	5403_LTX-2.3_i2v_distil_simple	Image to Video 基本(simple)	fp8 distilled	01:32.83	05:54.25	06:45.56	15:37.77
D	5403v2_LTX-2.3_t2v_distil_simple	Image to Video 基本(simple)	fp8 distilled	01:32.83	05:54.25	06:45.56	15:37.77
E	5404_LTX-2.3_T2V_I2V_1st_distilled	Text/Image to Video 1stage (蒸留版)	fp8 distilled	05:12.67	30:34.31	32:02.62	96:23.36
E1	5405_LTX-2.3_T2V_I2V_2st_distilled	Text/Image to Video 2stage (蒸留版)	fp8 distilled	09:54.06	58:42.72	62:40.42	×
E2	5406_LTX-2.3_t2v_3st_distilled	Text To Video 3stage (蒸留版)	fp8 distilled	02:43.63	26:31.65	18:23.37	54:45.53
E3	5407_LTX-2.3_i2v_3st_distilledt	Image To Video 3stage (蒸留版)	fp8 distilled	03:02.44	22:48.83	18:16.20	56:59.33
E4	5408_LTX-2.3_a2v_3st_distilled	Audio To Video 3stage (蒸留版)	fp8 distilled	02:44.38	21:15.00	16:45.13	51:16.87
F	5410_LTX-2.3_ia2v_distilled	Image Audio to Video (蒸留版)	fp8 distilled	03:36.76	09:00.67	06:39.46	10:44.13
G	5411_LTX-2.3_ia2v_distilled_simple	Image Audio to Video (蒸留版 simple)	fp8 distilled	01:56.61	08:10.36	05:54.76	09:52.81
H	5412_LTX-2.3_flf2v_distilled	FLF2V (蒸留版)	fp8 distilled	06:09.99		11:35.09
I	5413_LTX-2.3_flf2v_distilled_simple	FLF2V (蒸留版 simple)	fp8 distilled	06:09.99		11:35.09
J	5414_LTX-2,3_trans_distilled	Style Transition (蒸留版)	fp8 distilled	02:36.31		08:00.97
K	5415_LTX-2,3_trans_distilled_simple	Style Transition (蒸留版 simple)	fp8 distilled	02:36.31		08:00.97
L	5416_LTX-2.3_id_lora_distilled	ID LoRA (蒸留版)	fp8 distilled	05:35.63		14:39.30
M	5417_LTX-2.3_id_lora_distilled_simple	ID LoRA (蒸留版 simple)	fp8 distilled	05:10.87		14:44.56
A	5440_LTX-2.3_t2v_distilled_GGUF	Text to Video (GGUF)	GGUF distill	02:51.28	06:57.75	09:24.07	17:15.15
B	5441_LTX-2.3_i2v_distilled_GGUF	Image to Video (GGUF)	GGUF distill	03:28.56	06:23.98	07:39.56	11:43.00
C	5442_LTX-2.3_T2V_distilled_GGUF	Text to Video (GGUF 展開版)	GGUF distill	03:43.24	18:40.72	07:59.01	40:02.24
D	5443_LTX-2.3_I2V_distilled_GGUF	Image to Video (GGUF 展開版)	GGUF distill	04:42.55	14:09.71	08:28.67	14:11.10
E	5444_LTX-2.3_T2V_I2V_1st_distil_GF	Text/Image to Video 1stage (GGUF蒸留版)	GGUF distill	08:02.39	10:50.23	16:39.30	21:45.20
E1	5445_LTX-2.3_T2V_I2V_2st_distil_GF	Text/Image to Video 2stage (GGUF蒸留版)	GGUF distill	12:50.99	26:20.50	40:06.29	×
E2	5446_LTX-2.3_t2v_3st_distiled_GF	Text To Video 3stage (GGUF 蒸留版))	GGUF distill	04:47.94	07:03.50	09:46.58	11:13.88
E3	5447_LTX-2.3_i2v_3st_distiled_GF	Image To Video 3stage (GGUF 蒸留版))	GGUF distill	04:02.69	07:46.81	11:05.99	11:03.46
E4	5448_LTX-2.3_a2v_3st_distiled_GF	Audio To Video 3stage (GGUF 蒸留版))	GGUF distill	03:25.44	07:53.89	10:14.84	10:01.91
F	5450_ltx-2.3_ia2v_distilled_GGUF	Image Audio to Video (GGUF)	GGUF distill	05:39.86	09:28.37	11:23.68	13:46.97
G	5451_ltx-2.3_ia2v_distilled_GGUF	Image Audio to Video (GGUF 展開版)	GGUF distill	04:05.78	07:20.17	10:09.44	11:55.80
H	5452_LTX-2.3_flf2v_distilled_GGUF	FLF2V (GGUF 蒸留版)	GGUF distill	07:05.22		17:28.66
I	5453_LTX-2.3_flf2v_distilled_GGUF	FLF2V (GGUF 蒸留版 simple)	GGUF distill	07:05.22		17:28.66
J	5454_LTX-2,3_trans_distilled_GGUF	Style Transition (GGUF 蒸留版)	GGUF distill	03:31.59		10:20.05
K	5455_LTX-2,3_trans_distilled_GGUF	Style Transition (GGUF 蒸留版 simple)	GGUF distill	03:31.59		10:20.05
L	5456_LTX-2.3_id_lora_distilled_GGUF	ID LoRA (GGUF 蒸留版)	GGUF distill	08:08.15		22:50.58
M	5457_LTX-2.3_id_lora_distilled_GGUF	ID LoRA (GGUF 蒸留版 simple)	GGUF distill	07:34.61		22:25.47
Z	5491_LTX-2.3_i2v_distilled_auto	LTX-2.3 自動プロンプト	fp8 distilled	10:16.24	22:34.79	33:01.01	×

↑

動画生成のための環境構築†

必要モデルのダウンロードと配置

「Stability Matrix」上の「ComfyUI」ではモデルフォルダの場所が異なっていることに注意 → モデルフォルダの配置

モデル名	ファイル名（.safetensors）	配置先		ダウンロード URL
checkpoints	ltx-2.3-22b-dev-fp8	/StabilityMatrix/Data/ Models/	StableDiffusion/	ltx-2.3-22b-dev-fp8.safetensors.safetensors
	ltx-2.3-22b-distilled-fp8		StableDiffusion/	ltx-2.3-22b-distilled-fp8.safetensors
	ltx-2.3-22b-dev-Q4_K_M.gguf		diffusion_models/	ltx-2.3-22b-dev-Q4_K_M.gguf
	ltx-2.3-22b-distilled-Q4_K_M.gguf		diffusion_models/	ltx-2.3-22b-distilled-Q4_K_M.gguf
LoRA	ltx-2.3-22b-distilled-lora-384		Lora/	ltx-2.3-22b-distilled-lora-384.safetensors
LoRA	ltx-2.3-22b-distilled-lora-dynamic_fro09_avg_rank_105_bf16		Lora/	ltx-2.3-22b-distilled-lora-dynamic_fro09_avg_rank_105_bf16.safetensors
text_encoders	gemma_3_12B_it_fp4_mixed		text_encoders	gemma_3_12B_it_fp4_mixed.safetensors
	gemma_3_12B_it_fp8_scaled			gemma_3_12B_it_fp8_scaled.safetensors
	ltx-2.3_text_projection_bf16			ltx-2.3_text_projection_bf16
VAE	LTX23_audio_vae_bf16		VAE/	LTX23_audio_vae_bf16.safetensors
VAE	LTX23_video_vae_bf16		VAE/	LTX23_video_vae_bf16.safetensors
UP Scale	ltx-2.3-spatial-upscaler-x2-1.1	/StabilityMatrix/Data/ Packages/ComfyUI/models/	latent_upscale_models/	ltx-2.3-spatial-upscaler-x2-1.1.safetensors
UP Scale	~~ltx-2.3-spatial-upscaler-x2-1.0~~ ※	/StabilityMatrix/Data/ Packages/ComfyUI/models/	latent_upscale_models/	ltx-2.3-spatial-upscaler-x2-1.0

　・ LTX-2.3 FP8 Model Card

Name	Notes
ltx-2.3-22b-dev-fp8	The full model, flexible and trainable, in fp8
ltx-2.3-22b-distilled-fp8	The distilled version of the full model, 8 steps, CFG=1, in fp8

　・ GGUF版で使用　※ ltx-2.3-spatial-upscaler-x2-1.1.safetensors を使用する

Windows の場合「ページングファイル」を再設定する → ページングファイルのサイズを 128GBに設定
GGUFモデルを使用する場合はカスタムノード『 ComfyUI-GGUF 』をインストールする（共通手順による）
・GitHub: ComfyUI-GGUF

↑

Step 1：オフィシャルサイトの標準テンプレートからワークフローを作成†

　「ltx-2.3-22b-dev-fp8.safetensors」標準(dev) fp8 モデルを使用する

ワークフローを選ぶ

① 左端のメニューから「Template」を選択
②「Video」を押す
③ 検索欄に「ltx2.3」を入力する

・表示された一覧からワークフローを選ぶ
④「LTX-2.3 Text to Video」テキストから動画生成
⑤「LTX-2.3 Image to Video」静止画像から動画生成
⑥「LTX-2.3 Image Audio to Video」静止画像と音声データから動画生成
⑦「LTX-2.3 FLF2V」最初と最後のフレーム 2枚の画像から動画生成
⑧「LTX-2.3 Style Transition」2枚の画像からシーン間の切り替え
⑨「LTX-2.3 ID LoRA」1枚の画像と短い音声クリップから動画を生成

・ワークフローでエラーが発生する場合は前項のモデルの配置を確認する

・ワークフロー内で使われる画像データのダウンロード
　　 GitHub: ComfyUI-Org workflow_templates

動作確認を行ってから保存する

	ワークフロー	ダウンロード URL	保存ワークフロー名
④	LTX-2.3 Text to Video	video_ltx2_3_t2v.json	video_ltx2_3_t2v_org.json
⑤	LTX-2.3 Image to Video	video_ltx2_3_i2v.json	video_ltx2_3_i2v_org.json
⑥	LTX-2.3 Image Audio to Video	video_ltx2_3_ia2v.json	video_ltx2_3_ia2v_org.json
⑦	LTX-2.3 FLF2V	video_ltx2_3_flf2v.json	video_ltx2_3_flf2v_org.json
⑧	LTX-2.3 Style Transition	template_ltx2_3_style_transition.json	template_ltx2_3_style_transition_org.json
⑨	LTX-2.3 ID LoRA	video_ltx2_3_id_lora.json	video_ltx2_3_id_lora_org.json

・オリジナルのワークフロー

④ テキストから動画生成　Text to Video
入力画像ダミー画像	*Prompt:* Dynamic cinematic close-up of high-tech modular machinery self-assembling in midair, precision robotic parts, magnetic connectors, and glowing circuits clicking together, subtle smoke and light flares, extremely detailed titanium textures. The final product displays a clean, clear surface with large glowing engraved text “LTX-2.3” centered and unobstructed, dramatic lighting, photorealism, 8K, sharp focus.
入力画像ダミー画像	空中で自己組み立てされるハイテクモジュール式機械のダイナミックなシネマティッククローズアップ。精密なロボット部品、磁気コネクタ、光る回路がカチッと音を立てて組み合わさり、かすかな煙と光のフレア、極めて精緻なチタンの質感。最終製品は、中央に大きく光る刻印文字「LTX-2.3」が遮るものなく配置された、清潔でクリアな表面を呈し、ドラマチックな照明、フォトリアリズム、8K、シャープなフォーカスを実現しています。
↑ video_ltx2_3_t2v_org.json 　　　　　　　　SubGraph 展開 →
⑤ 静止画像から動画生成　Image to Video
入力画像 egyptian_queen.png	*Prompt:* Egyptian royal in blue-and-gold headdress and high collar, white dress with golden embroidery and armbands, desert, robot soldiers in formation left and right. She walks steadily forward, head held level and gaze fixed ahead—no dipping or lowering of the head. The camera performs a single, smooth push-in only: starting in a wider shot of her, the robots, and the desert, it moves steadily forward until she is in a medium or medium-close frame, then holds. She stops, posture and head still upright, and says: “The old gods are silent. I am not.” Robot soldiers shift or march in place; sand and fabric move with the wind. No pull-back; the only camera move is the continuous push-in.
入力画像 egyptian_queen.png	青と金の頭飾りとハイカラー、金の刺繍と腕輪のついた白いドレスを着たエジプトの王族。砂漠、左右に整列したロボット兵士たち。彼女は頭を水平に保ち、視線をまっすぐ前に向けたまま、頭を下げたり下げたりすることなく、着実に前進する。カメラは、彼女とロボット、砂漠を捉えたワイドショットから始まり、彼女がミディアムまたはミディアムクローズのフレームに入るまで着実に前進し、そこで静止する。彼女は立ち止まり、姿勢と頭は依然としてまっすぐで、「古い神々は沈黙している。私は沈黙しない」と言う。ロボット兵士たちはその場で移動したり行進したりし、砂と布は風に揺れる。プルバックはなく、カメラの動きは連続的なプッシュインのみである。
↑ video_ltx2_3_i2v_org.json 　　　　　　　　SubGraph 展開 →
⑥ 静止画像と音声データから動画生成　Image Audio to Video
入力画像 cactus_man.png 入力音声 ltx_23_audio.mp3	*Prompt:* The fuzzy cactus creature is talking to the viewer as it grips the steering wheel with one hand, the other hand gestures naturally as it speaks. The car is moving, revealing the sunlit coastal background, static camera fixed on character, smooth side-tracking shot matching the car speed scene: Sunlit coastal road trip, clear coastal background character: Fuzzy cactus creature with big square sunglasses and a Hawaiian shirt action: One hand grips the steering wheel, the other gestures naturally while talking to the camera camera: Fixed on character, smooth side-tracking shot matching car speed
入力画像 cactus_man.png 入力音声 ltx_23_audio.mp3	毛むくじゃらのサボテンのような生き物が、片手でハンドルを握りながら、もう片方の手で自然なジェスチャーを交え、カメラに向かって話しかけています。車は動き、陽光に照らされた海岸線を背景に映し出しています。カメラはキャラクターに固定され、車の速度に合わせて滑らかな横移動ショットが用いられています。シーン：陽光に照らされた海岸沿いのドライブ、澄んだ海岸線の背景キャラクター：大きな四角いサングラスとハワイアンシャツを着た毛むくじゃらのサボテンのような生き物動作：片手でハンドルを握り、もう片方の手で自然なジェスチャーを交えながらカメラに向かって話すカメラ：キャラクターに固定され、車の速度に合わせて滑らかな横移動ショット
↑ video_ltx2_3_ia2v_org.json 　　　　　　　　SubGraph 展開 →

⑦ 最初と最後のフレーム画像から動画生成　FLF2V
最初のフレーム画像 high_view_classic_car.png	最後のフレーム画像 low_view_classic_car.png	*Prompt:* The camera move from a high position to a low position, keeping the character in the frame centered. Music: Synthwave cyberpunk music with calm ambient synths and driving 80s beats..
		カメラは高い位置から低い位置へと移動し、常に画面中央に人物を捉える。音楽：シンセウェーブ・サイバーパンク調の音楽。穏やかなアンビエントシンセと、力強い80年代風のビートが特徴。
↑ video_ltx2_3_flf2v_org.json 　　　　　　　　SubGraph 展開 →
⑧ 2枚の画像からシーン間の切り替え　Style Transition
最初のフレーム画像 ltx23_flf2v_first_frame.png	最後のフレーム画像 ltx23_flf2v_last_frame.png	*Prompt:* The red hair wizard girl looks up as the magical flame burns in her palm. Camera dollys out. The scene turns from a professional photography to a wet pastel watercolor painting. zhuanchang
		赤毛の魔法使いの少女が、手のひらで燃える魔法の炎を見上げる。カメラがドリーアウトする。場面はプロの写真撮影から、濡れたパステル水彩画へと変化する。zhunchang
↑ template_ltx2_3_style_transition_org.json 　　　　　　　　SubGraph 展開 →
⑨ 1枚の画像と短い音声クリップから動画を生成　ID LoRA
入力画像 vintage_thinker.png	入力音声 ltx23_reference_audio.mp3	*Prompt:* [VISUAL]: Opens with a medium shot, camera slowly pushes in toward the character. A man with short dark hair and round glasses, wearing a retro orange floral shirt, looks directly at the camera, his mouth opens and closes naturally as he speaks, tilts his head playfully. [SPEECH]: Hey, what do you think of this vibe? Feels like we’re back in the 90s. [SOUNDS]: Playful and upbeat tone, moderate volume, close to the microphone.
		【映像】：ミディアムショットで始まり、カメラがゆっくりとキャラクターに近づいていく。黒髪で丸眼鏡をかけた男性が、レトロなオレンジ色の花柄シャツを着て、カメラをまっすぐ見つめている。口は自然に開閉し、楽しそうに首を傾げながら話す。【セリフ】：なあ、この雰囲気どう思う？まるで90年代に戻ったみたいだろ？【音声】：楽しげで明るいトーン、適度な音量、マイクに近い。
↑ video_ltx2_3_id_lora_org.json 　　　　　　　　SubGraph 展開 →

・オリジナル・ワークフロー考察「video_ltx2_3_t2v_org.json」
　1. ワークフロー内に「switch to Text to Video?」の設定（true/false）がありデフォルトでは true となっている
　　True = Text to Video, False = Image to Video として機能（動作）を切り替えることができる
　2. このワークフローを実行すると、入力されたプロンプトからさらに詳細なプロンプトを生成し、このプロンプトにより生成が行われる
　3. 生成される詳細プロンプトは実行のたびに表現のニュアンスが違っている

< 内部で生成されたプロンプトの例 >
Style: realistic with cinematic lighting. In a close-up, high-tech modular machinery self-assembling dynamically in midair—precision robotic parts clicking together, magnetic connectors connecting, and glowing circuits connecting subtly. Subtle smoke and light flares drift through the air. The final product displays a clean, clear surface with large, glowing engraved text “LTX-2.3” centered and unobstructed. Dramatic lighting highlights the titanium textures. Extremely detailed titanium textures are visible everywhere, catching the light. Sharp focus creates a sense of precision. Ambient sounds include faint clicks and whirs as the machinery assembles itself. Behind the machinery, other patrons move subtly in and out of focus.

スタイル：映画のような照明を用いたリアルな表現。クローズアップでは、ハイテクなモジュール式機械が空中でダイナミックに自己組み立てされる様子が映し出される。精密なロボット部品がカチッと音を立てて組み合わさり、磁気コネクタが接続され、光る回路が微妙に接続される。かすかな煙と光のフレアが空中を漂う。完成品は、中央に大きく光る「LTX-2.3」の刻印文字が遮られることなく、すっきりとした表面を呈する。ドラマチックな照明がチタンの質感を際立たせる。至る所に極めて精緻なチタンの質感が見られ、光を捉えている。シャープなフォーカスが精密さを感じさせる。機械が組み立てられる際の微かなクリック音や唸り音が環境音として聞こえる。機械の背後では、他の客が微妙にピントが合ったり外れたりする。

　4. この処理（プロンプト・エンハンサー）は時間がかかるのと、適切でない表現が追加されることがある
　　外部 LLM の使用や、詳細なプロンプトを指定できる場合などはこの処理をバイパスする

ワークフローを整理する　Text to Video / Image to Video

テキストから動画生成　Text to Video	静止画像から動画生成　Image to Video
5300 Text to Video 基本ワークフロー	5300 SubGraph

「LTX/」5300_LTX-2.3_t2v_dev.json
5301 Image to Video 基本ワークフロー	5301 SubGraph

「LTX/」5301_LTX-2.3_i2v_dev.json
5302 Text to Video 基本ワークフロー (simple)	5303 Image to Video 基本ワークフロー (simple)

「LTX/」5302_LTX-2.3_T2V_dev_simple.json	「LTX/」5303_LTX-2.3_I2V_dev_simple.json

・生成結果動画（音声付き）

5302_LTX-2.3_T2V_simple.json	5303_LTX-2.3_I2V_simple.json

ワークフローを整理する　Image Audio to Video

静止画像と音声データから動画生成　Image Audio to Video
5310 Image Audio to Video 基本ワークフロー	5310 SubGraph

「LTX/」5310_LTX-2.3_ia2v_dev.json
5311 Image Audio to Video 基本ワークフロー (simple)

「LTX/」5311_LTX-2.3_ia2v_dev_simple.json

・生成結果動画（音声付き）

5310_LTX-2.3_ia2v_dev.json.json

↑

Step 2：GGUF版（dev）ワークフローの作成†

　「ltx-2.3-22b-dev-fp8.safetensors」標準(dev) fp8 モデルでは VRAM 8GB 以下の環境ではメモリー不足のようなので GGUF 量子化モデルにしてみる

GGUF 量子化モデルについて
・基本的にはビット数が多い程、精度が上がるが VRAM 消費も多くなる
・GGUF は速度ではなくVRAMを節約する、技術的には GGUF は圧縮されてるから遅くなる
・モデル全体がVRAMに収まらない問題がある環境においては GGUF の方が速くなることもある

LTX-2.3-dev GGUF モデル
タイプ	ビット数	モデルサイズ	内容
Q2_K	2	8.28 GB	2ビット量子化。16ブロックのスーパーブロックで、各ブロックは16のウェイトを持つ。1ウェイトあたり2.5625ビットになる
Q3_K_M	3	18.8 GB	3ビット量子化。16ブロックのスーパーブロックで、各ブロックは16のウェイトを持つ。1ウェイトあたり3.4375ビットになる
Q3_K_S	3	9.95 GB	3ビット量子化。16ブロックのスーパーブロックで、各ブロックは16のウェイトを持つ。1ウェイトあたり3.4375ビットになる
Q4_K_M	4	14.3 GB	4ビット量子化。8ブロックのスーパーブロックで、各ブロックは32のウェイトを持つ。1ウェイトあたり4.5ビットになる
Q4_K_S	4	13.1 GB	4ビット量子化。8ブロックのスーパーブロックで、各ブロックは32のウェイトを持つ。1ウェイトあたり4.5ビットになる
Q5_K_M	5	16.1 GB	5ビット量子化。8ブロックのスーパーブロックで、各ブロックは32のウェイトを持つ。1ウェイトあたり5.5ビットになる
Q5_K_S	5	16.2 GB	5ビット量子化。8ブロックのスーパーブロックで、各ブロックは32のウェイトを持つ。1ウェイトあたり5.5ビットになる
Q6_K	6	17.8 GB	6ビット量子化。16ブロックのスーパーブロックで、各ブロックは16のウェイトを持つ。1ウェイトあたり6.5625ビットになる
Q8_0	8	22.8 GB	8ビット近似値に量子化。各ブロックは32のウェイトを持つ
F16	16	42.0 GB	16ビット標準IEEE754 半精度浮動小数点数

　※ https://huggingface.co/unsloth/LTX-2.3-GGUF/tree/main

GGUF 量子化モデル対応のワークフローをダウンロードする
・LTX-2.3 22B GGUF WORKFLOWS 12GB VRAM
・Rebels LTX-2.3 Dev (GGUF)

GGUF版整理したワークフロー　Text to Video / Image to Video

テキストから動画生成　Text to Video	静止画像から動画生成　Image to Video
5340 Text to Video 基本ワークフロー (GGUF)	5340 SubGraph

「LTX/」5340_LTX-2.3_t2v_dev_GGUF.json
5341 Image to Video 基本ワークフロー (GGUF)	5341 SubGraph

「LTX/」5341_LTX-2.3_i2v_dev_GGUF.json
5342 Text to Video 基本ワークフロー (GGUF 展開版)	5343 Image to Video 基本ワークフロー (GGUF 展開版)

「LTX/」5342_LTX-2.3_T2V_GGUF.json	「LTX/」5343_LTX-2.3_I2V_dev_GGUF.json

・生成結果動画（音声付き）

5340_LTX-2.3_t2v_dev_GGUF.json	5341_LTX-2.3_i2v_dev_GGUF.json

GGUF版整理したワークフロー　Image Audio to Video

静止画像と音声データから動画生成　Image Audio to Video
5350 Image Audio to Video 基本ワークフロー (GGUF)	5350 SubGraph

「LTX/」5350_LTX-2.3_ia2v_dev_GGUF.json
5351 Image Audio to Video 基本ワークフロー (GGUF 展開版)

「LTX/」5351_LTX-2.3_ia2v_dev_GGUF.json

・生成結果動画（音声付き）

5350_LTX-2.3_ia2v_dev.json.json

↑

Step 3：標準テンプレートのワークフローを蒸留版（distilled）にする†

　基本的に標準テンプレート (dev) のワークフローで LoRA (ltx-2.3-22b-distilled-lora-384) ノードをバイパスして、モデルを変更することで機能する
　「Text to Video 基本ワークフロー」については若干の修正を加える（後述）

ワークフローを整理する　Text to Video / Image to Video

テキストから動画生成　Text to Video	静止画像から動画生成　Image to Video
5400 Text to Video 基本ワークフロー (distilled)	5400 SubGraph (distilled)

「LTX/」5400_LTX-2.3_t2v_distilled.json
5401 Image to Video 基本ワークフロー (distilled)	5401 SubGraph (distilled)

「LTX/」5401_LTX-2.3_i2v_distilled.json
5402 Text to Video 基本ワークフロー (distilled/simple)	5403 Image to Video 基本ワークフロー (distilled/simple)

「LTX/」5402_LTX-2.3_T2V_distilled_simple.json	「LTX/」5403_LTX-2.3_I2V_distilled_simple.json

・Text to Video 基本ワークフローについて
　- このワークフローを実行すると、入力されたプロンプトからさらに詳細なプロンプトを生成し、このプロンプトにより生成が行われる
　- 生成時間の短縮のため、このノードグループをバイパスして入力されたプロンプトそのもので生成するように変更する
　- 入力するプロンプトは、オリジナルワークフローを実行したときに生成されたプロンプトを使用する

・生成結果動画（音声付き）

オリジナルのワークフロー	プロンプト生成をバイパス

*Prompt:* Dynamic cinematic close-up of high-tech modular machinery self-assembling in midair, precision robotic parts, magnetic connectors, and glowing circuits clicking together, subtle smoke and light flares, extremely detailed titanium textures. The final product displays a clean, clear surface with large glowing engraved text “LTX-2.3” centered and unobstructed, dramatic lighting, photorealism, 8K, sharp focus.	*Prompt:* realistic with cinematic lighting. In a close-up, high-tech modular machinery self-assembling in midair, precision robotic parts and magnetic connectors click together with glowing circuits. Subtle smoke and light flares create dramatic effects as the titanium textures display extreme detail. The final product displays a clean, clear surface with large glowing engraved text “LTX-2.3” centered and unobstructed. The scene’s sharp focus highlights 8K photorealism.
空中で自己組み立てされるハイテクモジュール式機械のダイナミックなシネマティッククローズアップ。精密なロボット部品、磁気コネクタ、光る回路がカチッと音を立てて組み合わさり、かすかな煙と光のフレア、極めて精緻なチタンの質感。最終製品は、中央に大きく光る刻印文字「LTX-2.3」が遮るものなく配置された、清潔でクリアな表面を呈し、ドラマチックな照明、フォトリアリズム、8K、シャープなフォーカスを実現しています。	映画のようなライティングによるリアルな描写。クローズアップでは、ハイテクなモジュール式機械が空中で自己組み立てされ、精密なロボット部品と磁気コネクタが光る回路と共にカチッと嵌合する様子が描かれています。かすかな煙と光のフレアがドラマチックな効果を生み出し、チタンの質感は極めて精緻なディテールを際立たせています。完成品は、中央に大きく光る「LTX-2.3」の刻印文字が遮るものなく配置された、すっきりとしたクリアな表面を呈しています。シーンのシャープなフォーカスが8Kフォトリアリズムを際立たせています。

・Image to Video ワークフローの生成動画の最終フレームを保存できるようにする

静止画像から動画生成　Image to Video V2 (最終フレーム保存)

追加部分	「LTX/」5403v2_LTX-2.3_I2V_distilled_simple.json

ワークフローを整理する　Image Audio to Video

静止画像と音声データから動画生成　Image Audio to Video
入力画像	入力音声	プロンプト
woman4.png	seikai.mp3	カメラを見て、真ん中に'LTX-2.3'のロゴマークの入った白いTシャツを着て明るく話す表情豊かな女性のミディアムショット、上半身が映り、胸と肩が画面内に収まっている。「セイカイ。いいカンジだよ。」とほほ笑んで話します。


5410 Image Audio to Video 基本ワークフロー (distilled)	5410 SubGraph

「LTX/」5410_LTX-2.3_ia2v_distilled.json
5411 Image Audio to Video 基本ワークフロー (distilled/simple)

「LTX/」5411_LTX-2.3_ia2v_distilled_simple.json

・生成結果動画（音声付き）

5410_LTX-2.3_ia2v_distilled.json.json

ワークフローを整理する　FLF2V

最初と最後のフレーム 2枚の画像から動画生成　FLF2V
5412 FLF2V ワークフロー (distilled)	5412 SubGraph

「LTX/」5412_LTX-2.3_flf2v_distilled.json
5411 Image Audio to Video ワークフロー (distilled/simple)

「LTX/」5413_LTX-2.3_flf2v_distilled_simple.json

・生成結果動画（音声付き）

5412_LTX-2.3_flf2v_distilled.json

ワークフローを整理する　Style Transition

2枚の画像からシーン間の切り替え動画生成　Style Transition
5414 Style Transition ワークフロー (distilled)	5414 SubGraph

「LTX/」5414_LTX-2,3_trans_distilled.json
5415 Style Transition ワークフロー (distilled/simple)

「LTX/」5415_LTX-2,3_trans_distilled_simple.json

・生成結果動画（音声付き）

5414_LTX-2,3_trans_distilled.json

ワークフローを整理する　ID LoRA

1枚の画像と短い音声クリップから動画を生成　ID LoRA
5416 ID LoRA ワークフロー (distilled)	5416 SubGraph

「LTX/」5416_LTX-2.3_id_lora_distilled.json
5417 ID LoRA ワークフロー (distilled/simple)

「LTX/」5417_LTX-2.3_id_lora_distilled_simple.json

・生成結果動画（音声付き）

5416_LTX-2.3_id_lora_distilled.json

↑

Step 4：GGUF版（distilled）ワークフローの作成†

distilled GGUF 量子化モデルについて
・「ltx-2.3-22b-distilled-Q4_K_M.gguf」を使用する
・モデルのサイズは dev / distilled ほぼ同じ　→ GGUF版（dev）

GGUF 量子化モデルのワークフロー
・Step 3 で作成した dev ワークフローのモデルを変更する

GGUF版整理したワークフロー　Text to Video / Image to Video

テキストから動画生成　Text to Video	静止画像から動画生成　Image to Video
5440 Text to Video 基本ワークフロー distilled (GGUF)	5440 SubGraph

「LTX/」5440_LTX-2.3_t2v_distilled_GGUF.json
5441 Image to Video 基本ワークフロー distilled (GGUF)	5441 SubGraph

「LTX/」5441_LTX-2.3_i2v_distilled_GGUF.json
5442 Text to Video 基本ワークフロー distilled (GGUF 展開版)	5443 Image to Video 基本ワークフロー distilled (GGUF 展開版)

「LTX/」5442_LTX-2.3_T2V_distilled_GGUF.json	「LTX/」5443_LTX-2.3_I2V_distilled_GGUF.json

GGUF版整理したワークフロー　Image Audio to Video

静止画像と音声データから動画生成　Image Audio to Video
5450 Image Audio to Video 基本ワークフロー distilled (GGUF)	5450 SubGraph

「LTX/」5450_LTX-2.3_ia2v_distilled_GGUF.json
5351 Image Audio to Video 基本ワークフロー distilled (GGUF 展開版)

「LTX/」5451_LTX-2.3_ia2v_distilled_GGUF.json

GGUF版整理したワークフロー　FLF2V

最初と最後のフレーム 2枚の画像から動画生成　FLF2V
5452 FLF2V ワークフロー distilled (GGUF)	5452 SubGraph

「LTX/」5452_LTX-2.3_flf2v_distilled_GGUF.json
5353 FLF2V ワークフロー distilled (GGUF 展開版)

「LTX/」5453_LTX-2.3_flf2v_distilled_GGUF.json

GGUF版整理したワークフロー　Style Transition

2枚の画像からシーン間の切り替え動画生成　Style Transition
5454 Style Transition ワークフロー distilled (GGUF)	5454 SubGraph

「LTX/」5454_LTX-2,3_trans_distilled_GGUF.json
5355 Style Transition ワークフロー distilled (GGUF 展開版)

「LTX/」5455_LTX-2,3_trans_distilled_GGUF.json

GGUF版整理したワークフロー　ID LoRA

1枚の画像と短い音声クリップから動画を生成　ID LoRA
5456 ID LoRA ワークフロー distilled (GGUF)	5456 SubGraph

「LTX/」5456_LTX-2.3_id_lora_distilled_GGUF.json
5357 ID LoRA ワークフロー distilled (GGUF 展開版)

「LTX/」5457_LTX-2.3_id_lora_distilled_GGUF.json

↑

Step 5: Lightricks オフィシャルサイトのワークフロー†

　ComfyUI サイトとは別に LTX2.3 開発元の Lightricks オフィシャルサイトにもワークフローのサンプルが用意されているので検証する
　→ PSA: 公式のLTX 2.3ワークフローを使ってください。ComfyUIに含まれているものではなく、こちらの方がかなり良いです。

事前設定
1. 拡張ノード『 ComfyMath 』『 RES4LYF 』をインストールする（共通手順による）
　・https://github.com/evanspearman/ComfyMath
　・https://github.com/ClownsharkBatwing/RES4LYF

2. 拡張ノードをアップデートする（ワークフローのエラーが消えない場合）

①「Manager」ボタンを押す
②「Update All」を選択する
③「Restart」ボタンが表示されたら押す
④ 新規の顔面が表示されるまで待ち、新規画面を閉じる
④ 前の画面を閉じて Web ページを終了
⑤「StabilityMatrix」を終了し、再度起動する
※ うまくいかない場合は、「chech update」から個別に更新する

Single Stage 版
1. ワークフロー「LTX-2.3_T2V_I2V_Single_Stage_Distilled_Full.json」をダウンロードする

2. モデルを変更する

旧	変更後	適応箇所
ltx-2.3-22b-dev.safetensors	ltx-2.3-22b-dev-fp8	5
ltx-2.3-22b-distilled-lora-384.safetensors	ltx-2.3-22b-distilled-lora-384	2
confy_gemma_3.12B_it.safetensors	gemma_3_12B_it_fp4_mixed	1

標準版(dev) / 蒸留版(distilled) Text to Video, Image to Video

「~beta」LTX-2.3_T2V_I2V_Single_Stage_Distilled_Full_org.json

3. ワークフローを整理する

1 ステップでテキスト・静止画像から動画を生成　Text/Image to Video Single Stage
通常版(dev) Text to Video / Image to Video	蒸留版(distilled) Text to Video / Image to Video

「LTX/」5304_LTX-2.3_T2V_I2V_1st_dev.json	「LTX/」5404_LTX-2.3_T2V_I2V_1st_distilled.json

*Prompt:* A traditional Japanese tea ceremony takes place in a tatami room as a host carefully prepares matcha. Soft traditional koto music plays in the background, adding to the serene atmosphere. The bamboo whisk taps rhythmically against the ceramic bowl while water simmers in an iron kettle. Guests kneel in formal seiza position, watching in respectful silence. The host bows and presents the tea bowl, turning it precisely before offering it to the first guest with soft-spoken words.
畳の部屋で、亭主が丁寧に抹茶を点てる伝統的な日本の茶道が繰り広げられる。静謐な琴の音色が背景に流れ、穏やかな雰囲気を醸し出す。鉄のやかんで湯が沸く中、竹製の茶筅が陶器の茶碗をリズミカルに叩く。客は正座の姿勢で跪き、静かにその様子を見守る。亭主は一礼し、茶碗を丁寧に回してから、最初の客にそっと口づけながら差し出す。

Two Stage 版
1. ワークフロー「LTX-2.3_T2V_I2V_Two_Stage_Distilled.json」をダウンロードする

2. モデルを変更する

旧	変更後	適応箇所
ltx-2.3-22b-dev.safetensors	ltx-2.3-22b-dev-fp8	5
ltx-2.3-22b-distilled-lora-384.safetensors	ltx-2.3-22b-distilled-lora-384	1
confy_gemma_3.12B_it.safetensors	gemma_3_12B_it_fp4_mixed	1

蒸留版(distilled) Text to Video, Image to Video

「~beta」LTX-2.3_T2V_I2V_Two_Stage_Distilled_org_org.json

3. ワークフローを整理する

2 ステップでテキスト・静止画像から動画を生成　Text/Image to Video Two Stage
	蒸留版(distilled) Text to Video / Image to Video

	「LTX/」5405_LTX-2.3_T2V_I2V_2st_distilledlled.json

GGUF 版

2 ステップでテキスト・静止画像から動画を生成　Image to Video Two Stage (GGUF)
通常版(dev) Text to Video / Image to Video 1stage	蒸留版(distilled) Text to Video / Image to Video 1stage

「LTX/」5344_LTX-2.3_T2V_I2V_1st_dev_GF.json	「LTX/」5444_LTX-2.3_T2V_I2V_1st_distil_GF.json
	蒸留版(distilled) Text to Video, Image to Video 2stage

	「LTX/」5445_LTX-2.3_T2V_I2V_2st_distil_GF.json

↑

Step 6: 3 Stage 動画生成†

　通常 2stage の生成では、一度低解像度で生成したものを Hires.fix する。これを拡張して 3 Stage の生成をする
非常に小さな解像度で生成したものを 2 倍 Hires.fix、さらにそれをもう 2 倍 Hires.fix する。コミュニティでは明確に結果が良いとされている
　→ Comfy with ComfyUI: LTX-2.3

text2video

3 ステップでテキストから動画を生成　Text to Video Three Stage
蒸留版(distilled) Text to Video 3 Stage	蒸留版(distilled) Text to Video 3 Stage (GGUF)

「LTX/」5406_LTX-2.3_t2v_3st_distilled.json	「LTX/」5447_LTX-2.3_i2v_3st_distilled_GF.json

*Prompt:* A cinematic rainy London street at blue hour. Wet cobblestones reflect deep blue and amber light, and a glowing red British telephone box stands beside the sidewalk. A young woman in a dark trench coat steps out of the phone booth onto the pavement, pauses for a beat, then begins walking calmly along the sidewalk. She is filmed clearly from the side in a graceful profile tracking shot, with the camera moving parallel to her from the building side of the street. Her body remains safely framed on the pavement, while the road and traffic stay separated behind her. A classic black London taxi passes through the background on the street at a comfortable distance, its headlights and reflections sliding across the wet road without crossing into her path. Warm pub windows, rain-speckled glass, soft streetlamp glow, puddle reflections, and faint mist create a refined melancholic London atmosphere. Her coat hem and hair move lightly in the damp evening breeze. The sound is purely natural location sound: soft rainwater dripping, distant tires on wet pavement, a passing taxi, faint city ambience, and the subtle echo of the street at night. The scene feels elegant, restrained, lonely, and cinematic.
映画のような雨のロンドンの街並み、夕暮れ時。濡れた石畳は深い青と琥珀色の光を反射し、歩道脇には赤く光るイギリスの電話ボックスが立っている。暗いトレンチコートを着た若い女性が電話ボックスから歩道に出て、一瞬立ち止まり、それから静かに歩道を歩き始める。彼女は優雅な横顔のトラッキングショットで横から鮮明に捉えられ、カメラは通りの建物側から彼女と平行に移動する。彼女の体は歩道にしっかりと収まり、道路と交通は彼女の背後で分離されている。背景の通りでは、クラシックな黒いロンドンタクシーが適度な距離を保って通り過ぎ、ヘッドライトと反射光は濡れた路面を滑るように進み、彼女の進路を遮ることはない。温かいパブの窓、雨粒のついたガラス、柔らかな街灯の光、水たまりの反射、そしてかすかな霧が、洗練された物憂げなロンドンの雰囲気を醸し出している。彼女のコートの裾と髪は、湿った夕方のそよ風に軽く揺れる。音は純粋に自然の音だ。雨粒が静かに滴る音、濡れた舗装路を走る遠くのタイヤの音、通り過ぎるタクシーの音、かすかな街のざわめき、そして夜の街路に響く繊細なこだま。その情景は、優雅で抑制され、孤独感があり、まるで映画のワンシーンのようだ。

image2video

3 ステップで静止画像から動画を生成　Image to Video Three Stage
蒸留版(distilled) Image to Video 3 Stage	蒸留版(distilled) Image to Video 3 Stage (GGUF)

「LTX/」5407_LTX-2.3_i2v_3st_distilled.json	「LTX/」5447_LTX-2.3_i2v_3st_distilled_GF.json

*Prompt:* A child outdoors on a sunny day brings the bubble wand to their lips and gently blows, creating a stream of floating soap bubbles. The child smiles naturally, blinks once, and slightly moves their head and hand while watching the bubbles drift upward and across the frame. Soft sunlight glows through the bubbles and hair, and the background stays bright and softly blurred. The camera remains close and mostly steady with only a slight natural handheld drift, keeping the child’s face as the center of the shot. Gentle, natural motion, realistic expression, clear bubble movement, warm cinematic atmosphere.
晴れた日に、屋外で子供がシャボン玉の棒を口元に当て、そっと息を吹きかけると、水面に浮かぶシャボン玉が次々と現れる。子供は自然な笑顔を見せ、一度まばたきをし、頭と手を少し動かしながら、シャボン玉が上へ、そして画面を横切って漂っていく様子を見つめる。柔らかな陽光がシャボン玉と髪の毛を通して輝き、背景は明るく、柔らかくぼかされている。カメラは子供の顔を中心に据え、ほとんど手持ちで自然なわずかな揺れがあるものの、子供の顔に寄り添うように撮影されている。優しく自然な動き、リアルな表情、はっきりとしたシャボン玉の動き、温かみのある映画のような雰囲気。

image_audio2video

静止画像と音声ファイルから動画を生成　Audio to Video 3 Stage
蒸留版(distilled) Audio to Video 3 Stage	蒸留版(distilled) Audio to Video 3 Stage (GGUF)

「LTX/」5408_LTX-2.3_a2v_3st_distilled.json	「LTX/」5448_LTX-2.3_a2v_3st_distilled_GF.json

*Prompt:* A quiet mountain lodge on a cold overcast day. A person in a jacket walks slowly across a damp gravel path near the lodge entrance. Small stones and wet gravel shift under each step. Fog hangs lightly in the air, pine trees stand in the background, and the ground is dark from recent moisture. The camera follows from a low side angle, focusing on the footsteps, the texture of the gravel, and the calm mountain atmosphere. Natural ambience, no music.
曇り空の寒い日、静かな山小屋。ジャケットを着た人が、小屋の入り口近くの湿った砂利道をゆっくりと歩いている。小石や濡れた砂利が、一歩ごとに動く。霧が薄く立ち込め、背景には松の木が立ち、地面は最近の湿気で黒ずんでいる。カメラは低い横からのアングルで、足跡、砂利の質感、そして穏やかな山の雰囲気に焦点を当てて追う。音楽なしの自然な雰囲気。

↑

GGUF モデルについて†

GGUF（GPT-Generated Unified Format）とは → 量子化モデルとは
・GGUFは、大規模言語モデル（LLM）を一般の消費者向けPC（CPUやGPU）で高速かつ効率的に動作させるためのファイルフォーマット
・旧来の GGML 形式を強化したもので、量子化（軽量化）モデルの配布に広く使われ、1ファイルでモデルの重みやメタデータを含む点が特徴

GGUFの主な特徴とメリット
1. ローカル環境への最適化: CPUやApple Silicon（M1/M2/M3）でも高速に推論可能。
　GPUのメモリ（VRAM）が足りない場合でも、メインメモリ（RAM）を使って動作できる
2. 1ファイル完結: モデルのパラメータ情報や、トークナイザーの設定など、必要なデータをすべて1つの.ggufファイルに集約しており、管理が容易
3. 量子化に対応 (K-quants): 「Q4_K_M」などの混合精度量子化技術により、高精度を維持しつつ、モデルサイズを大幅に削減（1/2～1/4程度）
4. 高い互換性: llama.cpp、Ollama、LM Studioなど、多くのローカルLLMツールでネイティブにサポート

GGUF 速度
・GGUF は速度ではなくVRAMを節約する、技術的には GGUF は圧縮されてるから遅くなる
・モデル全体がVRAMに収まらない問題がある環境においては GGUF の方が速くなることもある

GGUF モデルへ対応するワークフローの変更点

ノード通常モデル対応 GGUF 対応

Checkpoint
Load Checkpoint
Unet Loader (GGUF) / VAE Loder KJ

VAE
Load Audio VAE
VAE Encoder KJ

Text Encoder
LTXV Audio Text Encoder Loader
Dual CLIP Loader

　※ 参考URL → LTX-2.3 GGUF Image-to-Video & Text-to-Video in ComfyUI

↑

生成動画例†

生成結果動画（音声付き）　※ 生成に使用したワークフロー → 5402v2_LTX-2.3_t2v_distil_simple.json
　プロンプト引用：→ ショート動画１：「LTX-2公式のプロンプト作成ガイド」より

①

アクション満載の映画のようなモンスタートラックのショットです。トラックはカメラの前を通過し、左にパンしてトラックの無謀な運転を追っています。トラックの周りには埃やモーションブラーがあり、カメラは遠くまでトラックを追おうとするため手持ちで操作しているような感覚です。その後、トラックはドリフトして向きを変え、非常に近い位置で見られるまでカメラに向かって戻ってきます。

②

温かく親密なシネマティックなパフォーマンスは、木製パネルの居心地の良いバーで行われ、柔らかな琥珀色の実用的な照明と浅い被写界深度によって背景に輝くボケが生み出されています。ショットは、黒色のショートヘアに前髪のある20代の若い女性シンガーのミディアムクローズアップで始まります。彼女はアコースティックギターをかき鳴らしながらマイクに歌い、目を閉じ、リラックスした姿勢をとっています。カメラはゆっくりと彼女の周りを左に弧を描きながら、彼女の顔とマイクに鮮明なフォーカスを保ち、彼女の背後でギターを弾く男性バンドメンバー2人は柔らかくぼやけています。暖かい光が彼女の顔と髪を包み込み、額縁に入った写真と木の壁が背景を流れていきます。穏やかなアコースティックギターのかき鳴らしに乗せた彼女の澄んだ歌声に導かれるように、アンビエントなライブミュージックが空間を満たします。

③

アニメ映画のようなショット。ロボットがゆっくり歩き、カメラがドリーバックして、ミディアムショットでロボットのゆっくりした歩行を続けます。ロボットはゆっくりと重々しく走り始めます。その後、ロボットは停止し、カメラはドリーバックを続け、肩越しのショットで青い類似のロボットが現れます。

↑

自動プロンプト入力†

拡張ノードをインストールする → テキストエンコーダでプロンプトを作る
1. ComfyUI-TextGenerateGemma3Prompt　をダウンロード（この拡張ノードは手動でインストール）
2.「ComfyUI/」custom_nodes/」に配置する
3. ComfyUI を起動する

LTX-2.3 対応ワークフロー

5491 Image to Video 自動プロンプト基本ワークフロー 5491 SubGraph

「LTX/」5491_LTX-2.3_i2v_distilled_auto.json

プロンプトなしで入力画像から生成された動画

↑

忘備録†

↑

モデル「dev」「distilled」の主な違い†

　「dev」と「distilled」の主な違いは、モデルの用途、速度、生成品質にある
LTX 2.3 や FLUX などの画像・動画生成AIにおいて、これらは「標準的な高品質モデル（dev）」と「高速化された軽量モデル（distilled）」という位置付け

比較

特徴	Dev (Development)	Distilled (蒸留モデル)
位置づけ	ベース/フルモデル	高速化モデル (Distilled)
生成速度	遅い (標準)	非常に速い (4-8ステップ)
品質/描写力	非常に高い (詳細)	高い (devに準ずる)
ステップ数	20 - 50+ ステップ	4 - 8 ステップ
適した用途	LoRA学習、高品質な静止画/動画	速さを求める生成、プレビュー

↑

モデル「fp8」「GUFF」の主な違い†

　AI生成モデル（画像生成の Stable Diffusion や Fluxなど）において、FP8 と GGUF はどちらも「モデルの軽量化（量子化）」を目的としてるが、仕組みと目的が異る。速度とVRAM（GPUメモリ）の容量なら FP8、環境の柔軟性と精度の高さなら GGUF(Q8) が推奨される

比較

	FP8 (Floating Point 8-bit)	GGUF (GPT-Generated Unified Format)
特徴	8ビットの浮動小数点数で計算	CPUとGPUを連携して動作させるため（llama.cpp系）、量子化精度が非常に高い
メリット	精度低下を比較的抑えつつ、FP16（16ビット）に比べて速度が速く、VRAM使用量をほぼ半分に削減できる	Q8 (8bit) などの量子化を使えば、元のFP16モデルとほぼ変わらない高品質な画像を生成できる。VRAMが少ない環境でも動作させやすい
デメリット	GGUFに比べて若干VRAMを消費しやすい傾向がある	FP8に比べると生成速度が少し遅くなる場合がある
ステップ数	20 - 50+ ステップ	4 - 8 ステップ
適した要件	NVIDIA製などの高性能GPU（VRAM12GB以上など）を使って、高速に生成したい場合	VRAMが少ない（8GB〜12GB未満など）か、CPUメインの環境、あるいは画質を最優先したい場合

↑

ID-LoRA†

ID-LoRA とは
- LTX-2.3 ID-LoRA（Identity-Driven In-Context LoRA）は、LTX-2.3ビデオ生成モデル向けにリリースされた新しい技術
- 1枚の画像と短い音声クリップから、一貫性のある外見と声を再現したリップシンク動画を生成する
- ComfyUI に統合されており、アバターやAIインフルエンサーの作成に最適
- この技術により、専門的な知識なしで、高品質な顔と音声の一致した動画を作成できる

主な特徴と詳細
- 技術内容: ID-LoRA（Identity-Driven In-Context LoRA）を使用し、テキストプロンプト、リファレンス画像、音声クリップを組み合わせて動画を生成
- 主要なモデル: 「Celib Vive HQ3K」と「TalkVID 3K」の2つのID-LoRAモデルが利用可能
- ComfyUI 統合: ComfyUIの最新バージョンにて、専用のネイティブノード（LTX Reference Audio ID-LoRA）を使用して動作
- 利点: キャラクターの整合性（一貫性）を保ちやすく、物語性のあるコンテンツや自動生成に最適
- 最新機能: Wan 2.2との統合による自動化ワークフローや、音声入力・より良い同期（Lip Sync）に対応

↑

Style Transition†

Style Transition とは
- LTX 2.3 トランジションLoRA は ComfyUI 向けに設計された特殊なモデル拡張
- 異なるシーンや画像間でシームレスでガイド付きの変換を作成するためのもの
- ベースモデル単体よりも優れたモーション連続性を持つ高度な映像生成を可能にする、更新されたLTX 2.3エコシステムの一部

主な特徴と機能
- 'シーン間の切り替え:'' シーンやスタイル(例:漫画、鉛筆スケッチ)、キャラクターの切り替えなどを視覚状態から別の視覚状態へ滑らかに変化する
- ファースト・ラストフレーム生成: 開始と終了の画像(キーフレームから動画へ)を使い、論理的な動きでそれらをつなぐワークフローに優れている
- 一貫性の向上: LTX-2.3 が処理する8フレームチャンク間で起こりうる「時間跳躍」効果を軽減する
- 劇的な変身: 特に変身が極端な場合(例えば、人が生き物に変身する)場合に有効

↑

FLF2V†

FLF2V とは
- FLF2V（First Frame Last Frame to Video）は、アリババ（Alibaba/通義万相チーム）が開発した、オープンソースのAI動画生成モデル「Wan2.1/2.2」の一部機能
- 「動画の最初（First）のフレームと最後（Last）のフレームとなる 2枚の画像を指定するだけで、その間の動的な移り変わり（トランジション）を自動で生成し、1つの滑らかな動画にする技術」

FLF2Vの主な特徴
- 始終の強力な制御: 従来の image-to-video（画像から動画）技術は AIが動きを予測していたが、FLF2V は開始と終了の絵を正確に固定できるため、意図した通りの物語やアニメーションを作りやすい
- 滑らかな遷移: 2枚のキーフレームの間を論理的かつ自然に補間し、高品質な720p動画を生成する
- 用途: ループ動画（最初と最後がつながる動画）の作成や、キャラクターが変身するシーン、2つの異なるシーンを滑らかにつなぐ動画作成に最適

↑

720p（ななひゃくにじゅう・ぴー）†

720p とは
- HD（ハイビジョン）規格の映像フォーマットで、解像度「1280 x 720 ピクセル」、アスペクト比「16:9」の映像を指す
- プログレッシブ走査（順次走査）で描画するため、動きの速い映像でも滑らかに表示できるのが特徴
- 動画配信やテレビ会議などで広く使われている
- 動画の品質としては、高精細さが必要な大画面テレビでは1080pや4Kが好まれるが,720p はネット上の動画視聴において最も汎用的な「HD画質」の基準となっている
主な特徴
- 解像度: 1280 x 720（約92万画素）
- 「p」の意味: プログレッシブ（Progressive）。1フレームずつ映像を全て描き出す方式（インターレースの「i」より滑らか）
- 別名: 「HD Ready」や「HD」と呼ばれる
- 用途: YouTubeやストリーミング配信の標準的な高画質設定、テレビ会議、Webカメラなど
1080p（フルHD）や4Kとの違い

規格解像度画素数

720p (HD) 1280 x 720 約92万

1080p (Full HD) 1920 x 1080 約207万

2160p (4K) 3840 x 2160 約829万

↑

動画編集 Tips†

サイト内参照 URL
- 環境構築 → 『仮想環境 (py38_learn)』
- Python 私的汎用ライブラリ２ → 動画編集 Tips
- 画像生成AI「ComfyUI」９（動画編３）→ 動画編集 Tips

「ffmpeg」を利用して動画を編集する

conda activate py38_learn

cd D:video_edit

フレーム数を得る（ソース画像：input.mp4）

ffprobe -v error -count_frames -select_streams v:0 -show_entries stream=nb_read_frames -of csv=p=0 input.mp4

n フレームを画像として保存（ソース画像：input.mp4, 保存画像：output.png）

ffmpeg -i input.mp4 -vf "select=gte(n\,n)" -vframes 1 output.png

動画ファイルを連結する（リストファイル：mylist.txt, 完成動画：output.mp4）

ffmpeg -safe 0 -f concat -i mylist.txt -c copy output.mp4

・「mylist.txt」

file D:/video_edit/5403_2026-05-13_00006_.mp4
file D:/video_edit/5403_2026-05-13_00008_.mp4
file D:/video_edit/5403_2026-05-13_00010_.mp4
file D:/video_edit/5403_2026-05-13_00012_.mp4
file D:/video_edit/5413_2026-05-13_00001_.mp4

↑

更新履歴†

2026/04/07 初版

↑

参考資料†

ComfyUI LTX-2.3

ComfyUI LTX-2.3 workflow
- LTX-2.3 22B GGUF WORKFLOWS 12GB VRAM
- Rebels LTX-2.3 Dev (GGUF)

ComfyUI LTX-2.3 GGUF

ノード	通常モデル対応	GGUF 対応
Checkpoint	Load Checkpoint	Unet Loader (GGUF) / VAE Loder KJ
VAE	Load Audio VAE	VAE Encoder KJ
Text Encoder	LTXV Audio Text Encoder Loader	Dual CLIP Loader

5491 Image to Video 自動プロンプト基本ワークフロー	5491 SubGraph

「LTX/」5491_LTX-2.3_i2v_distilled_auto.json

規格	解像度	画素数
720p (HD)	1280 x 720	約92万
1080p (Full HD)	1920 x 1080	約207万
2160p (4K)	3840 x 2160	約829万

ComfyUI9b の履歴(No.61)

画像生成AI「ComfyUI」９（動画編４） == 編集中 ==†

LTX-2.3 による音声付き動画生成†

概要†

プロジェクトで作成するワークフロー†

動画生成のための環境構築†

Step 1：オフィシャルサイトの標準テンプレートからワークフローを作成†

Step 2：GGUF版（dev）ワークフローの作成†

Step 3：標準テンプレートのワークフローを蒸留版（distilled）にする†

Step 4：GGUF版（distilled）ワークフローの作成†

Step 5: Lightricks オフィシャルサイトのワークフロー†

Step 6: 3 Stage 動画生成†

GGUF モデルについて†

生成動画例†

自動プロンプト入力†

忘備録†

モデル「dev」「distilled」の主な違い†

モデル「fp8」「GUFF」の主な違い†

ID-LoRA†

Style Transition†

FLF2V†

720p（ななひゃくにじゅう・ぴー）†

動画編集 Tips†

更新履歴†

参考資料†

画像生成AI「ComfyUI」９（動画編４）　== 編集中 ==†