ComfyUI9 の履歴(No.25)

私的AI研究会 > ComfyUI9

画像生成AI「ComfyUI」９（動画編２）　== 編集中 ==†

　「ComfyUI」を使ってローカル環境でのAI画像生成を検証する

▲　目　次

画像生成AI「ComfyUI」９（動画編２）　== 編集中 ==
参考資料

※ 最終更新:2026/01/30　

↑

LTX-2 による音声付き動画生成†

　2026年1月発表された音声対応の動画生成モデル「LTX-2」が巷で話題になっている。
「ConfyUI」標準対応となって使いやすく、低 VRAM 環境でも工夫すれば動作可能らしいので検証してみる

↑

概要†

「LTX-2」とは
- イスラエルの Lightricks（ライトリックス）社が開発 2026年 1月の CES で公開した高性能オープンソース動画生成AIモデル
- 動画と音声を同時に生成できるマルチモーダルな動画生成 AI

主な特徴
- 動画生成と同時に、セリフ、BGM、効果音を生成し、映像と音が同期した長尺（最大20秒程度）の動画を生成可能
- オープンソースでモデルの重みが公開されておりローカル環境で動かすことが可能で、ComfyUI などで標準対応が進んでいる
- 話している口元と音声の同期（リップシンク）が非常に正確で「リップシンクマシン」とも評される
- テキストから動画（Text-to-Video）、画像から動画（Image-to-Video）、制御情報を使った生成（Canny、深度、ポーズ）など複数の入力方式に対応

動作前提要件（ComfyUI-LTXVideo より）
- ComfyUI installed
- CUDA-compatible GPU with 32GB+ VRAM
- 100GB+ free disk space for models and cache

オフィシャルサイト

↑

プロジェクトで作成するワークフロー†

ComfyUI_proj.zip をダウンロード（随時更新中）※2026/01/29更新
・解凍してできるフォルダ

📂ComfyUI
  ├─📂input　　　　　　　　　　　　　　← ワークフローに含まれる入力画像
  └─📂user
        └─📂default
              └─📂workflows　　　　　　　　← ワークフローの保存場所
                    ├─📂_audio
                    ├─📂_base
                    ├─📂_base_i2i
                    ├─📂_base_t2i
                    ├─📂_prompt
                    ├─📂_utility
                    ├─📂_video
                    └─📂test

・解凍してできる「ComfyUI/」フォルダを「StabilityMatrix/Data/Packages/ComfyUI」へ上書きコピーする

ワークフローと動作環境による生成時間（分：秒）

ワークフロー	機　能	モデル	CPU					CPU
ワークフロー	機　能	モデル	RTX 4070	RTX 4060	RTX 4060L	RTX 3050	GTX 1050	i7-1260P	i7-1185G7
video_ltx2_i2v_001.json	静止画から動画生成	蒸留版 fp8	01:16.64	03:26.27	05:42.91	13:38.81	×	×	×
video_ltx2_t2v_001.json	テキストから動画生成	蒸留版 fp8	01:56.49	03:56.71	05:06.04	17:28.08	×	×	×
video_ltx2_i2v_dev_003.json	静止画から動画生成	通常版 fp4	03:14.28	07:22.89	09:11.15	19:17.54	×	×	×
video_ltx2_t2v_dev_003.json	テキストから動画生成		02:54.28	07:31.23	09:44.48	17:50.15	×	×	×
LTX-2_text2video_V2_004.json	text2video			11:44.47	14:18.25	26:41.94	×	×	×
LTX-2_text2video_distilled_005.json	text2video 8ステップ			13:36.94	23:55.01	16:45.69	×	×	×
LTX-2_image2video_distilled_V2_007.json	image2video			31:50.14	23:10.95	29:10.14	×	×	×

↑

動画生成のための環境構築†

必要モデルのダウンロードと配置

「Stability Matrix」上の「ComfyUI」ではモデルフォルダの場所が異なっていることに注意 → モデルフォルダの配置

モデル名	ファイル名（.safetensors）	配置先（/StabilityMatrix/Data/）		ダウンロード URL
checkpoints	ltx-2-19b-distilled-fp8 (fp8 蒸留版)	Models/	StableDiffusion/	ltx-2-19b-distilled-fp8.safetensors
checkpoints	ltx-2-19b-dev-fp4 (fp4 通常版)		StableDiffusion/	ltx-2-19b-dev-fp4.safetensors
lolas	ltx-2-19b-distilled-lora-384		Lora/	ltx-2-19b-distilled-lora-384.safetensors
lolas	ltx-2-19b-lora-camera-control-dolly-left		Lora/	ltx-2-19b-lora-camera-control-dolly-left
vae	LTX2_audio_vae_bf16		VAE/	LTX-2 VAE Files
	LTX2_video_vae_bf16
	LTX2_video_vae_old_bf16
latent_upscale_models	ltx-2-spatial-upscaler-x2-1.0	Packages/ComfyUI/models/	latent_upscale_models	LTX-2 Spatial Upscaler ×2 (v1.0)
text_encoders	gemma_3_12B_it_fp4_mixed		text_encoders	gemma_3_12B_it_fp4_mixed.safetensors
text_encoders	ltx-2-19b-embeddings_connector_dev_bf16		text_encoders	LTX-2 Embeddings Connector (bf16)

Windows の場合「ページングファイル」を再設定する

①「Windows ＋ 'R'」キーを押し「ファイル名を指定して実行」を開く
②「 sysdm.cpl 」と注力し「OK」ボタンをを押す
③「詳細設定」タブで「パフォーマンス」の項目の「設定」を押す
④「すべてのドライブのページングファイルのサイズを自動的に管理する」のチェックを外す
⑤「カスタムサイズ」を選択
⑥「初期サイズ」「最大サイズ」に 131072 を指定する（128GB）
⑦「設定」ボタンを押す
⑧「ページングファイルのサイズ」が 131072-131072 になっていることを確認
⑨「OK」を押してからシステムを再起動する

↑

Step 1：標準テンプレート（Distilled 蒸留版）による生成†

ワークフローを選ぶ

① 左端のメニューから「Template」を選択
②「Video」を押す
③ 検索欄に「ltx2」を入力する

・表示された一覧からワークフローを選ぶ
④「LTX-2 Text to Video (Distilled)」テキストから動画生成（蒸留版）
⑤「LTX-2 Image to Video (Distilled)」静止画像から動画生成（蒸留版）

⑥ どちらのワークフローでもアラートダイアログが出るが無視して閉じる
⑦ 表示されたワークフローのノードにある「ckpt_name」の項目を
　「ltx-2-19b-distilled-fp8.safetensors」に変更する
　※ 低VRAM で動作する「ltx-2-19b-distilled.safetensors」の軽量版を使用する

・ワークフローでエラーが発生する場合は前項のモデルの配置を確認する
・動作確認を行ったものを「workflow/_video/」フォルダに配置している
　（冒頭のプロジェクトファイルのダウンロード手順を実行した場合）
　④「LTX-2 Text to Video (Distilled)」→ 「video_ltx2_t2v_001.json」
　⑤「LTX-2 Image to Video (Distilled)」→ 「video_ltx2_i2v_001.json」

「LTX-2 Text to Video (Distilled)」テキストから動画生成（蒸留版）

　Prompt
A man in a black tuxedo stands motionless in a small, red-tiled bathroom, facing a mirror. The camera sits just behind his right shoulder, framing both his back and his solemn reflection. Suddenly, he opens his mouth and begins to sing opera in Italian: "La donna e mobile, qual piuma al vento." Rich, resonant notes echo through the space. As his voice climbs in pitch, his brows lift, and his expression becomes more passionate, almost vulnerable. The overhead lighting casts a sharp glow on his face and tuxedo, reflecting in the glossy red tiles around him. The camera is static

黒いタキシードを着た男が、赤いタイル張りの小さな浴室で、鏡に向かってじっと立っている。カメラは彼の右肩のすぐ後ろに設置され、彼の背中と厳粛な表情を映し出している。突然、彼は口を開き、イタリア語でオペラを歌い始める。「La donna e mobile, qual piuma al vento（女よ、動け、風に吹かれて）」。豊かで響き渡る音色が空間に響き渡る。声のトーンが上がるにつれ、彼の眉は上がり、表情はより情熱的で、ほとんど無防備なほどになる。天井の照明が彼の顔とタキシードに鋭い輝きを放ち、周囲の光沢のある赤いタイルに反射している。カメラは静止している。

　※ ワークフロー：「_video/」

video_ltx2_t2v_001.json

・生成結果動画（音声付き）
　

「LTX-2 Image to Video (Distilled)」静止画像から動画生成（蒸留版）

Prompt
A wide, dynamic tracking shot follows a group of mountain bikers as they race across a pristine snow-covered landscape on a brilliant winter morning. The camera moves at speed, keeping pace with the lead biker in a vibrant yellow jacket and orange helmet, who launches into the air over a snow mound, their bike suspended against the clear blue sky. Snow particles explode around them, catching the golden light of the low sun that creates dramatic backlighting and long shadows across the terrain. Several other bikers follow closely behind, their dark silhouettes kicking up plumes of powdery snow as they navigate the undulating mounds. The only sounds are the crunch of tires biting into packed snow, the whoosh of air as they fly through jumps, and the distant sound of heavy breathing and excited shouts. One biker calls out to another with exhilaration: "This is incredible! The light is perfect!" Another responds with a breathless laugh: "Keep pushing! We're almost at the ridge!" The camera glides smoothly alongside the group, occasionally pulling back to reveal the full expanse of the snowy landscape with bare birch trees and dark evergreens lining the edge. The mood is exhilarating, fast-paced, and full of the raw energy of winter mountain biking, with every jump and turn captured in crisp detail against the brilliant white snow and vibrant sky.

輝くような冬の朝、雪に覆われた手つかずの大地を駆け抜けるマウンテンバイクの一団を、ワイドでダイナミックなトラッキングショットが追う。鮮やかな黄色のジャケットとオレンジ色のヘルメットを羽織った先頭のライダーが、澄み切った青空にバイクを浮かせ、雪山を駆け抜ける。周囲で雪の粒が弾け飛び、低い太陽の黄金色の光を捉え、ドラマチックな逆光と長い影が地形に映し出される。数人のライダーがすぐ後ろをついてくる。彼らの黒いシルエットは、うねる雪山を駆け抜けながら、粉雪の煙を巻き上げている。聞こえるのは、タイヤが圧雪に食い込む音、ジャンプを駆け抜ける際の息の音、そして遠くから聞こえる荒い息遣いと興奮した叫び声だけだ。ライダーの一人が興奮気味に他のライダーに声をかける。「信じられない！光が最高！」。別のライダーが息を切らして笑いながら応える。「頑張れ！もうすぐ尾根だ！」カメラはグループの横を滑らかに滑るように進み、時折手前に引くことで、葉の落ちた白樺と濃い常緑樹が縁を縁取る、雪景色の雄大な景色を映し出します。爽快でテンポが速く、冬のマウンテンバイクの生々しいエネルギーに満ち溢れた雰囲気が伝わってきます。輝く白い雪と鮮やかな空を背景に、あらゆるジャンプやターンが鮮明なディテールで捉えられています。

・入力画像 「mountain_bikers.jpg」

　※ ワークフロー：「_video/」

video_ltx2_i2v_001.json

・生成結果動画（音声付き）
　

ここまでの結果

「Text to Video」「Image to Video」は軽量版モデルの使用で問題なく動作した
生成開始から終了まで 100GBのシステムメモリーが確保され VRAMもフル仕様の状態になる

生成時間は

回数	ワークフロー	RTX 4070	RTX 4060	RTX 4060L	RTX 3050
初回	video_ltx2_i2v_001.json	03:33.83	05:08.03	07:30.43	09:12.62
2回目	video_ltx2_i2v_001.json	01:16.64	03:56.71	05:42.91	17:28.08
2回目	video_ltx2_t2v_001.json	01:56.49	03:26.27	05:06.04	13:38.81

・「FramePack」に比較すると音も生成しているにもかかわらず相当早い
　（数分の１～数十分の１）

プロンプトはかなり難しい（学習と経験が必要）
「GGUF量子化モデル」を使うとさらに軽量・高速化が期待できるらしい
・提供されている専用カスタムノードを使用する必要がある（今後要調査）
音に合わせて動画生成することもできるようだが詳細はこれから調査する

未確認のモデル類

モデル名	ファイル名	内容	ダウンロード URL
Diffusion Model	ltx-2-19b-distilled.safetensors	蒸留モデル(fp16)	ltx-2-19b-distilled.safetensors
Diffusion Model	ltx-2-19b-distilled_Q4_K_M.gguf	GGUF量子化モデル	ltx-2-19b-distilled_Q4_K_M
Text Encoder	gemma_3_12B_it_fp8_e4m3fn.safetensors	Gemma 3 12B (FP8 e4m3fn)	gemma_3_12B_it_fp8_e4m3fn

↑

Step 2：標準テンプレート（通常版）による生成†

ワークフローを選ぶ

① 左端のメニューから「Template」を選択
②「Video」を押す
③ 検索欄に「ltx2」を入力する

・表示された一覧からワークフローを選ぶ
④「LTX-2 Text to Video」テキストから動画生成（通常版）
⑤「LTX-2 Image to Video)」静止画像から動画生成（通常版）

⑥ どちらのワークフローでもアラートダイアログが出るが無視して閉じる
⑦ 表示されたワークフローのノードにある「ckpt_name」の項目を
　「|ltx-2-19b-dev-fp4.safetensors」に変更する
　※ 低VRAM で動作する「ltx-2-19b-dev.safetensors」の軽量版を使用する

・ワークフローでエラーが発生する場合は前項のモデルの配置を確認する
・動作確認を行ったものを「workflow/_video/」フォルダに配置している
　（冒頭のプロジェクトファイルのダウンロード手順を実行した場合）
　④「LTX-2 Text to Video」→ 「video_ltx2_t2v_003.json」
　⑤「LTX-2 Image to Video」→ 「video_ltx2_i2v_003.json」

「LTX-2 Text to Video」テキストから動画生成（通常版）

　Prompt
A close-up of a cheerful girl puppet with curly auburn yarn hair and wide button eyes, holding a small red umbrella above her head. Rain falls gently around her. She looks upward and begins to sing with joy in English: "It's raining, it's raining, I love it when its raining." Her fabric mouth opening and closing to a melodic tune. Her hands grip the umbrella handle as she sways slightly from side to side in rhythm. The camera holds steady as the rain sparkles against the soft lighting. Her eyes blink occasionally as she sings.

巻き毛の栗毛と大きなボタンのような目をした、元気いっぱいの女の子の人形のクローズアップ。小さな赤い傘を頭上に掲げている。雨が優しく彼女の周りに降り注ぐ。彼女は上を見上げ、楽しそうに英語で歌い始める。「雨が降ってる、雨が降ってる、雨が降ってる時が大好き」。布製の口がメロディーに合わせて開いたり閉じたりしている。傘の柄を握りしめ、リズムに合わせて軽く左右に体を揺らす。カメラは、柔らかな光に照らされた雨のきらめきを捉える。歌いながら、彼女の目は時折瞬きする。

　Negative Prompt
blurry, low quality, still frame, frames, watermark, overlay, titles, has blurbox, has subtitles

ぼやけている、低品質、静止画、フレーム、透かし、オーバーレイ、タイトル、ぼかしボックスあり、字幕あり

　※ ワークフロー：「_video/」

video_ltx2_t2v_dev_003.json

・生成結果動画（音声付き）※右は蒸留版ワークフロー（video_ltx2_t2v_001.json）で生成
　

「LTX-2 Image to Video (Distilled)」静止画像から動画生成（通常版）

Prompt
A close-up shot of a young waitress in a retro 1950s diner, her warm brown eyes meeting the camera with a gentle smile. She wears a black polka-dot dress with an elegant cream lace collar, her reddish-brown hair styled in an elaborate updo with delicate curls framing her freckled face. Soft, warm light from overhead fixtures illuminates her features as she stands behind a yellow counter. The camera begins slightly to her side, then slowly pushes in toward her face, revealing the subtle rosy blush on her cheeks. In the blurred background, the soft teal walls and a glowing red "Diner" sign create a nostalgic atmosphere. The ambient sounds of clinking dishes, distant conversations, and the gentle hum of a jukebox fill the air. She tilts her head slightly and says in a friendly, warm voice: "Welcome to Rosie's. What can I get for you today?" The mood is inviting, timeless, and full of classic American diner charm.

1950年代のレトロなダイナーで、若いウェイトレスのクローズアップショット。温かみのある茶色の瞳が、優しい微笑みを浮かべながらカメラを見つめている。彼女はエレガントなクリーム色のレースの襟が付いた黒い水玉模様のドレスを身にまとい、赤みがかった茶色の髪は繊細なカールで丁寧にアップスタイルにスタイリングされ、そばかすのある顔を縁取っている。黄色いカウンターの後ろに立つウェイトレスの顔は、天井から差し込む柔らかく温かみのある光で照らされている。カメラは彼女の少し横から撮影を始め、徐々に顔に近づいていき、ほのかなバラ色の頬を浮かび上がらせている。ぼやけた背景の中で、柔らかな青緑色の壁と赤く輝く「ダイナー」の看板がノスタルジックな雰囲気を醸し出している。皿がぶつかる音、遠くで交わされる会話、そしてジュークボックスの優しい音が辺りを満たしている。彼女は少し首を傾げ、親しみやすく温かい声でこう言った。「ロージーズへようこそ。本日は何をお選びになりますか？」雰囲気は魅力的で時代を超越しており、古典的なアメリカのダイナーの魅力に満ちています。

・入力画像 「cute_girl.jpg」

　※ ワークフロー：「_video/」

video_ltx2_i2v_dev_003.json

・生成結果動画（音声付き）※右は蒸留版ワークフロー（video_ltx2_i2v_001.json）で生成
　

↑

LTX-2 モデル†

　LTX-2 (Lightricks社製) 動画生成AIにおける Distilled（蒸留モデル）とFP4（4-bit量子化モデル）の主な違いは、「生成速度（ステップ数）」と「精度（画質・細部）」のトレードオフである。以下は 2026年1月時点のローカル環境（ComfyUI等）での傾向

LTX-2 モデル別比較

特徴	Distilled (蒸留版)	FP4 (4-bit量子化版)	Dev (ベースモデル)
概要	知識蒸留という手法を用いて、元モデル(Dev)の能力を維持しつつ、少ないステップ数で高画質な動画を生成できるように軽量・高速化したモデル	NVIDIAの最新技術などを利用し、モデルの重みを4-bitまで圧縮したモデル	通常版ベースモデル
主な用途	高速生成 (8〜16ステップ)	省VRAM・高速化 (RTX50xx向け)	高品質・高精細 (30+ステップ)
ステップ数	8-16 steps (非常に速い)	30+ steps (重い)	30-50+ steps (遅い)
VRAM消費	中 (FP8等でさらに軽量化可)	極小 (8GB〜12GBで動作)	大 (24GB以上推奨)
画質・詳細	良好だが「AI的」な質感	やや粗い、アーティファクト大	最高 (リアル)
適したGPU	RTX 3090/4090/50xx	RTX 50xx (NVFP4), 4060	ハイエンド環境
メリット	爆速: 生成時間が大幅に短縮される簡単: プロンプトへの追従性が高く、細かいパラメータ調整が不要	省VRAM: VRAM消費が非常に少なく、8GB〜12GBのGPUでも動作する RTX 50シリーズで最速: RTX 50シリーズ（5080/5090等）で特化した高速化が期待できる
デメリット	FP16などに比べると細かいディテール（顔の表情など）が少し塑性的な「プラスチック感」が出やすい	画質低下: 量子化の弊害で、詳細部分の描写が粗くなることがある GPUを選ぶ: 高速化の恩恵を最大に受けるにはRTX 50シリーズが必要
ダウンロード URL	ltx-2-19b-distilled.safetensors (fp16) ltx-2-19b-distilled-fp8.safetensors (fp8)	ltx-2-19b-dev-fp4.safetensors (fp4)	ltx-2-19b-dev.safetensors (fp16) ltx-2-19b-dev-fp8.safetensors (fp8)
容量	43.3 GB (fp16) / 27.1 GB (fp8)	20 GB (fp4)	43.3 GB (fp16) / 27.1 GB (fp8)

どちらを選ぶべきか？
・RTX 5060 / 5080 / 5090 の場合:
　FP4版 + Distilled Lora または Distilled FP8 が、画質と速度のバランスが最も良い
　※ Distilledモデルで「プラスチック感」が気になる場合は、LoRAを併用する（強度0.6〜1.0）ことで改善することがある

・VRAMが少ない (12GB以下) の場合:
　FP4モデル
　※ RTX 5000シリーズ（Blackwell）GPU であれば「蒸留版 nvfp4」がもっとも効率的な選択肢　ダウンロード → (nvfp4) ltx-2-19b distilled
　　 RTX 4000シリーズではデータエラーで動作しない

・高品質な動画が欲しい場合:
　Dev (FP8 or FP16) モデルを使用して、時間をかけて生成する

↑

Step 3：標準テンプレートまとめ†

　ComfyUI オフィシャルサイトの標準ワークフローは「subgraph」機能を使用して入出力要素以外を隠して簡略化しているので、これを外しフラットな通常のワークフローの形にして整理しなおした

修正版標準テンプレートの生成結果

text to video			image to video
distilled (蒸留版) fp8		dev(通常版) fp4	Distilled (蒸留版) fp8	dev(通常版) fp4

101_T2V_LTX2_base_distilled.json		103_T2V_LTX2_base_dev.json	&ref(): File not found: "101_I2V_LTX2_base_distilled.json" at page "ComfyUI9";	&ref(): File not found: "103_I2V_LTX2_base_dev.json" at page "ComfyUI9";



生成時間（分：秒）
RTX-4070
RTX-4060	04:59.94	03:58.41	08:21.71	07:17.85
RTX-4060L	05:35.99	04:37.25	10:09.20	10:26.00
RTX-3050	13:54.88	17:32.56	19:12.19	19:36.78

↑

LTX-2 による音声付き動画生成２＜Confy(快適)に使うConfyUI＞†

　LTX-2 ワークフローは複雑で理解しにくい。いろいろ調査していく過程で日本人の方の作成したわかりやすいサイトを見つけた。
このサイトをお手本にLTX-2 ワークフローの検証を進める

↑

はじめに†

推奨設定値

解像度	640×640（1:1）	出力	1280x1280
	768×512（3:2）		1536x1024
	704×512（4:3）		1408x1024
	※後処理で 2倍にアップスケールするので実際の出力は倍のサイズになる
	※32の倍数である必要がある
FPS	24 / 25 / 30
フレーム	最大：257 frames（25fpsで約10秒）
	推奨：121–161（品質とメモリのバランス）
	※8n+1 になる必要がある

モデルの配置

📂StabilityMatrix/Data/
 └ 📂Models/
    ├ 📂StableDiffusion/
    │   └── ltx-2-19b-dev-fp4.safetensors
    └ 📂Lora/
         └── ltx-2-19b-distilled-lora-384.safetensors

📂ComfyUI/
 └ 📂models/
    ├ 📂latent_upscale_models/
    │   └── ltx-2-spatial-upscaler-x2-1.0.safetensors
    └ 📂text_encoders/
         └── gemma_3_12B_it_fp4_scaled.safetensors

　※ より軽量化へ 4bit量子化モデルに変更する
　　・ltx-2-19b-dev-fp8.safetensors → ltx-2-19b-dev-fp4.safetensors
　　・gemma_3_12B_it_fp8_scaled.safetensors → gemma_3_12B_it_fp4_mixed.safetensors
　※ StabilityMatrix 環境のため共有モデルを配置変更

↑

text2video†

基本ワークフローを読み込む
・サイトのワークフロー LTX-2_text2video_V2.json をダウンロード
・「checkpoints」「text_encoders」モデルを 4bit軽量版に変更する
・左右のレイアウトを詰めて画面全体を表示しながら実行できるようにする
基本的な処理の流れ

Step1: text2video + audio
　・ベースとなる動画（+ 音声）を生成する

Step2: Upscale（Hires.fix）
　・できた動画を 2 倍にアップスケールし、
　　video2video でもう一度リファイン
　・品質は低下するが Uopscale をパスすることもできる

Step3: Decode
　・動画と音声を別々にデコードして出力する
修正した「text2video 基本」ワークフロー（右は Upscale をパスした場合）

　※ ワークフロー：「_video/」LTX-2_text2video_V2_004.json

処理の流れ
1. 動画解像度・長さ・FPSの設定

生成したい動画と音声のパラメータを決定する

・「EmptyLTXVLatentVideo」「 LTXV Empty Latent Audio」に、
　解像度・フレーム数・FPS を入力（推奨設定値に従って設定）

・Upscale 処理の場合は設定する解像度は出力動画の半分の値にする

2. プロンプト入力

・LTXシリーズの特徴で、プロンプトは多少こだわらないと大した動画が作れない

・細かいフォーマットが決まっているわけではない

・小説を書くように、生成したい動画を記述してみる
　参考 → Prompting Guide for LTX-2

　Prompt
A stylized 3D cartoon shot at the entrance of an upscale restaurant at night, warm lantern light spilling onto a polished stone doorway as soft city ambience hums in the background. The camera starts low and close to the front steps and slowly pushes forward toward the door, emphasizing glossy reflections and the cozy golden glow inside. A panda waiter in a crisp red vest and black bow tie steps into frame, grips the handle with a gentle motion, and opens the door wide with a welcoming flourish. The panda’s round face, bright expressive eyes, and friendly smile read clearly as it leans forward in a small polite bow and speaks in a warm, inviting voice: “Welcome to Restaurant Shanghai.” The camera continues its smooth push-in to a medium close-up on the panda’s face and upper body, with the softly lit interior behind it. Ambient audio includes a subtle door chime, quiet restaurant chatter, and the panda’s clear line delivery.

高級レストランの入り口を捉えた、スタイリッシュな3Dアニメ風の映像。温かみのあるランタンの光が、磨き上げられた石造りの扉に降り注ぎ、背景には柔らかな都会の雰囲気が漂っている。カメラは低い位置から玄関の階段付近からゆっくりとドアへと近づき、光沢のある反射と店内の心地よい金色の輝きを強調する。鮮やかな赤いベストと黒い蝶ネクタイを身に着けたパンダのウェイターがフレームに入り、軽くハンドルを握り、歓迎の意を示すようにドアを大きく開ける。パンダの丸顔、明るく表情豊かな瞳、そして人懐っこい笑顔は、パンダが身を乗り出し、小さく丁寧にお辞儀をしながら、温かく招き入れるような声で「上海レストランへようこそ」と語りかける様子から読み取れる。カメラは、柔らかな光に照らされた店内を背景に、パンダの顔と上半身をミディアムクローズアップで捉える。環境音には、かすかなドアチャイムの音、レストランの静かな会話、そしてパンダの明瞭なセリフが含まれている。

　Negative Prompt
blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, wrong hand count, artifacts around text, unreadable text on shirt or hat, incorrect lettering on cap (“PNTR”), incorrect t-shirt slogan (“JUST DO IT”), missing microphone, misplaced microphone, inconsistent perspective, camera shake, incorrect depth of field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, smiling, laughing, exaggerated sadness, wrong gaze direction, eyes looking at camera, mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, off-sync audio, missing sniff sounds, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, missing door or shelves, missing shallow depth of field, flat lighting, inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts.

ぼやけている、焦点が合っていない、露出オーバー、露出不足、コントラストが低い、色が褪せている、ノイズが多すぎる、粒状の質感、照明が悪い、ちらつき、モーションブラー、歪んだプロポーション、不自然な肌の色、変形した顔の特徴、非対称の顔、顔の特徴がない、余分な手足、変形した手、間違った手のカウント、文字の周囲のアーティファクト、シャツや帽子の判読できない文字、帽子の間違った文字（「PNTR」）、間違ったTシャツのスローガン（「JUST DO IT」）、マイクがない、マイクの位置が間違っている、一貫性のない遠近感、カメラの揺れ、間違った被写界深度、背景が鮮明すぎる、背景が乱雑、気が散る反射、強い影、一貫性のない照明方向、色の縞模様、漫画のようなレンダリング、3D CGI の外観、非現実的なマテリアル、不気味の谷効果、間違った民族、間違った性別、誇張された表情、笑顔、笑い、誇張された悲しみ、間違った視線方向、カメラを見ている目、歪んだ音声、ロボットのような声、エコー、バックグラウンドノイズ、同期していないオーディオ、スニフ音の欠落、不正確なセリフ、追加されたセリフ、繰り返しのスピーチ、ぎこちない動き、不自然な間、不正確なタイミング、不自然な遷移、一貫性のないフレーミング、傾いたカメラ、ドアまたは棚の欠落、浅い被写界深度の欠落、平坦な照明、一貫性のないトーン、映画のような過飽和、様式化されたフィルター、または AI アーティファクト。

3. サンプリング（1段目）

・基本は「ステップ数と CFG を決めてサンプリングする」
　この workflow では、20 steps / CFG 4.0 で 1段目を回している

・LTXVScheduler という専用スケジューラーを使用している
　動きとしては linear_quadratic に似たもの

・LTX-2 は動画と音声を同時に扱う
　🟫LTXVConcatAVLatentで、動画 latent と音声 latent を 1本にする

4. latent のアップスケール（x2）

・動画latent の解像度を 2倍にアップスケールする
　専用のモデル (ltx-2-spatial-upscaler-x2)を使用する

5. サンプリング（2段目 / video2video）

アップスケール後の latent を短いステップでリファインする
・ここでは 4～8 ステップで生成できるようになる distilled-loraを使用する
　他のモデルでいうところの Lightning / Turbo のようなもの
　この workflow では 3 steps で回している
　これに合わせて、CFGは 1.0 に変更する

・Manual Sigma を使っているため少し分かりにくいが、Simple 相当では
　denoise = 0.47 前後に近い挙動

6. デコード

最後に、動画と音声をそれぞれデコードして書き出す
・latent を動画用 / 音声用に分け、適切な VAE でデコードする
・VRAM に余裕が無いため Tiled VAE を使っている

生成結果動画（音声付き）

2X Upscale (1408x1024 pixel) Upscale なし (704x512 pixel)

生成時間（分：秒）11:38.87 生成時間（分：秒）08:43.00

2X Upscale (1408x1024 pixel)	Upscale なし (704x512 pixel)

生成時間（分：秒）11:38.87	生成時間（分：秒）08:43.00

↑

text2video 8ステップ†

基本ワークフローを読み込む
・サイトのワークフロー LTX-2_text2video_distilled をダウンロード
・「checkpoints」「text_encoders」モデルを 4bit軽量版に変更する
・左右のレイアウトを詰めて画面全体を表示しながら実行できるようにする
修正した「text2video 8ステップ」ワークフロー

　※ ワークフロー：「_video/」LTX-2_text2video_distilled_005.json

生成結果動画（音声付き）

ltx-2-19b-dev-fp4.safetensors	ltx-2-19b-distilled-fp8.safetensors

生成時間（分：秒）24:46.75	生成時間（分：秒）08:46.19
Step数を 20 → 8 に減らしたにもかかわらず生成速度は倍以上かかっている。VRAM 容量不足で効果が出ないのかもしれない。 fp8 → fp4 のモデルについては同じワークフローで問題がないようだ。	蒸留版を通常版と同じワークフローで使用するのは問題があるようで、生成品質が良くない。対応方法は調査の必要がある。

↑

image2video†

基本ワークフローを読み込む
・サイトのワークフロー LTX-2_image2video_distilled_V2.json をダウンロード
・「checkpoints」「text_encoders」モデルを 4bit軽量版に変更する
・左右のレイアウトを詰めて画面全体を表示しながら実行できるようにする

基本的な処理の流れ
・基本は「1フレーム目を入力画像で固定して、残りを生成」
　例えば 121フレームの動画を作る場合の流れ

　1. 121 frames の枠を作る（8n+1）
　　🌫️ 🌫️ 🌫️ 🌫️ 🌫️ ... 🌫️

　2. 1フレーム目だけ入力画像で上書き
　　🖼️ 🌫️ 🌫️ 🌫️ 🌫️ ... 🌫️

　3. 残りの120フレームを生成
　　🖼️ ✨ ✨ ✨ ✨ ... ✨

　🖼️ を起点に、後ろのフレーム（✨）が埋まっていくイメージ

修正した「image2video 基本」ワークフロー

　※ ワークフロー：「_video/」LTX-2_image2video_distilled_V2_007.json

処理の流れ
1. 入力画像のリサイズ（2系統作る）

・最終出力したい解像度に合わせたフル解像度版を作る
　任意のサイズにリサイズ (ここでは 1MP メガピクセル)
　幅・高さは 64 の倍数にする
　1段目は 1/2 解像度で回すため、半分にしても 32 の倍数になるように 64 の倍数にする

・1段目（半解像度）用に、上の画像を縦横 1/2 にした版も作る
　EmptyLTXVLatentVideo には、この半解像度側の width/height を入力する

2. 画像の下処理

LTX-Video からの特徴で、動画は静止画と違い、少し圧縮されて劣化しているため
綺麗すぎる画像を使うと、全く動かない動画が生成されることがある

・これを回避するため、LTXVPreprocess でわざと動画の圧縮っぽく劣化させる

3. LTXVImgToVideoInplace（1段目の差し込み）

ここが image2video の本体

・1段目（半解像度）の video latent に対して、1フレーム目を画像で差し込む

4.アップスケール側（2段目）にも同じことをする

2段目も同様に画像を差し込む

・必ず spatial ノードのあとにこのノードを接続すること
・strength は 1.0 にする
　これを小さくすると、差し込んだ画像自体も image2image されるような挙動になる
　入力画像と1フレーム目を完全一致させたい場合は 1.0 にする

5. プロンプト入力

　Prompt
The woman is briskly walking from deep in the street toward the foreground with sharp, rhythmic boot steps clicking on the stone pavement, while the camera is smoothly backing up to keep her centered as she advances. The steady footfalls and light coat-fabric rustle sit under a quiet city bed of distant traffic hiss and occasional muffled voices bouncing off the stone walls. As she comes closer, a British public telephone box appears along the sidewalk; the camera continues retreating as she angles toward it, the footsteps tightening in pace and echo. She pulls the door open and steps inside, the door creaking and closing with a hollow thud that slightly muffles the outside ambience. She lifts the receiver, a soft plastic scrape and cord shift audible, then inserts coins with distinct metallic clinks and dials the number with crisp clicks, followed by a steady dial tone turning into a faint ringback as she holds the handset to her ear and waits, breathing quietly.

女性は通りの奥から手前に向かって、石畳をカチカチと音を立てながら、鋭くリズミカルなブーツの足音を響かせながら早足で歩いている。カメラは彼女が前進する間、彼女の視線を中央に留めるために滑らかに後退する。一定の足音と軽いコートの布地の擦れる音は、遠くの交通騒音と時折石壁に反響するくぐもった声といった静かな街の床に静まり返っている。彼女が近づくと、歩道沿いにイギリスの公衆電話ボックスが現れ、カメラは彼女がその方へと角度を変えながら後退し続けると、足音は速度と反響を増していく。彼女はドアを開けて中に入ると、ドアはきしみ、空洞の音を立てて閉まり、外の空気をわずかにかき消した。彼女は受話器を持ち上げ、柔らかいプラスチックの擦れる音とコードが動く音が聞こえる。そして、はっきりとした金属的なカチカチという音とともに硬貨を挿入し、カチッとした音とともに番号をダイヤルする。続いて、一定のダイヤルトーンがかすかな呼び出し音に変わり、彼女は受話器を耳に当て、静かに呼吸しながら待つ。

生成結果動画（音声付き）

入力静止画像生成動画

生成画像のサイズ

入力画像サイズ（pixel）サイズ指定（MP:メガピクセル）生成画像サイズ（Pixel）

1280x72 0.4 832 x 448

0.6 1024 x 576

0.8 1152 x 704

1.0 1280 x 768

↑

忘備録†

↑

torch.OutOfMemoryError: メモリー不足エラー†

発生時の状況

・生成途中で左のダイアログを表示して停止する
・再度「Run」を押すと何事もなく生成終了となる場合がある

ComfyUIで「メモリ不足（Out of Memory / OOM）」エラーが発生しても、再実行（Queueボタンをもう一度押す）すると成功する、あるいは2回に1回は成功するような現象は、VRAM（ビデオメモリ）がギリギリの状態で動作している際によく発生する「不安定なメモリ管理」の症状

▼　エラー・ログ

    :
!!! Exception during processing !!! Allocation on device 0 would exceed allowed memory. (out of memory)
Currently allocated     : 5.62 GiB
Requested               : 54.00 MiB
Device limit            : 8.00 GiB
Free (according to CUDA): 0 bytes
PyTorch limit (set by user-supplied memory fraction)
                        : 17179869184.00 GiB
Traceback (most recent call last):
  File "D:\StabilityMatrix\Data\Packages\ComfyUI\execution.py", line 518, in execute
    output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, v3_data=v3_data)
  File "D:\StabilityMatrix\Data\Packages\ComfyUI\execution.py", line 329, in get_output_data
    return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, v3_data=v3_data)
  File "D:\StabilityMatrix\Data\Packages\ComfyUI\execution.py", line 303, in _async_map_node_over_list
    await process_inputs(input_dict, i)
  File "D:\StabilityMatrix\Data\Packages\ComfyUI\execution.py", line 291, in process_inputs
    result = f(**inputs)
  File "D:\StabilityMatrix\Data\Packages\ComfyUI\comfy_extras\nodes_lt_upsampler.py", line 53, in upsample_latent
    upscale_model.to(device)  # TODO: use the comfy model management system.
  File "D:\StabilityMatrix\Data\Packages\ComfyUI\venv\lib\site-packages\torch\nn\modules\module.py", line 1381, in to
    return self._apply(convert)
  File "D:\StabilityMatrix\Data\Packages\ComfyUI\venv\lib\site-packages\torch\nn\modules\module.py", line 933, in _apply
    module._apply(fn)
  File "D:\StabilityMatrix\Data\Packages\ComfyUI\venv\lib\site-packages\torch\nn\modules\module.py", line 933, in _apply
    module._apply(fn)
  File "D:\StabilityMatrix\Data\Packages\ComfyUI\venv\lib\site-packages\torch\nn\modules\module.py", line 933, in _apply
    module._apply(fn)
  File "D:\StabilityMatrix\Data\Packages\ComfyUI\venv\lib\site-packages\torch\nn\modules\module.py", line 964, in _apply
    param_applied = fn(param)
  File "D:\StabilityMatrix\Data\Packages\ComfyUI\venv\lib\site-packages\torch\nn\modules\module.py", line 1367, in convert
    return t.to(
torch.OutOfMemoryError: Allocation on device 0 would exceed allowed memory. (out of memory)
Currently allocated     : 5.62 GiB
Requested               : 54.00 MiB
Device limit            : 8.00 GiB
Free (according to CUDA): 0 bytes
PyTorch limit (set by user-supplied memory fraction)
                        : 17179869184.00 GiB

Memory summary: 
|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |   5475 MiB |   5752 MiB |      0 B   |      0 B   |
|       from large pool |      0 MiB |      0 MiB |      0 B   |      0 B   |
|       from small pool |      0 MiB |      0 MiB |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Active memory         |   5475 MiB |   5752 MiB |      0 B   |      0 B   |
|       from large pool |      0 MiB |      0 MiB |      0 B   |      0 B   |
|       from small pool |      0 MiB |      0 MiB |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Requested memory      |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| GPU reserved memory   |   8000 MiB |   8064 MiB |      0 B   |      0 B   |
|       from large pool |      0 MiB |      0 MiB |      0 B   |      0 B   |
|       from small pool |      0 MiB |      0 MiB |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Non-releasable memory |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|

Got an OOM, unloading all loaded models.
Prompt executed in 329.08 seconds
got prompt
Requested to load VideoVAE
loaded completely; 5625.68 MB usable, 2331.69 MB loaded, full load: True
Requested to load LTXAV
loaded partially; 5264.00 MB usable, 5103.96 MB loaded, 8675.30 MB offloaded, 160.04 MB buffer reserved, lowvram patches: 1264
100%|██████████| 3/3 [00:44<00:00, 14.76s/it]
Requested to load AudioVAE
loaded completely; 519.71 MB usable, 415.20 MB loaded, full load: True
Requested to load VideoVAE
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 2331.69 MB offloaded, 648.02 MB buffer reserved, lowvram patches: 0
Prompt executed in 95.35 seconds
    :

↑

更新履歴†

2026/01/22 初版

↑

参考資料†

ComfyUI LTX-2

ComfyUI LTX-2 省メモリー
- LTX-2 環境構築トレンド1/21版

入力画像サイズ（pixel）	サイズ指定（MP:メガピクセル）	生成画像サイズ（Pixel）
1280x72	0.4	832 x 448
	0.6	1024 x 576
	0.8	1152 x 704
	1.0	1280 x 768

入力静止画像	生成動画

ComfyUI9 の履歴(No.25)

画像生成AI「ComfyUI」９（動画編２） == 編集中 ==†

LTX-2 による音声付き動画生成†

概要†

プロジェクトで作成するワークフロー†

動画生成のための環境構築†

Step 1：標準テンプレート（Distilled 蒸留版）による生成†

Step 2：標準テンプレート（通常版）による生成†

LTX-2 モデル†

Step 3：標準テンプレート まとめ†

LTX-2 による音声付き動画生成２ ＜Confy(快適)に使うConfyUI＞†

はじめに†

text2video†

text2video 8ステップ†

image2video†

忘備録†

torch.OutOfMemoryError: メモリー不足エラー†

更新履歴†

参考資料†

画像生成AI「ComfyUI」９（動画編２）　== 編集中 ==†

Step 3：標準テンプレートまとめ†

LTX-2 による音声付き動画生成２＜Confy(快適)に使うConfyUI＞†