Documentation Index
Fetch the complete documentation index at: https://docs.comfy.org/llms.txt
Use this file to discover all available pages before exploring further.
This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! Edit on GitHubThe TextEncodeHunyuanVideo_ImageToVideo node creates conditioning data for video generation by combining text prompts with image embeddings. It uses a CLIP model to process both the text input and visual information from a CLIP vision output, then generates tokens that blend these two sources according to the specified image interleave setting.
Inputs
| Parameter | Data Type | Required | Range | Description |
|---|---|---|---|---|
clip | CLIP | Yes | - | The CLIP model used for tokenization and encoding |
clip_vision_output | CLIP_VISION_OUTPUT | Yes | - | The visual embeddings from a CLIP vision model that provide image context |
prompt | STRING | Yes | - | The text description to guide the video generation. Supports multiline input and dynamic prompts. The prompt is formatted using a template that asks the model to describe the video based on the reference image, covering aspects like main content, object details, actions, background, and camera angles. |
image_interleave | INT | Yes | 1-512 | How much the image influences things vs the text prompt. Higher number means more influence from the text prompt. (default: 2) |
Outputs
| Output Name | Data Type | Description |
|---|---|---|
CONDITIONING | CONDITIONING | The conditioning data that combines text and image information for video generation |
Source fingerprint (SHA-256):
ecc190941e8d355bc6e6e4b5b7938d54a79e70a7ff0049157dab30b720605e6a