TextEncodeHunyuanVideo_ImageToVideo - ComfyUI Built-in Node Documentation

This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! Edit on GitHub

The TextEncodeHunyuanVideo_ImageToVideo node creates conditioning data for video generation by combining text prompts with image embeddings. It uses a CLIP model to process both the text input and visual information from a CLIP vision output, then generates tokens that blend these two sources according to the specified image interleave setting.

Inputs

Parameter	Data Type	Required	Range	Description
`clip`	CLIP	Yes	-	The CLIP model used for tokenization and encoding
`clip_vision_output`	CLIP_VISION_OUTPUT	Yes	-	The visual embeddings from a CLIP vision model that provide image context
`prompt`	STRING	Yes	-	The text description to guide the video generation. Supports multiline input and dynamic prompts. The prompt is formatted using a template that asks the model to describe the video based on the reference image, covering aspects like main content, object details, actions, background, and camera angles.
`image_interleave`	INT	Yes	1-512	How much the image influences things vs the text prompt. Higher number means more influence from the text prompt. (default: 2)

Outputs

Output Name	Data Type	Description
`CONDITIONING`	CONDITIONING	The conditioning data that combines text and image information for video generation

Source fingerprint (SHA-256): ecc190941e8d355bc6e6e4b5b7938d54a79e70a7ff0049157dab30b720605e6a

Documentation Index

​Inputs

​Outputs

Inputs

Outputs