Stable Diffusion Lyric Videos

Lyric Videos

A creative fusion of AI visuals and music. Each video is generated using Stable Diffusion based on lyrics.

Project Overview

This project was initially conceived as a creative pipeline for producing TikTok content using Stable Diffusion. Over the course of a year, I generated 102 short videos, which can be broadly categorized into two types: ControlNet Lyric Videos and Sound Reactive Animations.

Examples of each are embedded on the right. You can pause them or scroll below to explore a detailed breakdown of the process used to create each format.

ControlNet Lyric Videos

The first video example demonstrates the use of ControlNet in combination with Stable Diffusion to render text that appears visually embedded within generated imagery. This effect was achieved by overlaying masked text as part of the image input to the model, allowing the content of the lyrics to be aesthetically integrated with the surrounding visuals.

To prepare these text inputs, I used the Python library Pillow to programmatically generate PNG slides containing single lyric phrases. These slides served as foundational input for the image generation workflow. The core function used for this task is shown below:


def generate_image(word, font_path):
    width, height = 512, 910
    image = Image.new('RGB', (width, height), color='white')
    font_size = 100
    font = ImageFont.truetype(font_path, font_size)

    while font.getsize(word)[0] > width or font.getsize(word)[1] > height:
        font_size -= 1
        font = ImageFont.truetype(font_path, font_size)

    draw = ImageDraw.Draw(image)
    text_width, text_height = draw.textsize(word, font=font)
    x = (width - text_width) / 2
    y = (height - text_height) / 2
    draw.text((x, y), word, fill='black', font=font)

    return image

Each generated text slide was then used within the Automatic1111 Stable Diffusion GUI to create ControlNet canny masks. Using the batch processing functionality, I produced six unique image variations per word, allowing for subtle visual diversity while maintaining a consistent structure.

Finally, the output sequences were imported into DaVinci Resolve, where each lyric slide was aligned with the corresponding segment of the audio track to form the completed video.

Sound Reactive Videos

The second video, shown to the right, is a sound-reactive animation generated using Parseq in conjunction with the Deforum extension for the Automatic1111 Stable Diffusion Web UI.

Deforum enables frame-by-frame image generation by interpolating between frames, allowing each new frame to blend smoothly with the previous one. While this approach is less technically advanced than recent AI video models, it offers a uniquely stylized and expressive aesthetic that I find visually compelling.

Parseq is a web-based keyframing tool that integrates with Deforum to provide precise control over animation parameters. It allows for camera movement, prompt scheduling, generation strength adjustment (i.e., how closely a frame adheres to the prior image), and much more. Additionally, Parseq supports audio integration: users can upload music tracks, isolate specific events (such as drum hits, vocal peaks, or instrumental cues), and map those events to changes in visual parameters across the animation timeline.

@haystormjuno Messing around with ControlNet #thefrontbottoms #stablediffusion #midwestemo ♬ Twin Size Mattress by The Front Bottoms - J

@haystormjuno A Lack Of Color - Death Cab For Cutie #indiemusic #emo #alternative #deathcabforcuite #lyrics ♬ A Lack of Color - Death Cab for Cutie