ð¥ð¥ð¥ð± Really? Yup. This is the v1.5 stable diffusion model with no modifications, just fine-tuned on images of spectrograms paired with text. Audio processing happens downstream of the model. It can generate infinite variations of a prompt by varying the seed. All the same web UIs and techniques like img2img, inpainting, negative prompts, and interpolation work out of the box. Code: https://git
{{#tags}}- {{label}}
{{/tags}}