-
Stony Brook University
- NY
- www3.cs.stonybrook.edu/~kkahatapitiy/
- @kkahatapitiy
Stars
FORA introduces simple yet effective caching mechanism in Diffusion Transformer Architecture for faster inference sampling.
Adaptive Caching for Faster Video Generation with Diffusion Transformers
FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality
Clarity: A Minimalist Website Template for AI Research
Language Repository for Long Video Understanding
SGLang is a fast serving framework for large language models and vision language models.
Unofficial Implementation of "Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs"
Official inference repo for FLUX.1 models
Code for paper "VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos"
[CVPR2024 Highlight] VBench - We Evaluate Video Generation
Theia: Distilling Diverse Vision Foundation Models for Robot Learning
Tarsier -- a family of large-scale video-language models, which is designed to generate high-quality video descriptions , together with good capability of general video understanding.
Latte: Latent Diffusion Transformer for Video Generation.
[ECCV 2024] Official Implementation of CoPT: Unsupervised Domain Adaptive Segmentation using Domain-Agnostic Text Embeddings
Official repo for AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated by AI
VideoSys: An easy and efficient system for video generation
GIF encoder based on libimagequant (pngquant). Squeezes maximum possible quality from the awful GIF format.
[NeurIPS 2021 Spotlight] Official code for "Focal Self-attention for Local-Global Interactions in Vision Transformers"
This project aim to reproduce Sora (Open AI T2V model), we wish the open source community contribute to this project.
Open-Sora: Democratizing Efficient Video Production for All
Lumina-T2X is a unified framework for Text to Any Modality Generation
Stable Diffusion web UI
Official PyTorch Implementation of "Scalable Diffusion Models with Transformers"
[CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts