The AI Acceleration AccelerationCloud

Train, fine-tune-and run inference on AI models blazing fast, at low cost, and at production scale.

Trusted by

200+ generative AI models

Build with open-source and specialized multimodal models for chat, images, code, and more. Migrate from closed models with OpenAI-compatible APIs.

together.ai

Chat

The Llama 3.2-Vision collection features multimodal LLMs (11B and 90B) optimized for visual recognition, image reasoning, captioning, and answering image-related questions.

Image

Premium image generation model by Black Forest Labs.

Chat

Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models.

Chat

Powerful decoder-only models available in 7B and 72B variants, developed by Alibaba Cloud's Qwen team for advanced language processing.

Chat

This LLM is customized by NVIDIA to improve the helpfulness of LLM generated responses to user queries.

Language

The Mixtral-8x22B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts.

Chat

DBRX Instruct is a mixture-of-experts (MoE) large language model trained from scratch by Databricks. DBRX Instruct specializes in few-turn.

Chat

Arctic is a dense-MoE Hybrid transformer architecture pre-trained from scratch by the Snowflake AI Research Team.

Chat

A hybrid architecture composed of multi-head, grouped-query attention and gated convolutions arranged in Hyena blocks, different from traditional decoder-only Transformers

Chat

Free endpoint to try Llama 3.2 11B.

Free

Chat

The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pre-trained and instruction tuned generative models in 8B, 70B and 405B sizes, that outperform many available open source and closed chat models on common industry benchmarks.

Chat

Trained from scratch on a vast dataset of 2 trillion tokens in both English and Chinese.

Chat

The Llama 3.2-Vision collection features multimodal LLMs (11B and 90B) optimized for visual recognition, image reasoning, captioning, and answering image-related questions.

Image

Free endpoint for the SOTA open-source image generation model by Black Forest Labs.

Free

Image

Fastest available endpoint for the SOTA open-source image generation model by Black Forest Labs.

Chat

Experimental research model by Alibaba's Qwen team focused on enhancing AI reasoning capabilities.

Embeddings

An universal English sentence embedding model by WhereIsAI. Its embedding dimension is 1024, it takes up to 512 context length.

Chat

A transformer-based decoder-only language model pre-trained on a large amount of data. In comparison with the previously released Qwen

Language

The Yi series models are large language models trained from scratch by developers at 01.AI

Code

SOTA code LLM with advanced code generation, reasoning, fixing, and support for up to 128K tokens.

Chat

The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts.

Embeddings

An 80M checkpoint of M2-BERT, pretrained with sequence length 32768, and it has been fine-tuned for long-context retrieval.

Image

A text-to-image generative AI model that excels at creating 1024x1024 images.

Chat

instruct fine-tuned version of Mistral-7B-v0.1

Language

NexusRaven is an open-source and commercially viable function calling LLM that surpasses the state-of-the-art in function calling capabilities.

Chat

First Nous collection of dataset and models made by fine-tuning mostly on data created by Nous in-house

Chat

Code Llama is a family of large language models for code based on Llama 2 providing infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks.

Language

Designed for few-shot prompts, fine-tuned over the RedPajama-INCITE-Base-7B-v1 base model.

Chat

Vicuna is a chat assistant trained by fine-tuning Llama 2 on user-shared conversations collected from ShareGPT.

Chat

Nous-Hermes-Llama2-13b is a state-of-the-art language model fine-tuned on over 300,000 instructions.

Chat

This model achieves a substantial and comprehensive improvement on coding, mathematical reasoning and open-domain conversation capacities

Embeddings

BAAI general embedding - large, english - model v1.5. FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search. And it also can be used in vector databases for LLMs.

Embeddings

An 80M checkpoint of M2-BERT, pretrained with sequence length 2048, and it has been fine-tuned for long-context retrieval.

Embeddings

An 80M checkpoint of M2-BERT, pretrained with sequence length 8192, and it has been fine-tuned for long-context retrieval.

Code

This model empowers Code LLMs with complex instruction fine-tuning, by adapting the Evol-Instruct method to the domain of code.

Embeddings

BAAI general embedding - base, english - model v1.5. FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search. And it also can be used in vector databases for LLMs.

Chat

Extending LLaMA-2 to 32K context, built with Meta's Position Interpolation and Together AI's data recipe and system optimizations, instruction tuned by Together

Language

7.3B parameter model that outperforms Llama 2 13B on all benchmarks, approaches CodeLlama 7B performance on code, Uses Grouped-query attention (GQA) for faster inference and Sliding Window Attention (SWA) to handle longer sequences at smaller cost

Code

Chat

Vicuna is a chat assistant trained by fine-tuning Llama 2 on user-shared conversations collected from ShareGPT.

Language

Phind-CodeLlama-34B-v1 trained on additional 1.5B tokens high-quality programming-related data proficient in Python, C/C++, TypeScript, Java, and more.

Language

Language model trained on 2 trillion tokens with double the context length of Llama 1. Available in three sizes: 7B, 13B and 70B parameters

Code

Language

Extending LLaMA-2 to 32K context, built with Meta's Position Interpolation and Together AI's data recipe and system optimizations.

Chat

This model is a 75/25 merge of Chronos (13B) and Nous Hermes (13B) models resulting in having a great ability to produce evocative storywriting and follow a narrative.

Rerank

Salesforce Research's proprietary fine-tuned rerank model with 8K context, outperforming Cohere Rerank for superior document retrieval.

Chat

An instruction fine-tuned LLaMA-2 (70B) model by merging Platypus2 (70B) by garage-bAInd and LLaMA-2 Instruct v2 (70B) by upstage.

Language

This model achieves a substantial and comprehensive improvement on coding, mathematical reasoning and open-domain conversation capacities.

Chat

MythoLogic-L2 and Huginn merge using a highly experimental tensor type merge technique. The main difference with MythoMix is that I allowed more of Huginn to intermingle with the single tensors located at the front and end of a model

Chat

7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Aibaba Cloud. Qwen-7B-Chat is a large-model-based AI assistant, which is trained with alignment techniques.

Language

7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Aibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc.

Image

First generation premium image generation model by Black Forest Labs.

Chat

Chat model fine-tuned using data from Dolly 2.0 and Open Assistant over the RedPajama-INCITE-Base-7B-v1 base model.

Embeddings

Pretrained model on English language using a masked language modeling (MLM) objective. The embedding dimension is 768, and the number of model parameters is 110M. This model is uncased: it does not make a difference between english and English.

Language

Base model that aims to replicate the LLaMA recipe as closely as possible (blog post).

Embeddings

This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and was designed for semantic search. It has been trained on 500K (query, answer) pairs from the MS MARCO dataset. Its embedding dimension is 768 with 512 max context length.

Language

This model can be used to moderate other chatbot models. Built using GPT-JT model fine-tuned on Ontocord.ai's OIG-moderation dataset v0.1.

Language

Fork of GPT-J instruction tuned to excel at few-shot prompts (blog post).

Chat

Chat model fine-tuned from EleutherAI’s GPT-NeoX with over 40 million instructions on carbon reduced compute.

Language

An auto-regressive language model, based on the transformer architecture. The model comes in different sizes: 7B, 13B, 33B and 65B parameters.

Chat

Falcon-40B-Instruct is a causal decoder-only model built by TII based on Falcon-40B and finetuned on a mixture of Baize.

Language

Falcon-40B is a causal decoder-only model built by TII and trained on 1,000B tokens of RefinedWeb enhanced with curated corpora.

Chat

Chat model based on EleutherAI’s Pythia-7B model, and is fine-tuned with data focusing on dialog-style interactions.

End-to-end platform for the full generative AI lifecycle

Leverage pre-trained models, fine-tune them for your needs, or build custom models from scratch. Whatever your generative AI needs, Together AI offers a seamless continuum of AI compute solutions to support your entire journey.

Inference
The fastest way to launch AI models:
- ✔ Serverless or dedicated endpoints
- ✔ Deploy in enterprise VPC or on-prem
- ✔ SOC 2 and HIPAA compliant
Fine-Tuning
Tailored customization for your tasks
- ✔ Complete model ownership
- ✔ Fully tune or adapt models
- ✔ Easy-to-use APIs
- Full Fine-Tuning
- LoRA Fine-Tuning
GPU Clusters
Full control for massive AI workloads
- ✔ Accelerate large model training
- ✔ GB200, H200, and H100 GPUs
- ✔ Pricing from $1.75 / hour

Run
models

Train 
Models

Speed, cost, and accuracy. Pick all three.

SPEED RELATIVE TO VLLM

4x FASTER

LLAMA-3 8B AT FULL PRECISION

400 TOKENS/SEC

COST RELATIVE TO GPT-4o

11x lower cost

Why Together Inference

accelerated by cutting edge research

Transformer-optimized kernels: our researchers' custom FP8 inference kernels, 75%+ faster than base PyTorch

‍
Quality-preserving quantization: accelerating inference while maintaining accuracy with advances such as QTIP
‍
Speculative decoding: faster throughput, powered by novel algorithms and draft models trained on RedPajama dataset
Flexibility to choose a model that fits your needs
Turbo: Best performance without losing accuracy
‍
Reference: Full precision, available for 100% accuracy
‍
Lite: Optimized for fast performance at the lowest cost
Available via Dedicated instances and serverless API
Dedicated instances: fast, consistent performance, without rate limits, on your own single-tenant NVIDIA GPUs
‍
Serverless API: quickly switch from closed LLMs to models like Llama, using our OpenAI compatible APIs

Control your IP.
‍Own your AI.

Fine-tune open-source models like Llama on your data and run them on Together Cloud, in a hyperscaler VPC, or on-prem. With no vendor lock-in, your AI remains fully under your control.

together files upload acme_corp_customer_support.jsonl
  
{
  "filename" : "acme_corp_customer_support.json",
  "id": "file-aab9997e-bca8-4b7e-a720-e820e682a10a",
  "object": "file"
}
  
  
together finetune create --training-file file-aab9997-bca8-4b7e-a720-e820e682a10a
--model together compute/RedPajama-INCITE-7B-Chat

together finetune create --training-file $FILE_ID 
--model $MODEL_NAME 
--wandb-api-key $WANDB_API_KEY 
--n-epochs 10 
--n-checkpoints 5 
--batch-size 8 
--learning-rate 0.0003
{
    "training_file": "file-aab9997-bca8-4b7e-a720-e820e682a10a",
    "model_output_name": "username/togethercomputer/llama-2-13b-chat",
    "model_output_path": "s3://together/finetune/63e2b89da6382c4d75d5ef22/username/togethercomputer/llama-2-13b-chat",
    "Suffix": "Llama-2-13b 1",
    "model": "togethercomputer/llama-2-13b-chat",
    "n_epochs": 4,
    "batch_size": 128,
    "learning_rate": 1e-06,
    "checkpoint_steps": 2,
    "created_at": 1687982945,
    "updated_at": 1687982945,
    "status": "pending",
    "id": "ft-5bf8990b-841d-4d63-a8a3-5248d73e045f",
    "epochs_completed": 3,
    "events": [
        {
            "object": "fine-tune-event",
            "created_at": 1687982945,
            "message": "Fine tune request created",
            "type": "JOB_PENDING",
        }
    ],
    "queue_depth": 0,
    "wandb_project_name": "Llama-2-13b Fine-tuned 1"
}

Fine-tuning API

Forge the AI frontier. Train on expert-built clusters.

Built by AI researchers for AI innovators, Together GPU Clusters are powered by NVIDIA GB200, H200, and H100 GPUs, along with the Together Kernel Collection — delivering up to 24% faster training operations.

Top-Tier NVIDIA GPUs
NVIDIA's latest GPUs, like GB200, H200, and H100, for peak AI performance, supporting both training and inference.
Accelerated Software Stack
The Together Kernel Collection includes custom CUDA kernels, reducing training times and costs with superior throughput.
High-Speed Interconnects
InfiniBand and NVLink ensure fast communication between GPUs, eliminating bottlenecks and enabling rapid processing of large datasets.
Highly Scalable & Reliable
Deploy 16 to 1000+ GPUs across global locations, with 99.9% uptime SLA.
Expert AI Advisory Services
Together AI’s expert team offers consulting for custom model development and scalable training best practices.
Robust Management Tools
Slurm and Kubernetes orchestrate dynamic AI workloads, optimizing training and inference seamlessly.

Together GPU Clusters

Training-ready clusters – H100, H200, or A100

Reserve your cluster today

THE AI ACCELERATION CLOUD

BUILT ON LEADING AI RESEARCH.

Innovations

Our research team is behind breakthrough AI models, datasets, and optimizations.

See all research

Cocktail SGD

With Cocktail SGD, we’ve addressed a key hindrance to training generative AI models in a distributed environment: networking overhead. Cocktail SGD is a set of optimizations that reduces network overhead by up to 117x.

FlashAttention-3

FlashAttention-3 achieves up to 75% GPU utilization on H100s, making AI models up to 2x faster and enabling efficient processing of longer text inputs. It allows for faster training and inference of LLMs, supports lower precision operations for improved efficiency.

RedPajama

Our RedPajama project enables leading generative AI models to be available as fully open-source. The RedPajama models have been downloaded millions of times, and the RedPajama dataset has been used to create over 500 leading models.

Sub-quadratic model architectures

In close collaboration with Hazy Research, we’re working on the next core architecture for generative AI models that will provide even faster performance with longer context. Our research published in this area includes Striped Hyena, Monarch Mixer, and FlashConv.

Customer Stories

See how we support leading teams around the world. Our customers are creating innovative generative AI applications, faster.

Pika creates the next gen text-to-video models on Together GPU Clusters

Nexusflow uses Together GPU Clusters to build cybersecurity models

Arcee builds domain adaptive language models with Together Custom Models

Testing conducted by Together AI in November 2023 using Llama-2-70B running on Together Inference, TGI, vLLM, Anyscale, Perplexity, and Open AI. Mosaic ML comparison based on published numbers in Mosaic ML blog. Detailed results and methodology published here.
Testing conducted by Together AI in November 2023 using Llama-2-70B running on Together Inference, TGI, vLLM, Anyscale, Perplexity, and Open AI. Mosaic ML comparison based on published numbers in Mosaic ML blog. Detailed results and methodology published here.
Testing conducted by Together AI in November 2023 using Llama-2-70B running on Together Inference. Detailed results and methodology published here.
Based on published pricing November 8th, 2023, comparing Open AI GPT-3.5-Turbo to Llama-2-13B on Together Inference using Serverless Endpoints. Assumes equal number of input and output tokens.
Compared to a standard attention implementation in PyTorch, FlashAttention-2 can be up to 9x faster. Source.
Testing methodology and results published in this research paper.
Based on published pricing November 8th, 2023, comparing AWS Capacity Blocks and AWS p5.48xlarge instances to Together GPU Clusters configured with an equal number of H100 SXM5 GPUs on our 3200 Gbps Infiniband networking configuration.

The AI Acceleration AccelerationCloud

200+ generative AI models

Llama 3.2 90B

FLUX1.1 [pro]

Gemma 2 Instruct

Qwen 2.5 72B

Llama 3.1 Nemotron 70B Instruct

Mixtral-8x22B

DBRX-Instruct

Arctic-Instruct

Striped Hyena Nous

Llama 3.2 11B Free

Llama 3.1 8B

Deepseek-67B

Llama 3.2 11B

FLUX.1 [schnell] Free

FLUX.1 [schnell]

Llama 3.1 405B

Qwen QwQ 32B Preview

UAE-Large v1

Qwen 2

01-AI Yi

Qwen 2.5 Coder 32B Instruct

Mixtral 8x7B

M2-BERT 80M 32K Retrieval

Stable Diffusion XL 1.0

Mistral Instruct

Nexus Raven

Nous Capybara

Code Llama Instruct

RedPajama-INCITE Instruct

Vicuna v1.5 16K

Nous Hermes Llama-2

Wizard LM

BGE-Large-EN v1.5

M2-BERT 80M 2K Retrieval

M2-BERT 80M 8K Retrieval

WizardCoder Python v1.0

BGE-Base-EN v1.5

LLaMA-2-7B-32K-Instruct

Mistral

Code Llama Python

Vicuna v1.5

Phind Code LLaMA v2

LLaMA-2

Code Llama

LLaMA-2-32K

Chronos Hermes

Salesforce LlamaRank

Platypus2 Instruct

WizardLM v1.0 (70B)

MythoMax-L2

Qwen-Chat

Qwen

FLUX.1 [pro]

RedPajama-INCITE Chat

BERT

RedPajama-INCITE

Sentence-BERT

GPT-JT-Moderation

GPT-JT

GPT-NeoXT-Chat-Base

LLaMA

Falcon Instruct

Falcon

Pythia-Chat-Base

End-to-end platform for the full generative AI lifecycle

Inference

Fine-Tuning

GPU Clusters

Speed, cost, and accuracy. Pick all three.

SPEED RELATIVE TO VLLM

LLAMA-3 8B AT FULL PRECISION

COST RELATIVE TO GPT-4o

Why Together Inference

accelerated by cutting edge research

Flexibility to choose a model that fits your needs

Available via Dedicated instances and serverless API

Control your IP. ‍Own your AI.

Forge the AI frontier. Train on expert-built clusters.

Control your IP.
‍Own your AI.

THE AI ACCELERATION CLOUD