Data Sources¶
When you get data into Tinybird, it's stored in a Data Source. You then write SQL queries to explore the data from the Data Source. Tinybird represents Data Sources using the icon.
For example, if your event data lives in a Kafka topic, you can create a Data Source that connects directly to Kafka and writes the events to Tinybird. You can then create a Pipe to query fresh event data.
A Data Source can also be the result of materializing a SQL query through a Pipe.
Create Data Sources¶
You can use Tinybird's UI, CLI, and API to create Data Sources.
Using the UI¶
Follow these steps to create a new Data Source:
- In your Workspace, go to Data Sources.
- Select + to add a new Data Source.
Using the CLI¶
You can create Data Source using the tb datasource
command. See tb datasource in the CLI reference.
Set the Data Source TTL¶
You can apply a TTL (Time To Live) to a Data Source in Tinybird. Use a TTL to define how long you want to store data.
For example, you can define a TTL of 7 Days, which means that any data older than 7 Days should be deleted. Data older than the defined TTL is deleted automatically.
You must define the TTL at the time of creating the Data Source and your data must have a column with a type that represents a date. Valid types are any of the Date
or Int
types.
Using the UI¶
If you are using the Tinybird Events API and want to use a TTL, create the Data Source with a TTL first before sending data.
Follow these steps to set a TTL using the Tinybird UI:
- Select Advanced Settings.
- Open the TTL menu.
- Select a column that represents a date.
- Define the TTL period in days.
If you need to apply transformations to the date column, or want to use more complex logic, select the Code editor tab and enter SQL code to define your TTL.
Using the CLI¶
Follow these steps to set a TTL using the Tinybird CLI:
- Create a new Data Source and .datasource file using the
tb datasource
command. - Edit the .datasource file you've created.
- Go to the Engine settings.
- Add a new setting called
ENGINE_TTL
and enter your TTL string enclosed in double quotes. - Save the file.
The following example shows a .datasource file with TTL defined:
SCHEMA > `date` DateTime, `product_id` String, `user_id` Int64, `event` String, `extra_data` String ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYear(date)" ENGINE_SORTING_KEY "date, user_id, event, extra_data" ENGINE_TTL "date + toIntervalDay(90)"
Change Data Source TTL¶
You can modify the TTL of an existing Data Source, either by adding a new TTL or by updating an existing TTL.
Using in the UI¶
Follow these steps to modify a TTL using the Tinybird UI:
- Go to the Data Source details page by clicking on the Data Source with the TTL you wish to change.
- Select the Schema tab.
- Select the TTL text.
- A dialog opens. Select the menu.
- Select the field to use for the TTL.
- Change the TTL interval.
- Select Save.
The updated TTL value appears in the Data Source's schema page.
Using the CLI¶
Follow these steps to modify a TTL using the Tinybird CLI:
- Open the .datasource file.
- Go to the Engine settings.
- If
ENGINE_TTL
doesn't exist, add it and enter your TTL enclosed in double quotes. - If a TTL is already defined, modify the existing setting.
The following is an example TTL setting:
ENGINE_TTL "date + toIntervalDay(90)"
When ready, save the .datasource file and push the changes to Tinybird using the CLI:
tb push DATA_SOURCE_FILE -f
Share a Data Source¶
Workspace administrators can share a Data Source with another Workspace they've access to on the same region and cluster.
To share a Data Source, follow these steps:
- Find the Data Source you want to share inside Data Project.
- Select the More actions (⋯) icon next to Data Source.
- Select Share.
- Type the Workspace name or ID.
- Select Share.
You can use the shared Data Source to create Pipes and Materialized Views in the target Workspace. Users that have access to a shared Data Source can access the tinybird.datasources_ops_log
and the tinybird.kafka_ops_log
Service Data Sources.
Limitations¶
The following limitations apply to shared Data Sources:
- Shared Data Sources are read-only.
- You can't share a shared Data Source, only the original.
- You can't check the quarantine of a shared Data Source.
Supported engines¶
Tinybird features different strategies to store data, which define where and how the data is stored and also what kind of data access, queries, and availability your data has. A Tinybird Data Source uses a table engine that determines those factors.
See Engines.
Supported data types¶
Data types define how values are stored and processed in a database. They determine what kind of data can be stored in a column (like numbers, text, dates, etc.), how much storage space the data will use, and what operations can be performed on the values. Choosing the right data type is important for both data integrity and query performance.
See Data types.
Set a different codec¶
Tinybird applies compression codecs to data types to optimize performance. You can override the default compression codecs by adding the CODEC(<codec>)
statement after the type declarations in your .datasource schema. For example:
SCHEMA > `product_id` Int32 `json:$.product_id`, `timestamp` DateTime64(3) `json:$.timestamp` CODEC(DoubleDelta, ZSTD(1)),
Supported file types and compression formats for ingest¶
The following file types and compression formats are supported at ingest time:
File type | Method | Accepted extensions | Compression formats supported |
---|---|---|---|
CSV | File upload, URL | .csv , .csv.gz | gzip |
NDJSON | File upload, URL, Events API | .ndjson , .ndjson.gz | gzip |
Parquet | File upload, URL | .parquet , .parquet.gz | gzip |
Avro | Kafka | gzip |
Quarantine Data Sources¶
Every Data Source you create in your Workspace has a quarantine Data Source associated that store data that doesn't fit the schema. If you send rows that don't fit the Data Source schema, they're automatically sent to the quarantine table so that the ingest process doesn't fail.
By convention, a quarantine Data Source is named {datasource_name}_quarantine
. You can review quarantined rows at any time or perform operations on them using Pipes. This is a useful source of information when fixing issues in the origin source or applying changes during ingest.
The quarantine Data Source schema contains the columns of the original row and the following columns with information about the issues that caused the quarantine:
c__error_column
Array(String) contains an array of all the columns that contain an invalid value.c__error
Array(String) contains an array of all the errors that caused the ingestion to fail and lead to store the values in quarantine. This column along thec__error_column
allows you so easily identify which is the columns that has problems and which is the error.c__import_id
Nullable(String) contains the job's identifier in case the column was imported through a job.insertion_date
(DateTime) contains the timestamp in which the ingestion was done.
See the Quarantine guide for practical examples on using the quarantine Data Source.
Partitioning¶
Use partitions for data manipulation. Partitioning isn't intended to speed up SELECT
queries: experiment with more efficient sorting keys (ENGINE_SORTING_KEY
) for that. A bad partition key, or creating too many partitions, can negatively impact query performance.
Partitioning is configured using the ENGINE_PARTITION_KEY
setting.
When choosing a partition key:
- Leave the
ENGINE_PARTITION_KEY
key empty. If the table is small or you aren't sure what the best partition key should be, leave it empty: the data is placed in a single partition. - Use a date column. Depending on the filter, you can opt for more or less granularity based on your needs.
toYYYYMM(date_column)
ortoYear(date_column)
are valid default choices.
If you have questions about choosing a partition key, contact Tinybird at [email protected] or in the Community Slack.
Examples¶
The following examples show how to define partitions.
Using an empty tuple to create a single partition
ENGINE_PARTITION_KEY "tuple()"
Using a Date column to create monthly partitions
ENGINE_PARTITION_KEY "toYYYYMM(date_column)"
Using a column to partition by event types
ENGINE_PARTITION_KEY "event_type % 8"
Upserts and deletes¶
See this guide. Depending on the frequency needed, you might want to convert upserts and deletes into an append operation that you can solve through deduplication.
Limits¶
There is a limit of 100 Data Sources per Workspace.