Skip to content

[Feature] [VDP] [Pipeline] Data movement tools #1023

Open
@chuang8511

Description

Is There an Existing Issue for This?

  • I have searched the existing issues

Where do you intend to apply this feature?

Instill Core, Instill Cloud

Is your Proposal Related to a Problem?

Background

When there are multiple data sources in companies, the data engineers in the companies need to migrate data from a source to another source.

The data is scattered around in applications, it is time-consuming for a company to write several tools to collect the data from applications, such as Gmail / Slack / ….

Describe Your Proposed Solution

User stories

Story 1

  • As a data engineer, he/she wants to migrate raw data to analysable data to another data source

Possible pipelines
image

Concrete examples
image

e.g. transaction data is not analysable, but weekly transaction amount & transaction count are.

Story 2

As a data engineer, he/ she wants to transform unstructured data into analysable data and load to another data source.

Possible pipelines
image

Concrete example
image

Highlight the Benefits

It can solve the problem in the real world.

Anything Else?

Possible components

  • Note: The sequence means the priority.

Data components

RDBMS

  • AWS
    • RDS
  • GCP
    • Cloud SQL / BigQuery
  • Postgres
  • MySQL
  • MSSQL
  • Oracle DB

NoSQL

  • AWS
    • NoSQL (DynamoDB / MongoDB)
  • GCP
    • Datastore
  • MongoDB
  • Elasticsearch
  • Cassandra

Vector DB

  • Weaviate
  • Qdrant
  • Chroma
  • Zilliz
  • Milvus

Others

  • AWS
    • S3
  • GCP
    • Google Cloud Storage
  • AWS Datalake
  • Google Sheet

Application components

  • Discord / X / Slack / … are expected to built from other tools. But, you could need to build a specific TASK for Application component according to your usage.
  • Please notify in Slack if there are further concrete idea that you want to build some specific application components. We can discuss those in details.

Reference tools

  • Airbyte
    • Data source -> Data destination

Milestones

  1. Read the current pipelines
  1. Design the pipeline according to user stories.
  • Please draw the concrete pipelines first to ask us review before delving into development.
  • Timeline: 5 working days
  1. Check which components we are missing according to the designed pipeline.
  • Please create the skeleton PR first for the incoming components
  • Timeline: 2~3 working days
  1. Connect those components.
  • Timeline: 10 working days
  1. Build the designed pipeline after you connect those components.
  • Timeline: 1 working day

Note

  • About timeline, let's adjust it dynamically if there are much more complicated issues than we think
  • Milestone 2~5 is a cycle. Let's finish a whole complete user story first and then iterate it.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    LabelfeatureNew feature or requestneed-triageNeed to be investigated further

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions