Turn files into a SQL Queryable data source.
Allows Cloud storage (Google Storage, S3, etc) files (csv, json, custom-protobuf) to be queried with traditional sql. Also allows these files to have custom serializations, compressions, encryptions.
Design Goal
- Hackable Stores easy to add to Google Storage, local files, s3, etc.
- Hackable File Formats Protbuf files, WAL files, mysql-bin-log's etc.
Developing new stores or file formats
- FileStore defines file storage (s3, google-storage, local files, sftp, etc)
- StoreReader defines file storage reader(writer) for finding lists of files, and opening files.
- FileHandler Defines Registry to create handler for converting a file from
StoreReader
into aFileScanner
that iterates rows of this file. Also extracts info from filepath, ie often folders serve as "columns" or "tables" in the virtual table.- FileScanner File Row Reading, how to transform contents of file into qlbridge.Message for use in query engine. Currently CSV, Json types.
Similar To
Similar to the federated BigQuery engine
that can query avro, csv files (or BigTable) or prestodb.
But with focus on hacking in your own sources or file formats, encryption, etc.
We are going to create a CSV database
of Baseball data from
http://seanlahman.com/baseball-archive/statistics/
# download files to local /tmp
mkdir -p /tmp/baseball
cd /tmp/baseball
curl -Ls http://seanlahman.com/files/database/baseballdatabank-2017.1.zip > bball.zip
unzip bball.zip
mv baseball*/core/*.csv .
rm bball.zip
rm -rf baseballdatabank-*
# run a docker container locally
docker run -e "LOGGING=debug" --rm -it -p 4000:4000 \
-v /tmp/baseball:/tmp/baseball \
gcr.io/dataux-io/dataux:latest
In another Console open Mysql:
# connect to the docker container you just started
mysql -h 127.0.0.1 -P4000
-- Now create a new Source
CREATE source baseball WITH {
"type":"cloudstore",
"schema":"baseball",
"settings" : {
"type": "localfs",
"format": "csv",
"path": "baseball/",
"localpath": "/tmp"
}
};
show databases;
use baseball;
show tables;
describe appearances
select count(*) from appearances;
select * from appearances limit 10;
This is similar to above, but will use Google Cloud Storage instead of local file drives. If you are running inside of Google Cloud the performance off of Cloud Storage is amazing, much, much faster than you would expect.
# create a google-cloud-storage bucket
gsutil mb gs://my-baseball-bucket
# sync files to cloud
gsutil rsync -d -r /tmp/baseball gs://my-baseball-bucket/
# run a docker container locally, using your local google cloud credentials
docker run -e "GOOGLE_APPLICATION_CREDENTIALS=/.config/gcloud/application_default_credentials.json" \
-e "LOGGING=debug" \
--rm -it \
-p 4000:4000 \
-v ~/.config/gcloud:/.config/gcloud \
gcr.io/dataux-io/dataux:latest
# connect to the docker container you just started
mysql -h 127.0.0.1 -P4000
-- Now create a new Source
CREATE source gcsbball2 WITH {
"type":"cloudstore",
"schema":"gcsbball",
"settings" : {
"type": "gcs",
"project": "your-google-project",
"bucket": "my-baseball-bucket",
"format": "csv"
}
};
show databases;
use baseball;
show tables;
describe appearances
select count(*) from appearances;
select * from appearances limit 10;