Skip to content

Commit

Permalink
Update Doc
Browse files Browse the repository at this point in the history
  • Loading branch information
Aaron Raddon committed Nov 11, 2017
1 parent 2e78073 commit 2281191
Show file tree
Hide file tree
Showing 3 changed files with 137 additions and 35 deletions.
80 changes: 61 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,10 +44,66 @@ easy to add custom data sources as well as REST api sources.


## Try it Out
This example imports a couple hours worth of historical data
from https://www.githubarchive.org/ into a local
elasticsearch server for example. Requires etcd server running local as well.
Docker setup coming soon.
These examples are:
1. We are going to create a CSV `database` of Baseball data from http://seanlahman.com/baseball-archive/statistics/
2. Connect to Google BigQuery public datasets (you will need a project, but the free quota will probably keep it free).



```sh
# download files to local /tmp
mkdir -p /tmp/baseball
cd /tmp/baseball
curl -Ls http://seanlahman.com/files/database/baseballdatabank-2017.1.zip > bball.zip
unzip bball.zip

mv baseball*/core/*.csv .
rm bball.zip
rm -rf baseballdatabank-*

# run a docker container locally
docker run -e "LOGGING=debug" --rm -it -p 4000:4000 \
-v /tmp/baseball:/tmp/baseball \
gcr.io/dataux-io/dataux:latest


```
In another Console open Mysql:
```sql
# connect to the docker container you just started
mysql -h 127.0.0.1 -P4000


-- Now create a new Source
CREATE source baseball WITH {
"type":"cloudstore",
"schema":"baseball",
"settings" : {
"type": "localfs",
"format": "csv",
"path": "baseball/",
"localpath": "/tmp"
}
};

show databases;

use baseball;

show tables;

describe appearances

select count(*) from appearances;

select * from appearances limit 10;


```

Big Query Example
------------------------------

```sh

# assuming you are running local, if you are instead in Google Cloud, or Google Container Engine
Expand Down Expand Up @@ -104,35 +160,21 @@ select * from film_locations limit 10;

```

Roadmap(ish)
------------------------------
* Writes
* write pub/sub: inbound insert/update are available as pub-sub messages.
* write propogation: inbound insert/update gets written multiple places.
* Data Sources: Improve them.




**Hacking**

For now, the goal is to allow this to be used for library, so the
`vendor` is not checked in. use docker containers or `dep` for now.

* see **tools/importgithub** for tool to import 2 days of github data for examples above.

```sh
# run dep ensure
dep ensure -update -v


cd $GOPATH/src/github.com/dataux/dataux/tools/importgithub
go build
./importgithub ## will import ~200k plus docs from Github archive

```

Other Projects, Database Proxies & Multi-Data QL
Related Projects, Database Proxies & Multi-Data QL
-------------------------------------------------------
* ***Data-Accessability*** Making it easier to query, access, share, and use data. Protocol shifting (for accessibility). Sharing/Replication between db types.
* ***Scalability/Sharding*** Implement sharding, connection sharing
Expand Down
90 changes: 75 additions & 15 deletions backends/files/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,31 +65,79 @@ select * from appearances limit 10;



Adding A Source
Query CSV Files
--------------------

We are going to create a CSV `database` of Baseball data from
http://seanlahman.com/baseball-archive/statistics/

```sh
# from baseball http://seanlahman.com/baseball-archive/statistics/
mkdir -p /tmp/baseball2
cd /tmp/baseball2
curl -Ls http://seanlahman.com/files/database/baseballdatabank-2017.1.zip > bball2.zip
unzip bball2.zip
# download files to local /tmp
mkdir -p /tmp/baseball
cd /tmp/baseball
curl -Ls http://seanlahman.com/files/database/baseballdatabank-2017.1.zip > bball.zip
unzip bball.zip

mv baseball*/core/*.csv .

rm bball2.zip
rm bball.zip
rm -rf baseballdatabank-*

# run a docker container locally
docker run -e "LOGGING=debug" --rm -it -p 4000:4000 \
-v /tmp/baseball:/tmp/baseball \
gcr.io/dataux-io/dataux:latest


```
In another Console open Mysql:
```sql
# connect to the docker container you just started
mysql -h 127.0.0.1 -P4000


-- Now create a new Source
CREATE source baseball WITH {
"type":"cloudstore",
"schema":"baseball",
"settings" : {
"type": "localfs",
"format": "csv",
"path": "baseball/",
"localpath": "/tmp"
}
};

show databases;

use baseball;

show tables;

describe appearances

select count(*) from appearances;

select * from appearances limit 10;


```
Query CSV Files on Google Cloud Storage
----------------------------------------------

This is similar to above, but will use Google Cloud Storage instead
of local file drives. If you are running inside of Google Cloud the
performance off of Cloud Storage is amazing, much, much faster than
you would expect.

```
# create a google-cloud-storage bucket
gsutil mb gs://my-dataux2-bucket
gsutil mb gs://my-baseball-bucket
# sync files to cloud
gsutil rsync -d -r /tmp/baseball2/ gs://my-dataux2-bucket/
gsutil rsync -d -r /tmp/baseball gs://my-baseball-bucket/
# run a docker container locally, using your local credentials
# run a docker container locally, using your local google cloud credentials
docker run -e "GOOGLE_APPLICATION_CREDENTIALS=/.config/gcloud/application_default_credentials.json" \
-e "LOGGING=debug" \
--rm -it \
Expand All @@ -109,10 +157,22 @@ CREATE source gcsbball2 WITH {
"schema":"gcsbball",
"settings" : {
"type": "gcs",
"bucket": "my-dataux-bucket",
"format": "csv",
"jwt": "/.config/gcloud/application_default_credentials.json"
"project": "your-google-project",
"bucket": "my-baseball-bucket",
"format": "csv"
}
};

show databases;

use baseball;

show tables;

describe appearances

select count(*) from appearances;

select * from appearances limit 10;

```
2 changes: 1 addition & 1 deletion backends/files/filesource_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -229,7 +229,7 @@ func TestSelectFilesList(t *testing.T) {
}{}
validateQuerySpec(t, tu.QuerySpec{
Sql: "select file, `table`, size, partition from localfiles_files",
ExpectRowCt: 3,
ExpectRowCt: 2,
ValidateRowData: func() {
u.Infof("%v", data)
// assert.True(t, data.Deleted == false, "Not deleted? %v", data)
Expand Down

0 comments on commit 2281191

Please sign in to comment.