SlideShare a Scribd company logo
Sadayuki Furuhashi
Scripting Embulk
Plugins
Embulk & Digdag Online Meetup 2020
A founder of Treasure Data, Inc. located in Silicon Valley.
OSS I designed:
An open-source hacker.
Github: @frsyuki
Sadayuki Furuhashi
Data integrationis a key for data-driven business.
AI/ML Analytics
Databases
Data is moving to the cloud & SaaS
SaaS
SaaS data integration is becoming more common
SaaS
Who develops new SaaS integrations?
Java developers
Low code
Scripting with SDKs
Scripting
Embulk plugin API
SaaS users
Dev vs user
gap!
Who develops new SaaS integrations?
Java developers
Low code
Scripting with SDKs
Scripting
Embulk plugin API
Embulk scripting
SaaS users
Dev = user
Scripting on the powerful framework
Embulk scripting plugin
Embulk core framework
Your script
SDK / library
✓High-performance
✓Choices of output plugins
Embulk plugins
How it works?
1. Run a script
3. Write rows as a CSV file
4. Read the CSV file
2. Load rows
named pipe
Embulk scripting plugin
Your script
SDK / library
Named pipe is like a file but not a file.
• It doesn’t consume disk space.
• It doesn’t cause disk IO (=fast).
• It transfers data as your script writes
rows (=fast).
How it works?
1. Run a script
3. Write rows as a CSV file
4. Read the CSV file
named pipe
Embulk scripting plugin
Your script
SDK / library
output plugin5. Pass rows to an

output plugin
2. Load rows
How to use embulk-input-script
1. Install
2. Create a config
3. Run
$ embulk gem install embulk-input-script
in:
type: script
run: ruby your_script.rb #-- any executable
out:
type: …
$ embulk run config.yaml
How to develop a script- your script runs 3 times
if ARGV[0] == “setup”
File.write(ARGV[2], “…”)
elsif ARGV[0] == “run”
CSV.open(ARGV[2], “w”) do |file|
file << row
…
end
elsif ARGV[0] == “finish”
puts “Done!”
end
$ script.rb setup <config.yaml> <setup.yaml>
$ script.rb run <setup.yaml> <N> <output.csv>
$ script.rb finish <setup.yaml>
First, write a setup file. It should include
column names, column types and parallelism.
Second, load rows and write them to a CSV file.
If the setup file says parallelism is bigger than 1,
this runs for multiple times with N=0, 1, 2, 3, …
Finally, do cleanup if necessary.
$ script.rb setup <config.yaml> <setup.yaml>
$ script.rb run <setup.yaml> <output.csv> <N>
$ script.rb finish <config.yaml> <setup.yaml>
Examples
• Importing server status from DataDog

https://github.com/embulk/embulk-input-script/tree/master/examples/datadog_hosts
• Importing AWS EC2 server list

https://github.com/embulk/embulk-input-script/tree/master/examples/aws_ec2_instances
Wanted
• Output support

embulk-output-script is not available.
• Converter from a script to an Embulk plugin gem

When you create a script, you want to release it so that other people can reuse it.

To do it, we need a tool that packages the script with embulk-input-script as a gem.

More Related Content

Scripting Embulk Plugins

  • 2. A founder of Treasure Data, Inc. located in Silicon Valley. OSS I designed: An open-source hacker. Github: @frsyuki Sadayuki Furuhashi
  • 3. Data integrationis a key for data-driven business. AI/ML Analytics Databases
  • 4. Data is moving to the cloud & SaaS SaaS
  • 5. SaaS data integration is becoming more common SaaS
  • 6. Who develops new SaaS integrations? Java developers Low code Scripting with SDKs Scripting Embulk plugin API SaaS users Dev vs user gap!
  • 7. Who develops new SaaS integrations? Java developers Low code Scripting with SDKs Scripting Embulk plugin API Embulk scripting SaaS users Dev = user
  • 8. Scripting on the powerful framework Embulk scripting plugin Embulk core framework Your script SDK / library ✓High-performance ✓Choices of output plugins Embulk plugins
  • 9. How it works? 1. Run a script 3. Write rows as a CSV file 4. Read the CSV file 2. Load rows named pipe Embulk scripting plugin Your script SDK / library Named pipe is like a file but not a file. • It doesn’t consume disk space. • It doesn’t cause disk IO (=fast). • It transfers data as your script writes rows (=fast).
  • 10. How it works? 1. Run a script 3. Write rows as a CSV file 4. Read the CSV file named pipe Embulk scripting plugin Your script SDK / library output plugin5. Pass rows to an
 output plugin 2. Load rows
  • 11. How to use embulk-input-script 1. Install 2. Create a config 3. Run $ embulk gem install embulk-input-script in: type: script run: ruby your_script.rb #-- any executable out: type: … $ embulk run config.yaml
  • 12. How to develop a script- your script runs 3 times if ARGV[0] == “setup” File.write(ARGV[2], “…”) elsif ARGV[0] == “run” CSV.open(ARGV[2], “w”) do |file| file << row … end elsif ARGV[0] == “finish” puts “Done!” end $ script.rb setup <config.yaml> <setup.yaml> $ script.rb run <setup.yaml> <N> <output.csv> $ script.rb finish <setup.yaml> First, write a setup file. It should include column names, column types and parallelism. Second, load rows and write them to a CSV file. If the setup file says parallelism is bigger than 1, this runs for multiple times with N=0, 1, 2, 3, … Finally, do cleanup if necessary. $ script.rb setup <config.yaml> <setup.yaml> $ script.rb run <setup.yaml> <output.csv> <N> $ script.rb finish <config.yaml> <setup.yaml>
  • 13. Examples • Importing server status from DataDog
 https://github.com/embulk/embulk-input-script/tree/master/examples/datadog_hosts • Importing AWS EC2 server list
 https://github.com/embulk/embulk-input-script/tree/master/examples/aws_ec2_instances
  • 14. Wanted • Output support
 embulk-output-script is not available. • Converter from a script to an Embulk plugin gem
 When you create a script, you want to release it so that other people can reuse it.
 To do it, we need a tool that packages the script with embulk-input-script as a gem.