Powered by and Ansible
This project is intended to be used in tandem with Muchos to automate the deployment of DataWave for development and testing purposes on a cluster of arbitrary size.
The project is comprised primarily of Ansible scripts, which are intended to be used on your cluster after Muchos setup has been completed. Thus, users will first employ Muchos independently to establish DataWave's base dependencies (Hadoop, Accumulo, and ZooKeeper) and to establish the base Ansible inventory required to automate configuration and deployment of DataWave.
Compatibility Notes
Testing/verification has been performed on AWS using the following
Muchos Commit | Configuration | DataWave Commit |
---|---|---|
4f1a4ae | muchos.props.example | 2.6.41 |
Prerequisites / Assumptions
- Familiarity with the basics of Ansible is recommended but not required
- Familiarity with the following is assumed
- Hadoop HDFS and MapReduce
- Accumulo and ZooKeeper
- DataWave
- Muchos (see Muchos documentation for prerequisites)
1. Use Muchos to set up your cluster
When configuring Muchos, keep in mind that you'll be installing DataWave after Muchos setup is
complete. So, you'll want to consider the future home for DataWave's ingest and web components when
defining the [nodes]
section of muchos.props
If desired, you can have Muchos set up dedicated hosts for these by adding nodes of type client in muchos.props. For example:
...
[nodes]
...
ingest1 = client
webserver1 = client
Muchos will install and configure base dependencies on client nodes, but no service daemons will be activated.
It is not a requirement to have distinct hosts for DataWave's ingest and web services. They may coexist on the the same host and/or alongside other cluster services, provided that sufficient resources exist on the target host(s).
In Step 3 below, you'll define the target host(s) for DataWave and integrate them into your existing Ansible inventory.
2. When Muchos setup is complete, ssh to your proxy host and clone this repository. For example:
<me@localhost>$ cd /path/to/fluo-muchos
<me@localhost>$ bin/muchos ssh
...
<cluster_user@leader1>$ git clone https://github.com/NationalSecurityAgency/datawave-muchos.git
Remaining tasks below should be performed on the proxy host as the user denoted by your
cluster_user
variable.
3. Symlink your Muchos inventory and assign your DataWave-specific hosts in the dw-hosts file
$ cd datawave-muchos/ansible/inventory
# 3.1 - Create symlink to your Muchos hosts file
$ ln -s /home/cluster_user/ansible/conf/hosts muchos-hosts
# 3.2 - Edit the DataWave inventory file as needed
$ vi dw-hosts
...
This allows us to pass the inventory directory itself as an argument to Ansible, e.g., ansible-playbook -i inventory/ ...
,
which tells Ansible to merge all files present into a single inventory automatically.
At this point, you should have only two files in the directory, muchos-hosts and dw-hosts.
4. Configure your all group and datawave group variables
$ cd datawave-muchos/ansible/group_vars
# 4.1 (Required) - Symlink the Muchos 'all' vars file
$ ln -s /home/cluster_user/ansible/group_vars/all all
# 4.2 (Optional) - Set DataWave-specific overrides in the 'datawave' vars file
$ vi datawave
...
-
Generally, you'll find variables and their default values defined in ansible/roles/{{ role name }}/defaults/main.yml, so that they can be easily overridden (values assigned there receive the lowest possible precedence in Ansible)
-
Most of the variables you'll care about are here: ansible/roles/common/defaults/main.yml
-
You may find it convenient to override variables from the command line via Ansible's
-e / --extra-vars
option, as demonstrated below in post-deployment/force redeploy. (In Ansible, command line overrides receive the highest possible precedence)
5. Lastly, build/deploy DataWave with the datawave.yml playbook
$ cd datawave-muchos/ansible
$ ansible-playbook -i inventory datawave.yml
# Or equivalently...
$ scripts/dw-play.sh
-
Note: The dw-build role will first git-clone a remote DataWave repository on your proxy host, as configured by the following variables:
dw_repo
,dw_clone_dir
,dw_checkout_version
-
Note: To build DataWave's ingest and web tarballs, the proxy host will need a few GB free on the volume containing the local git repo. Additionally, you'll need a few GB free for the local Maven repo. For EC2 clusters, depending on the source AMI and storage configuration, you may need to attach and mount a volume large enough to accomodate these directories, configured via
dw_clone_dir
anddw_m2_repo_dir
respectively -
Note: By default, ingest services should be started up automatically on your
ingestmaster
host upon successful completion of thedatawave.yml
playbook. See this issue for instructions to verify that services started successfully
Additional playbooks are provided as a convenience to simplify common post-deployment tasks on your cluster. These are described below. Also note that the datawave.yml playbook imports post-deployment.yml to allow you to run many of these tasks automatically after DataWave has been installed. In general, tasks in post-deployment.yml will be conditionally activated based on the value of one or more boolean variables, which you may override as needed.
If dw_install_web_client
was set to True
(default), then a simple, curl-based query client for DataWave
will have been installed and configured on your proxy host.
The client will simplify your interaction with the DataWave Query API by...
- automatically configuring test PKI materials and associated curl parameters
- setting reasonable defaults for DataWave-specific parameters
- automatically pretty-printing web service responses based on their content type
- automatically closing queries when response code 204 is returned (no results found)
- etc
For example:
$ which datawave || source ~/.bashrc
...
$ datawave query --expression "PAGE_TITLE:AccessibleComputing" --show-meta
{
"Events": [
{
...
}
],
...
"ReturnedEvents": 1
}
Query ID: 51082ed4-b579-45b8-879f-3afdb10e6ec3
Time: 0.271 Response Code: 200 Response Type: application/json
$ datawave query --next 51082ed4-b579-45b8-879f-3afdb10e6ec3 --show-meta
Time: 0.093 Response Code: 204 Response Type: N/A
[DW-INFO] - End of result set, as indicated by 204 response. Closing query automatically
...
- Query Client options:
$ datawave query --help
- Other options:
$ datawave --help
- More info: view the client script
Generally speaking, all Anisble tasks here are designed to be idempotent operations on your cluster. Thus, it is usually safe to assume that executing the datawave.yml playbook multiple times will always result in the same cluster state. However, you may want to change that behavior at times by overriding certain default variables.
For example, you may want to rebuild DataWave and redeploy updated versions of ingest and query services:
# Force rebuild/redeploy
$ cd datawave-muchos/ansible
$ ansible-playbook -i inventory datawave.yml -e '{ "dw_force_redeploy": true }'
# Or equivalently...
$ scripts/dw-play-redeploy.sh
Upon redeploy...
- Previously ingested data in Accumulo is always preserved.
- Any manual, in-place modifications made to deployed services will likely be lost.
- Prior to redeploy, graceful shutdown of DataWave services is attempted.
For additional flexibility, the datawave.yml playbook makes use of Ansible tags, so specific
tasks can be whitelisted/blacklisted via the --tags
,--skip-tags
options respectively.
For example:
# Force a redeploy of DataWave without rebuilding the source code
$ cd datawave-muchos/ansible
$ ansible-playbook -i inventory datawave.yml -e '{ "dw_force_redeploy": true }' --skip-tags build
# Or equivalently...
$ scripts/dw-play-redeploy.sh --skip-tags build
# View all tasks and their associated tags for the entire playbook
$ ansible-playbook datawave.yml --list-tasks
# Or equivalently...
$ scripts/dw-play.sh --list-tasks
cd datawave-muchos/ansible
# Start (this is already a post-deployment task, as dw_start_ingest is set to True by default)
$ ansible-playbook -i inventory start-ingest.yml
# Stop
$ ansible-playbook -i inventory stop-ingest.yml
- See also scripts/dw-services-start.sh and scripts/dw-services-stop.sh
- Note: See this issue for instructions to verify that
ingest services were started successfully by the
start-ingest.yml
playbook
$ cd datawave-muchos/ansible
# Start (can be automated as a post-deployment task, if dw_start_web == True)
$ ansible-playbook -i inventory start-web.yml
# Stop
$ ansible-playbook -i inventory stop-web.yml
See also scripts/dw-services-start.sh and scripts/dw-services-stop.sh
TVMAZE Dataset (http://www.tvmaze.com/api)
To download/ingest a small subset of TVMAZE show and cast member data:
# Note: this can also be automated as a post-deployment task, if dw_ingest_tvmaze == True
$ cd datawave-muchos/ansible
$ ansible-playbook -i inventory tvmaze-ingest.yml
To download and ingest all TV shows and associated cast info:
$ cd scripts
$ ./tvmaze-ingest.sh
- Script options:
./tvmaze-ingest.sh -h
- More info: ansible/roles/tvmaze/README
Wikipedia Dataset (https://dumps.wikimedia.org/enwiki/)
To download a Wikipedia XML data dump and ingest a small subset (~100,000 pages) of its entries:
# Note: this can also be automated as a post-deployment task, if dw_ingest_wikipedia == True
$ cd datawave-muchos/ansible
$ ansible-playbook -i inventory wikipedia-ingest.yml
# Or equivalently...
$ scripts/wikipedia-ingest.sh
- If desired, the entire XML dump may be ingested by tweaking Ansible variable,
wiki_max_streams_to_extract
, subject to the storage limitations of your cluster - More info: ansible/roles/wikipedia/README