GitHub - NationalSecurityAgency/datawave-muchos: This project leverages Ansible to automate DataWave deployments on your cluster

Powered by and Ansible Ansible

Purpose

This project is intended to be used in tandem with Muchos to automate the deployment of DataWave for development and testing purposes on a cluster of arbitrary size.

The project is comprised primarily of Ansible scripts, which are intended to be used on your cluster after Muchos setup has been completed. Thus, users will first employ Muchos independently to establish DataWave's base dependencies (Hadoop, Accumulo, and ZooKeeper) and to establish the base Ansible inventory required to automate configuration and deployment of DataWave.

Compatibility Notes

Testing/verification has been performed on AWS using the following

Muchos Commit	Configuration	DataWave Commit
4f1a4ae	muchos.props.example	2.6.41

Prerequisites / Assumptions

Familiarity with the basics of Ansible is recommended but not required
Familiarity with the following is assumed
- Hadoop HDFS and MapReduce
- Accumulo and ZooKeeper
- DataWave
- Muchos (see Muchos documentation for prerequisites)

Get Started

1. Use Muchos to set up your cluster

When configuring Muchos, keep in mind that you'll be installing DataWave after Muchos setup is complete. So, you'll want to consider the future home for DataWave's ingest and web components when defining the [nodes] section of muchos.props

If desired, you can have Muchos set up dedicated hosts for these by adding nodes of type client in muchos.props. For example:

  ...
  [nodes]
  ...
  ingest1 = client
  webserver1 = client

Muchos will install and configure base dependencies on client nodes, but no service daemons will be activated.

It is not a requirement to have distinct hosts for DataWave's ingest and web services. They may coexist on the the same host and/or alongside other cluster services, provided that sufficient resources exist on the target host(s).

In Step 3 below, you'll define the target host(s) for DataWave and integrate them into your existing Ansible inventory.

2. When Muchos setup is complete, ssh to your proxy host and clone this repository. For example:

<me@localhost>$ cd /path/to/fluo-muchos
<me@localhost>$ bin/muchos ssh
...
<cluster_user@leader1>$ git clone https://github.com/NationalSecurityAgency/datawave-muchos.git

Remaining tasks below should be performed on the proxy host as the user denoted by your cluster_user variable.

3. Symlink your Muchos inventory and assign your DataWave-specific hosts in the dw-hosts file

$ cd datawave-muchos/ansible/inventory

# 3.1 - Create symlink to your Muchos hosts file
$ ln -s /home/cluster_user/ansible/conf/hosts muchos-hosts

# 3.2 - Edit the DataWave inventory file as needed
$ vi dw-hosts
  ...

This allows us to pass the inventory directory itself as an argument to Ansible, e.g., ansible-playbook -i inventory/ ..., which tells Ansible to merge all files present into a single inventory automatically.

At this point, you should have only two files in the directory, muchos-hosts and dw-hosts.

4. Configure your all group and datawave group variables

$ cd datawave-muchos/ansible/group_vars

# 4.1 (Required) - Symlink the Muchos 'all' vars file
$ ln -s /home/cluster_user/ansible/group_vars/all all

# 4.2 (Optional) - Set DataWave-specific overrides in the 'datawave' vars file
$ vi datawave
  ...

Generally, you'll find variables and their default values defined in ansible/roles/{{ role name }}/defaults/main.yml, so that they can be easily overridden (values assigned there receive the lowest possible precedence in Ansible)
Most of the variables you'll care about are here: ansible/roles/common/defaults/main.yml
You may find it convenient to override variables from the command line via Ansible's -e / --extra-vars option, as demonstrated below in post-deployment/force redeploy. (In Ansible, command line overrides receive the highest possible precedence)

5. Lastly, build/deploy DataWave with the datawave.yml playbook

$ cd datawave-muchos/ansible
$ ansible-playbook -i inventory datawave.yml

# Or equivalently...
$ scripts/dw-play.sh

Note: The dw-build role will first git-clone a remote DataWave repository on your proxy host, as configured by the following variables: dw_repo, dw_clone_dir, dw_checkout_version
Note: To build DataWave's ingest and web tarballs, the proxy host will need a few GB free on the volume containing the local git repo. Additionally, you'll need a few GB free for the local Maven repo. For EC2 clusters, depending on the source AMI and storage configuration, you may need to attach and mount a volume large enough to accomodate these directories, configured via dw_clone_dir and dw_m2_repo_dir respectively
Note: By default, ingest services should be started up automatically on your ingestmaster host upon successful completion of the datawave.yml playbook. See this issue for instructions to verify that services started successfully

Post-Deployment

Additional playbooks are provided as a convenience to simplify common post-deployment tasks on your cluster. These are described below. Also note that the datawave.yml playbook imports post-deployment.yml to allow you to run many of these tasks automatically after DataWave has been installed. In general, tasks in post-deployment.yml will be conditionally activated based on the value of one or more boolean variables, which you may override as needed.

DataWave Query Client

If dw_install_web_client was set to True (default), then a simple, curl-based query client for DataWave will have been installed and configured on your proxy host.

The client will simplify your interaction with the DataWave Query API by...

automatically configuring test PKI materials and associated curl parameters
setting reasonable defaults for DataWave-specific parameters
automatically pretty-printing web service responses based on their content type
automatically closing queries when response code 204 is returned (no results found)
etc

For example:

 $ which datawave || source ~/.bashrc
 ...
 $ datawave query --expression "PAGE_TITLE:AccessibleComputing" --show-meta
 {
   "Events": [
       {
         ...
       }
   ], 
   ...
   "ReturnedEvents": 1
 }
 Query ID: 51082ed4-b579-45b8-879f-3afdb10e6ec3
 Time: 0.271 Response Code: 200 Response Type: application/json
 
 $ datawave query --next 51082ed4-b579-45b8-879f-3afdb10e6ec3 --show-meta
 Time: 0.093 Response Code: 204 Response Type: N/A
 [DW-INFO] - End of result set, as indicated by 204 response. Closing query automatically
 ...

Query Client options: $ datawave query --help
Other options: $ datawave --help
More info: view the client script

Force Redeploy

Generally speaking, all Anisble tasks here are designed to be idempotent operations on your cluster. Thus, it is usually safe to assume that executing the datawave.yml playbook multiple times will always result in the same cluster state. However, you may want to change that behavior at times by overriding certain default variables.

For example, you may want to rebuild DataWave and redeploy updated versions of ingest and query services:

# Force rebuild/redeploy
$ cd datawave-muchos/ansible
$ ansible-playbook -i inventory datawave.yml -e '{ "dw_force_redeploy": true }'

# Or equivalently...
$ scripts/dw-play-redeploy.sh

Upon redeploy...

Previously ingested data in Accumulo is always preserved.
Any manual, in-place modifications made to deployed services will likely be lost.
Prior to redeploy, graceful shutdown of DataWave services is attempted.

Ansible Tags

For additional flexibility, the datawave.yml playbook makes use of Ansible tags, so specific tasks can be whitelisted/blacklisted via the --tags,--skip-tags options respectively. For example:

# Force a redeploy of DataWave without rebuilding the source code
$ cd datawave-muchos/ansible
$ ansible-playbook -i inventory datawave.yml -e '{ "dw_force_redeploy": true }' --skip-tags build

# Or equivalently...
$ scripts/dw-play-redeploy.sh --skip-tags build
  
# View all tasks and their associated tags for the entire playbook
$ ansible-playbook datawave.yml --list-tasks

# Or equivalently...
$ scripts/dw-play.sh --list-tasks

Start/Stop Ingest

cd datawave-muchos/ansible

# Start (this is already a post-deployment task, as dw_start_ingest is set to True by default)
$ ansible-playbook -i inventory start-ingest.yml

# Stop
$ ansible-playbook -i inventory stop-ingest.yml

See also scripts/dw-services-start.sh and scripts/dw-services-stop.sh
Note: See this issue for instructions to verify that ingest services were started successfully by the start-ingest.yml playbook

Start/Stop Web Services

$ cd datawave-muchos/ansible

# Start (can be automated as a post-deployment task, if dw_start_web == True)
$ ansible-playbook -i inventory start-web.yml

# Stop
$ ansible-playbook -i inventory stop-web.yml

See also scripts/dw-services-start.sh and scripts/dw-services-stop.sh

DataWave Ingest Examples

TVMAZE Dataset (http://www.tvmaze.com/api)

To download/ingest a small subset of TVMAZE show and cast member data:

# Note: this can also be automated as a post-deployment task, if dw_ingest_tvmaze == True

$ cd datawave-muchos/ansible
$ ansible-playbook -i inventory tvmaze-ingest.yml

To download and ingest all TV shows and associated cast info:

$ cd scripts
$ ./tvmaze-ingest.sh

Script options: ./tvmaze-ingest.sh -h
More info: ansible/roles/tvmaze/README

Wikipedia Dataset (https://dumps.wikimedia.org/enwiki/)

To download a Wikipedia XML data dump and ingest a small subset (~100,000 pages) of its entries:

# Note: this can also be automated as a post-deployment task, if dw_ingest_wikipedia == True

$ cd datawave-muchos/ansible
$ ansible-playbook -i inventory wikipedia-ingest.yml

# Or equivalently...
$ scripts/wikipedia-ingest.sh

If desired, the entire XML dump may be ingested by tweaking Ansible variable, wiki_max_streams_to_extract, subject to the storage limitations of your cluster
More info: ansible/roles/wikipedia/README

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
ansible		ansible
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
muchos.props.example		muchos.props.example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Purpose

Get Started

Post-Deployment

DataWave Query Client

Force Redeploy

Ansible Tags

Start/Stop Ingest

Start/Stop Web Services

DataWave Ingest Examples

TVMAZE Dataset (http://www.tvmaze.com/api)

Wikipedia Dataset (https://dumps.wikimedia.org/enwiki/)

About

Releases 2

Packages

Languages

License

NationalSecurityAgency/datawave-muchos

Folders and files

Latest commit

History

Repository files navigation

Purpose

Get Started

Post-Deployment

DataWave Query Client

Force Redeploy

Ansible Tags

Start/Stop Ingest

Start/Stop Web Services

DataWave Ingest Examples

TVMAZE Dataset (http://www.tvmaze.com/api)

Wikipedia Dataset (https://dumps.wikimedia.org/enwiki/)

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages