Skip to content

Commit

Permalink
[SPARK-9570] [DOCS] Consistent recommendation for submitting spark ap…
Browse files Browse the repository at this point in the history
…ps to YARN, -master yarn --deploy-mode x vs -master yarn-x'.

Recommend `--master yarn --deploy-mode {cluster,client}` consistently in docs.
Follow-on to apache#8385
CC nssalian

Author: Sean Owen <[email protected]>

Closes apache#8968 from srowen/SPARK-9570.
  • Loading branch information
srowen committed Oct 4, 2015
1 parent 721e8b5 commit 82bbc2a
Show file tree
Hide file tree
Showing 4 changed files with 34 additions and 27 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ will run the Pi example locally.

You can set the MASTER environment variable when running examples to submit
examples to a cluster. This can be a mesos:// or spark:// URL,
"yarn-cluster" or "yarn-client" to run on YARN, and "local" to run
"yarn" to run on YARN, and "local" to run
locally with one thread, or "local[N]" to run locally with N threads. You
can also use an abbreviated class name if the class is in the `examples`
package. For instance:
Expand Down
32 changes: 17 additions & 15 deletions docs/running-on-yarn.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,18 +16,19 @@ containers used by the application use the same configuration. If the configurat
Java system properties or environment variables not managed by YARN, they should also be set in the
Spark application's configuration (driver, executors, and the AM when running in client mode).

There are two deploy modes that can be used to launch Spark applications on YARN. In `yarn-cluster` mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In `yarn-client` mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
There are two deploy modes that can be used to launch Spark applications on YARN. In `cluster` mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In `client` mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.

Unlike [Spark standalone](spark-standalone.html) and [Mesos](running-on-mesos.html) modes, in which the master's address is specified in the `--master` parameter, in YARN mode the ResourceManager's address is picked up from the Hadoop configuration. Thus, the `--master` parameter is `yarn-client` or `yarn-cluster`.
Unlike [Spark standalone](spark-standalone.html) and [Mesos](running-on-mesos.html) modes, in which the master's address is specified in the `--master` parameter, in YARN mode the ResourceManager's address is picked up from the Hadoop configuration. Thus, the `--master` parameter is `yarn`.

To launch a Spark application in `yarn-cluster` mode:
To launch a Spark application in `cluster` mode:

$ ./bin/spark-submit --class path.to.your.Class --master yarn-cluster [options] <app jar> [app options]
$ ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options] <app jar> [app options]

For example:

$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--master yarn \
--deploy-mode cluster \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 \
Expand All @@ -37,16 +38,17 @@ For example:

The above starts a YARN client program which starts the default Application Master. Then SparkPi will be run as a child thread of Application Master. The client will periodically poll the Application Master for status updates and display them in the console. The client will exit once your application has finished running. Refer to the "Debugging your Application" section below for how to see driver and executor logs.

To launch a Spark application in `yarn-client` mode, do the same, but replace `yarn-cluster` with `yarn-client`. The following shows how you can run `spark-shell` in `yarn-client` mode:
To launch a Spark application in `client` mode, do the same, but replace `cluster` with `client`. The following shows how you can run `spark-shell` in `client` mode:

$ ./bin/spark-shell --master yarn-client
$ ./bin/spark-shell --master yarn --deploy-mode client

## Adding Other JARs

In `yarn-cluster` mode, the driver runs on a different machine than the client, so `SparkContext.addJar` won't work out of the box with files that are local to the client. To make files on the client available to `SparkContext.addJar`, include them with the `--jars` option in the launch command.
In `cluster` mode, the driver runs on a different machine than the client, so `SparkContext.addJar` won't work out of the box with files that are local to the client. To make files on the client available to `SparkContext.addJar`, include them with the `--jars` option in the launch command.

$ ./bin/spark-submit --class my.main.Class \
--master yarn-cluster \
--master yarn \
--deploy-mode cluster \
--jars my-other-jar.jar,my-other-other-jar.jar
my-main-jar.jar
app_arg1 app_arg2
Expand Down Expand Up @@ -129,8 +131,8 @@ If you need a reference to the proper location to put log files in the YARN so t
<td><code>spark.yarn.am.waitTime</code></td>
<td><code>100s</code></td>
<td>
In <code>yarn-cluster</code> mode, time for the YARN Application Master to wait for the
SparkContext to be initialized. In <code>yarn-client</code> mode, time for the YARN Application Master to wait
In <code>cluster</code> mode, time for the YARN Application Master to wait for the
SparkContext to be initialized. In <code>client</code> mode, time for the YARN Application Master to wait
for the driver to connect to it.
</td>
</tr>
Expand Down Expand Up @@ -268,8 +270,8 @@ If you need a reference to the proper location to put log files in the YARN so t
<td>
Add the environment variable specified by <code>EnvironmentVariableName</code> to the
Application Master process launched on YARN. The user can specify multiple of
these and to set multiple environment variables. In <code>yarn-cluster</code> mode this controls
the environment of the Spark driver and in <code>yarn-client</code> mode it only controls
these and to set multiple environment variables. In <code>cluster</code> mode this controls
the environment of the Spark driver and in <code>client</code> mode it only controls
the environment of the executor launcher.
</td>
</tr>
Expand Down Expand Up @@ -388,6 +390,6 @@ If you need a reference to the proper location to put log files in the YARN so t
# Important notes

- Whether core requests are honored in scheduling decisions depends on which scheduler is in use and how it is configured.
- In `yarn-cluster` mode, the local directories used by the Spark executors and the Spark driver will be the local directories configured for YARN (Hadoop YARN config `yarn.nodemanager.local-dirs`). If the user specifies `spark.local.dir`, it will be ignored. In `yarn-client` mode, the Spark executors will use the local directories configured for YARN while the Spark driver will use those defined in `spark.local.dir`. This is because the Spark driver does not run on the YARN cluster in `yarn-client` mode, only the Spark executors do.
- In `cluster` mode, the local directories used by the Spark executors and the Spark driver will be the local directories configured for YARN (Hadoop YARN config `yarn.nodemanager.local-dirs`). If the user specifies `spark.local.dir`, it will be ignored. In `client` mode, the Spark executors will use the local directories configured for YARN while the Spark driver will use those defined in `spark.local.dir`. This is because the Spark driver does not run on the YARN cluster in `client` mode, only the Spark executors do.
- The `--files` and `--archives` options support specifying file names with the # similar to Hadoop. For example you can specify: `--files localtest.txt#appSees.txt` and this will upload the file you have locally named `localtest.txt` into HDFS but this will be linked to by the name `appSees.txt`, and your application should use the name as `appSees.txt` to reference it when running on YARN.
- The `--jars` option allows the `SparkContext.addJar` function to work if you are using it with local files and running in `yarn-cluster` mode. It does not need to be used if you are using it with HDFS, HTTP, HTTPS, or FTP files.
- The `--jars` option allows the `SparkContext.addJar` function to work if you are using it with local files and running in `cluster` mode. It does not need to be used if you are using it with HDFS, HTTP, HTTPS, or FTP files.
2 changes: 1 addition & 1 deletion docs/sql-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -1585,7 +1585,7 @@ on all of the worker nodes, as they will need access to the Hive serialization a
(SerDes) in order to access data stored in Hive.

Configuration of Hive is done by placing your `hive-site.xml` file in `conf/`. Please note when running
the query on a YARN cluster (`yarn-cluster` mode), the `datanucleus` jars under the `lib_managed/jars` directory
the query on a YARN cluster (`cluster` mode), the `datanucleus` jars under the `lib_managed/jars` directory
and `hive-site.xml` under `conf/` directory need to be available on the driver and all executors launched by the
YARN cluster. The convenient way to do this is adding them through the `--jars` option and `--file` option of the
`spark-submit` command.
Expand Down
25 changes: 15 additions & 10 deletions docs/submitting-applications.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,8 @@ run it with `--help`. Here are a few examples of common options:
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn-cluster \ # can also be yarn-client for client mode
--master yarn \
--deploy-mode cluster \ # can be client for client mode
--executor-memory 20G \
--num-executors 50 \
/path/to/examples.jar \
Expand All @@ -122,21 +123,25 @@ The master URL passed to Spark can be in one of the following formats:

<table class="table">
<tr><th>Master URL</th><th>Meaning</th></tr>
<tr><td> local </td><td> Run Spark locally with one worker thread (i.e. no parallelism at all). </td></tr>
<tr><td> local[K] </td><td> Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine). </td></tr>
<tr><td> local[*] </td><td> Run Spark locally with as many worker threads as logical cores on your machine.</td></tr>
<tr><td> spark://HOST:PORT </td><td> Connect to the given <a href="spark-standalone.html">Spark standalone
<tr><td> <code>local</code> </td><td> Run Spark locally with one worker thread (i.e. no parallelism at all). </td></tr>
<tr><td> <code>local[K]</code> </td><td> Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine). </td></tr>
<tr><td> <code>local[*]</code> </td><td> Run Spark locally with as many worker threads as logical cores on your machine.</td></tr>
<tr><td> <code>spark://HOST:PORT</code> </td><td> Connect to the given <a href="spark-standalone.html">Spark standalone
cluster</a> master. The port must be whichever one your master is configured to use, which is 7077 by default.
</td></tr>
<tr><td> mesos://HOST:PORT </td><td> Connect to the given <a href="running-on-mesos.html">Mesos</a> cluster.
<tr><td> <code>mesos://HOST:PORT</code> </td><td> Connect to the given <a href="running-on-mesos.html">Mesos</a> cluster.
The port must be whichever one your is configured to use, which is 5050 by default.
Or, for a Mesos cluster using ZooKeeper, use <code>mesos://zk://...</code>.
</td></tr>
<tr><td> yarn-client </td><td> Connect to a <a href="running-on-yarn.html"> YARN </a> cluster in
client mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable.
<tr><td> <code>yarn</code> </td><td> Connect to a <a href="running-on-yarn.html"> YARN </a> cluster in
<code>client</code> or <code>cluster</code> mode depending on the value of <code>--deploy-mode</code>.
The cluster location will be found based on the <code>HADOOP_CONF_DIR</code> or <code>YARN_CONF_DIR</code> variable.
</td></tr>
<tr><td> yarn-cluster </td><td> Connect to a <a href="running-on-yarn.html"> YARN </a> cluster in
cluster mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable.
<tr><td> <code>yarn-client</code> </td><td> Equivalent to <code>yarn</code> with <code>--deploy-mode client</code>,
which is preferred to `yarn-client`
</td></tr>
<tr><td> <code>yarn-cluster</code> </td><td> Equivalent to <code>yarn</code> with <code>--deploy-mode cluster</code>,
which is preferred to `yarn-cluster`
</td></tr>
</table>

Expand Down

0 comments on commit 82bbc2a

Please sign in to comment.