Skip to content

Commit

Permalink
[SPARK-32674][DOC] Add suggestion for parallel directory listing in t…
Browse files Browse the repository at this point in the history
…uning doc

### What changes were proposed in this pull request?

This adds some tuning guide for increasing parallelism of directory listing.

### Why are the changes needed?

Sometimes when job input has large number of directories, the listing can become a bottleneck. There are a few parameters to tune this. This adds some info to Spark tuning guide to make the knowledge better shared.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes apache#29498 from sunchao/SPARK-32674.

Authored-by: Chao Sun <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit bf221de)
Signed-off-by: HyukjinKwon <[email protected]>
  • Loading branch information
sunchao authored and HyukjinKwon committed Aug 21, 2020
1 parent d846476 commit 4344a69
Show file tree
Hide file tree
Showing 2 changed files with 33 additions and 0 deletions.
22 changes: 22 additions & 0 deletions docs/sql-performance-tuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,28 @@ that these options will be deprecated in future release as more optimizations ar
Configures the number of partitions to use when shuffling data for joins or aggregations.
</td>
</tr>
<tr>
<td><code>spark.sql.sources.parallelPartitionDiscovery.threshold</code></td>
<td>32</td>
<td>
Configures the threshold to enable parallel listing for job input paths. If the number of
input paths is larger than this threshold, Spark will list the files by using Spark distributed job.
Otherwise, it will fallback to sequential listing. This configuration is only effective when
using file-based data sources such as Parquet, ORC and JSON.
</td>
<td>1.5.0</td>
</tr>
<tr>
<td><code>spark.sql.sources.parallelPartitionDiscovery.parallelism</code></td>
<td>10000</td>
<td>
Configures the maximum listing parallelism for job input paths. In case the number of input
paths is larger than this value, it will be throttled down to use this value. Same as above,
this configuration is only effective when using file-based data sources such as Parquet, ORC
and JSON.
</td>
<td>2.1.1</td>
</tr>
</table>

## Broadcast Hint for SQL Queries
Expand Down
11 changes: 11 additions & 0 deletions docs/tuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -249,6 +249,17 @@ parent RDD's number of partitions. You can pass the level of parallelism as a se
or set the config property `spark.default.parallelism` to change the default.
In general, we recommend 2-3 tasks per CPU core in your cluster.

## Parallel Listing on Input Paths

Sometimes you may also need to increase directory listing parallelism when job input has large number of directories,
otherwise the process could take a very long time, especially when against object store like S3.
If your job works on RDD with Hadoop input formats (e.g., via `SparkContext.sequenceFile`), the parallelism is
controlled via [`spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads`](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml) (currently default is 1).

For Spark SQL with file-based data sources, you can tune `spark.sql.sources.parallelPartitionDiscovery.threshold` and
`spark.sql.sources.parallelPartitionDiscovery.parallelism` to improve listing parallelism. Please
refer to [Spark SQL performance tuning guide](sql-performance-tuning.html) for more details.

## Memory Usage of Reduce Tasks

Sometimes, you will get an OutOfMemoryError not because your RDDs don't fit in memory, but because the
Expand Down

0 comments on commit 4344a69

Please sign in to comment.