[SPARK-32674][DOC] Add suggestion for parallel directory listing in t…

…uning doc ### What changes were proposed in this pull request? This adds some tuning guide for increasing parallelism of directory listing. ### Why are the changes needed? Sometimes when job input has large number of directories, the listing can become a bottleneck. There are a few parameters to tune this. This adds some info to Spark tuning guide to make the knowledge better shared. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes apache#29498 from sunchao/SPARK-32674. Authored-by: Chao Sun <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit bf221de) Signed-off-by: HyukjinKwon <[email protected]>
Xtuden-com · Aug 21, 2020 · 4344a69 · 4344a69
1 parent d846476
commit 4344a69
Show file tree

Hide file tree

Showing 2 changed files with 33 additions and 0 deletions.
diff --git a/docs/sql-performance-tuning.md b/docs/sql-performance-tuning.md
@@ -90,6 +90,28 @@ that these options will be deprecated in future release as more optimizations ar
       Configures the number of partitions to use when shuffling data for joins or aggregations.
     </td>
   </tr>
+  <tr>
+    <td><code>spark.sql.sources.parallelPartitionDiscovery.threshold</code></td>
+    <td>32</td>
+    <td>
+      Configures the threshold to enable parallel listing for job input paths. If the number of
+      input paths is larger than this threshold, Spark will list the files by using Spark distributed job.
+      Otherwise, it will fallback to sequential listing. This configuration is only effective when
+      using file-based data sources such as Parquet, ORC and JSON.
+    </td>
+    <td>1.5.0</td>
+  </tr>
+  <tr>
+    <td><code>spark.sql.sources.parallelPartitionDiscovery.parallelism</code></td>
+    <td>10000</td>
+    <td>
+      Configures the maximum listing parallelism for job input paths. In case the number of input
+      paths is larger than this value, it will be throttled down to use this value. Same as above,
+      this configuration is only effective when using file-based data sources such as Parquet, ORC
+      and JSON.
+    </td>
+    <td>2.1.1</td>
+  </tr>
 </table>
 
 ## Broadcast Hint for SQL Queries

diff --git a/docs/tuning.md b/docs/tuning.md
@@ -249,6 +249,17 @@ parent RDD's number of partitions. You can pass the level of parallelism as a se
 or set the config property `spark.default.parallelism` to change the default.
 In general, we recommend 2-3 tasks per CPU core in your cluster.
 
+## Parallel Listing on Input Paths
+
+Sometimes you may also need to increase directory listing parallelism when job input has large number of directories,
+otherwise the process could take a very long time, especially when against object store like S3.
+If your job works on RDD with Hadoop input formats (e.g., via `SparkContext.sequenceFile`), the parallelism is
+controlled via [`spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads`](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml) (currently default is 1).
+
+For Spark SQL with file-based data sources, you can tune `spark.sql.sources.parallelPartitionDiscovery.threshold` and
+`spark.sql.sources.parallelPartitionDiscovery.parallelism` to improve listing parallelism. Please
+refer to [Spark SQL performance tuning guide](sql-performance-tuning.html) for more details.
+
 ## Memory Usage of Reduce Tasks
 
 Sometimes, you will get an OutOfMemoryError not because your RDDs don't fit in memory, but because the