Skip to content

Commit 60472db

Browse files
HyukjinKwonrxin
authored andcommitted
[SPARK-21485][SQL][DOCS] Spark SQL documentation generation for built-in functions
## What changes were proposed in this pull request? This generates a documentation for Spark SQL built-in functions. One drawback is, this requires a proper build to generate built-in function list. Once it is built, it only takes few seconds by `sql/create-docs.sh`. Please see https://spark-test.github.io/sparksqldoc/ that I hosted to show the output documentation. There are few more works to be done in order to make the documentation pretty, for example, separating `Arguments:` and `Examples:` but I guess this should be done within `ExpressionDescription` and `ExpressionInfo` rather than manually parsing it. I will fix these in a follow up. This requires `pip install mkdocs` to generate HTMLs from markdown files. ## How was this patch tested? Manually tested: ``` cd docs jekyll build ``` , ``` cd docs jekyll serve ``` and ``` cd sql create-docs.sh ``` Author: hyukjinkwon <[email protected]> Closes apache#18702 from HyukjinKwon/SPARK-21485.
1 parent cf29828 commit 60472db

11 files changed

Lines changed: 203 additions & 3 deletions

File tree

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,8 @@ dev/pr-deps/
4747
dist/
4848
docs/_site
4949
docs/api
50+
sql/docs
51+
sql/site
5052
lib_managed/
5153
lint-r-report.log
5254
log/

docs/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,6 @@ jekyll plugin to run `build/sbt unidoc` before building the site so if you haven
6868
may take some time as it generates all of the scaladoc. The jekyll plugin also generates the
6969
PySpark docs using [Sphinx](http://sphinx-doc.org/).
7070

71-
NOTE: To skip the step of building and copying over the Scala, Python, R API docs, run `SKIP_API=1
72-
jekyll`. In addition, `SKIP_SCALADOC=1`, `SKIP_PYTHONDOC=1`, and `SKIP_RDOC=1` can be used to skip a single
73-
step of the corresponding language.
71+
NOTE: To skip the step of building and copying over the Scala, Python, R and SQL API docs, run `SKIP_API=1
72+
jekyll`. In addition, `SKIP_SCALADOC=1`, `SKIP_PYTHONDOC=1`, `SKIP_RDOC=1` and `SKIP_SQLDOC=1` can be used
73+
to skip a single step of the corresponding language.

docs/_layouts/global.html

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,7 @@
8686
<li><a href="api/java/index.html">Java</a></li>
8787
<li><a href="api/python/index.html">Python</a></li>
8888
<li><a href="api/R/index.html">R</a></li>
89+
<li><a href="api/sql/index.html">SQL, Built-in Functions</a></li>
8990
</ul>
9091
</li>
9192

docs/_plugins/copy_api_dirs.rb

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -150,4 +150,31 @@
150150
cp("../R/pkg/DESCRIPTION", "api")
151151
end
152152

153+
if not (ENV['SKIP_SQLDOC'] == '1')
154+
# Build SQL API docs
155+
156+
puts "Moving to project root and building API docs."
157+
curr_dir = pwd
158+
cd("..")
159+
160+
puts "Running 'build/sbt clean package' from " + pwd + "; this may take a few minutes..."
161+
system("build/sbt clean package") || raise("SQL doc generation failed")
162+
163+
puts "Moving back into docs dir."
164+
cd("docs")
165+
166+
puts "Moving to SQL directory and building docs."
167+
cd("../sql")
168+
system("./create-docs.sh") || raise("SQL doc generation failed")
169+
170+
puts "Moving back into docs dir."
171+
cd("../docs")
172+
173+
puts "Making directory api/sql"
174+
mkdir_p "api/sql"
175+
176+
puts "cp -r ../sql/site/. api/sql"
177+
cp_r("../sql/site/.", "api/sql")
178+
end
179+
153180
end

docs/api.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,4 @@ Here you can read API docs for Spark and its submodules.
99
- [Spark Java API (Javadoc)](api/java/index.html)
1010
- [Spark Python API (Sphinx)](api/python/index.html)
1111
- [Spark R API (Roxygen2)](api/R/index.html)
12+
- [Spark SQL, Built-in Functions (MkDocs)](api/sql/index.html)

docs/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,7 @@ options for deployment:
100100
* [Spark Java API (Javadoc)](api/java/index.html)
101101
* [Spark Python API (Sphinx)](api/python/index.html)
102102
* [Spark R API (Roxygen2)](api/R/index.html)
103+
* [Spark SQL, Built-in Functions (MkDocs)](api/sql/index.html)
103104

104105
**Deployment Guides:**
105106

sql/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,3 +8,5 @@ Spark SQL is broken up into four subprojects:
88
- Execution (sql/core) - A query planner / execution engine for translating Catalyst's logical query plans into Spark RDDs. This component also includes a new public interface, SQLContext, that allows users to execute SQL or LINQ statements against existing RDDs and Parquet files.
99
- Hive Support (sql/hive) - Includes an extension of SQLContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs.
1010
- HiveServer and CLI support (sql/hive-thriftserver) - Includes support for the SQL CLI (bin/spark-sql) and a HiveServer2 (for JDBC/ODBC) compatible server.
11+
12+
Running `sql/create-docs.sh` generates SQL documentation for built-in functions under `sql/site`.

sql/core/src/main/scala/org/apache/spark/sql/api/python/PythonSQLUtils.scala

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,16 @@
1717

1818
package org.apache.spark.sql.api.python
1919

20+
import org.apache.spark.sql.catalyst.analysis.FunctionRegistry
21+
import org.apache.spark.sql.catalyst.expressions.ExpressionInfo
2022
import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
2123
import org.apache.spark.sql.types.DataType
2224

2325
private[sql] object PythonSQLUtils {
2426
def parseDataType(typeText: String): DataType = CatalystSqlParser.parseDataType(typeText)
27+
28+
// This is needed when generating SQL documentation for built-in functions.
29+
def listBuiltinFunctionInfos(): Array[ExpressionInfo] = {
30+
FunctionRegistry.functionSet.flatMap(f => FunctionRegistry.builtin.lookupFunction(f)).toArray
31+
}
2532
}

sql/create-docs.sh

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
#!/bin/bash
2+
3+
#
4+
# Licensed to the Apache Software Foundation (ASF) under one or more
5+
# contributor license agreements. See the NOTICE file distributed with
6+
# this work for additional information regarding copyright ownership.
7+
# The ASF licenses this file to You under the Apache License, Version 2.0
8+
# (the "License"); you may not use this file except in compliance with
9+
# the License. You may obtain a copy of the License at
10+
#
11+
# http://www.apache.org/licenses/LICENSE-2.0
12+
#
13+
# Unless required by applicable law or agreed to in writing, software
14+
# distributed under the License is distributed on an "AS IS" BASIS,
15+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
16+
# See the License for the specific language governing permissions and
17+
# limitations under the License.
18+
#
19+
20+
# Script to create SQL API docs. This requires `mkdocs` and to build
21+
# Spark first. After running this script the html docs can be found in
22+
# $SPARK_HOME/sql/site
23+
24+
set -o pipefail
25+
set -e
26+
27+
FWDIR="$(cd "`dirname "${BASH_SOURCE[0]}"`"; pwd)"
28+
SPARK_HOME="$(cd "`dirname "${BASH_SOURCE[0]}"`"/..; pwd)"
29+
30+
if ! hash python 2>/dev/null; then
31+
echo "Missing python in your path, skipping SQL documentation generation."
32+
exit 0
33+
fi
34+
35+
if ! hash mkdocs 2>/dev/null; then
36+
echo "Missing mkdocs in your path, skipping SQL documentation generation."
37+
exit 0
38+
fi
39+
40+
# Now create the markdown file
41+
rm -fr docs
42+
mkdir docs
43+
echo "Generating markdown files for SQL documentation."
44+
"$SPARK_HOME/bin/spark-submit" gen-sql-markdown.py
45+
46+
# Now create the HTML files
47+
echo "Generating HTML files for SQL documentation."
48+
mkdocs build --clean
49+
rm -fr docs

sql/gen-sql-markdown.py

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
#
2+
# Licensed to the Apache Software Foundation (ASF) under one or more
3+
# contributor license agreements. See the NOTICE file distributed with
4+
# this work for additional information regarding copyright ownership.
5+
# The ASF licenses this file to You under the Apache License, Version 2.0
6+
# (the "License"); you may not use this file except in compliance with
7+
# the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
#
17+
18+
import sys
19+
import os
20+
from collections import namedtuple
21+
22+
ExpressionInfo = namedtuple("ExpressionInfo", "className usage name extended")
23+
24+
25+
def _list_function_infos(jvm):
26+
"""
27+
Returns a list of function information via JVM. Sorts wrapped expression infos by name
28+
and returns them.
29+
"""
30+
31+
jinfos = jvm.org.apache.spark.sql.api.python.PythonSQLUtils.listBuiltinFunctionInfos()
32+
infos = []
33+
for jinfo in jinfos:
34+
name = jinfo.getName()
35+
usage = jinfo.getUsage()
36+
usage = usage.replace("_FUNC_", name) if usage is not None else usage
37+
extended = jinfo.getExtended()
38+
extended = extended.replace("_FUNC_", name) if extended is not None else extended
39+
infos.append(ExpressionInfo(
40+
className=jinfo.getClassName(),
41+
usage=usage,
42+
name=name,
43+
extended=extended))
44+
return sorted(infos, key=lambda i: i.name)
45+
46+
47+
def _make_pretty_usage(usage):
48+
"""
49+
Makes the usage description pretty and returns a formatted string.
50+
Otherwise, returns None.
51+
"""
52+
53+
if usage is not None and usage.strip() != "":
54+
usage = "\n".join(map(lambda u: u.strip(), usage.split("\n")))
55+
return "%s\n\n" % usage
56+
57+
58+
def _make_pretty_extended(extended):
59+
"""
60+
Makes the extended description pretty and returns a formatted string.
61+
Otherwise, returns None.
62+
"""
63+
64+
if extended is not None and extended.strip() != "":
65+
extended = "\n".join(map(lambda u: u.strip(), extended.split("\n")))
66+
return "```%s```\n\n" % extended
67+
68+
69+
def generate_sql_markdown(jvm, path):
70+
"""
71+
Generates a markdown file after listing the function information. The output file
72+
is created in `path`.
73+
"""
74+
75+
with open(path, 'w') as mdfile:
76+
for info in _list_function_infos(jvm):
77+
mdfile.write("### %s\n\n" % info.name)
78+
usage = _make_pretty_usage(info.usage)
79+
extended = _make_pretty_extended(info.extended)
80+
if usage is not None:
81+
mdfile.write(usage)
82+
if extended is not None:
83+
mdfile.write(extended)
84+
85+
86+
if __name__ == "__main__":
87+
from pyspark.java_gateway import launch_gateway
88+
89+
jvm = launch_gateway().jvm
90+
markdown_file_path = "%s/docs/index.md" % os.path.dirname(sys.argv[0])
91+
generate_sql_markdown(jvm, markdown_file_path)

0 commit comments

Comments
 (0)