Skip to content

Commit

Permalink
[SPARK-7721][INFRA] Run and generate test coverage report from Python…
Browse files Browse the repository at this point in the history
… via Jenkins

## What changes were proposed in this pull request?

### Background

For the current status, the test script that generates coverage information was merged
into Spark, apache#20204

So, we can generate the coverage report and site by, for example:

```
run-tests-with-coverage --python-executables=python3 --modules=pyspark-sql
```

like `run-tests` script in `./python`.

### Proposed change

The next step is to host this coverage report via `github.io` automatically
by Jenkins (see https://spark-test.github.io/pyspark-coverage-site/).

This uses my testing account for Spark, spark-test, which is shared to Felix and Shivaram a long time ago for testing purpose including AppVeyor.

To cut this short, this PR targets to run the coverage in
[spark-master-test-sbt-hadoop-2.7](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/)

In the specific job, it will clone the page, and rebase the up-to-date PySpark test coverage from the latest commit. For instance as below:

```bash
# Clone PySpark coverage site.
git clone https://github.com/spark-test/pyspark-coverage-site.git

# Remove existing HTMLs.
rm -fr pyspark-coverage-site/*

# Copy generated coverage HTMLs.
cp -r .../python/test_coverage/htmlcov/* pyspark-coverage-site/

# Check out to a temporary branch.
git symbolic-ref HEAD refs/heads/latest_branch

# Add all the files.
git add -A

# Commit current HTMLs.
git commit -am "Coverage report at latest commit in Apache Spark"

# Delete the old branch.
git branch -D gh-pages

# Rename the temporary branch to master.
git branch -m gh-pages

# Finally, force update to our repository.
git push -f origin gh-pages
```

So, it is a one single up-to-date coverage can be shown in the `github-io` page. The commands above were manually tested.

### TODOs

- [x] Write a draft HyukjinKwon
- [x] `pip install coverage` to all python implementations (pypy, python2, python3) in Jenkins workers  - shaneknapp
- [x] Set hidden `SPARK_TEST_KEY` for spark-test's password in Jenkins via Jenkins's feature
 This should be set in both PR builder and `spark-master-test-sbt-hadoop-2.7` so that later other PRs can test and fix the bugs - shaneknapp
- [x] Set an environment variable that indicates `spark-master-test-sbt-hadoop-2.7` so that that specific build can report and update the coverage site - shaneknapp
- [x] Make PR builder's test passed HyukjinKwon
- [x] Fix flaky test related with coverage HyukjinKwon
  -  6 consecutive passes out of 7 runs

This PR will be co-authored with me and shaneknapp

## How was this patch tested?

It will be tested via Jenkins.

Closes apache#23117 from HyukjinKwon/SPARK-7721.

Lead-authored-by: Hyukjin Kwon <[email protected]>
Co-authored-by: hyukjinkwon <[email protected]>
Co-authored-by: shane knapp <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
  • Loading branch information
HyukjinKwon and shaneknapp committed Feb 1, 2019
1 parent e44f308 commit cdd694c
Show file tree
Hide file tree
Showing 3 changed files with 71 additions and 3 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

[![Jenkins Build](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/badge/icon)](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7)
[![AppVeyor Build](https://img.shields.io/appveyor/ci/ApacheSoftwareFoundation/spark/master.svg?style=plastic&logo=appveyor)](https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark)
[![PySpark Coverage](https://img.shields.io/badge/dynamic/xml.svg?label=pyspark%20coverage&url=https%3A%2F%2Fspark-test.github.io%2Fpyspark-coverage-site&query=%2Fhtml%2Fbody%2Fdiv%5B1%5D%2Fdiv%2Fh1%2Fspan&colorB=brightgreen&style=plastic)](https://spark-test.github.io/pyspark-coverage-site)

Spark is a fast and general cluster computing system for Big Data. It provides
high-level APIs in Scala, Java, Python, and R, and an optimized engine that
Expand Down
63 changes: 60 additions & 3 deletions dev/run-tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@
import re
import sys
import subprocess
import glob
import shutil
from collections import namedtuple

from sparktestsupport import SPARK_HOME, USER_HOME, ERROR_CODES
Expand Down Expand Up @@ -400,15 +402,66 @@ def run_scala_tests(build_tool, hadoop_version, test_modules, excluded_tags):
run_scala_tests_sbt(test_modules, test_profiles)


def run_python_tests(test_modules, parallelism):
def run_python_tests(test_modules, parallelism, with_coverage=False):
set_title_and_block("Running PySpark tests", "BLOCK_PYSPARK_UNIT_TESTS")

command = [os.path.join(SPARK_HOME, "python", "run-tests")]
if with_coverage:
# Coverage makes the PySpark tests flaky due to heavy parallelism.
# When we run PySpark tests with coverage, it uses 4 for now as
# workaround.
parallelism = 4
script = "run-tests-with-coverage"
else:
script = "run-tests"
command = [os.path.join(SPARK_HOME, "python", script)]
if test_modules != [modules.root]:
command.append("--modules=%s" % ','.join(m.name for m in test_modules))
command.append("--parallelism=%i" % parallelism)
run_cmd(command)

if with_coverage:
post_python_tests_results()


def post_python_tests_results():
if "SPARK_TEST_KEY" not in os.environ:
print("[error] 'SPARK_TEST_KEY' environment variable was not set. Unable to post "
"PySpark coverage results.")
sys.exit(1)
spark_test_key = os.environ.get("SPARK_TEST_KEY")
# The steps below upload HTMLs to 'github.com/spark-test/pyspark-coverage-site'.
# 1. Clone PySpark coverage site.
run_cmd([
"git",
"clone",
"https://spark-test:%[email protected]/spark-test/pyspark-coverage-site.git" % spark_test_key])
# 2. Remove existing HTMLs.
run_cmd(["rm", "-fr"] + glob.glob("pyspark-coverage-site/*"))
# 3. Copy generated coverage HTMLs.
for f in glob.glob("%s/python/test_coverage/htmlcov/*" % SPARK_HOME):
shutil.copy(f, "pyspark-coverage-site/")
os.chdir("pyspark-coverage-site")
try:
# 4. Check out to a temporary branch.
run_cmd(["git", "symbolic-ref", "HEAD", "refs/heads/latest_branch"])
# 5. Add all the files.
run_cmd(["git", "add", "-A"])
# 6. Commit current HTMLs.
run_cmd([
"git",
"commit",
"-am",
"Coverage report at latest commit in Apache Spark",
'--author="Apache Spark Test Account <[email protected]>"'])
# 7. Delete the old branch.
run_cmd(["git", "branch", "-D", "gh-pages"])
# 8. Rename the temporary branch to master.
run_cmd(["git", "branch", "-m", "gh-pages"])
# 9. Finally, force update to our repository.
run_cmd(["git", "push", "-f", "origin", "gh-pages"])
finally:
os.chdir("..")


def run_python_packaging_tests():
set_title_and_block("Running PySpark packaging tests", "BLOCK_PYSPARK_PIP_TESTS")
Expand Down Expand Up @@ -567,7 +620,11 @@ def main():

modules_with_python_tests = [m for m in test_modules if m.python_test_goals]
if modules_with_python_tests:
run_python_tests(modules_with_python_tests, opts.parallelism)
# We only run PySpark tests with coverage report in one specific job with
# Spark master with SBT in Jenkins.
is_sbt_master_job = "SPARK_MASTER_SBT_HADOOP_2_7" in os.environ
run_python_tests(
modules_with_python_tests, opts.parallelism, with_coverage=is_sbt_master_job)
run_python_packaging_tests()
if any(m.should_run_r_tests for m in test_modules):
run_sparkr_tests()
Expand Down
10 changes: 10 additions & 0 deletions python/pyspark/streaming/tests/test_dstream.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,16 @@
import unittest
from functools import reduce
from itertools import chain
import platform

from pyspark import SparkConf, SparkContext, RDD
from pyspark.streaming import StreamingContext
from pyspark.testing.streamingutils import PySparkStreamingTestCase


@unittest.skipIf(
"pypy" in platform.python_implementation().lower() and "COVERAGE_PROCESS_START" in os.environ,
"PyPy implementation causes to hang DStream tests forever when Coverage report is used.")
class BasicOperationTests(PySparkStreamingTestCase):

def test_map(self):
Expand Down Expand Up @@ -389,6 +393,9 @@ def failed_func(i):
self.fail("a failed func should throw an error")


@unittest.skipIf(
"pypy" in platform.python_implementation().lower() and "COVERAGE_PROCESS_START" in os.environ,
"PyPy implementation causes to hang DStream tests forever when Coverage report is used.")
class WindowFunctionTests(PySparkStreamingTestCase):

timeout = 15
Expand Down Expand Up @@ -466,6 +473,9 @@ def func(dstream):
self._test_func(input, func, expected)


@unittest.skipIf(
"pypy" in platform.python_implementation().lower() and "COVERAGE_PROCESS_START" in os.environ,
"PyPy implementation causes to hang DStream tests forever when Coverage report is used.")
class CheckpointTests(unittest.TestCase):

setupCalled = False
Expand Down

0 comments on commit cdd694c

Please sign in to comment.