Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build.sh: Add option to log nvcc compile times #1262

Merged

Conversation

ahendriksen
Copy link
Contributor

@ahendriksen ahendriksen commented Feb 9, 2023

Add --time option to build.sh that enables compile time logging of nvcc.

Also, add a script cpp/scripts/analyze_nvcc_log.py to find the translation units that take the longest time.

Output looks like:

$ cpp/scripts/analyze_nvcc_log.py cpp/build/nvcc_compile_log.csv
-- loading data
-- analyzing data
-- Ten longest translation units:
phase  index                                               file        cicc   cudafe++  fatbinary  gcc (compiling)  gcc (preprocessing 1)  gcc (preprocessing 4)        ptxas   total time
0         10  ions/detail/canberra_double_double_double_int.cu    42.431063  10.601856   0.020979         6.747153               3.721194               2.093567  1618.390375  1684.006186
1         11  zations/detail/canberra_float_float_float_int.cu    36.928960   9.804138   0.011537         6.796088               3.481156               1.790703  1584.262875  1643.075457
2         85  ors/specializations/refine_d_uint64_t_uint8_t.cu   602.935531  14.980877   0.529673        36.300566               6.270717               2.889723   933.622969  1597.530056
3         84  bors/specializations/refine_d_uint64_t_int8_t.cu   606.513281  16.243960   0.729282        39.981113               5.608029               3.028493   897.241469  1569.345628
4         53  stance/neighbors/ivfpq_search_int8_t_uint64_t.cu   841.049750   8.233967   1.025554        24.248578               4.069022               1.747108   631.193734  1511.567713
5         52  istance/neighbors/ivfpq_search_float_uint64_t.cu   837.241437   8.145278   1.042313        24.400606               3.433528               1.882623   627.786672  1503.932457
6         54  tance/neighbors/ivfpq_search_uint8_t_uint64_t.cu   846.706656   8.371286   1.025517        24.094691               3.432749               1.645345   618.319234  1503.595479
7         76  izations/detail/ivfpq_search_uint8_t_uint64_t.cu   698.726266   7.086368   1.050021        39.727723               3.259101               1.333935   406.509937  1157.693351
8         74  alizations/detail/ivfpq_search_float_uint64_t.cu   706.702516   6.905794   1.049731        39.923895               2.814361               2.057154   395.604000  1155.057450
9         75  lizations/detail/ivfpq_search_int8_t_uint64_t.cu   689.390281   6.483386   1.025864        39.865668               3.121696               1.297788   409.099562  1150.284245
10        83  hbors/specializations/refine_d_uint64_t_float.cu   334.705594  15.466444   0.680270        36.551977               5.405133               2.947568   715.708781  1111.465767
-- Plotting absolute compile times
-- Wrote absolute compile time plot to cpp/build/nvcc_compile_log.csv.absolute.compile_times.png
-- Plotting relative compile times
-- Wrote relative compile time plot to cpp/build/nvcc_compile_log.csv.relative.compile_times.png

@ahendriksen ahendriksen requested a review from a team as a code owner February 9, 2023 13:53
@ahendriksen
Copy link
Contributor Author

ahendriksen commented Feb 9, 2023

An example of what the output looks like on PR #1254 :

image
image

@ahendriksen ahendriksen requested a review from cjnolet February 9, 2023 13:54
@ahendriksen ahendriksen added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Build Time Improvement 3 - Ready for Review labels Feb 9, 2023
@ahendriksen
Copy link
Contributor Author

Okay.. This is causing segfaults in CI.. That is a disappointment..

@cjnolet
Copy link
Member

cjnolet commented Feb 9, 2023

In addition to the segfault, we have other challenges right now trying to extract the ninja_logs from the build for each package in CI (conda build wants to create its own working directory and then copy to a final directory when done, which overwrites the existing files each time).

Maybe for now we could add the --time flag as an option in build.sh (I'm thinking maybe build.sh -t to enable it? What do you think?).

I'd also like to start capturing these scripts in the codebase. Maybe we could create a new directory inside cpp/scripts specifically for these build analysis scripts? Something like cpp/scripts/stats? (I'm terrible at naming sometimes)

@robertmaynard
Copy link
Contributor

Since we are specifying the same file for each compiler invocation we are hitting parallel clobber issues due to multiple compiler instances writing to the file at the same time.

If you need a high level overview of what compile jobs have taken so long, and not where in nvcc it has occured. You can scrape the ninja build database for that information. libcudf is doing this already: https://github.com/rapidsai/cudf/blob/branch-23.04/build.sh#L303

@ahendriksen
Copy link
Contributor Author

Since we are specifying the same file for each compiler invocation we are hitting parallel clobber issues due to multiple compiler instances writing to the file at the same time.

I thought that could be an issue as well, but locally (12 cores) this has not been a problem. To rule it out, is there a way to instruct CMake to make the flag dependent on the output file, e.g., to get --time=CMakeFiles/raft_distance_lib.dir/src/distance/neighbors/ivfpq_search_uint8_t_uint64_t.cu.o.nvcc_log.csv?

@robertmaynard
Copy link
Contributor

Since we are specifying the same file for each compiler invocation we are hitting parallel clobber issues due to multiple compiler instances writing to the file at the same time.

I thought that could be an issue as well, but locally (12 cores) this has not been a problem. To rule it out, is there a way to instruct CMake to make the flag dependent on the output file, e.g., to get --time=CMakeFiles/raft_distance_lib.dir/src/distance/neighbors/ivfpq_search_uint8_t_uint64_t.cu.o.nvcc_log.csv?

There is no easy way to do this. CMake doesn't offer any way to get the object location on a per source basis.

You would need to do something like this:

get_target_property(sources <target_name> SOURCES)
foreach(source IN LISTS sources)
  cmake_path(GET source FILENAME source_name)
  set_source_files_properties(${source} PROPERTIES COMPILE_FLAGS "--time=${source_name}.csv")
endforeach()

@ahendriksen ahendriksen requested a review from a team as a code owner February 10, 2023 10:12
@ahendriksen ahendriksen changed the title Log nvcc compile time by default build.sh: Add option to log nvcc compile times Feb 10, 2023
@ahendriksen
Copy link
Contributor Author

@robertmaynard Thanks for the snippet! It works locally.

Since the --time is unlikely to work in general (it is now failing in CI; it may interfere with compile_commands.json), I have made it an optional flag to build.sh as @cjnolet suggested. I have also added a script to analyze the results.

@ahendriksen ahendriksen force-pushed the enh-log-nvcc-compile-times branch from d2becd2 to 97d5f9e Compare February 10, 2023 10:39
@cjnolet
Copy link
Member

cjnolet commented Feb 11, 2023

@ahendriksen while it may seem cumbersome, I’ve been thinking about @robertmaynard’s previous suggestion and I actually think it could be useful to dump individual csv files for each of the source files that get compiled.

Though, I wonder, should we stick them in a common directory (maybe something like cpp/build/cuda_time_logs?) I still like the idea of keeping it as an option in build.sh (which ultimately turns on an option in cmake) and I can’t imagine this would add all that much overhead to the build. What we can do is have the script read through all the csv files in the whatever we decide to name that directory.

what do you think?

@ahendriksen
Copy link
Contributor Author

@cjnolet, One thing that may not be immediately obvious from the suggestion of @robertmaynard is that:

  • it creates file names that are equal to the source file name (and thus discards the directory structure)
  • we have a dozen files that share the same name.

In summary, with the change you are suggesting, there would still be some translation units that result in writes to the same csv file and we would have to collect all csv files in a directory (as you suggest) to reduce clutter. As such, I think the current design is a local optimum.

@robertmaynard
Copy link
Contributor

@cjnolet, One thing that may not be immediately obvious from the suggestion of @robertmaynard is that:

  • it creates file names that are equal to the source file name (and thus discards the directory structure)
  • we have a dozen files that share the same name.

In summary, with the change you are suggesting, there would still be some translation units that result in writes to the same csv file and we would have to collect all csv files in a directory (as you suggest) to reduce clutter. As such, I think the current design is a local optimum.

Those are both solvable by computing the relative path of the source file from the root of the project and encoding it into the file name / path.

@ahendriksen
Copy link
Contributor Author

I thought this was going to be an easy change. The scope is slowly getting out of hand. Maybe we can close the PR and leave it as is for documentation purposes. What do you think @cjnolet?

@cjnolet
Copy link
Member

cjnolet commented Mar 9, 2023

@ahendriksen I'll leave that up to you. I do think it's a very useful change and it would really help debugging compile times. I agree with @robertmaynard, though, that we should probably capture these timings in individual files. It doesn't seem too hard to accomplish this by taking the path for each file and replacing the / for _. Maybe I'm oversimplifying this though?

@robertmaynard is there an easy way to get a hold of the relative path (from the repo root) for each source file? @ahendriksen I would honestly be happy even if we just dumped the renamed files to a single directory for now.

@robertmaynard
Copy link
Contributor

@ahendriksen I'll leave that up to you. I do think it's a very useful change and it would really help debugging compile times. I agree with @robertmaynard, though, that we should probably capture these timings in individual files. It doesn't seem too hard to accomplish this by taking the path for each file and replacing the / for _. Maybe I'm oversimplifying this though?

@robertmaynard is there an easy way to get a hold of the relative path (from the repo root) for each source file? @ahendriksen I would honestly be happy even if we just dumped the renamed files to a single directory for now.

@cjnolet Here is how you can get the relative path for each source file and convert them into allowable file names:

get_target_property(sources <TARGET> SOURCES)
foreach(source IN LISTS sources)
  cmake_path(IS_ABSOLUTE source is_abs)
  if(is_abs)
    cmake_path(RELATIVE_PATH source BASE_DIRECTORY ${PROJECT_SOURCE_DIR}) # convert to relative path if not already one
  endif()
  string(MAKE_C_IDENTIFIER "${source}" filename) #convert to valid filename
  set_source_files_properties(${source} PROPERTIES COMPILE_FLAGS "--time=${filename}.csv")
endforeach()

This will create a csv file in the cpp/build directory that records the
compilation of each translation unit and how long each phase of the nvcc
compilation took.

There does not seem to be a downside to enabling this, and it will be
very helpful to diagnose build issues.

To analyze the file, the following python code will help. It requires
pandas, matplotlib, and seaborn:

------------------------------------------------------------

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from matplotlib import colors

df = pd.read_csv("./nvcc_compile_log.csv")
df = df.rename(columns=str.strip)
df["seconds"] = df["metric"] / 1000
df["file"] = df["source file name"]
df["phase"] = df["phase name"].str.strip()

def categorize_time(s):
    if s < 60:
        return "less than a minute"
    else:
        return "more than a minute"

dfp = df.query("phase!='nvcc (driver)'").pivot("file", values="seconds", columns="phase")
dfp_sum = dfp.sum(axis="columns")

df_fraction = dfp.divide(dfp_sum, axis="index")
df_fraction["total time"] = dfp_sum
df_fraction = df_fraction.melt(ignore_index=False, id_vars="total time", var_name="phase", value_name="fraction")

dfp["total time"] = dfp_sum
df_absolute = dfp.melt(ignore_index=False, id_vars="total time", var_name="phase", value_name="seconds")

df_fraction["time category"] = dfp["total time"].apply(categorize_time)
df_absolute["time category"] = dfp["total time"].apply(categorize_time)

palette = {
    "gcc (preprocessing 4)": colors.hsv_to_rgb((0, 1, 1)),
    'cudafe++': colors.hsv_to_rgb((0, 1, .75)),
    'gcc (compiling)': colors.hsv_to_rgb((0, 1, .4)),
    "gcc (preprocessing 1)": colors.hsv_to_rgb((.33, 1, 1)),
    'cicc': colors.hsv_to_rgb((.33, 1, 0.75)),
    'ptxas': colors.hsv_to_rgb((.33, 1, 0.4)),
    'fatbinary': "grey",
}

sns.displot(
    df_absolute.sort_values("total time"),
    y="file",
    hue="phase",
    hue_order=reversed(["gcc (preprocessing 4)", 'cudafe++', 'gcc (compiling)',
               "gcc (preprocessing 1)", 'cicc', 'ptxas',
               'fatbinary',
              ]),
    palette=palette,
    weights="seconds",
    multiple="stack",
    kind="hist",
    height=20,
)
plt.xlabel("seconds");
plt.savefig('absolute_compile_times.png')

sns.displot(
    df_fraction.sort_values('total time'),
    y="file",
    hue="phase",
    hue_order=reversed(["gcc (preprocessing 4)", 'cudafe++', 'gcc (compiling)',
               "gcc (preprocessing 1)", 'cicc', 'ptxas',
               'fatbinary',
              ]),
    palette=palette,
    weights="fraction",
    multiple="stack",
    kind="hist",
    height=15,
)
plt.xlabel("fraction");
plt.savefix("relative_compile_times.png")
Hopefully this prevents segfaults in CI.
The time option is disabled by default. When enabled, writes a log of
compilation times to
cpp/build/nvcc_compile_log.csv.

This is not supported in CI, as it leads to seg faults.
@ahendriksen ahendriksen force-pushed the enh-log-nvcc-compile-times branch from f0d31b0 to 2aec639 Compare March 21, 2023 17:12
Perhaps the file system outside the CMakeFiles/ directory is not
writable.
cpp/CMakeLists.txt Outdated Show resolved Hide resolved
cpp/scripts/analyze_nvcc_log.py Outdated Show resolved Hide resolved
@ahendriksen
Copy link
Contributor Author

I think it might be more difficult than we thought to get this working in CI. I have tried the suggestion writing to separate csv files, but the compilation continues to segfault in CI. A formatted example from the logs:

sccache /opt/conda/conda-bld/_build_env/bin/nvcc -forward-unknown-to-host-compiler [.. snip flags .. ]  \
--time=CMakeFiles/nvcc_log_src_neighbors_refine_d_int64_t_int8_t_cu.csv  \
-MD -MT CMakeFiles/raft_lib.dir/src/neighbors/refine_d_int64_t_int8_t.cu.o \
 -MF CMakeFiles/raft_lib.dir/src/neighbors/refine_d_int64_t_int8_t.cu.o.d -x cu \
-c /opt/conda/conda-bld/work/cpp/src/neighbors/refine_d_int64_t_int8_t.cu \
-o CMakeFiles/raft_lib.dir/src/neighbors/refine_d_int64_t_int8_t.cu.o
nvcc: Segmentation fault

I thought it could have to do with long file names that commonly occur in conda builds, but the paths passed to nvcc are all reasonably short.

If this cannot be made to work in CI, then maybe we should just document that -DCMAKE_CUDA_FLAGS='--time=nvcc_log.csv' will output the log file in one location and the script that is provided in the scripts directory can be used to analyze it (with all the risks for bitrot that that entails).

@cjnolet
Copy link
Member

cjnolet commented Mar 25, 2023

@robertmaynard do you have any other ideas as to why this might still be causing a segfault in CI.

Im actually a little surprised myself that it's still doing failing.

@cjnolet
Copy link
Member

cjnolet commented Mar 28, 2023

@ahendriksen Im actually on board w/ keeping this feature as-is for now (outputting to individual files that we can analyze separately) and just not having it run in CI. At least the feature exists that we can run locally to profile the build and we can turn it back on in CI once we figure out what's causing the segault. What do you think? I'd love to have this in 23.04.

@ahendriksen
Copy link
Contributor Author

ahendriksen commented Mar 28, 2023 via email

@cjnolet
Copy link
Member

cjnolet commented Mar 29, 2023

@ahendriksen are you sure all the data is being written to the file in that case? I wouldn't normally expect multiple threads / processes to be able to independently write to the same file without some sort of locking mechanism. @robertmaynard do you know if this is the case with nvcc?

@ahendriksen
Copy link
Contributor Author

@cjnolet : I haven't read the nvcc source code if that is what you mean. I just tested with make -j 200 and 300 single empty-kernel translation units. The resulting csv file still parses and seems to be correct.

@robertmaynard
Copy link
Contributor

I agree with @ahendriksen in testing this nvcc feature does seem to be multi process safe. It most likely does come at some performance cost as each processes takes an exclusive lock on the file.

My only theory on why we are seeing segfaults in CI is that this tracking does incur some memory overhead which is causing a OOM segfault.

@cjnolet
Copy link
Member

cjnolet commented Mar 29, 2023

@ahendriksen @robertmaynard sounds good to me. If we aren't losing any information, I do agree that it's easier for analysis just to keep everything in the same file.

@cjnolet
Copy link
Member

cjnolet commented Mar 29, 2023

/merge

@rapids-bot rapids-bot bot merged commit e963f5a into rapidsai:branch-23.04 Mar 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Build Time Improvement CMake cpp improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

4 participants