last updated: 5/31/2023
This is a rewrite in kotlin of parts of the devcdm and netcdf-java libraries.
The intention is to create a maintainable, read-only, pure JVM library allowing full access to netcdf3, netcdf4, hdf4, hdf5, hdf-eos2 and hdf-eos5 data files.
Please contact me if you'd like to help out. Especially needed are test datasets from all the important data archives!!
There is so much important scientific data stored in the NetCDF and HDF file formats, that those formats will never go away. It is important that there be maintainable, independent libraries to read these files forever.
The Netcdf-Java library prototyped a "Common Data Model" (CDM) to provide a single API to access various file formats. The netcdf* and hdf* file formats are similar enough to make a common API a practical and useful goal. By focusing on read-only access to just these formats, the API and the code are kept simple.
In short, a library that focuses on simplicity and clarity is a safeguard for the huge investment in these scientific datasets.
The reference libraries are well maintained but complex. They are coded in C, which is a difficult language to master and keep bug free, with implication for memory safety and security. The libraries require various machine and OS dependent toolchains. Shifts in funding could wipe out much of the institutional knowledge needed to maintain them.
The HDF file formats are overly complicated, which impacts code complexity and clarity. The data structures do not always map to a user understandable data model. Semantics are left to data-writers to document (or not). While this problem isn't specific to HDF file users, it is exacerbated by a "group of messages" design approach.
The HDF4 C library is a curious hodgepodge of disjointed APIs. The HDF5 API is better and the Netcdf4 API much better. But all suffer from the limitations of the C language, the difficulty of writing good documentation for all skill levels of programmers, and the need to support legacy APIs.
HDF-EOS uses an undocumented "Object Descriptor Language (ODL)" text format, which adds a dependency on the SDP Toolkit and possibly other libraries. These toolkits also provide functionality such as handling projections and coordinate system conversions, and arguably its impossible to process HDF-EOS without them. So the value added here by an independent library for data access is less clear. For now, we will provide a "best-effort" to expose the internal contents of the file.
Currently, the Netcdf-4 and HDF4 libraries are not thread safe, even when operating on different files. The HDF5 library can be built with MPI-IO for parallel file systems. The serial HDF5 library is apparently thread safe but does not support concurrent reading. These are serious limitations for high performance, scalable applications.
Our library tries to ameliorate these problems for scientists and the interested public to access the data without having to become specialists in the file formats and legacy APIs.
Kotlin is a modern, statically typed, garbage-collected language suitable for large development projects. It has many new features for safer (like null-safety) and more concise (like functional idioms) code, and is an important improvement over Java, without giving up any of Java's strengths. Kotlin will attract the next generation of serious open-source developers, and hopefully some of them will be willing to keep this library working into the unforeseeable future.
We are aiming to be within 2x of the C libraries for reading data. Preliminary tests indicate that's a reasonable goal. For HDF5 files using deflate filters, the deflate library dominates the read time, and standard Java deflate libraries are about 2X slower than native code.Unless the deflate libraries get better, theres not much gain in trying to make other parts of the code faster.
Its possible we can use kotlin coroutines to speed up performance bottlenecks. TBD.
We are using the Foreign Function & Memory API (Java 19 Preview) for testing against the Netcdf, HDF5, and HDF4 C libraries. With these tools we can be confident that our library gives the same results as the reference libraries.
Currently we have this test coverage from core/test:
cdm 88% (1560/1764) LOC
hdf4 84% (1743/2071) LOC
hdf5 81% (2278/2792) LOC
netcdf3 77% (230/297) LOC
The core library has ~6500 LOC.
More and deeper test coverage is provided in the clibs module, which compares netchdf metadata and data against the Netcdf, HDF5, and HDF4 C libraries. The clibs module is not part of the released netchdf library and is only supported for test purposes.
Currently we have ~1500 test files:
hdf4 = 205 files
hdf-eos2 = 267 files
hdf5 = 113 files
hdf-eos5 = 18 files
netcdf3 = 664 files
netcdf3.2 = 81 files
netcdf3.5 = 1 files
netcdf4 = 121 files
We need to get representative samples of recent files for improved testing and code coverage.
We have the goal to give read access to all the content in NetCDF, HDF5, HDF4 and HDF-EOS files.
The library will be thread-safe for reading multiple files concurrently.
We are focussing on earth science data, and dont plan to support other uses except as a byproduct.
We do not plan to provide write capabilities.
Datatype<T>, Attribute<T>, Variable<T>, StructureMember<T>, Array<T> and ArraySection<T> are all generics, with T indicating the data type returned when read, eg:
fun <T> readArrayData(v2: Variable<T>, section: SectionPartial? = null) : ArrayTyped<T>
- Datatype.ENUM returns an array of the corresponding UBYTE/USHORT/UINT. Call data.convertEnums() to turn this into an ArrayString of corresponding enum names.
- Datatype.CHAR: All Attributes of type CHAR are assumed to be Strings. All Variables of type CHAR return data as
ArrayUByte. Call data.makeStringsFromBytes() to turn this into Strings with the array rank reduced by one.
- Netcdf-3 does not have STRING or UBYTE types. In practice, CHAR is used for either.
- Netcdf-4/HDF5 library encodes CHAR values as HDF5 string type with elemSize = 1, so we use that convention to detect legacy CHAR variables in HDF5 files. NC_CHAR should not be used in Netcdf-4, use NC_UBYTE or NC_STRING.
- HDF4 does not have a STRING type, but does have signed and unsigned CHAR, and signed and unsigned BYTE. We map both signed and unsigned to Datatype.CHAR and handle it as above (Attributes are Strings, Variables are UBytes).
- Datatype.STRING is variable length, whether the file storage is variable or fixed length.
Unlike Netcdf-Java, we follow Netcdf-4 "user defined types" and add typedefs for Compound, Enum, Opaque, and Vlen.
- Datatype.ENUM typedef has a map from integer to name (same as Netcdf-Java)
- Datatype.COMPOUND typedef contains a description of the members of the Compound (aka Structure).
- Datatype.OPAQUE typedef may contain the byte length of OPAQUE data.
- Datatype.VLEN typedef has the base type. An array of VLEN may have different lengths for each object.
- Unlike Netcdf-3 and Netcdf-4, dimensions may be "anonymous", in which case they have a length but not a name, and are local to the variable they are referenced by.
- There are no UNLIMITED dimensions. These are unneeded since we do not support writing.
- Creation order is ignored
- Not including symbolic links in a group, as these point to an existing dataset (variable)
- Opaque: hdf5 makes arrays of Opaque all the same size, which gives up some of its usefulness. If theres a need, we will allow Opaque(*) indicating that the sizes can vary.
- Attributes can be of type REFERENCE, with value the full path name of the referenced dataset.
- All data access is unified under the netchdf API
- The StructMetadata ODL is gathered and applied to the file header metadata as well as possible. Contact us with example files if you see something we are missing.
An independent implementation of HDF4/HDF5/HDF-EOS in kotlin.
I am working on an independent library implementation of HDF4/HDF5/HDF-EOS in kotlin here. This will be complementary to the important work of maintaining the primary HDF libraries. The goal is to give read access to all the content in NetCDF, HDF5, HDF4 and HDF-EOS files.
The core library is pure Kotlin. Kotlin currently runs on JVM's as far back as Java 8. However, I am targeting the latest LTS (long term support) Java version, and will not be explicitly supporting older versions.
A separate library tests the core against the C libraries. The key to this working reliably is if members of the HDF community contribute test files to make sure the libraries agree. I have a large cache of test files from my work on netcdf-java, but these are mostly 10-20 years old.
Currently the code is in alpha, and you must build it yourself with gradle. When it hits beta, I will start releasing compiled versions to Maven Central.
I welcome any feedback, questions and concerns. Thanks!