Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add known forth for ATLAS #1282

Merged
merged 14 commits into from
Nov 8, 2024

Conversation

nikoladze
Copy link
Contributor

@nikoladze nikoladze commented Aug 30, 2024

Since edge cases still keep popping up for types in PHYSLITE that actually have always the same interpretation (most prominently vector<vector<ElementLink<...>>> we were discussing to include a mechanism to pre-define known forth code for special cases.

This PR attempts to add such a mechanism and define a special treatment for vector<vector<ElementLink<...>>> in ATLAS PHYSLITE files. One needs to define 2 things:

  • A string containing the full forth code used to deserialize the binary data
  • An awkward form that references the correct array names used in the forth code via form_key

These 2 properties are picked up in the constructor of the generic AsObjects interpretation, ensuring that no forth code needs to be generated (and therefore also no python loop through any of the basket data) when the array is read with library="ak" (default).

In the current state it seems to be working, but at the moment breaks a lot of tests, mainly because the hardcoded form now does not match anymore excactly what it was before. Looking into how to resolve this.

tagging @matthewfeickert, @alexander-held

@nikoladze nikoladze marked this pull request as draft August 30, 2024 09:57
@nikoladze nikoladze changed the title Add known forth for ATLAS feat: Add known forth for ATLAS Sep 2, 2024
@matthewfeickert
Copy link
Member

@nikoladze is this ready for review? Or were there additional things that you needed to add to it?

@nikoladze
Copy link
Contributor Author

@nikoladze is this ready for review? Or were there additional things that you needed to add to it?

I think i'm getting there. Before merging i also want to write a bit more info into the docstrings, e.g. how to add further stuff in the future.

Current status:

  • i now implemented this based on a lookup in the AsObjects constructor as discussed with @jpivarski. I think this was necessary (as opposed to putting it into interpretation_of) since other places in the code need the Model of the interpretation. So now only the forth code is special cased, not the whole interpretation
  • In addition to that, also the awkward form needs to be provided since this has to reference the names of the buffers that come out of running the forth code. I implemented this now as a member function of a class that knows about the typename since the awkward forms coming out of uproot also carry the typename that corresponds to the RecordArray (other tests would fail if the form is different because of this)
  • For start i implemented this now for some of the vector<vector<ElementLink<DataVector<T>>>> types which cause most of the problems.

Some problems that this PR should fix (for vector<vector<ElementLink...):

  • In some cases all vectors of a basket are empty, causing the discovery mechanism to loop (in python) through the whole data of the basket which may be slow (performance issue). If the type for these is now part of the special cases then the predefined forth code will be used instead
  • In some cases there will be a CannotBeAwkward exception, e.g. in the test file some streamer info for std::vector<std::vector<ElementLink<DataVector<xAOD::NeutralParticle_v1>>>> is missing. Now also here the special cased forth code (and awkward form) will be picked up.

@@ -122,6 +130,13 @@ def awkward_form(
tobject_header=False,
breadcrumbs=(),
):
awkward = uproot.extras.awkward()
if self._form is not None: # TODO: is this really fine?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpivarski i'm a bit uncomfortable with the piece of code i added here. The _form attribute already existed before, but was never directly returned here. The tests seem to pass, but do you know if there are potential problems that may be caused by directly returning _form if it is not None?

One thing that i had to do (and also don't quite feel comfortable with) was to convert the form from a dict representation which seems to happen sometimes (but i'm a bit unsure where and why not always).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still trying to understand your first paragraph, but about the second paragraph: there are three ways that a Form can be represented,

  • as a ak.forms.Form object, which is immutable and therefore needs to be entirely known or at least built from the leaves up to the root
  • as a Python dict (JSON represented as dicts and lists), which isn't type-safe, but it can be modified in place
  • as a JSON string, which doesn't even ensure that pairs of square brackets are closed.

Thus, they have decreasing levels of safety and we prefer the safer ones when possible. The problem is that the process of discovering the type involves starting at the root and walking down toward the leaves, filling in a ListOffsetArray's content if we see a non-empty example of it. So the ideal, ak.forms.Form, isn't possible and the Form starts its life as a Python dict. When it gets converted into an object, that's final. The Forth is fully discovered at that point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i now realized that the self._form attribute will always be in the Python dict form (either from generated or known forth code). So at the place where it is returned in AsObjects.awkward_form(...) it has to be converted to the final ak.forms.Form object. In principle the form from the known forth path could have been provided directly as ak.forms.Form, but to be consistent with the one created from the ForthGenerator it is now also a dict.

@nikoladze nikoladze marked this pull request as ready for review September 17, 2024 11:58
@jpivarski
Copy link
Member

@nikoladze, can I "Update branch" to see if this passes tests?

There are some good things in this PR and I'd like to merge it. When you marked it "Ready for Review," do you mean that there are no more features planned, only getting the tests to pass?

@nikoladze
Copy link
Contributor Author

@nikoladze, can I "Update branch" to see if this passes tests?

i did that now - looks like they do.

There are some good things in this PR and I'd like to merge it. When you marked it "Ready for Review," do you mean that there are no more features planned, only getting the tests to pass?

I'm not planning to add further stuff right now. Many of the issues with awkward forth and ElementLinks have been fixed in other PRs, but this PR will bring the benefit of solving two more edge cases, the first can be seen in this example:

import uproot
import skhep_testdata
tree = uproot.open({skhep_testdata.data_path("uproot-issue-123a.root"): "CollectionTree"})
tree["PrimaryVerticesAuxDyn.neutralParticleLinks"].arrays()

Uproot can generate a model for this branch

>>> tree["PrimaryVerticesAuxDyn.neutralParticleLinks"].interpretation.model
AsVector(True, AsVector(False, Unknown_ElementLink_3c_DataVector_3c_xAOD_3a3a_NeutralParticle_5f_v1_3e3e_))

But because of the "Unknown" it could, before this PR, only be read with library="np". I believe this is because this file has streamer info for vector<vector<ElementLink<DataVector<xAOD::NeutralParticle_v1>>>> but not for ElementLink<DataVector<xAOD::NeutralParticle_v1, the content of the vector<vector<. But we know that this branch can be read in the same way as the other vector<vector<ElementLink<...>> branches, so it is added in the list and works now with with the known forth code.

Now i realize i summarized the 2 edge cases already in the comment above, so i'm just quoting myself for the second edge case that should be solved by this PR:

In some cases all vectors of a basket are empty, causing the discovery mechanism to loop (in python) through the whole data of the basket which may be slow (performance issue). If the type for these is now part of the special cases then the predefined forth code will be used instead

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is great—I will merge it and it will be included in the next Uproot patch release. If you need to add more known Forth cases in the future, it can be a new PR.

@jpivarski jpivarski merged commit 7b39448 into scikit-hep:main Nov 8, 2024
26 checks passed
@matthewfeickert
Copy link
Member

Thank you @nikoladze and @jpivarski!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants