Maybe add retry logic for failed remote data access #1219

jpivarski · 2024-05-14T17:26:37Z

It wouldn't work for some failures, and that's the hard part: deciding when retrying is likely to work and when it's doomed and you're just extending the time needed before getting a useful error message to the user.

Here's the original #1217 (comment), which has some notes about how it could be implemented.

If this were to be done in Uproot, we'd have to dynamically adjust the request size and resubmit when we see such an error, and have some reasonable give-up policy after a specified number of retries to keep it from spiraling out of control. And if the OSError came from a local file, rather than XRootD, it's not retryable (if an open local file suddenly can't be read, someone must have unplugged the disk or something).

Possibly the smallest sections that would need to be retried are:

in HasBranches.arrays:

uproot5/src/uproot/behaviors/TBranch.py

Lines 797 to 835 in 13087b0

    
           arrays, expression_context, branchid_interpretation = _regularize_expressions( 
        
               self, 
        
               expressions, 
        
               cut, 
        
               filter_name, 
        
               filter_typename, 
        
               filter_branch, 
        
               keys, 
        
               aliases, 
        
               language, 
        
               get_from_cache, 
        
           ) 
        
           ranges_or_baskets = [] 
        
           checked = set() 
        
           for _, context in expression_context: 
        
               for branch in context["branches"]: 
        
                   if branch.cache_key not in checked: 
        
                       checked.add(branch.cache_key) 
        
                       for ( 
        
                           basket_num, 
        
                           range_or_basket, 
        
                       ) in branch.entries_to_ranges_or_baskets(entry_start, entry_stop): 
        
                           ranges_or_baskets.append((branch, basket_num, range_or_basket)) 
        
           interp_options = {"ak_add_doc": ak_add_doc} 
        
           _ranges_or_baskets_to_arrays( 
        
               self, 
        
               ranges_or_baskets, 
        
               branchid_interpretation, 
        
               entry_start, 
        
               entry_stop, 
        
               decompression_executor, 
        
               interpretation_executor, 
        
               library, 
        
               arrays, 
        
               False, 
        
               interp_options, 
        
           )

in HasBranches.iterate:

uproot5/src/uproot/behaviors/TBranch.py

Lines 1034 to 1072 in 13087b0

    
           ranges_or_baskets = [] 
        
           checked = set() 
        
           for _, context in expression_context: 
        
               for branch in context["branches"]: 
        
                   if branch.cache_key not in checked: 
        
                       checked.add(branch.cache_key) 
        
                       for ( 
        
                           basket_num, 
        
                           range_or_basket, 
        
                       ) in branch.entries_to_ranges_or_baskets( 
        
                           sub_entry_start, sub_entry_stop 
        
                       ): 
        
                           previous_basket = previous_baskets.get( 
        
                               (branch.cache_key, basket_num) 
        
                           ) 
        
                           if previous_basket is None: 
        
                               ranges_or_baskets.append( 
        
                                   (branch, basket_num, range_or_basket) 
        
                               ) 
        
                           else: 
        
                               ranges_or_baskets.append( 
        
                                   (branch, basket_num, previous_basket) 
        
                               ) 
        
           arrays = {} 
        
           interp_options = {"ak_add_doc": ak_add_doc} 
        
           _ranges_or_baskets_to_arrays( 
        
               self, 
        
               ranges_or_baskets, 
        
               branchid_interpretation, 
        
               sub_entry_start, 
        
               sub_entry_stop, 
        
               decompression_executor, 
        
               interpretation_executor, 
        
               library, 
        
               arrays, 
        
               True, 
        
               interp_options, 
        
           )

in Branch.array:

uproot5/src/uproot/behaviors/TBranch.py

Lines 1787 to 1831 in 13087b0

    
           arrays = {} 
        
           expression_context = [] 
        
           branchid_interpretation = {} 
        
           _regularize_branchname( 
        
               self, 
        
               self.name, 
        
               self, 
        
               interpretation, 
        
               get_from_cache, 
        
               arrays, 
        
               expression_context, 
        
               branchid_interpretation, 
        
               True, 
        
               False, 
        
           ) 
        
           ranges_or_baskets = [] 
        
           checked = set() 
        
           for _, context in expression_context: 
        
               for branch in context["branches"]: 
        
                   if branch.cache_key not in checked and not isinstance( 
        
                       branchid_interpretation[branch.cache_key], 
        
                       uproot.interpretation.grouped.AsGrouped, 
        
                   ): 
        
                       checked.add(branch.cache_key) 
        
                       for ( 
        
                           basket_num, 
        
                           range_or_basket, 
        
                       ) in branch.entries_to_ranges_or_baskets(entry_start, entry_stop): 
        
                           ranges_or_baskets.append((branch, basket_num, range_or_basket)) 
        
           interp_options = {"ak_add_doc": ak_add_doc} 
        
           _ranges_or_baskets_to_arrays( 
        
               self, 
        
               ranges_or_baskets, 
        
               branchid_interpretation, 
        
               entry_start, 
        
               entry_stop, 
        
               decompression_executor, 
        
               interpretation_executor, 
        
               library, 
        
               arrays, 
        
               False, 
        
               interp_options, 
        
           )

Maybe it's possible to retry only the _ranges_or_baskets_to_arrays call, but I'm not sure if there would be a counting problem if the arrays dict and ranges_or_baskets list are not reinitialized after a partial failure. (These are notes for implementation.) If the number of items in the arrays dict is wrong, the code will hang in this loop, rather than raise an error!

There would also need to be a way to change the granularity of the XRootD request while still requesting all the data the user wants. The Source.chunks method doesn't have a way to express that, but maybe the coalesce algorithm does? Unfortunately, coalesce arguments are specified per-FSSpecSource object, but maybe each retry could be a new, more finely granular FSSpecSource? (We don't know if another thread is using the same FSSpecSource object; we can't change it in place.)

Originally posted by @jpivarski in #1217 (comment)

The text was updated successfully, but these errors were encountered:

jpivarski added the feature New feature or request label May 14, 2024

jpivarski mentioned this issue May 14, 2024

xrootd read fails for large arrays #1217

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maybe add retry logic for failed remote data access #1219

Maybe add retry logic for failed remote data access #1219

jpivarski commented May 14, 2024 •

edited

Loading

Maybe add retry logic for failed remote data access #1219

Maybe add retry logic for failed remote data access #1219

Comments

jpivarski commented May 14, 2024 • edited Loading

jpivarski commented May 14, 2024 •

edited

Loading