Skip to content

updated how custom filters are implemented for the Table Report#1976

Draft
lisaleemcb wants to merge 4 commits intoskrub-data:mainfrom
lisaleemcb:issue1969
Draft

updated how custom filters are implemented for the Table Report#1976
lisaleemcb wants to merge 4 commits intoskrub-data:mainfrom
lisaleemcb:issue1969

Conversation

@lisaleemcb
Copy link
Contributor

@lisaleemcb lisaleemcb commented Mar 18, 2026

Description

Previously, the custom_filters kwarg passed to the TableReport required a nested dict such as

CUSTOM_FILTERS = {'FILTER ID': {filter1: [0,1,2]}}.

Additionally, the only custom filter type that could be passed was a list of ints, indicating the indices of the desired columns.

My proposed change is to

a) streamline the user input such that only one dictionary is required, for example,
CUSTOM_FILTERS = {filter1: selector}

and

b) changed the required input to be a skrub Selector object.

Addresses #1969

TBC

  • I have added tests that verify the bug fix
  • I have added an entry to CHANGES.rst describing the fix
  • My code follows the code style of this project
  • I have checked my code and corrected any misspellings

How Has This Been Tested?

"display_name": display_name,
"columns": filter_index,
}
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will break out of the loop before processing the other filters. also we should use sbd to access the columns and the shape. regarding the name, in scikit-learn functions that validate input and return a potentially transformed or sanitized version of it are often called check_something and skrub tends to follow that habit, too

maybe we can refactor this part slightly along the lines of

def _check_col_filter(name, cols, df):
    err_msg = (
        "Custom column filters should be either a Selector or a list of column names"
        f"\n  or a list of column indices. Got a bad filter for key {name!r}: {cols!r}"
    )
    if isinstance(cols, s.Selector):
        return cols.expand_index(df)
    if not isinstance(cols, Sequence):
        raise TypeError(err_msg)
    if all(isinstance(c, str) for c in cols):
        all_col_names = set(sbd.column_names(df))
        bad_col_names = [c for c in cols if c not in all_col_names]
        if bad_col_names:
            raise ValueError(
                "The following column names passed for "
                f"filter {name!r} are not in the dataframe: {bad_col_names}"
            )
        return s.make_selector(cols).expand_index(df)
    if all(isinstance(c, numbers.Integral) for c in cols):
        bad_idx = [c for c in cols if not 0 <= c < sbd.shape(df)[1]]
        if bad_idx:
            raise ValueError(
                "The following column indices passed for "
                f"filter {name!r} are out of range: {bad_idx}"
            )
        return list(cols)
    raise TypeError(err_msg)


def _check_column_filters(column_filters, df):
    if column_filters is None:
        return None
    if not isinstance(column_filters, Mapping):
        raise TypeError(
            "column_filters should be a dict mapping names to column lists, "
            f"got object of type: {type(column_filters)}"
        )
    return {
        str(name): {
            "display_name": str(name),
            "columns": _check_col_filter(name, cols, df),
        }
        for name, cols in column_filters.items()
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks for the feedback! I'll take a look at your suggestions.



def _get_column_filters(summary):
def _get_column_filters(summary, column_filters):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the column_filters parameter does not seem to be used

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants