Skip to content

Memory Management Issue in class ClassificationMetric #526

Open
@trivedi-nitesh

Description

I'm using the provided code to compute SPD (Statistical Parity Difference) on adult datasets. However, upon calling the function get_spd_and_accuracy within a loop, I've noticed that memory consumption gradually increases when class_metrics.statistical_parity_difference() is instantiated with each iteration, and this memory is not being released at the end of each iteration.

from aif360.datasets import AdultDataset, GermanDataset, CompasDataset
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from aif360.datasets import StandardDataset, BinaryLabelDataset
from aif360.metrics import ClassificationMetric
from copy import deepcopy
from sklearn.metrics import accuracy_score



def get_spd_and_accuracy(df, protected, target):
  '''Prepare the data for training and testing'''
  train, test = train_test_split(df, test_size=0.2, shuffle=True)
  X_train = train.drop([protected, target], axis=1).values
  y_train = train[target].values
  y_test = test[target].values
  X_test = test.drop([protected, target], axis=1).values
  
  '''Train the model and predict the labels for the training and testing data'''
  lmod = LogisticRegression(solver='liblinear', class_weight='balanced')
  lmod.fit(X_train,y_train)
  
  y_train_pred = lmod.predict(X_train)
  y_test_pred = lmod.predict(X_test)
  
  '''Prepare the data for the AIF360 metrics'''
  train_transf = StandardDataset(train, 
  label_name=target, 
  favorable_classes=[1], 
  protected_attribute_names=[protected],
  categorical_features=[],
  features_to_drop=[],
  privileged_classes=[[1.0]])
  train_transf_pred = deepcopy(train_transf)
  train_transf_pred.labels = y_train_pred
  un_p=[{protected:0.0}]
  p=[{protected:1.0}]
  
  '''Calculate the Statistical Parity Difference and Accuracy Score'''    
  class_metrics = ClassificationMetric(train_transf,train_transf_pred,unprivileged_groups=un_p, privileged_groups=p)
  print(round(class_metrics.statistical_parity_difference(),2))
  print(round(accuracy_score(y_test, y_test_pred),2))


dataset = AdultDataset()
df = dataset.convert_to_dataframe()[0]
target = df.columns[-1]
protected = 'sex'

for i in range(25):
  get_spd_and_accuracy(df, protected, target) 

Any insights or recommendations regarding memory release strategies in this context would be greatly appreciated. Below is the snapshot of increase in memory.

Before executing the code
Screenshot from 2024-04-19 10-59-07

During the execution of the code
Screenshot from 2024-04-19 11-03-13

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions