Feature Engineering - Add with conditional

First published on March 16, 2022

Last updated at April 19, 2022

 

8 minute read

Felicia Kuan

TLDR

It seems absurd to add more columns to an already large dataset, right? 👀 (See) how procedurally adding data using existing columns helps the model gain further insight when predicting. 

Glossary

  • What is it?

  • The significance of boolean values in your dataset

  • Ways to implement in code! 👩‍💻

  • Magical no-code solution ✨🔮

What is it?

Adding a new column with a conditional is an operation in a machine learning analysis that takes existing data, possibly several columns of varying ranges, and performs a conditional comparison to represent those columns into a single value (often boolean) for each row.

The significance of boolean values in your dataset

To understand the importance of generalizing some of your features/columns, consider what boolean values represent: true or false values that clearly indicate to the model what the positive or negative outcomes are in this prediction. 

This column containing boolean values improves model training accuracy because it summarizes a range of values into a true or false metric that simply indicates whether certain conditions are fulfilled. It’s like a test study guide: it is a summary/indicator that is helpful in predicting the likelihood of an outcome (like passing or not passing a test)

Confused lady math meme

Some examples

If the definition was a little abstract, here are some examples! The following are problems that are simplified by adding new columns using a conditional separator: 

  • Classify a range and give it value -> Assign “Pass/No Pass” to a numerical grade. 

  • Simplify column data -> Reduce complexity of store records by using a boolean value to indicate whether a line of seasonal clothing sales was a net gain or loss

When ranking student satisfaction with their colleges, academic performance is only one aspect of the student experience (ie. extracurriculars, leadership, other life responsibilities). Thus, we don’t need to be so specific with typical 0-100 grades, as the wider range of values in a column generates more noise. By adding an additional “Pass/No pass” feature, we increase the accuracy of the model because a numerical range of grades is much harder to generalize (and predict) than “Pass/No pass.”

UCLA student life (Source: Daily Bruin)

Ways to implement in code! 👩‍💻

Using the grades example discussed earlier, we can simplify a numerical grade like “95% or 72%” into “pass or no pass” values. We have a 

of grades (or “marks”) that we’ll go through to determine whether the student passed the exam or not. 

1
2
3
import pandas as pd
df = pd.read_csv('Marks10.csv')
df

From scratch

In regular Python, we’re simply interested in the 

Exampoints

column that denotes the grades. Thus, we’d simply save that column as a list of fifty exam scores named “grades.”

1
2
grades = df['Exampoints']
print(list(grades))

Next, we’d use a 

conditional

comparison to loop through all the scores to check whether the grades are passing (which is anything higher than a 60%).

1
2
3
4
5
6
7
8
9
10
11
12
13
# included full list for readability
grades = [31, 45, 23, 69, 78, 45, 23, 89, 100, 97,
          56, 11, 9, 55, 43, 44, 45, 46, 47, 48,
          49, 23, 24, 25, 26, 69, 70, 71, 72, 73,
          74, 75, 76, 34, 35, 36, 37, 38, 39, 40,
          41, 42, 43, 44, 45, 46, 47, 48, 49, 50]
pass_no_pass = []

for grade in grades:
   if grade >= 60:
     pass_no_pass.append('Pass')
   else:
     pass_no_pass.append('No pass')

Our list of “pass or no pass” values would then contain all of the records that we need to make a new column of data! 

1
print(pass_no_pass)

Now that you know the logic behind performing a conditional operation on an existing column to determine values in a new one, next we’ll show you how you’d add the additional column to the dataset for a side-by-side comparison of grades with pass/no pass:

1
2
df['pass_no_pass'] = pass_no_pass
df

“Best practice”: list comprehension

This method is the cleanest  way for a Python developer to 

of data because it’s the fastest (compared to 

, a Pandas function they hate). It’s a no-nonsense approach that extracts the column we’re analyzing as a list and only looks at that. In a way, it’s the method that’s most similar to our first “From scratch” example.

First, we write a function that takes in the exam score as an input and determines whether the grade is passing or not:

1
2
3
4
5
6
7
8
import pandas as pd
df = pd.read_csv('Marks10.csv')

def passed(grade):
if grade >= 60:
return True
else:
return False

Then, while iterating through the rows and just looking at the values in the column “Exampoints,” it passes the values into the function we created earlier. We take the returned function values and save it in our new column, “Passed.” 

1
2
3
4
pass_no_pass = [passed(score) for score in df['Exampoints']]
df['Passed'] = pass_no_pass

df

Although this method might be fast, we still need to write a helper function that helps us calculate the boolean values. In the next section, we will show you how to use a built-in numpy function to do the same thing!

Using numpy.where()

Now that you know the logic behind performing a conditional operation, let’s see how we do this using a Pandas dataframe and the numpy function 

np.where()

to add a “Passed” column with boolean values this time! 

The reason why we do this instead of the “Pass/No Pass” outputs like in the last example is to clearly indicate to the model, which understands “true” values to mean an outcome, denoted by column names like “Passed,” did occur. 

The function, 

, takes three parameters and they are:

  1. The comparison, which in our case, is whether the student scored > 60.

  2. True condition; if the student has “Passed,” we save this row value as “True.” 

  3. Overwise, “False” is stored.

1
2
3
4
5
6
7
8
import pandas as pd
import numpy as np

df = pd.read_csv('Marks10.csv')

df['Passed'] = np.where(df['Exampoints'] > 60, True, False)

df

With this true or false column, our model now knows what “good” grades look like! However, 

all

of the outcomes obtained in “Passed” are useful for aiding the model in making predictions on student performance, including scores of students who didn’t pass! 

What’s fascinating about a data-driven analysis is that we can learn something valuable about student experience, performance, etc from everyone– not just the high-achieving students. A model can only be accurate and generalizable with a high volume of diverse data points.

Therefore, even though a D or below may not satisfy you or your parents, know that to our model, your worth is immeasurable 😊. 

Source: Bugcat Capoo

Okay technically we can measure how influential a row of data is to training results thanks to ML, but let’s not be so cerebral all the time 😂

Magical no-code solution ✨🔮

Speaking of not thinking so hard, sometimes, when experimenting with data, we’d rather face data-related challenges than coding ones. 

can alleviate the coding-related burdens for you! 

To add a new column using a conditional operation to improve your model’s accuracy and generality, first go to:

  1. Edit data > Add column

  2. Fill in “conditional” for logic

  3. Then, fill in the logic for the columns you’re comparing (ex: “Exampoints >= 60”)

  4. Set the first outcome of the comparison evaluation as “True,” and the second as “False”

Happy experimenting! Hope Mage can give you a magical experience working with data 🌟

Want to learn more about machine learning (ML)? Visit 

! ✨🔮