# Tabular training
To illustrate the tabular application, we will use the example of the
[Adult dataset](https://archive.ics.uci.edu/ml/datasets/Adult) where we
have to predict if a person is earning more or less than $50k per year
using some general data.
``` python
from fastai.tabular.all import *
```
We can download a sample of this dataset with the usual
[`untar_data`](https://docs.fast.ai/data.external.html#untar_data)
command:
``` python
path = untar_data(URLs.ADULT_SAMPLE)
path.ls()
```
(#3) [Path('/home/ml1/.fastai/data/adult_sample/models'),Path('/home/ml1/.fastai/data/adult_sample/export.pkl'),Path('/home/ml1/.fastai/data/adult_sample/adult.csv')]
Then we can have a look at how the data is structured:
``` python
df = pd.read_csv(path/'adult.csv')
df.head()
```
|
age |
workclass |
fnlwgt |
education |
education-num |
marital-status |
occupation |
relationship |
race |
sex |
capital-gain |
capital-loss |
hours-per-week |
native-country |
salary |
| 0 |
49 |
Private |
101320 |
Assoc-acdm |
12.0 |
Married-civ-spouse |
NaN |
Wife |
White |
Female |
0 |
1902 |
40 |
United-States |
>=50k |
| 1 |
44 |
Private |
236746 |
Masters |
14.0 |
Divorced |
Exec-managerial |
Not-in-family |
White |
Male |
10520 |
0 |
45 |
United-States |
>=50k |
| 2 |
38 |
Private |
96185 |
HS-grad |
NaN |
Divorced |
NaN |
Unmarried |
Black |
Female |
0 |
0 |
32 |
United-States |
<50k |
| 3 |
38 |
Self-emp-inc |
112847 |
Prof-school |
15.0 |
Married-civ-spouse |
Prof-specialty |
Husband |
Asian-Pac-Islander |
Male |
0 |
0 |
40 |
United-States |
>=50k |
| 4 |
42 |
Self-emp-not-inc |
82297 |
7th-8th |
NaN |
Married-civ-spouse |
Other-service |
Wife |
Black |
Female |
0 |
0 |
50 |
United-States |
<50k |
Some of the columns are continuous (like age) and we will treat them as
float numbers we can feed our model directly. Others are categorical
(like workclass or education) and we will convert them to a unique index
that we will feed to embedding layers. We can specify our categorical
and continuous column names, as well as the name of the dependent
variable in
[`TabularDataLoaders`](https://docs.fast.ai/tabular.data.html#tabulardataloaders)
factory methods:
``` python
dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary",
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
cont_names = ['age', 'fnlwgt', 'education-num'],
procs = [Categorify, FillMissing, Normalize])
```
The last part is the list of pre-processors we apply to our data:
- [`Categorify`](https://docs.fast.ai/tabular.core.html#categorify) is
going to take every categorical variable and make a map from integer
to unique categories, then replace the values by the corresponding
index.
- [`FillMissing`](https://docs.fast.ai/tabular.core.html#fillmissing)
will fill the missing values in the continuous variables by the median
of existing values (you can choose a specific value if you prefer)
- [`Normalize`](https://docs.fast.ai/data.transforms.html#normalize)
will normalize the continuous variables (subtract the mean and divide
by the std)
To further expose whatâs going on below the surface, letâs rewrite this
utilizing `fastai`âs
[`TabularPandas`](https://docs.fast.ai/tabular.core.html#tabularpandas)
class. We will need to make one adjustment, which is defining how we
want to split our data. By default the factory method above used a
random 80/20 split, so we will do the same:
``` python
splits = RandomSplitter(valid_pct=0.2)(range_of(df))
```
``` python
to = TabularPandas(df, procs=[Categorify, FillMissing,Normalize],
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
cont_names = ['age', 'fnlwgt', 'education-num'],
y_names='salary',
splits=splits)
```
Once we build our
[`TabularPandas`](https://docs.fast.ai/tabular.core.html#tabularpandas)
object, our data is completely preprocessed as seen below:
``` python
to.xs.iloc[:2]
```
|
workclass |
education |
marital-status |
occupation |
relationship |
race |
education-num_na |
age |
fnlwgt |
education-num |
| 15780 |
2 |
16 |
1 |
5 |
2 |
5 |
1 |
0.984037 |
2.210372 |
-0.033692 |
| 17442 |
5 |
12 |
5 |
8 |
2 |
5 |
1 |
-1.509555 |
-0.319624 |
-0.425324 |
Now we can build our
[`DataLoaders`](https://docs.fast.ai/data.core.html#dataloaders) again:
``` python
dls = to.dataloaders(bs=64)
```
> Later we will explore why using
> [`TabularPandas`](https://docs.fast.ai/tabular.core.html#tabularpandas)
> to preprocess will be valuable.
The [`show_batch`](https://docs.fast.ai/data.core.html#show_batch)
method works like for every other application:
``` python
dls.show_batch()
```
|
workclass |
education |
marital-status |
occupation |
relationship |
race |
education-num_na |
age |
fnlwgt |
education-num |
salary |
| 0 |
State-gov |
Bachelors |
Married-civ-spouse |
Prof-specialty |
Wife |
White |
False |
41.000000 |
75409.001182 |
13.0 |
>=50k |
| 1 |
Private |
Some-college |
Never-married |
Craft-repair |
Not-in-family |
White |
False |
24.000000 |
38455.005013 |
10.0 |
<50k |
| 2 |
Private |
Assoc-acdm |
Married-civ-spouse |
Prof-specialty |
Husband |
White |
False |
48.000000 |
101299.003093 |
12.0 |
<50k |
| 3 |
Private |
HS-grad |
Never-married |
Other-service |
Other-relative |
Black |
False |
42.000000 |
227465.999281 |
9.0 |
<50k |
| 4 |
State-gov |
Some-college |
Never-married |
Prof-specialty |
Not-in-family |
White |
False |
20.999999 |
258489.997130 |
10.0 |
<50k |
| 5 |
Local-gov |
12th |
Married-civ-spouse |
Tech-support |
Husband |
White |
False |
39.000000 |
207853.000067 |
8.0 |
<50k |
| 6 |
Private |
Assoc-voc |
Married-civ-spouse |
Sales |
Husband |
White |
False |
36.000000 |
238414.998930 |
11.0 |
>=50k |
| 7 |
Private |
HS-grad |
Never-married |
Craft-repair |
Not-in-family |
White |
False |
19.000000 |
445727.998937 |
9.0 |
<50k |
| 8 |
Local-gov |
Bachelors |
Married-civ-spouse |
#na# |
Husband |
White |
True |
59.000000 |
196013.000174 |
10.0 |
>=50k |
| 9 |
Private |
HS-grad |
Married-civ-spouse |
Prof-specialty |
Wife |
Black |
False |
39.000000 |
147500.000403 |
9.0 |
<50k |
We can define a model using the
[`tabular_learner`](https://docs.fast.ai/tabular.learner.html#tabular_learner)
method. When we define our model, `fastai` will try to infer the loss
function based on our `y_names` earlier.
**Note**: Sometimes with tabular data, your `y`âs may be encoded (such
as 0 and 1). In such a case you should explicitly pass
`y_block = CategoryBlock` in your constructor so `fastai` wonât presume
you are doing regression.
``` python
learn = tabular_learner(dls, metrics=accuracy)
```
And we can train that model with the `fit_one_cycle` method (the
`fine_tune` method wonât be useful here since we donât have a pretrained
model).
``` python
learn.fit_one_cycle(1)
```
| epoch |
train_loss |
valid_loss |
accuracy |
time |
| 0 |
0.369360 |
0.348096 |
0.840756 |
00:05 |
We can then have a look at some predictions:
``` python
learn.show_results()
```
|
workclass |
education |
marital-status |
occupation |
relationship |
race |
education-num_na |
age |
fnlwgt |
education-num |
salary |
salary_pred |
| 0 |
5.0 |
12.0 |
3.0 |
8.0 |
1.0 |
5.0 |
1.0 |
0.324868 |
-1.138177 |
-0.424022 |
0.0 |
0.0 |
| 1 |
5.0 |
10.0 |
5.0 |
2.0 |
2.0 |
5.0 |
1.0 |
-0.482055 |
-1.351911 |
1.148438 |
0.0 |
0.0 |
| 2 |
5.0 |
12.0 |
6.0 |
12.0 |
3.0 |
5.0 |
1.0 |
-0.775482 |
0.138709 |
-0.424022 |
0.0 |
0.0 |
| 3 |
5.0 |
16.0 |
5.0 |
2.0 |
4.0 |
4.0 |
1.0 |
-1.362335 |
-0.227515 |
-0.030907 |
0.0 |
0.0 |
| 4 |
5.0 |
2.0 |
5.0 |
0.0 |
4.0 |
5.0 |
1.0 |
-1.509048 |
-0.191191 |
-1.210252 |
0.0 |
0.0 |
| 5 |
5.0 |
16.0 |
3.0 |
13.0 |
1.0 |
5.0 |
1.0 |
1.498575 |
-0.051096 |
-0.030907 |
1.0 |
1.0 |
| 6 |
5.0 |
12.0 |
3.0 |
15.0 |
1.0 |
5.0 |
1.0 |
-0.555412 |
0.039167 |
-0.424022 |
0.0 |
0.0 |
| 7 |
5.0 |
1.0 |
5.0 |
6.0 |
4.0 |
5.0 |
1.0 |
-1.582405 |
-1.396391 |
-1.603367 |
0.0 |
0.0 |
| 8 |
5.0 |
3.0 |
5.0 |
13.0 |
2.0 |
5.0 |
1.0 |
-1.362335 |
0.158354 |
-0.817137 |
0.0 |
0.0 |
Or use the predict method on a row:
``` python
row, clas, probs = learn.predict(df.iloc[0])
```
``` python
row.show()
```
|
workclass |
education |
marital-status |
occupation |
relationship |
race |
education-num_na |
age |
fnlwgt |
education-num |
salary |
| 0 |
Private |
Assoc-acdm |
Married-civ-spouse |
#na# |
Wife |
White |
False |
49.0 |
101319.99788 |
12.0 |
>=50k |
``` python
clas, probs
```
(tensor(1), tensor([0.4995, 0.5005]))
To get prediction on a new dataframe, you can use the `test_dl` method
of the [`DataLoaders`](https://docs.fast.ai/data.core.html#dataloaders).
That dataframe does not need to have the dependent variable in its
column.
``` python
test_df = df.copy()
test_df.drop(['salary'], axis=1, inplace=True)
dl = learn.dls.test_dl(test_df)
```
Then
[`Learner.get_preds`](https://docs.fast.ai/learner.html#learner.get_preds)
will give you the predictions:
``` python
learn.get_preds(dl=dl)
```
(tensor([[0.4995, 0.5005],
[0.4882, 0.5118],
[0.9824, 0.0176],
...,
[0.5324, 0.4676],
[0.7628, 0.2372],
[0.5934, 0.4066]]), None)
> **Note**
>
> Since machine learning models canât magically understand categories it
> was never trained on, the data should reflect this. If there are
> different missing values in your test data you should address this
> before training
## `fastai` with Other Libraries
As mentioned earlier,
[`TabularPandas`](https://docs.fast.ai/tabular.core.html#tabularpandas)
is a powerful and easy preprocessing tool for tabular data. Integration
with libraries such as Random Forests and XGBoost requires only one
extra step, that the `.dataloaders` call did for us. Letâs look at our
`to` again. Its values are stored in a `DataFrame` like object, where we
can extract the `cats`, `conts,` `xs` and `ys` if we want to:
``` python
to.xs[:3]
```
|
workclass |
education |
marital-status |
occupation |
relationship |
race |
education-num_na |
age |
fnlwgt |
education-num |
| 25387 |
5 |
16 |
3 |
5 |
1 |
5 |
1 |
0.471582 |
-1.467756 |
-0.030907 |
| 16872 |
1 |
16 |
5 |
1 |
4 |
5 |
1 |
-1.215622 |
-0.649792 |
-0.030907 |
| 25852 |
5 |
16 |
3 |
5 |
1 |
5 |
1 |
1.865358 |
-0.218915 |
-0.030907 |
Now that everything is encoded, you can then send this off to XGBoost or
Random Forests by extracting the train and validation sets and their
values:
``` python
X_train, y_train = to.train.xs, to.train.ys.values.ravel()
X_test, y_test = to.valid.xs, to.valid.ys.values.ravel()
```
And now we can directly send this in!