# Tabular training To illustrate the tabular application, we will use the example of the [Adult dataset](https://archive.ics.uci.edu/ml/datasets/Adult) where we have to predict if a person is earning more or less than $50k per year using some general data. ``` python from fastai.tabular.all import * ``` We can download a sample of this dataset with the usual [`untar_data`](https://docs.fast.ai/data.external.html#untar_data) command: ``` python path = untar_data(URLs.ADULT_SAMPLE) path.ls() ``` (#3) [Path('/home/ml1/.fastai/data/adult_sample/models'),Path('/home/ml1/.fastai/data/adult_sample/export.pkl'),Path('/home/ml1/.fastai/data/adult_sample/adult.csv')] Then we can have a look at how the data is structured: ``` python df = pd.read_csv(path/'adult.csv') df.head() ```

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
0	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	NaN	Wife	White	Female	0	1902	40	United-States	>=50k
1	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	>=50k
2	38	Private	96185	HS-grad	NaN	Divorced	NaN	Unmarried	Black	Female	0	0	32	United-States	<50k
3	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	>=50k
4	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	<50k

Some of the columns are continuous (like age) and we will treat them as float numbers we can feed our model directly. Others are categorical (like workclass or education) and we will convert them to a unique index that we will feed to embedding layers. We can specify our categorical and continuous column names, as well as the name of the dependent variable in [`TabularDataLoaders`](https://docs.fast.ai/tabular.data.html#tabulardataloaders) factory methods: ``` python dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary", cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'], cont_names = ['age', 'fnlwgt', 'education-num'], procs = [Categorify, FillMissing, Normalize]) ``` The last part is the list of pre-processors we apply to our data: - [`Categorify`](https://docs.fast.ai/tabular.core.html#categorify) is going to take every categorical variable and make a map from integer to unique categories, then replace the values by the corresponding index. - [`FillMissing`](https://docs.fast.ai/tabular.core.html#fillmissing) will fill the missing values in the continuous variables by the median of existing values (you can choose a specific value if you prefer) - [`Normalize`](https://docs.fast.ai/data.transforms.html#normalize) will normalize the continuous variables (subtract the mean and divide by the std) To further expose whatâ€™s going on below the surface, letâ€™s rewrite this utilizing `fastai`â€™s [`TabularPandas`](https://docs.fast.ai/tabular.core.html#tabularpandas) class. We will need to make one adjustment, which is defining how we want to split our data. By default the factory method above used a random 80/20 split, so we will do the same: ``` python splits = RandomSplitter(valid_pct=0.2)(range_of(df)) ``` ``` python to = TabularPandas(df, procs=[Categorify, FillMissing,Normalize], cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'], cont_names = ['age', 'fnlwgt', 'education-num'], y_names='salary', splits=splits) ``` Once we build our [`TabularPandas`](https://docs.fast.ai/tabular.core.html#tabularpandas) object, our data is completely preprocessed as seen below: ``` python to.xs.iloc[:2] ```

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num
15780	2	16	1	5	2	5	1	0.984037	2.210372	-0.033692
17442	5	12	5	8	2	5	1	-1.509555	-0.319624	-0.425324

Now we can build our [`DataLoaders`](https://docs.fast.ai/data.core.html#dataloaders) again: ``` python dls = to.dataloaders(bs=64) ``` > Later we will explore why using > [`TabularPandas`](https://docs.fast.ai/tabular.core.html#tabularpandas) > to preprocess will be valuable. The [`show_batch`](https://docs.fast.ai/data.core.html#show_batch) method works like for every other application: ``` python dls.show_batch() ```

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary
0	State-gov	Bachelors	Married-civ-spouse	Prof-specialty	Wife	White	False	41.000000	75409.001182	13.0	>=50k
1	Private	Some-college	Never-married	Craft-repair	Not-in-family	White	False	24.000000	38455.005013	10.0	<50k
2	Private	Assoc-acdm	Married-civ-spouse	Prof-specialty	Husband	White	False	48.000000	101299.003093	12.0	<50k
3	Private	HS-grad	Never-married	Other-service	Other-relative	Black	False	42.000000	227465.999281	9.0	<50k
4	State-gov	Some-college	Never-married	Prof-specialty	Not-in-family	White	False	20.999999	258489.997130	10.0	<50k
5	Local-gov	12th	Married-civ-spouse	Tech-support	Husband	White	False	39.000000	207853.000067	8.0	<50k
6	Private	Assoc-voc	Married-civ-spouse	Sales	Husband	White	False	36.000000	238414.998930	11.0	>=50k
7	Private	HS-grad	Never-married	Craft-repair	Not-in-family	White	False	19.000000	445727.998937	9.0	<50k
8	Local-gov	Bachelors	Married-civ-spouse	#na#	Husband	White	True	59.000000	196013.000174	10.0	>=50k
9	Private	HS-grad	Married-civ-spouse	Prof-specialty	Wife	Black	False	39.000000	147500.000403	9.0	<50k

We can define a model using the [`tabular_learner`](https://docs.fast.ai/tabular.learner.html#tabular_learner) method. When we define our model, `fastai` will try to infer the loss function based on our `y_names` earlier. **Note**: Sometimes with tabular data, your `y`â€™s may be encoded (such as 0 and 1). In such a case you should explicitly pass `y_block = CategoryBlock` in your constructor so `fastai` wonâ€™t presume you are doing regression. ``` python learn = tabular_learner(dls, metrics=accuracy) ``` And we can train that model with the `fit_one_cycle` method (the `fine_tune` method wonâ€™t be useful here since we donâ€™t have a pretrained model). ``` python learn.fit_one_cycle(1) ```

epoch	train_loss	valid_loss	accuracy	time
0	0.369360	0.348096	0.840756	00:05

We can then have a look at some predictions: ``` python learn.show_results() ```

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary	salary_pred
0	5.0	12.0	3.0	8.0	1.0	5.0	1.0	0.324868	-1.138177	-0.424022	0.0	0.0
1	5.0	10.0	5.0	2.0	2.0	5.0	1.0	-0.482055	-1.351911	1.148438	0.0	0.0
2	5.0	12.0	6.0	12.0	3.0	5.0	1.0	-0.775482	0.138709	-0.424022	0.0	0.0
3	5.0	16.0	5.0	2.0	4.0	4.0	1.0	-1.362335	-0.227515	-0.030907	0.0	0.0
4	5.0	2.0	5.0	0.0	4.0	5.0	1.0	-1.509048	-0.191191	-1.210252	0.0	0.0
5	5.0	16.0	3.0	13.0	1.0	5.0	1.0	1.498575	-0.051096	-0.030907	1.0	1.0
6	5.0	12.0	3.0	15.0	1.0	5.0	1.0	-0.555412	0.039167	-0.424022	0.0	0.0
7	5.0	1.0	5.0	6.0	4.0	5.0	1.0	-1.582405	-1.396391	-1.603367	0.0	0.0
8	5.0	3.0	5.0	13.0	2.0	5.0	1.0	-1.362335	0.158354	-0.817137	0.0	0.0

Or use the predict method on a row: ``` python row, clas, probs = learn.predict(df.iloc[0]) ``` ``` python row.show() ```

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary
0	Private	Assoc-acdm	Married-civ-spouse	#na#	Wife	White	False	49.0	101319.99788	12.0	>=50k

``` python clas, probs ``` (tensor(1), tensor([0.4995, 0.5005])) To get prediction on a new dataframe, you can use the `test_dl` method of the [`DataLoaders`](https://docs.fast.ai/data.core.html#dataloaders). That dataframe does not need to have the dependent variable in its column. ``` python test_df = df.copy() test_df.drop(['salary'], axis=1, inplace=True) dl = learn.dls.test_dl(test_df) ``` Then [`Learner.get_preds`](https://docs.fast.ai/learner.html#learner.get_preds) will give you the predictions: ``` python learn.get_preds(dl=dl) ``` (tensor([[0.4995, 0.5005], [0.4882, 0.5118], [0.9824, 0.0176], ..., [0.5324, 0.4676], [0.7628, 0.2372], [0.5934, 0.4066]]), None)

> **Note** > > Since machine learning models canâ€™t magically understand categories it > was never trained on, the data should reflect this. If there are > different missing values in your test data you should address this > before training

## `fastai` with Other Libraries As mentioned earlier, [`TabularPandas`](https://docs.fast.ai/tabular.core.html#tabularpandas) is a powerful and easy preprocessing tool for tabular data. Integration with libraries such as Random Forests and XGBoost requires only one extra step, that the `.dataloaders` call did for us. Letâ€™s look at our `to` again. Its values are stored in a `DataFrame` like object, where we can extract the `cats`, `conts,` `xs` and `ys` if we want to: ``` python to.xs[:3] ```

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num
25387	5	16	3	5	1	5	1	0.471582	-1.467756	-0.030907
16872	1	16	5	1	4	5	1	-1.215622	-0.649792	-0.030907
25852	5	16	3	5	1	5	1	1.865358	-0.218915	-0.030907

Now that everything is encoded, you can then send this off to XGBoost or Random Forests by extracting the train and validation sets and their values: ``` python X_train, y_train = to.train.xs, to.train.ys.values.ravel() X_test, y_test = to.valid.xs, to.valid.ys.values.ravel() ``` And now we can directly send this in!