Analysis of Filipino Family Households Income
Classification and Expenditure Patterns Using
Machine Learning
Dy S. Medel1,2 and Aristeo C. Salapa2
1
St. John Paul II College of Davao, Davao City, Philippines
1
College of Development Management, University of
Southeastern Philippines, Davao City, Philippines
Corresponding email:
[email protected] ABSTRACT
This study employs machine learning algorithms to analyze
income and expenditure patterns in Filipino family households,
aiming to support socioeconomic development and effective
policy-making. Utilizing a comprehensive dataset from across
the Philippines, which includes demographic details, monthly
income, and expenditure data, the study applies Naïve Bayes,
IBk, and decision trees to categorize household income levels
and identify spending trends. Among the 11 algorithms tested,
Naïve Bayes proves most effective, achieving the highest
accuracy rate (70%), F-measure (0.695), and kappa statistics
(0.5579). By leveraging these machine learning techniques, the
research provides nuanced insights into the financial behaviors
of Filipino families, enhancing the precision of socioeconomic
planning and resource allocation. This analysis not only
facilitates a deeper understanding of economic stability at the
household level but also aids in tailoring governmental support
to those in need, particularly during challenging economic
times.
Keywords: machine learning, household income classification,
Filipino households, household expenditures, Philippines
39
INTRODUCTION
The financial health of families significantly influences
societal growth, economic stability, and the well-being of
individuals. The capacity of households to meet basic needs,
afford healthcare, education, and effectively engage in
economic activities is heavily contingent upon their financial
status (Sri et al., 2021). In an effort to track these dynamics, the
Philippine Statistics Authority (2020) conducts the Family
Income and Expenditure Survey triennially. This survey
categorizes household income and serves as a fundamental
resource for socioeconomic planning, policy development,
resource distribution, poverty alleviation, and the monitoring of
governmental progress (Zoleta, 2023).
Understanding the trends in household income and
expenditure is essential for effective financial planning and
policy making. However, the complexity and volume of the data
involved pose significant challenges to traditional statistical
methods (Fan, Han & Liu, 2014). In response, recent
advancements have seen machine learning algorithms emerge
as powerful tools capable of extracting meaningful insights
from large datasets, thus enhancing decision-making processes
(Adadi, 2021; Elshawi et al., 2018; Sarker, 2021).
Despite the critical nature of this data, relatively few
researchers have delved into using data mining techniques to
analyze the financial behaviors of Filipino households. Doce et
al. (2021) utilized the 2015 Family Income and Expenditure
Survey to identify spending patterns and income classifications,
finding that the majority of households fall into the lower-
income brackets. This underscores the persistent economic
challenges within the population. Further applying machine
learning techniques, Sri et al. (2021) developed a model to
predict household income based on expenditure and other
40
relevant data, utilizing decision tree and RandomForest
algorithms. Their findings highlighted the RandomForest
model's superior predictive accuracy of 74.35%, compared to
the decision tree's 48%. Such models are instrumental in
offering a granular view of household financial dynamics.
The necessity of such analytical tools became
particularly apparent during the recent pandemic, which starkly
highlighted socio-economic disparities. The Philippine
government allocated approximately 200 billion pesos for a
social amelioration program to aid 18 million low-income
households during extended lockdowns. This situation sparked
widespread debate on social media about which groups should
receive government support, emphasizing the need for precise
income classification to better target social welfare programs
and interventions. With these circumstances, this study aims to
harness machine learning algorithms to refine the classification
of household income and expenditure patterns, providing
policymakers with enhanced tools for identifying vulnerable
populations and improving resource allocation. By improving
the accuracy of these classifications, the study seeks to
contribute to more targeted and effective socio-economic
interventions.
METHOD
Dataset. Secondary data from Kaggle was used for this
study. It is composed of 40,000 instances of surveyed
households covering 17 regions nationwide. As shown in Table
1, the family income and expenditure data are composed of 61
attributes, which include one class attribute and 60 explanatory
attributes. As a common practice in machine learning, 20,095
41
instances (70%) were used for training set and 8,669 (30%) were
used for test set.
Data Preparation. A visual examination of the dataset
revealed that all attributes contained no missing values. The
data were however filtered by the number of family members
in the household. The household with only five or fewer
members was utilized. Consequently, the actual 40,000
instances have been reduced to 28,765 due to the number of
family members filtering. Additionally, the annual family
income in the data was converted to a monthly amount. These
limitations were in accordance with the income classification
matrix used in this study.
Table 1. List of attributes of the loan prediction dataset
Attribute Type Description Transformation
Required
A1 Nominal Region No
A2 Numeric Total Household Income Yes
A3 Nominal Income Class Yes
A4 Numeric Total Food Expenditure Yes
A5 Nominal Main Source of Income No
A6 Nominal Agricultural Household No
Indicator
A7 Numeric Bread and Cereals Expenditure Yes
A8 Numeric Total Rice Expenditure Yes
A9 Numeric Meat Expenditure Yes
A10 Numeric Total Fish and Marine products Yes
Expenditure
A11 Numeric Fruit Expenditure Yes
A12 Numeric Vegetables Expenditure Yes
A13 Numeric Restaurant and Hotels Yes
Expenditure
A14 Numeric Alcoholic Beverages Yes
Expenditure
42
A15 Numeric Tobacco Expenditure Yes
A16 Numeric Clothing, Footwear, and Other Yes
Wear Expenditure
A17 Numeric Housing and Water Yes
Expenditure
A18 Numeric Imputed House Rental Value Yes
A19 Numeric Medical Expenditure Yes
A20 Numeric Transportation Expenditure Yes
A21 Numeric Communication Expenditure Yes
A22 Numeric Education Expenditure Yes
A23 Numeric Miscellaneous Goods and Yes
Services Expenditure
A24 Numeric Special Occasions Expenditure Yes
A25 Numeric Crop Farming and Gardening Yes
expenses
A26 Numeric Total Income from Yes
Entrepreneurial Activities
A27 Nominal Household Head Sex Yes
A28 Numeric Household Head Age Yes
A29 Nominal Household Head Marital No
Status
A30 Nominal Household Head Highest No
Grade Completed
A31 Nominal Household Head Job or No
Business Indicator
A32 Nominal Household Head Occupation No
A33 Nominal Household Head Class of No
Worker
A34 Nominal Type of Household No
A35 Nominal Total Number of Family No
Members
A36 Numeric Members with age less than 5- Yes
year-old
A37 Numeric Members with age 5 - 17 years Yes
old
A38 Numeric Total number of family Yes
members employed
43
A39 Nominal Type of Building/House No
A40 Nominal Type of Roof No
A41 Nominal Type of Walls No
A42 Numeric House Floor Area Yes
A43 Numeric House Age Yes
A44 Numeric Number of bedrooms Yes
A45 Nominal Tenure Status No
A46 Nominal Toilet Facilities No
A47 Nominal Electricity Ownership No
A48 Nominal Main Source of Water Supply No
A49 Numeric Number of Television Yes
A50 Numeric Number of CD/VCD/DVD Yes
A51 Numeric Number of Component/Stereo Yes
Set
A52 Numeric Number of Yes
Refrigerator/Freezer
A53 Numeric Number of Washing Machine Yes
A54 Numeric Number of Air Conditioner Yes
A55 Numeric Number of Jeep or Van Yes
A56 Numeric Number of Landline/wireless Yes
telephones
A57 Numeric Number of Cellular Phone Yes
A58 Numeric Number of Personal Computer Yes
A59 Numeric Number of Stove with Yes
Oven/Gas Range
A60 Numeric Number of Motorized Banca Yes
A61 Numeric Number of Motorcycle/Tricycle Yes
The numeric attributes were converted to nominal
attributes through numeric to nominal filter in unsupervised
attribute. In the attributeIndices, all the numeric attributes
(A2, A4, A5, A6, A7, A8, A9, A10, A11, A12, A13, A14, A15, A16,
A17, A18, A19, A20, A21, A22, A23, A24, A25, A26, A28, A36,
44
A37, A38, A42, A43, A44, A49, A50, A51, A52, A53, A54, A55,
A56, A57, A58, A59, A60 and A61) were chosen before the
conversion. The change in color from black and white to red,
blue, and green indicated that numeric attributes are now
changed into nominal attributes.
Meanwhile, the classification of income classes of
Filipino households was adopted from Albert et al. (2018) which
outlines the indicative range of family monthly incomes, for a
family of five, based on 2017 prices. Table 2 outlines the
complete information.
Table 2. Income clusters of Filipino households as of 2017
Income Cluster Indicative Range
Poor Less than Php 9,520
Low-income class (but Between Php 9,520 and PHP
not poor) 19,040
Lower middle-income Between Php 19,040 and PHP
class 38,080
Middle middle-income Between Php 38,080 and PHP
class 66,640
Upper middle-income Between Php 66,640 and PHP
class 114,240
Upper-income class Between Php 114,240 and PHP
(but not rich) 190,400
Rich At least Php 190,400
Microsoft Excel software was used to convert the
monthly income to the seven categories of the income cluster.
The file was converted to comma-separated values (.csv) format
and loaded to Weka for data analysis.
Data classification and Cross-Validation. The widely
used classifiers were employed in the data set, these are; naïve
Bayes, lBK k-NN, and decision trees (J48). Naïve Bayes, selects
45
the class that maximizes the likelihood of the observed feature
values given the training data. In lBK k-NN classifications, k-NN
3, k-NN 5, k-NN 7, k-NN 9, k-NN 11, and k-NN 13 were done,
while confidence factor of 0.25, 0.5, 0.75 and 0.99 were selected
for decision tree (J48), wherein the lower confidence factors
lead to larger trees, while higher confidence factors result in
smaller trees (Alejandrino, Bolacoy & Murcia, 2023).
RESULTS AND DISCUSSION
Eleven cross-validations were employed using the
three classifiers; one for Naïve Bayes, six for lBk (k-NN) and four
for decision tree (J48). Table 3 shows the percentage of correctly
classified instances of the 11 cross-validations run. Among the
classifiers, Naïve Bayes has the highest classification accuracy
(70%), lBK variants followed having k-NN 11 of 62.67% accuracy
of classification and decision tree (J48) the confidence factor of
50% ranked third in the accuracy rating.
Table 3. Classification accuracy of the classifiers and their
variants on training dataset
Correctly Classified
Classifier Variants
Instances
Naïve Bayes - 14067 (70.00%)
lBk (k-NN) 3 12243 (60.93%)
5 12409 (61.75%)
7 12525 (62.33%)
9 12559 (62.50%)
11 12594 (62.67%)
Decision Tree 0.25 10395 (51.73%)
(J48) 0.5 10494 (52.22%)
0.75 9199 (45.78%)
0.99 9199 (45.78%)
46
To evaluate the performance of various classifiers,
Figure 1 displayed the confusion matrices for three primary
models. The Naive Bayes classifier showed substantial accuracy
in identifying various income classes, with the majority of
misclassifications occurring between adjacent classes. For
instance, class 'a' (low income but not poor) saw 5,795 correct
classifications, but also notable misclassifications into higher
income brackets such as 'b' and 'c'. Similarly, class 'b' (poor) had
10,397 correct identifications with some misclassifications into
class 'a'.
Figure 1. Confusion matric for NaiveBayes used in the study
For Figure 2, the lBk (k-NN 11) algorithm displayed a
pattern where class 'a' had 5,033 correctly classified instances,
but suffered from extensive misclassifications across other
classes, highlighting the challenge of distinguishing closely
related income groups using this method. Notably, class 'b' had
over 10,535 correct classifications, but also a significant number
of instances incorrectly identified as class 'a'.
Figure 2. Confusion matric for lBk (k-NN 11) used in the study
47
Lastly, as seen in Figure 3, the Decision Tree (J48) with
a 50% confidence factor showed varied performance across
classes. For example, it performed well in classifying class 'b'
with 11,286 correct classifications but struggled with class 'd',
where no instances were correctly identified. This variability
underscores the challenges in using decision trees for nuanced
classifications in complex datasets like household income.
Figure 2. Confusion matric for J48 (0.99) used in the study
The average F-measure, number of correctly classified
instances and kappa statistics were considered to determine
the effectiveness of the cross-validation of training set. Among
the 11 classifiers Naïve Bayes have the greatest correctly
classified instances and the highest F-measure and kappa
statistics.
Finally, NaiveBayes has the best goodness of fit was
used in the last 8670 instances. The classifier performed well in
classifying instances belonging to the 'poor' class, with 2373
true positives out of 2384 instances and only 11 false negatives.
The 'lower middle' class had a relatively high false negative rate,
with 455 instances misclassified as 'low income but not poor'
and 7 instances misclassified as 'upper middle.' The 'low income
but not poor' class had a significant number of
misclassifications, with 1797 instances being falsely classified as
'lower middle' and 767 instances classified as 'poor.' The
48
Table 3. Classification performance of the utilized classifiers
and their variants on training dataset
Classifier Variants F-measure k
Naïve Bayes 0.695 0.5579
lBk (k-NN) 3 0.594 0.4087
5 0.325 0.4195
7 0.326 0.4277
9 0.312 0.4296
11 0.312 0.4315
Decision Tree (J48) 0.25 0.184 0.2634
0.5 0.192 0.2703
0.75 0.13 0.0718
0.99 0.13 0.0718
'middle' class had a considerable number of misclassifications,
particularly being confused with 'poor,' with 299 instances
falsely classified as 'poor.' The 'upper middle' and 'upper
middle but not rich' classes had relatively fewer instances and
varied misclassifications. The 'rich' class was not predicted
correctly in any instances, with all instances being misclassified
as 'upper middle' or 'middle.'
CONCLUSION AND RECOMMENDATIONS
Conclusions
The study was able to apply machine learning in
identifying the best classifier in a given family income and
expenditure data. Results revealed that NaiveBayes has the best
accuracy among the 11 classifiers, has the highest classified
instances, the highest F-measure, and the highest kappa
49
statistics. It is recommended that the model will be used for the
entire data set to validate the accuracy.
Recommendations
To enhance the robustness and accuracy of predictive
models, it is advisable to expand the dataset by including a
broader range of socioeconomic variables such as employment
status, economic conditions, and educational attainment. This
expansion would provide a more comprehensive view of the
factors influencing income and expenditure patterns.
Additionally, there is a need to optimize existing machine
learning algorithms, particularly by fine-tuning the parameters
of models like Naïve Bayes and k-NN. Experimenting with more
complex machine learning approaches, such as ensemble
methods or deep learning, could yield further improvements in
prediction accuracy.
It is also recommended that policymakers integrate
machine learning insights into their decision-making processes.
By identifying households that fall into vulnerable economic
categories, policies can be better targeted to those most in
need. Developing systems capable of processing data in real-
time can offer ongoing insights into household income
dynamics, allowing for timely adjustments to socio-economic
policies.
Furthermore, enhancing public awareness and
education about the use of data analytics and machine learning
in public policy is crucial. This might include programs aimed at
improving digital literacy to help the general public understand
the benefits of these technologies. Finally, fostering cross-
sector collaboration among government, academia, and the
private sector can leverage shared expertise and innovation,
promoting the effective use of machine learning in public
planning and policy.
50
REFERENCES
Adadi, A. (2021). A survey on data-efficient algorithms in big
data era. Journal of Big Data, 8(1), 24.
Albert, J. R. G., Santos, A. G. F., & Vizmanos, J. F. V. (2018). Profile
and determinants of the middle-income class in the
Philippines (No. 2018-20). PIDS Discussion Paper Series.
Alejandrino, J. C., Bolacoy Jr, J., & Murcia, J. V. B. (2023).
Supervised and unsupervised data mining approaches in
loan default prediction. International Journal of Electrical
& Computer Engineering (2088-8708), 13(2).
Doce, L. J. P., Flores, R. T. G., Pitogo, V. A., & Caguiat, M. R. R.
(2020). Analysis of income classification and expenditure
patterns among filipino households using data mining
applications. In ICCE 2020-28th Int. Conf. Comput. Educ.
Proc (Vol. 2, pp. 322-331).
Elshawi, R., Sakr, S., Talia, D., & Trunfio, P. (2018). Big data
systems meet machine learning challenges: towards big
data science as a service. Big Data Research, 14, 1-11.
Fan, J., Han, F., & Liu, H. (2014). Challenges of big data analysis.
National Science Review, 1(2), 293-314.
Philippine Statistics Authority. (2020). 2018 Family Income and
Expenditure Survey.
Sarker, I. H. (2021). Machine learning: Algorithms, real-world
applications and research directions. SN Computer
Science, 2(3), 160.
Sri, Y. B., Sravani, Y., Surendra, Y. B., Rishitha, S., & Sobhana, M.
(2021). Family expenditure and income analysis using
machine learning algorithms. In 2021 Second
International Conference on Smart Technologies in
Computing, Electrical and Electronics (ICSTCEE).
https://doi.org/10.1109/icstcee54422.2021.9708583
51
Zoleta, V. (2023, April 12). Understanding Social Classes in the
Philippines: Where Do You
Belong? Moneymax. https://www.moneymax.ph/person
al-finance/articles/social-class-philippines.
52