Big Data Solution For Tourism PDF
Big Data Solution For Tourism PDF
Big Data Solution For Tourism PDF
Abstract—In this paper, the key aim is to provide conceptual, However there is less research work on centralized repository
technical solution design to implement a supportive dashboard or personal dashboard which includes Big Data indicators to
which integrates with Big Data indicators and online transaction analyze trends, alerts, personal behavior and data. With the
processing (OLTP) systems. The proposed solution is mainly growing size of dataset it requires repository to be scalable
focusing on a case study of Travel and Tourism industry in Sri
and highly efficient.
Lanka. We carried out in-depth analysis to identify the necessity
of a dashboard and examine significant challenges need to The research method followed full life cycle process while
consider when designing the solution. Moreover, we evaluate providing descriptive and practical insight into big data
suitable big data technologies for implementation and Hadoop, solution. We will discuss on: Use of Hadoop and Map Reduce
Hbase, MapReduce has been proposed. Centralized repository big data distribution concepts, Crawling big data from
for user Meta data, Contextual Search, Early Warning Alerts, internet and other sources and mapping into key value pairs,
Index Indicators, Analytic tool and Reports, Marketing Store in Hbase database, reduce the data set into dashboard
campaign optimization, Link with social media features, Sales requirements, Design of dashboard with thresholds especially
and marketing forecasting are main dashboard features that has with big data indicators. The Dashboard embedded with
been designed based on the requirements. Our results attest
features to analyse web user clicks and categories click data
importance of Index indicators, one of the major functionality
which is built-in to dashboard. In this work, we present a to analyse a specific tourist, examine specific tourist behavior
detailed analysis of a total efficiency index using four indexing to guide in various mode, ability to turn customer transaction
strategies of varying complexity including Visit Index, Wealth history into intelligent recommendation for logistics, sell
Index, Health Index, and Lifestyle Index. We conclude by products and travel and tourism.
designing an open architecture, that can track and leverage data Personal indexes are very useful in many industries [5].
on the behavior of tourist via a dashboard which consider This paper examines the applicability, usability and design of
trends, to make better decisions, reduce risks and drive personal such indexes. The dashboard is screening personal data as an
tourist experiences. index and exploration probes categorized into four indexes
including Visit Index, Wealth Index, Health Index and
Keywords— Big Data, MapReduce, Hadoop, Hbase, Dashboard, Lifestyle Index. While carrying out insight into theory behind
Apache Sqoop, Apache Mahout Big Data architecture, Map Reduce and problem solving
I. INTRODUCTION process derives these indexes efficiently. The proposed Big
Data solution assists to solve a real world problem of
summarizing massive quantity of data from different sources.
Big Data has become very comprehensive and a tech Mainly, it enables to provide the potential value across
buzzword these days. It is breaking down the barriers that stakeholder cost effectively.
existed with exploding amount of data, historical data, This paper mainly consists following sections. Literature
expensive and complicated databases. With the rapid review presents the summary of the research that was carried
evolution in the Information Technology (IT) industry, out, findings and decisions made. Case description and
organizations are contemplating in moving towards Big Data solution design section present the analysis of the research,
solutions instead of conventional Relational Database system requirements and methodology selection. This also
Management Systems (RDBMS) and Datawarehouse (DW) outlines the design phases of the solution; diagrams are
[1]. Despite the global financial crisis meltdown directly presented that illustrates the design. Furthermore this section
impacting the Sri Lankan economy, leaders in the public, discusses the process carried out and the problems faced and
commercial, and social sectors are focusing on new business also how the problems were given the best possible solution.
opportunities within the country to boost up revenue through Testing phase was carried out in order to identify weaknesses
foreign exchange [2]. There has been an incrementing trend of the system and correct them. Discussion provides a critical
in travel and tourism industry after the end of Sri Lanka's 30 evaluation of the research, and limitations of the current
year civil war. Sri Lanka as a nation is economically growing solution and possible future enhancements will be outlined.
with increasing GDP growth rates and has become open for Conclusion presents the personal reflection for the research
new business opportunities and constantly attracts foreign and presented solution.
direct investments.
In a nutshell, most of Big Data related researches are based II. LITERATURE ANALYSIS
on two main areas named: Technology of how big data to be In the literature it is recognized that necessity of integrated
processed[3], and marketing of tools by different vendors [4]. big data solution while investigating a dichotomy exists
It is a significant constraint to apply big data concepts without between big data technologies and solutions. Big data
extensive re-work by expert data scientists. This paper mainly solution approach is still not prominent in Sri Lankan travel
discuss on how to utilize existing big data technologies and and tourism industry to generate massive revenue and deliver
tools in a more practical manner for the up-liftmen of society.
Authorized licensed use limited to: Staffordshire University. Downloaded on July 31,2020 at 14:20:09 UTC from IEEE Xplore. Restrictions apply.
208 Big Data solution for Sri Lankan development
analytic for decisions making and customer services. Analysis effective analysis of the information to stay on competitive.
of recent studies carried by vendors, market researchers, Therefore given solution needs to align with the stakeholder
solution developers and government intervention statistics, requirements, limitations and policies.
policy form the focus of big data technologies, studies of the Furthermore, research proves that typically high
objectives and decisions confronting to design big data percentage of RDBMS, Flat files, data warehouse appliances
solution which is fitting for Sri Lankan travel and tourism and in-memory databases are used technologies to manage
industry. This research combines two disciplines: In contrast, and analyze big data [12]. There are combination of new
Analyze existing big data technology and solutions used in technologies to manage and analyze big data including
travel and tourism industry. Secondly, how to utilize, leverage Hadoop and specialized analytical technologies such as
existing technologies and design big data solutions in a more columnar databases [13]. Chen et al (2012) organizations run
practical manner which will suites the travel and tourism reports and queries against of historical data, however with
industry in Sri Lanka. the growth of data it is not practical and hard to navigate
The statistical analysis published by the Sri Lanka tourism through them to find the most useful items. Predictive
development authority (2011) shows the enlargement of analytics, planning and forecasting and visualization
trends and structural characteristics of tourist traffic, as techniques are the process of examining large amounts of
International tourist arrivals grew up to a total 980 million in different data. Moving forward organisations are emerging
2011. Revenue from tourism, scheduled airline operations and these techniques into big data solutions [14]. As data volumes
passenger movements, Income and Employment, Tourist grow use of Hadoop will assist for predictive analytics and
Prices indicates that travel and tourism industry has a direct for visualization [15]. Comparing to the other technologies
involvement in core foreign exchange earners in the overall Ventana (2012) research asserts that 67% of organizations are
economy of Sri Lanka [6]. using Hadoop to build their new products and services [12].
Various methods has been used to extract, store, process Apache promotes Sqoop import data from numerous
and present data. With the advancement of hardware, the relational databases into HDFS, Hive or HBase [16]. We have
price of storing devices and processing devices has gone used Apache Mahout to classify data and use as a platform for
down. Advancement of web and sophisticated devices our research since the collection of data to be processed is
generate the need to work with massive loads of different very large and based on collaborative filtering, clustering, and
kinds of data. Bigdata can be defined as the high volume, classification [17]. Several practical case studies imply that
high velocity and high variety of data assets that require new use of standard Hadoop's MapReduce model to investigate
forms of processing to enable enhanced decision making, large data issues [18]. MapReduce framework based on
insight discovery and process optimization [7]. The primary Hadoop and it is easy to design efficient MapReduce
challenge is to build frontier dashboard to analyze data by algorithms for an instance there may be numbers of
converting seemingly unstructured data into useful documents where each document needs a set of terms and
information. need analyse a total number of occurrences of each term in all
Bigdata processing got it’s initiation through commercial documents therefore by using MapReduce algorithms and
giants like google and yahoo. Currently, it has been taken up Basic MapReduce Patterns including Counting and Summing,
by open source organizations including apache and Collating, Filtering, Parsing, and Validation, Distributed Task
commercial organizations like Oracle and IBM [8].The Execution, Sorting helps to design efficient solutions [19],
Google/OTX study finds that (2011), most of the travellers [20].
gather information before booking [9] [45]. Davenport (2013) GigaSpaces solutions carried out big data survey by aiming
asserts that travel industry should begin considering real big IT and business professionals. They assert that business
data solutions to provide better services and tailored travel decisions are heavily reliant on analysis of the data or to
experience to their customers, as creating an integrated data handle rapidly growing data inputs. Therefore organisations
source, data storage issues, and working in a hybrid consider Big Data processing as mission critical and it is
technological environment becomes challenge to big data essential to data processed in real time. Furthermore the
solutions in the travel industry. In fact, recent research survey clearly indicates that organisations seeking
reported that case studies of big data adoption, KAYAK combination of Big Data and cloud computing solutions to
travel meta search engine find the best possible flight or hotel, achieve maximum output [21]. Therefore in the future
Amadeus IT solutions for Air Lines “look to book” ratio, development, authors needs to focus on the combination of
Facebook’s ad, British Airways (BA) Opera Solutions, big data and cloud computing solution as well.
Marriott, Hipmunk, Munich Airport solution heavily depend
on variety of big data technologies and analytics for both III. CASE DESCRIPTION
internal decisions and customer services [10]. Mitra (2007)
also identifies the gap existing between travel search engines With the end of 30 year civil war Sri Lanka is going
including issues in poor content, search function and vertical through a rapid development phase and one key goal is to be
search issues [11]. Ventana (2012) research highlights those transform Sri Lanka into a tourist hub [22]. Big Data
large scale organisations, working with large scale data technologies have the potential to accelerate a country's
beginning at one terabyte (TB) to petabyte (PB) range still development in various aspects. This paper attempts to
using relational databases for their large-scale data processing provide insights into a case study of Travel and Tourism
[12]. As states by recent research there are different types of industry in Sri Lanka while designing a solution to build new
big data solutions provide by different vendors and there are conceptual breakthroughs.
limitations in solutions. Defining such solutions which is The proposed dashboard can be used by organizations
suite for Sri Lankan Travel and tourism by avoiding which already have assets of Big Data and need to kick start
limitation is a huge task. Apart from that, how this challenge without any consultation fees and overheads. It is integrated
is met is critical because organisations are highly relying on with comprehensive Big Data technologies to build an
2013 International Conference on Advances in ICT for Emerging Regions (ICTer) 12th & 13th December 2013
Authorized licensed use limited to: Staffordshire University. Downloaded on July 31,2020 at 14:20:09 UTC from IEEE Xplore. Restrictions apply.
Rinusha Irudeen #1and Sanjeeva Samaraweera*2 209
enterprise level reporting and business insights platform. The leverage data from multiple internal and external sources,
main challenge in the Big Data industry is extracting proper including structured, semi-structured, un- structured and Big
value from Big Data. Nowadays Organisations are investing Data types such as Hadoop. Large volume of data needs to
millions on flashy dashboards and reporting tools. However analyse right through the solution considering the
due to the lack of capabilities of these systems they are poor performance issues should be considered. In a rapidly
in presenting useful insights and have not achieved the changing business environment, information has to be up-to-
expected return on investment (ROI) from expensive date, accurate, and accessible and well governed [26]. New
dashboards and reports [23]. data sources need to be brought rapidly and existing sources
Need for understanding Big Data solutions, adoption, and need to modify according to the current requirement.
demand of Big Data technologies and how it could
revolutionize the business in novel ways has got increased V. SOLUTION DESIGN
mainly due to amount and the proportion of unstructured data
A) Requirement Gathering
in the whole data volume [24].
Most solutions stick to conventional structured solutions Requirement gathering is extremely important to the
where as others who are brave enough to get into the success of the study. This solution was developed using a
unstructured data will ultimately show information not thorough review of the literature on stakeholder analysis [25]
worthwhile. Due to the popularity of buzzword ‘Big Data’ the [28], policy process [27], consider the requirements [22],
market is looking for proper solutions but it is imperative that benefits [25], obstacles and future work as well. Initially, it is
industry has to come up with usable solutions. Although it is essential to discuss with all stakeholders and survey literature
not an easy task, the main idea of this paper is to show a of tourism industry. A brainstorming session was carried out
practical way of using the Big Data. The proposed solution on how best a manual reading of internet resources help to
includes examples for unstructured and unpredictable social uplift the industry. Furthermore, lists of reliable web
network data. Organisations value their data as corporate resources were identified in order to gather information.
asset because data can be effectively transformed into Identifying and understanding the stakeholders and their
valuable information that is used to make business decisions. interests is important to provide appropriate engagement
With the growth of unstructured, unpredictable data, this solution. As explained above, the proposed solution is based
initiative is really about installing the concept of on a case study from travel and tourism industry to apply the
managing data and providing Big Data solutions in different Big Data concepts. It is important to understand
aspects. In Travel and Tourist industry there are lots of the stakeholders and their objectives in order to ensure that all
possibilities in leveraging on Big Data Solutions to predict aspects have been addressed. Therefore we define
their desired travel destinations. stakeholders based on interest, ownership, specialist
x Social networking data on purchase patterns and the knowledge, impact or influence and contribution. In working
idea about their buying power can be used for on this paper we gathered information from all stakeholders
promotions of commodities, hotels and travel agents. including businessmen working in the Tourism Industry,
x Information and feedback about visits to SL in their tourists as well as government officers. With that information
blogs will immensely help hotels, boutiques, airlines, we designed a common personal dashboard customized for
airports and government institutes to improve their tourism industry.
services. To get information onto the dashboard we researched on
x Government authorities can use summarized data to best available Big Data technologies to populate accurate, fast
improve infrastructure and plan for capacity and and important indicators for end users to take decisions [28].
accommodation trends. Unlike a RDBMS which stores only transactional data or a
x Stakeholders can estimate the number of tourists to Datawarehouse which facilitates management to take
cater in next season by predictions using Big Data decisions using aggregated data, the proposed personal
analytics. dashboard will be very useful in end user to get summarized
x Know the current trends in tourism using social information about a person from different sources like
network data of similar stakeholders in tourist internet and make decisions [29].
destinations in other countries. Especially here we are looking at the possibility of getting
x Stakeholders can plan customized tour packages, information on individual tourists. Going through all these
promotional discounts, end of tour souvenirs etc resources manually and understanding about the tourists
according to consumer needs. personal information is vital. The following are some basic
This is done by designing personal dashboards and criteria for evaluating the appropriateness: Wealth, Health,
grouped total reports using Big Data analytics. purposes of visiting the country and Life style. Regardless of
the purpose of visit, Tourist’s life style is very important for a
IV. CHALLENGES tourist hotel or any other organization which is trying to
As revealed in the introduction, Sri Lankan government is invest on tourist [25] [30].
looking for to increase profitability, return on Investment, Even though manual research proves to be best method, it
modernize operations, improve tourist retention [25], extend is not effective and profitable to read about all tourists and
product lines and reduce risk through a solution. Existing keep information beforehand and also reading on a particular
services, products and solutions are constrained by traditional tourist from all sources is not practical. Also if the task is
data integration approaches that hinder productivity and distributed among several persons the criteria they used to see
benefits anticipated. how much efficient a particular tourist for investment is
There are significant challenges that need to be considered vastly different.
when designing the solution for Dashboard. Data Storage and Due to these reasons it is necessary that the process should
complexity are key factors. Therefore dashboard should be automated and proper efficient algorithms are in place in
12th & 13th December 2013 2013 International Conference on Advances in ICT for Emerging Regions (ICTer)
Authorized licensed use limited to: Staffordshire University. Downloaded on July 31,2020 at 14:20:09 UTC from IEEE Xplore. Restrictions apply.
210 Big Data solution for Sri Lankan development
Dashboard Jsoup is the main tool that is used here for extracting specific
ANALYZE AND
ACQUIRE ORGANIZE DECIDE Scraping
STORE Purpose
tools
JTidy To use a XML based tool to traverse it.
Scraping
Jsoup To extract specific data from the HTML
HDFS Hadoop
HtmlUnit To unit test the HTML.
Map Reduce TagSoup To parse a non-formatted HTML document.
Index
MAP () Output NekoHTML To parse a HTML document having many mistakes.
UnStructured Algorithms
REDUCE ()
Data Twitter4J To scrape tweets from twitter.
data from HTML
and Twitter4J Table 2: Scraping tools
will be used to
Apache Sqoop Apache scrape tweets from Twitter [32] [33].
Mahout
Dashboard
2) Integration with OLTP Database: This is done using
Apache Sqoop software [34]. Tourist Data from Oracle
database of Department of Immigration and Emigration is
planned to be imported into HDFS and will be used for
scraping.
OLTP
Database 3) Algorithm to distribute scraped data in HDFS: This will
Hbase map clients according to their country of origin [35]. This is
Structured Data done inside the crawling algorithm and it will check a proper
Semi- Structured
country of origin data column from a reputed social network
engine and shard data according to country of origin. This
Figure 1: Architecture Diagram for Big Data Solution will be very important as it’s easier to identify similar trends
and behaviour from tourists in the same country. Natural
1) Crawling Data from Internet: There are four main sites Language processing algorithms also can be enriched with
considered for personal data extracting. These four sites inherent qualities of a certain nation [36].
named Facebook, Twitter, LinkedIn, and Google+ were 4) MapReduce algorithm to get the indexes: Total Efficiency
analyzed for their behavior below. Index Indicator for tourist is the single indictor that is shown
in this dashboard as a single value calculated from this
The purpose of web crawling scraping in this project is to dashboard which shows the possibility of doing business with
collect information about a tourist from social websites in a tourist.
order to provide information about that particular person to x Greater than 75% - high possibility of doing business
the dashboard. This information feed will provide the x Greater than 50% - possible and need to incorporate
dashboard with the relevant details of the persons’ behavior strategies
and habits and will aid the stakeholders in the rating process.
2013 International Conference on Advances in ICT for Emerging Regions (ICTer) 12th & 13th December 2013
Authorized licensed use limited to: Staffordshire University. Downloaded on July 31,2020 at 14:20:09 UTC from IEEE Xplore. Restrictions apply.
Rinusha Irudeen #1and Sanjeeva Samaraweera*2 211
12th & 13th December 2013 2013 International Conference on Advances in ICT for Emerging Regions (ICTer)
Authorized licensed use limited to: Staffordshire University. Downloaded on July 31,2020 at 14:20:09 UTC from IEEE Xplore. Restrictions apply.
212 Big Data solution for Sri Lankan development
correct individual. Hence for the proposed solution, discover a different person. It is not coherent to use only name as
content provided via a centralized search engine and constraint when designing Index for the solution. Therefore
federated data sources. For an instance, personal details taken key criteria defined based on collection of constraints
from an integrated OLTP database is crawled. including fields like full name, age, country, address, and
passport number. That may be useful in cross checking
although it is highly unlikely that a person has published all
these details in sites accurately.
The basic information shown for any person like age, race,
Unstructured Structured Data gender, and country are the main details taken. Also the
published date of any information is very important as
Data indexes are compared over the time.
sources
OLTP Database The main task in index calculation is to get information for
Semi structured Data Department of Immigration the four indexes.
and Emigration
x Visit Index
x Wealth Index
x Health Index
Data Ingestion
x Lifestyle Index
Visit Index is the probability of a tourist to visit a
particular client on his tour. This is mainly done using
keywords used in the tourist’s blogs. If a tourist has indicated
that he is visiting the client then the visit index becomes 100.
If the person has commented that his lifestyle matches with
Analyze clients business and he is visiting Sri Lanka then the visit
index is 50. Likewise sophisticated algorithms have to be
designed in order to forecast the probability of a particular
tourist visiting the client site.
Hadoop Hbase Wealth Index is the index showing his wealth status. If a
person has keywords saying his salary is Millions or his
expenditure is Millions the index will be increased
Dashboard accordingly. Also standard economical sites can be used to
Index Indicators find whether this person is listed as a wealthy man we can
Contextual Search increase the index. Also his educational qualifications and
Centralized repository for user Meta data professional data are gathered and if he is highly qualified
Early Warning Alerts and is occupied in a good profession, the index can be
Analytic tool and Reports increased.
Marketing campaign optimization Health Index is another index we calculate. Any person’s
Link with social media features health is very important in deciding to select as as an
Sales and marketing forecasting investment for a particular client.
While for a health care provider an unhealthy customer
may be a better prospect while the normal tourist hotels will
look for healthy persons. It is possible to parameterize these
indexes and its effect on total efficiency index, so a health
Figure 2 Solution Design Process care provider can customize the calculation accordingly.
Dashboard enrich with early warning alerts and which is Depending on the requirement, algorithms involved for the
based on the pre-store social data on tourists , information normal tourist care providers can be defined.
which is given by tourist himself and any other information Lifestyle index is another factor that is analysed to
communicated through regulatory institutes. In particularly understand the purchase decision patterns of a tourist. Mainly
end users can see alerts via dashboard by searching a specific this is the index that is customized according to the tourist
tourist. For instance, based on the alerts visa officer can care provider. An environment friendly hotel will look for
decide whether a tourist can enter the country or not. environment concerned tourists where as an Eastern Food
Index indicators are one of the major functionality in the specialist will look for tourists who like Eastern cuisine. Also
proposed dashboard. Initial information gathering process is a particular client may add more weightage to this index and
defined based on information which is available on web. This influence the Total efficiency index more towards lifestyle.
method allows extended solution to gather data including Considering each of these four indexes, total efficiency
basic information, needs, desires and preference of a tourist. indicator is calculated for each tourist. Also the average total
Searching information for a particular tourist on web is not efficiency indicator is calculated for age, race, gender,
an easy task. The selection of a unique key is one of the most country of a tourist and this is another guideline to compare a
critical decisions when deciding key criteria. Therefore particular tourist’s efficiency compared to his peer groups.
constraints should ensure that the selected keys are unique. Also these four indexes as well as the total index are shown
In particular, it focuses on the issue of searching tourist over the time for a tourist and the trend as well as future
information on web by using only “Name” as key word and predictions can be identified from this.
also it is a well-known fact that the similar name refers to
2013 International Conference on Advances in ICT for Emerging Regions (ICTer) 12th & 13th December 2013
Authorized licensed use limited to: Staffordshire University. Downloaded on July 31,2020 at 14:20:09 UTC from IEEE Xplore. Restrictions apply.
Rinusha Irudeen #1and Sanjeeva Samaraweera*2 213
Analytic tools and reports which embedded into dashboard data. HBase provides two run modes including Standalone
capture most innovative statistics with open architecture. and Standalone HBase and distributed [40] [38]. Depends on
Because it is enrich with Index indicators and big data the requirement and with the minimum configuration has an
technologies such as Hadoop and Hbase. ability to switch between two modes. Each table is stored as a
In addition to gathering information about an individual multidimensional sparse map including rows and columns.
tourist, end user can track and leverage data on the behaviour One of the other major testing is on the performance with
of tourist based on clickstream data from the web and data on major volumes of data [41].
historical purchases. Basically proposed dashboard can utilize
as a predictive analytics models to make better decisions, E. Extract Data For Dashboard
reduce risks, to analyse trends, deliver more personal tourist End user allows making decisions more quickly, has
experiences. visibility into key metrics across visa details, personal
information, social media details and visualizing accurate
D. Analyse and Justification data via dashboard. They can have a better understanding of
There are number of technologies which were introduced the effects of marketing efforts. End users can rely on
with the evolution of Big Data. Database vendors demonstrate proposed dashboard to as reliable source for finding person
the advantages of their products along with hardware and information and obtaining records about them. Find tourist by
software configuration. Therefore it is not easy to choose current name, maiden name, address, phone number, or email.
appropriate technology or databases for a particular case Also include a country or city to narrow tourist search
study. In order to identify suitable technologies for the results. It retrieves billions of records to easily locate tourist
proposed solution design a brief vendor independent individually and retrieving data almost instantly. End users
comparison research based on productivity, performance, cost can sign in to preview actual records of each individual.
and effectiveness is carried out. The proposed solution As mentioned earlier similar name may be refer to
designed based on Hadoop Distributed File System (HDFS), a different person and nearly all email addresses are linked to
MapReduce, HBase, Apache Sqoop and Apache Mahout. more than one name. End users allow customize searching
Apache Hadoop is an open-source implication [15], [16]. criteria depending on their requirement. For an instance if the
As a result, the proposed solution can promotes for free end user selects email address as a searching criteria
redistribution and allows access an end solution’s design and dashboard will show information about every person
implementation. One of the benchmark of the proposed connected to that particular address. Therefore end user has
output is to design a solution within the budgetary constraints. option to select exact person and will retrieve accurate
There should be strategies for collecting massive amounts of information.
data from multiple sources. Hadoop enrich with facilities to
store, analyze and access massive amounts of data from
variety of sources across clusters of commodity Welcome,
W elcome, Siri
Siri !!!
!!! User
U ser Login
Login
User Name : ___________
TRAVEL
TRAVEL A
AND
ND T
TOURISM
OURISM
hardware[34][35].In addition to that, Hadoop integrated with Password : ___________
Cancel
C ancel Login
Login
components including Hadoop Distributed File System
Finder Monday, April 15,15,
2013
(HDFS) and Hadoop MapReduce. Therefore Apache Hadoop
Monday, April 2013
Search Background Check Criminal Records Search Public Records Search Widgets
faced both of these issues in employing it as part of our work
in collaborative filtering [17]. Total Efficiency Index Alerts
As explained earlier, contextual search, early warning Total Efficiency
Thus, MapReduce is a high-performance parallel data Visit Health Life Style Wealth
Customs Police
12th & 13th December 2013 2013 International Conference on Advances in ICT for Emerging Regions (ICTer)
Authorized licensed use limited to: Staffordshire University. Downloaded on July 31,2020 at 14:20:09 UTC from IEEE Xplore. Restrictions apply.
214 Big Data solution for Sri Lankan development
identify the accuracy of given details by visa holder. Also this data files into HDFS. Therefore it needs to initially validate
can search public records associated with that particular the data requirement and compare the input data file and
phone number. Social Security number (SSN) can be used as source data including social data, streaming data, and
one of the searching criteria to follow individuals' accounts structured data from OLTP database. Subsequently needs to
within the Social Security program. End users can verify validate that whether the data files are loaded into HDFS
comprehensive Background Check via dashboard. For appropriately.
instance visa officer can check if anyone has a criminal
record or finding out if someone has gone through bankruptcy 2) Validating MapReduce Phase: Coding issues in map
in their history. reduce jobs can be occurred therefore developers needs to
Basically dashboard shows everything that should be highly concentrate to identify and fix issues. Business logic,
known about anyone before entering to the country obtaining Data process, map reduce process, aggregation, output files
useful facts before making important decisions. If there is any and file format need to be validated during this phase.
security concern it will show in alert section and display as This solution heavily involves java code. It is important
security warning alert. that code is written according to java standards and best
Total Tourist Efficiency index is reflecting the investment practices to ensure proper functioning of the application. Java
capability of a tourist. This index makes clear how efficient a unit testing of this application is out of scope.
tourist as a whole for a purticular client.
3) Validating Analysing and storing phase: Once the data
Total Efficiency Index = (Visit Index + Health Index + from data sources loaded in to HDFS and map reduce process
Wealth Index + LifeStyle Index)/4 is completed, processed data will move to this phase. Data
transformations are very complex and it needs more
Average Efficiency Indexes are defined for group of processing time. Validation needs to consider in terms of data
tourists of same criteria. Criteria can be age, gender, country integrity and data quality. Hence transformation rules and
of residence, ethnicity etc. This is a good indicator to aggregation of data needs to be validated.
compare individual tourist efficiency with that of the average This solution involves inserting and selecting of data from
of each criteria he belongs to. NoSQL databases like Hbase. Still there are no hard and fast
standards and best practices for NoSQL based data operations.
Average Efficiency Index = (Sum of Total Efficiency Even though it is in the very scope of this paper the time
Indexes) / no of tourists factor has not enabled authors to explore much on this area.
Buying pattern is the buying frequency and amount spent in Authors plan to explore standards and best practices of
which a tourist purchase goods or services in a period of time NoSQL data operations in a future research paper.
2013 International Conference on Advances in ICT for Emerging Regions (ICTer) 12th & 13th December 2013
Authorized licensed use limited to: Staffordshire University. Downloaded on July 31,2020 at 14:20:09 UTC from IEEE Xplore. Restrictions apply.
Rinusha Irudeen #1and Sanjeeva Samaraweera*2 215
is very important for the end user confidence. We checked independent data. This was the main key factor in introducing
whether the solution is functionally fit for use and behaves as bigdata into tourist sector as this impressed the tourist
expected; validate end-to-end business process, user access, operators. Even though this data is known to the end user they
data availability, integrity and quality. The method of this readily accepted the way authors were trying to analyze this.
testing mainly is on comparing the end result with that of the Even though bigdata is a buzzword the use of it is very
source web sites. For example let's get a person whose visit limited and this a practical solution design which will attract
index to Sri Lanka is very high. Testers have to manually users to use this technology.
access the web sites involved and see whether they reflect Authors' idea was to implement this solution design before
these visits. Also they have to compare this with different the ICTer conference for the presentation. But due to other
users and see whether the visit index for this user is full time commitments they did not have time to show a fully
reasonable compared to web published data and other users. executable prototype of this solution. Especially both authors
In this way all the indexes has to be tested with sample users being database specialists it was no easy task for them to start
and also see whether average indexes are realistic and also on a venture of a java based project. Therefore the solution
other trends are correctly shown in web. will be limited to the total solution design with fully tested
hadoop and hbase components.
VII. DISCUSSION We would have validated the data files in different data
As discussed in the paper this research was based mainly nodes in order to complete the validating process during the
on the end users like tourist hotels and tourist guides etc. data crawling and scraping phrase. In the testing, we mainly
Therefore it was essential that we get the feedback from them tested standalone node only. Different issues may occurred
on the solution design as well. We got their feedback mainly when validating the map reduce process run on multiple
from the dashboard prototype which was a very good tool for nodes.
them to get good idea on our design.
Users were very enthusiastic on this kind of a solution. VIII. CONCLUSIONS
Their main worry was whether they will have to pay for this In this paper we proposed a solution design for a
application. They knew that we were based on readily dashboard which is integrated with big data technologies.
available data in internet and some of them were already Different types of big data technologies were analyzed and
using some web pages to get an idea of their tourists. So it identified. After evaluating stakeholder requirement and
was not an easy task to convince a new application. But the possibility of implementation proposed solution were
rich interface and the comparative indexes authors designed designed based on Hadoop.
were making sense to them. After looking at the return on Four fusion logics were applied to design main index
investment most of them were happy on this venture. indicators. Index Indicators, Contextual Search, Centralized
One of the main valid points was whether we can rely on repository for user Meta data, Early Warning Alerts, Analytic
web data. As they pointed out some of them had relied on tool and Reports, Marketing campaign optimization, Link
these web information earlier and had mixed results. Some with social media features, Sales and marketing forecasting
tourists sites were publishing genuine data where as some are the main functionalities will be embedded in the proposed
others were publishing fake data which we also had to agree. dashboard.
Therefore what they were pointing was rather than relying on The proposed personal dashboard enriches with
data that is published by tourist themselves we have to go for customized features and enables end user to get summarized
some independently confirmed data source, since they have to information about a tourist from different sources. On other
subscribe also. Authors pointed out the facility of integration hand it is very useful to take decisions in various aspects. The
with the Department of Immigration and Emigration which concept has been applied in one of Sri Lanka’s most
users welcomed and were suggesting further to get integrated emerging industry named, travel and tourism to show the
with official authorities of countries of relevant tourist as well. practicality of this technology to anybody.
This was a good suggestion and authors think this is possible
if they can get the sponsorship from government of Sri Lanka ACKNOWLEDGMENT
on this venture. Our gratitude is expressed to Database Competency
Bigdata solutions mainly go hand in hand with Excellence Group (DB-CEG) in Virtusa and the support
datawarehousing solutions. Software companies market them provided by Big Data Proof of Concept team (POC) who was
as cheap options for datawarehousing due to the open source sharing their expertise coding knowledge in implementing the
stack it uses. But penetrating into the market with this solution.
strategy is difficult when there is an existing datawarehousing
solution. Client thinks he already has the given solution using REFERENCES
the datawarehousing data and if he has already spent on
software he may not be interested in going for a new solution. [1] T.Lock (2012) The Register: The Big Data
Due to this problems in penetrating into the market with revolution.[Online].Available:http://www.theregister.co.uk/2012/10/0
8/big_data_revolution/
datawarehosing solutions authors came up with this novel [2] R.Henschke. (2009) Asia Calling: Sri Lanka Calls for Tourism
idea of integrating bigdata solutions to On Line Transaction DevelopmentBoom.[Online].Available:
Processing Solutions. Rather than marketing this as a http://www.asiacalling.org/en/special-reports/after-the-war-the-hard-
Management Information System authors are proposing to get work-begins-in-sri-lanka/1441-sri-lanka-calls-for-tourism-
development-boom
additional data using this solution for front office transaction [3] J.Manyika, M.Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and
systems. A.H Byers, (2011) Big data: The next frontier for innovation,
One of the main attractive points in this solution is rather competition, and productivity. California: McKinsey Global Institute
than working on historical data of the available databases [Online].Available:http://www.mckinsey.com/insights/business_techn
ology/big_data_the_next_frontier_for_innovation
authors is going for social media to explore additional and
12th & 13th December 2013 2013 International Conference on Advances in ICT for Emerging Regions (ICTer)
Authorized licensed use limited to: Staffordshire University. Downloaded on July 31,2020 at 14:20:09 UTC from IEEE Xplore. Restrictions apply.
216 Big Data solution for Sri Lankan development
[4] J. Kelly, D. Vellante, and D. Floyer, (2013) Wikibon: Big Data Market [29] R. E. Bryant, R.H. Katz and E. D. Lazowsk, “Big-Data Computing:
Size and Vendor Revenues. [Online]. Available: Creating revolutionary breakthroughs in commerce, science, and
http://wikibon.org/wiki/v/big_data_market_size_and_vendor_revenues society” Version B, 2008, pp 2-3.
[5] M. Feuz, M. Fuller and F. Stalder (2011) Personal Web searching in [30] The Authority on World Travel & Tourism: Travel & Tourism
the age of semantic capitalism: Diagnosing the mechanisms of Economic Impact 2012 WORLD (2012) [Online]. Available:
personalisation [Online]. Available: http://www.wttc.org/site_media/uploads/downloads/world2012.pdf
http://firstmonday.org/article/view/3344/2766 [31] An Oracle White Paper in Enterprise Architecture—Information
[6] Research and Internationalrelations Division. Sri lanka tourism Architecture: An Architect’s Guide to Big Data (2012) [Online].
development authority: annual statistical report (2011) [online]. Available:http://www.oracle.com/technetwork/topics/entarch/articles/o
Available:http://www.sltda.lk/sites/default/files/Annual_Statistical_Re ea-big-data-guide-1522052.pdf
port-2011.pdf [32] jsoup: Java HTML Parser [Online]. Available: http://jsoup.org/
[7] P. Russom, Big Data Analytics, TDWI Best Practices Report, Fourth [33] P.Houston, “Instant Jsoup How-To: Effectively extract and manipulate
Quarter, 2011. HTML content with the JSoup Library”, Packt Publishing.
[8] P.Zikopoulos and C.Eaton, “Understanding Big Data: Analytics for Birmingham: UK, 2013.
Enterprise Class Hadoop and Streaming Data” McGraw-Hill, 2011. [34] Integrating Hadoop with Enterprise RDBMS Using Apache SQOOP
[9] Google/OTX(2011)Traveler’s Road to Decision 2011:Google/IPSOS and Other Tools (2011) [Online]. Available:
OTXMediaCT,USA.[Online].Available: http://www.hadoopworld.com/session/integrating-hadoop-with-
http://www.thinkwithgoogle.com/insights/emea/library/studies/traveler enterprise-rdbms-using-apache-sqoop-and-other-tools/
s-road-to-decision-2011/ [35] H.Liao, J.Han and J. Fang (2010) Fifth IEEE International Conference
[10] T.H. Davenport, (2013) At the Big Data Crossroads:turning towards a on Networking, Architecture, and Storage: Multi-dimensional Index
smarter travel experience: Amadeus IT Group .[Online]. Available: on Hadoop Distributed File System (2010) [Online]. Available:
http://2013.amadeusblog.com/wp-content/uploads/2013/06/Amadeus- http://www.cs.odu.edu/~mukka/cs775s11/Presentations/papers/liao.pd
Big-Data-Report.pdf f
[11] S.Mitra, (2007) Web 3.0 and Travel Search Engines. [Online]. [36] G.Satell (2013) Why Facebook's Graph Search Really Does Matter:
Available: http://www.sramanamitra.com/2007/06/01/web-30-travel- Big Data + NLP [Online]. Available:
search-engines/ http://www.forbes.com/sites/gregsatell/2013/02/04/why-facebooks-
[12] Ventana research (2012) The Challenge of Big Data: Benchmarking graph-search-really-does-matter-big-data-nlp/
Large-Scale Data Management, California: USA. [Online]. Available: [37] The Blog of the International Computer Science Institute:Big Data or
http://www.ventanaresearch.com/uploadedFiles/Content/Landing_Pag Expert Annotation - What's Best for Natural Language Processing?
es/Ventana%20Research%20Benchmark%20Research%20Big%20Dat (2013). [Online]. Available:
a%20White%20Paper%202012.pdf http://www.icsi.berkeley.edu/icsi/blog/data-versus-experts
[13] D.J. Abadi, D.S.Myers, D.J.DeWitt,and S.Madden , Materialization [38] The Apache Software Foundation. Apache HBase. [Online]. Available:
Strategies in a Column-oriented DBMS. In: Proc. of ICDE (2007) http://hbase.apache.org/.
pp.466–475. [39] K.Shvachko, H. Kuang, S.Radia, and R. Chansler (2010) The Hadoop
[14] H.Chen, R.H. L. Chiang and V.C. Storey, Business intelligence and Distributed File System. O'Reilly Media, Yahoo! Press.
analytics:from big data to big impact: Business Intelligence [40] J. Dean and S. Ghemawat (2004) MapReduce: Simplified Data
Research ,MIS Quarterly Vol. 36 No. 4, 2012, pp. 1174-1175 Processing on Large Clusters.Vol 06.
[15] T.White, Hadoop: The Definitive Guide: Storage and Analysis at [41] Avinash, Lakshman. Cassandra-A Decentralized Structured Storage
internet scale, CA, USA: O'Reilly Media, 2012. system. In LADIS, 2009.
[16] K.Ting, and J.J Cecho, Apache Sqoop Cookbook: Unlocking Hadoop [42] M.StoneBreaker.SQL databases V. NOSQL databases,
for your relational Databases, CA, USA: O'Reilly Media, 2013. Communications of the ACM, Vol. 53 No. 4, pp.10-11, 2010.
[17] I.Drost, Scaling Data Analysis with Apache Mahout, CA, USA: [43] J.Hurwitz, A.Nugent, F.Halper,and M.Kaufman,"Big Data For
O'Reilly Media, 2011. Dummies: Big Data management ", NJ, USA: John Wiley &
[18] J. Lin, Exploring Large-Data Issues in the Curriculum: A Case Study Sons,2013.
with MapReduce (Proceedings of the Third Workshop on Issues in [44] C.Lam," Hadoop in action: Programming with Pig" Manning
Teaching Computational Linguistics), Ohio: USA, Association for Publications, 2010.
Computational Linguistic, 2008, pp 54–61. [45] D.Borthakur. (2008) Hadoop 1.2.1 Documentation: HDFS
[19] R. Baraglia, G. D. F. Morales, and C. Lucchese. Document similarity Architecture Guide [Online]. Available:
self joins with MapReduce. In ICDM, 2010. http://hadoop.apache.org/docs/stable/hdfs_design.html
[20] Y. Kim and K. Shim. Parallel top-k similarity joins algorithms using [46] Big data and analytics in travel and transportation: Beyond the hype:
MapReduce. In ICDE, 2012. Solutions that deliver big value, IBM Corporation,
[21] Gigaspaces (2012) “Big Data Survey: Real-Time Stream Processing 2013.[Online].Available:
and Cloud-Based, Big Data Increasing in Today’s Enterprises”: USA. http://public.dhe.ibm.com/common/ssi/ecm/en/gbw03215usen/GBW0
[Online].Available:http://www.gigaspaces.com/sites/default/files/prod 3215USEN.PDF
uct/BigDataSurvey_Report.pdf
[22] Explore Srilanka, Laya: Comfort, Peace And Serenity (2012) [Online].
Available: http://exploresrilanka.lk/2012/12/laya-comfort-peace-and-
serenity/
[23] S. Few and P. Edge (2007) Why most Dashboards Fails [Online].
Available:http://www.perceptualedge.com/articles/misc/WhyMostDas
hboardsFail.pdf
[24] S.Rosenbush, and M. Totty (2013) U.S. edition of The Wall Street
Journal, with the headline: How Big Data Is Changing the Whole
Equation for Business [Online]. Available:
http://online.wsj.com/article/SB1000142412788732417890457834007
1261396666.html
[25] Research and International relations Division. Sri lanka tourism
development authority: annual statistical report (2011) [online].
available:http://www.sltda.lk/sites/default/files/Annual_Statistical_Re
port-2011.pdf
[26] Data Protection Acts 1988 and 2003: A Guide for Data Controllers
[Online].Available:http://www.dataprotection.ie/documents/forms/Ne
wAGuideForDataControllers.pdf
[27] L.Wijesiri (2012) DAILY NEWS: Developing tourism in Sri Lanka
and challenges [Online]. Available:
http://www.dailynews.lk/2010/02/27/fea03.asp
[28] NewVantage Partners: Big Data Executive Survey (2013) [Online].
Available: http://newvantage.com/wp-content/uploads/2013/02/NVP-
Big-Data-Survey-2013-Summary-Report.pdf
2013 International Conference on Advances in ICT for Emerging Regions (ICTer) 12th & 13th December 2013
Authorized licensed use limited to: Staffordshire University. Downloaded on July 31,2020 at 14:20:09 UTC from IEEE Xplore. Restrictions apply.