Internet Census 2012 - Parth Shukla - Thesis
Internet Census 2012 - Parth Shukla - Thesis
Internet Census 2012 - Parth Shukla - Thesis
by
Parth Shukla
The School of Information Technology and
Electrical Engineering,
November 2013.
ii
Parth Shukla
41191545
PO Box 6080,
St Lucia, QLD 4068
In accordance with the requirements of the degree of Bachelor of Science (Computer Science) in the
School of Information Technology and Electrical Engineering, I present the following thesis entitled
This work was performed under the supervision of Dr. Marius Portmann. I declare that the work
submitted in this thesis is my own, except as acknowledged in the text and footnotes, and has not
been previously submitted for a degree at the University of Queensland or any other institution.
Yours sincerely,
Parth Shukla
iii
iv
To my wife
v
Acknowledgements
I would like to take this opportunity to thank a few people, without whose help and support, I
couldn’t possibly have successfully achieved the aims of this project in time.
Firstly, I’d like to thank Dr Marius Portmann for accepting to supervise me and for being a
supportive and understanding supervisor given my hectic work/study life. I thank you for reminding
me to keep my horizons wide during researching so as to not end up focusing only on one
implementation.
Also I couldn’t have done this project without the moral and organisational support of AusCERT
that I received through my wonderful manager Mike Holm. I thank you for getting me one step
closer to graduation!
I would also like to thank Dr. Xue Li for providing initial ideas for thought and research through a
brief but intensive meeting earlier in the semester. I thank you for your help in introducing me to
concepts which ultimately helped me grasp a better understanding of the “big data” field.
I am very thankful to Angus Gardner, the system administrator for AusCERT, for keeping the
primary resource of this project, the server “thumper” alive and functional during this project work. I
cannot thank you enough for coming into work on the Sunday before my project presentation in
order to revive the server which I’d foolishly managed to crash.
I am thankful to Dr. David Green from the High Performance Computing group at UQ for setting
up my access to the HPC and lending a helping hand to accelerate my learning. I greatly appreciate
the assistance you provided me via constant emails, even over weekends and during out-of-office
hours.
I am also thankful to the organising committee of the AusNOG conference for allowing me to
present on my thesis at the AusNOG 2013 conference. The conference provided me with an
invaluable opportunity to gain feedback and perspective on my research from the Australian
network operators’ viewpoint. I am deeply thankful to the committee for this unique opportunity.
I am thankful to Jason Scott from archive.org who worked closely with me in order to get a copy
of the recompressed gzip files onto archive.org for easy public access. I thank you for quickly setting
up access for me on archive.org servers and subsequently making the dataset publicly
downloadable.
Finally, I’m eternally thankful to my wife Trinh Nguyen for being there for me even when I
randomly yelled at the computer for mistakes I had made.
vi
Abstract
The aim of this project is to make the 9 terabyte dataset of the Internet Census 2012 easily
searchable, preferably with minimal resource requirements. This dataset contains information
gathered on the entire IPv4 space of over 3.7 billion IP addresses. It was released in a torrent as 568
GB of highly compressed ZPAQ files. This torrent is organised and stored in a manner which makes it
very difficult for any researcher to even access the contents of individual files, let alone conduct any
analysis on the entire dataset without having access to substantial computing resources. The aim of
this project, therefore, encompasses removing this initial hurdle in order to make the dataset easily
searchable.
As part of this project, the 568 GB of highly compressed ZPAQ files were successfully
recompressed into approximately 1.4 TB of gzip files that allow significantly easier access to the
data. These recompressed gzip files were released back into the public domain through archive.org.
The original aim of this project was to explore this dataset for interesting and informative
results. However, since this is the first academic paper exploring this particular dataset – and no
prior research exists on it – and furthermore since this project is time-limited to a single university
semester, it was vital to first focus on making the data easily searchable. Once the data is made
searchable, it would be considerably easier for future researchers to focus on actual analysis.
After detailed research, the aim of the project was successfully achieved. After assuming the
possibility of two different types of research interest in the data, two different solutions targeting
two different types of researchers and audiences were successfully implemented in this project.
Firstly, a flat file implementation that allows searching the gzip data directly was coded as part of
this project. This implementation is extremely easy to setup and requires minimal resources, but
comes with limitations on how the data can be searched. Interested researchers do not even need
access to the entire dataset to make use of this implementation. Using this implementation, one IP
address or a range of IP addresses can be searched in less than three minutes. This implementation
is aimed towards researchers interested in only one IP address or a small range IP addresses.
Other solutions are also feasible and are detailed in the “Research” chapter of this report. The
conclusion of this report provides suggestions on possible future directions of research as a result of
the findings of this report. Details on a collaborative project that will explore Google BigQuery as a
solution to making this dataset easily searchable can also be found in the conclusion. Google
BigQuery holds enormous promise but was not explored in this project due to budget limitations.
This project, thus, can serve as a launching pad for researchers wishing to analyse this dataset.
vii
Contents
Acknowledgements................................................................................................................................ vi
Abstract ................................................................................................................................................. vii
List of Figures .......................................................................................................................................... x
List of Tables .......................................................................................................................................... xi
Introduction ............................................................................................................................................ 1
Chapter 1 Background Information & Theory ..................................................................................... 3
1.1 ICMP Ping ................................................................................................................................ 3
1.2 Reverse DNS ............................................................................................................................ 3
1.3 Service probes ......................................................................................................................... 4
1.4 Host probes ............................................................................................................................. 5
1.5 Sync scans ............................................................................................................................... 5
1.6 TCP/IP Fingerprints ................................................................................................................. 6
1.7 IP ID Sequence ........................................................................................................................ 6
1.8 Trace route .............................................................................................................................. 7
1.9 Lists of IPs considered ‘active’ ................................................................................................ 8
1.10 Other Data............................................................................................................................... 9
Chapter 2 Reducing Problem Complexity .......................................................................................... 11
2.1 Storing the Decompressed ZPAQs ........................................................................................ 11
2.2 Using the High Performance Computing Grid ...................................................................... 13
2.3 Downloading the recompressed data ................................................................................... 14
Chapter 3 Research ............................................................................................................................ 17
3.1 Product Alternatives for Relational Search ........................................................................... 17
3.1.1 AT&T Daytona ............................................................................................................... 18
3.1.2 Splunk ............................................................................................................................ 18
3.1.3 MapReduce & Apache Hadoop ..................................................................................... 18
3.1.4 Google BigQuery ........................................................................................................... 19
3.1.5 Amazon Red Shift .......................................................................................................... 20
3.1.6 Relational Databases..................................................................................................... 20
Chapter 4 Flat file search implementation ........................................................................................ 23
4.1 Functionality and Requirements ........................................................................................... 23
4.2 Implementation Challenges .................................................................................................. 25
4.2.1 Version 1 ....................................................................................................................... 25
viii
4.2.2 Version 2 ....................................................................................................................... 25
4.2.3 Version 3 ....................................................................................................................... 26
4.2.4 Version 4 ....................................................................................................................... 27
4.2.5 Version 5 ....................................................................................................................... 27
4.2.6 Version 6 ....................................................................................................................... 28
4.2.7 Future Work .................................................................................................................. 28
4.3 Screenshots of the flat file search......................................................................................... 29
Chapter 5 Relational Search Implementation ................................................................................... 39
5.1 Significant Discovery ............................................................................................................. 40
5.2 Optimising MariaDB variables............................................................................................... 40
5.2.1 Data Directory ............................................................................................................... 41
5.2.2 InnoDB File-Per-Table Mode ......................................................................................... 41
5.2.3 The InnoDB Buffer Pool................................................................................................. 42
5.2.4 InnoDB Flush Method ................................................................................................... 42
5.2.5 InnoDB Log File Size ...................................................................................................... 43
5.2.6 InnoDB Log Buffer Size .................................................................................................. 43
5.2.7 InnoDB File Format ....................................................................................................... 44
5.2.8 MariaDB Settings File .................................................................................................... 44
5.3 Table Compression................................................................................................................ 45
5.4 Sample Table Schema ........................................................................................................... 46
5.5 Benchmarks........................................................................................................................... 48
5.5.1 INT or DATETIME to store timestamps ......................................................................... 48
5.5.2 Benchmarks Table ......................................................................................................... 49
5.6 Proof-of-Concept Insertion ................................................................................................... 50
Conclusion ............................................................................................................................................. 53
Future Work ...................................................................................................................................... 54
Google BigQuery Project ................................................................................................................... 54
Bibliography .......................................................................................................................................... 57
ix
List of Figures
Figure 1: Front page of flat file search .................................................................................................. 29
Figure 2: Error when searching private IP address ............................................................................... 29
Figure 3: Error when searching for reserved IP address ....................................................................... 30
Figure 4: Preliminary Search for a single IP address ............................................................................. 30
Figure 5: Full search for a single IP started ........................................................................................... 31
Figure 6: Search started email for a single IP ........................................................................................ 31
Figure 7: Page displayed when re-initiating search on same single IP ................................................. 31
Figure 8: Search completed email for a single IP .................................................................................. 32
Figure 9: Page displayed when conducting search over a previously searched IP address ................. 32
Figure 10: Log file for a single IP address .............................................................................................. 33
Figure 11: Example CSV file showing result of search .......................................................................... 33
Figure 12: Searching an IP range ........................................................................................................... 34
Figure 13: Preliminary search page for an IP range .............................................................................. 34
Figure 14: Full search for an IP range started ....................................................................................... 35
Figure 15: Search started email for an IP range.................................................................................... 35
Figure 16: Page displayed when re-initiating a search on the same IP range ...................................... 35
Figure 17: Search completed email for an IP range .............................................................................. 36
Figure 18: Page displayed when conducting search over a previously searched IP range ................... 36
Figure 19: Conducting a search using command line over an IP range ................................................ 37
Figure 20: Search started email for an IP range for a search started from command line .................. 37
Figure 21: Search completed email for an IP range for a search started from command line ............. 38
Figure 22: CSV files and log file storage locations ................................................................................ 38
x
List of Tables
Table 1: ICMP ping sample data.............................................................................................................. 3
Table 2: Reverse DNS sample data ......................................................................................................... 4
Table 3: Service probes sample data ...................................................................................................... 5
Table 4: Host probes sample data .......................................................................................................... 5
Table 5: Sync scans sample data ............................................................................................................. 6
Table 6: IP ID sequence sample data ...................................................................................................... 7
Table 7: Trace route sample data ........................................................................................................... 7
Table 8: MariaDB - different Integer types ........................................................................................... 47
Table 9: Table schema of 'search_rdns' ................................................................................................ 48
Table 10: Result of Benchmarks on MariaDB ....................................................................................... 49
Table 11: Table schema for 'search_icmp_ping' ................................................................................... 50
xi
xii
Introduction
It has no doubt been the dream of many to be able to scan and map the whole of the Internet in
order to perform different types of statistical analysis, including accurately analysing the current
technologies in use on the internet. Since IPv6 is only an emerging replacement of IPv4 and very few
devices, if any, run exclusively on IPv6, the complete IPv4 scan performed by the Internet Census
2012 can be considered to represent a historic first compilation of data on every single device
connected to the Internet [3].
The “Internet Census 2012” was a detailed scan of each and every one of the allocated IPv4
addresses on the Internet by an anonymous researcher that resulted in the publication of 9
terabytes of text data [1]. The data was split up and compressed using ZPAQ, and published as a 568
GB torrent [2]. This project focuses on creating a way to easily search and analyse any and all given
sections of the data gathered by the Internet Census 2012.
Such a scan has not been possible until now purely due to logistical reasons. The Internet
Assigned Numbers Authority has allocated 3,706,650,624 IPv4 addresses for use on the Internet [4].
Assuming a scan rate of 1 second per IP address in order to perform a comprehensive scan, it would
take over 117 years to complete a full scan with the use of one computer. However, with 117
devices setup to simultaneously scan different sections of the allocated IPv4 range, it would still take
1 year to complete the scan. Device failures may even further increase the scan time. Even with
1000 devices being used for simultaneous scanning, it would still take over 42 days to complete the
scan. Given the volatility of the Internet, the data gathered at the beginning of the scan would be
considered out-dated by the time the scan is completed.
Therefore, it is obvious that for a comprehensive scan of the allocated IPv4 addresses, a botnet
of considerable size would be required to be able to complete the scan within an acceptable enough
timeframe so as to avoid the data becoming out-dated. This requirement of a large botnet has been
the reason such a comprehensive scan has never been performed in the past. The anonymous
researcher overcame this difficulty by creating the Carna Botnet, which illegally compromised over
1.2 million devices [1]. The researcher claimed to have used 420,000 of these devices to complete
the whole scan in less than 24 hours [1]. Between March 2012 and December 2012, the anonymous
researcher used the Carna Botnet to perform the scan multiple times [1]. Gathering the data
multiple times might seem like a redundant and time-consuming task; however, it is important to
note that due to the volatility of the Internet and the tendency of devices to come online at different
times of the day around the world, re-scanning the same IP space repeatedly over different time
periods is necessary to be able to gather accurate data on the IPv4 space [1].
After aggregating the data and performing some initial analyses, all the gathered data along with
a detailed research paper was released to the public in March 2013 by the anonymous
researcher[1][3]. Please refer to the research paper for details on the methods used for gathering
the data with the use of the Carna Botnet.
Given that the sheer size of the raw data is 9 terabytes, it is very likely that researchers who wish
to analyse the data would not have access to the physical infrastructure required to store the data,
1
INTRODUCTION
let alone process it. This project provides two solutions that allow this data to be analysed with
minimal infrastructure requirements.
The primary resource for this project was a server named “thumper”, which was set up by
AusCERT and assembled for use specifically for this project. The following were the specifications of
thumper:
All project-related work, including testing, implementation and coding was conducted on this
server. The 568 GB torrent was downloaded at the author’s home and then transferred to the
AusCERT server on a portable hard drive. The download process took approximately 2 weeks and
was completed prior to the commencement of this project.
Access to the High Performance Computing Grid at the University of Queensland was also
obtained temporarily in order to decompress the 568 GB of ZPAQ files. More details on this are
located in the “Reducing Problem Complexity” chapter.
2
Chapter 1 Background Information & Theory
This section will present the information on the different types of data the torrent contains and how
the Internet Census 2012 dataset is stored and organised within the 568 GB torrent [2].
The download contains 8 different types of data. Six out of the 8 data types in the torrent are
organised by IPv4 class A’s. For example, for these data types, data for the whole of 1.0.0.0/8 or
1.0.0.0/255.0.0.0 range would be stored in one file called “1.zpaq”. Similarly, the next Class A
(2.0.0.0/8) would be stored in “2.zpaq” and so forth. Therefore, in order to find information on
130.102.40.14, the 130.zpaq file would be required to be unpacked for each of these data types. A
total of 221 class A ranges have been allocated by the Internet Assigned Numbers Authority [4]. As a
result, each of these six data types contains 221 ZPAQ files, representing each of the class A ranges.
For each of the other two data types, information for all IP addresses probed is stored in a single
file. Details on all of the eight data types from the scan are described below. Please note that all
timestamps in all data types are stored in Unix epoch time [24].
ICMP pings are stored in its own folder in the torrent. The folder contains one ZPAQ file per class
A as described above. This results in a total of 221 ZPAQ files in the ICMP ping folder.
Below is a sample of the ICMP ping data as shown on the download page of the Internet Census
2012 [14].
IP Timestamp Result
108.0.11.254 1355672700 alive, 24205
108.0.11.255 1335190500 Unreachable
108.0.18.118 1345646700 Icmp Type: 11,ICMP Time Exceeded, from 59.185.211.246
108.0.18.182 1335248100 Icmp Error: 13,ICMP Unreachable (Communication
Administratively Prohibited), from 10.23.41.2
3
1. BACKGROUND INFORMATION & THEORY
The IP address, timestamp and result of the lookup of that IP address are recorded in this data
type. Running these lookups multiple times resulted in 10.5 billion records being stored. The records
are compressed to 16.3 GB of ZPAQ files, which would decompress to 366 GB.
Reverse DNS records are also stored in their own folder. Same as ICMP Ping, the data is stored in
the folder by being split into the /8 IPv4 ranges. This results in a total of 221 ZPAQ files in the reverse
DNS folder.
Below is a sample of the reverse DNS data as shown on the download page of the Internet
Census 2012 [14].
IP Timestamp Result
108.0.140.255 1336954500 (3)
108.0.141.0 1336886100 (2)
108.0.141.1 1336914900 L100.LSANCA-VFTTP-165.verizon-gni.net
108.0.141.2 1336752900 pool-108-0-141-2.lsanca.fios.verizon.net
A total of 742 TCP and UDP ports were scanned. For every service probe, queries such as a get
request or an SSL session request were sent to a given port through a given protocol (TCP or UDP).
Some port and protocol combinations were probed multiple times, each time with a different query
[1]. As a result, a total of 1,101 different probes made up of different combinations of port, protocol
and query were sent to each allocated IP address [2]. The results of all of these are recorded in this
data type along with the IP addresses and a timestamp.
The records are stored in 374.5 GB of compressed ZPAQ files, which would decompress to 5.5
TB. Out of more than 4000 billion service probes that were sent in the duration of the whole scan,
only 180 billion were recorded and published since the rest represented timed out probes [1]. 5
billion of these records were queried by probing the top 30 ports in mid-December 2012 to obtain
up-to-date information before release [1].
The service probes data is organised slightly differently than others. The folder for the service
probes data in the torrent contains 1101 tar balls. Each tar file represents a port/protocol/query
combination used for probing. Each tar ball contains 221 ZPAQ files inside them, one ZPAQ file per
class A, which is the same as for ICMP ping and reveres DNS as described above. Since each tar ball
contained 221 ZPAQ files in them, the total number of ZPAQ files in this folder comes to 1101 x 221 =
243,321.
Below is a sample of the service probes data from a tar ball as shown on the download page of
the Internet Census 2012 [14]. The codes in the “State” column correspond to the following:
1=Open; 2=Open/Reset; 3=Open/Timeout; 4=Closed/Reset; and 5=Timeout.
4
1. BACKGROUND INFORMATION & THEORY
The state of the given IP (up or down) and the reason for that state is recorded in this data type
along with the IP address probed and a timestamp indicating the time of the host probe. Host
probes resulted in 19.5 billion records being stored. These records are stored in 27.9 GB of ZPAQ
files, which would decompress to 771 GB.
Host probes are stored in their own folder in the same manner as ICMP ping and reverse DNS.
Each class A range was stored in its own ZPAQ file, resulting in the host probes folder containing 221
ZPAQ files.
Below is a sample of the host probes data as shown on the download page of the Internet
Census 2012 [14].
This data type recorded the state of the port(s) (open/closed/filtered etc), the reason for the
state (e.g. unreachable), whether TCP or UDP was used and a list of ports that were scanned along
with the IP address and the timestamp of the scan. Sync scan resulted in 2.8 billion records being
stored. These records are stored in 65.5 GB of ZPAQ files, which would decompress to 435 GB.
5
1. BACKGROUND INFORMATION & THEORY
Same as host probes, ICMP pings and reverse DNS, sync scans are stored in their own folder with
one ZPAQ file per class A range. This results in 221 ZPAQ files in the sync scans folder.
Below is a sample of the sync scans data as shown on the download page of the Internet Census
2012 [14].
The generated fingerprint along with the IP address and the timestamp are stored in this data
type. This resulted in 80 million records being stored in 1.3 GB of ZPAQ files, which would
decompress to 50 GB.
TCP/IP fingerprints are also stored in their own folder with one ZPAQ file per class A range. This
results in 221 ZPAQ files in the TCP/IP fingerprint folder.
The sample data of TCP/IP Fingerprints is too large to fit on a page for print and therefore cannot
be included here. Please refer to the Internet Census 2012 download page to view a sample [14].
1.7 IP ID Sequence
IP ID sequence test is used to determine the IP ID sequence generation algorithm used by the target
IP address. This test is performed by comparing the IP IDs of reply packets [6]. Depending on the
results of this test, the IP ID sequence generation algorithm can be deduced, which can allow for
identification of the operating system of the given IP address. Same as TCP/IP fingerprints, only
some of the IP addresses were tested for IP ID sequence due to this functionality only being available
on some of the compromised devices of the Carna botnet with the supported version of Nmap.
6
1. BACKGROUND INFORMATION & THEORY
The result of the test, along with the IP address and the timestamps are recorded in this data
type. 75 million records for the IP ID sequence are stored in a 155 MB ZPAQ files, which would
decompress to 2.7 GB.
Unlike all other data types, IP ID sequences are all stored in a single ZPAQ file which contains all
the records.
Below is a sample of the IP ID Sequence data as shown on the download page of the Internet
Census 2012 [14].
IP Timestamp Result
1.0.16.2 1346710500 All zeros
1.0.16.6 1340090100 Incremental
1.0.16.8 1340081100 Busy server or unknown class
1.0.16.8 1346670900 Incremental
The complete result of the trace routes, protocol used for the trace route (ICMP or UDP) along
with timestamps, source IP and target IP are stored in this data type. 68 million records for the trace
routes are stored in a 1.9 GB ZPAQ file, which would decompress to 18 GB.
Same as IP ID sequences, all records of trace routes are stored in a single ZPAQ file.
Below is a sample of the trace route data as shown on the download page of the Internet Census
2012 [14].
7
1. BACKGROUND INFORMATION & THEORY
9:66.198.111.97:*,190ms,170ms;10:216.6.87.9:200m
s,200ms,160ms;
11:206.82.139.110:*,180ms,210ms;12:74.40.2.173:2
00ms,190ms,190ms;
13:74.40.2.193:*,200ms,200ms;14:74.40.1.54:220ms,
230ms,220ms;
1340158 189.81.25 29.103.13 icmp 1:200.164.176.130:20ms,30ms,30ms;2:200.164.28.10
500 .249 8.84 9:30ms,30ms,30ms;
3:200.223.44.233:20ms,30ms,30ms;4::*,*,*; 5::*,*,*;
These lists were used in this project as an add-on module for the flat file search implementation
to perform a preliminary search to indicate to researchers performing search whether interesting
data on a particular IP address exists or not. More information on how these lists are used is
available in the “Flat file search implementation” chapter. The combined size of these lists is 23.8 GB.
Details on these six lists and how they were derived as detailed by the anonymous researcher is
provided below [16].
1) List of IPs with ICMP ping responses: A list of 420 Million IP addresses that responded to ICMP
Ping at least 2 times between June 2012 and October 2012.
2) List of IPs with reverse DNS response: A list of 1051 Million IP addresses that had a reverse DNS
entry between June 2012 and October 2012.
3) List of IPs with ports open: A list of 165 Million IP addresses that had one or more of the Top
150 ports open and the port returned data in response to a service probe between June 2012
and October 2012.
4) List of IPs with open reset: A list of 7.6 Million IP addresses that had one or more of the Top 150
ports open but the connection was reset without sending data. Contains only IP addresses that
did that more than 5 times between June 2012 and October 2012.
5) List of IPs with open timeout: 15 Million IP addresses that had one or more of the Top 150 Port
open but the connection timed out without sending data. Contains only IP addresses that did
that more than 5 times between June 2012 and October 2012.
6) List of IPs with closed reset: 152 Million IP addresses that had one or more of the Top 150 Ports
closed or the connection was reset by a firewall. Contains only IP addresses that did that more
than 8 times between June 2012 and October 2012.
8
1. BACKGROUND INFORMATION & THEORY
A file containing a list of date/times and the number of records for each of the above data type
collected on each of those date/times is included in the torrent. This list shows an overview of when
the Internet Census dataset was gathered. Files containing the top most reverse DNS entries for all
the domains in the data have also been complied for multiple levels of domains (top level, second
level etc). This is assumed to be compiled by analysing the reverse DNS date type of the dataset.
Global statistics for each of the service probes were also compiled to indicate what responses
were given for each of the service probes. The statistics for each port/protocol indicated the number
of devices detected running different types of services on that port/protocol. For example, Port
3389/TCP, which is the default remote desktop port for Microsoft Windows, showed 6,203,805
devices detected running “Microsoft Terminal Services” and 4038 devices running an IRC server on
that port. Many other services with a number indicating the devices running that service were also
listed. These statistics provide a way for interested researchers to find the type of services that are
run on the Internet and their prevalence in the world for any given port and protocol.
Lastly, a single file containing a complete list of all the ZPAQ files in the torrent, with the SHA1
hash sum of each of them, was also included in the torrent. The hash sums from this file can be used
to check for corruption of the ZPAQ files that may have occurred during downloading or copying.
The hash sum of this file of hash sums itself was included in the “Read Me” file of the torrent. These
files were used for data integrity checks. More information on the use of these particular files can be
found in the “Using the High Performance Computing Grid” section on page 13.
9
10
Chapter 2 Reducing Problem Complexity
This chapter tackles the task of reducing the complexity of the problem faced in this project by
overcoming the first hurdle posed by the lack of easy access to the 568 GB of ZPAQ files.
Even with the necessary storage infrastructure, it still remains a difficult task to decompress the
ZPAQ files in order to shift through the data to find information on IP addresses of interest. The
ZPAQ format is known for its extremely good compression ratio - far better than RAR, ZIP, gzip or
even 7-zip [5]. However, this level of compression comes at a heavy cost to resources (CPU, RAM
and time) and therefore it is not possible to decompress these files swiftly [5]. Quick tests on the
data revealed that depending on the options chosen at compression time, a sample ZPAQ file of 100
MB could take anywhere from 1 hour to 3 hours to decompress on a modern computer such as the
AusCERT server thumper. Assuming an average of 2 hours per 100 MB, it would take over 470 days
to decompress 568 GB of data to the full 9 terabytes.
Furthermore, the “Background Information” chapter revealed the storage structure of all the
data types within the torrent. In order to locate the information for a single IP address (e.g.
130.102.50.50), a total of 1101 tar balls and 1108 ZPAQ files (130.zpaq) from each of the 8 data
types would need to be decompressed just to extract information on a single IP address. The
“Background Information” chapter further revealed that the torrent contains a total of 244,428
ZPAQ files. Therefore, anyone interested in analysing the entire data set would need to decompress,
1101 tar balls and 244,428 ZPAQ files before any analysis could even be started.
Given the high resources required to unpack just one ZPAQ, the effort required to extract all
data related to one IP address from the torrent data would far outweigh any information gain. The
need to overcome these harsh factors is mandatory for the continuation of this project.
Therefore, the first step of the project was to decompress the data to allow easy access. Given
the highly resource intensive ZPAQ algorithm detailed earlier, it makes sense to use parallel
computing nodes to help decompress the entire data set. Otherwise, as highlighted earlier, it would
take over 470 days just to decompress the ZPAQs using a single computer core. However, given that
the uncompressed data would end up using 9 TB of data, some preliminary research was completed
to determine if converting the data into another format may help save space without a significant
computational penalty on the CPU.
Further research revealed that web programming languages such as PHP, Python and Java also
support opening a gzip file directly and having it be decompressed on-the-fly as necessary [8][9][10].
11
2. REDUCING PROBLEM COMPLEXITY
As these are web programming languages, a check on these languages was necessary since these
languages can potentially be used in the flat file search implementation. Although PHP was
ultimately chosen due to various reasons, in order to leave as many options as possible open for
other researchers of the dataset, it was important to establish that native gzip reading support
existed in all of these languages.
Before proceeding, it is vital to grasp a basic understanding of gzip. GNU zip implements and
depends on the DEFLATE algorithm [11]. The DEFLATE algorithm compresses and decompresses the
data with only a single pass [12]. This means that during compression, as the algorithm reads a file
linearly, it looks for words it has previously read (if any) [12][13]. Whenever it comes across a word
previously seen, it replaces it with a reference pointing back to the original word [12][13]. It does not
perform a second pass to improve compression [11]. Similarly, upon decompression, it simply starts
to read the file from the beginning and whenever it comes across a reference, it replaces the
reference with the word being pointed at [12][13]. This allows it to decompress with a single pass as
well. As a result of the way the algorithm is structured, decompression on-the-fly, as performed by
Linux tools, is done easily and at an extremely nominal computational cost.
However, a quick performance test was done to compare parsing an uncompressed ZPAQ file
against a gzip re-compressed version of the same file. “130.zpaq” from the host probes folder was
chosen for this test. The Linux command to output the number of rows in a given file was used for
benchmarking, along with the Linux command “time” being used to determine the time taken for
the command to execute. The original ZPAQ file was 121 MB in size. It took approximately 1 hour
and 30 minutes to decompress the ZPAQ to the raw file, which came to 3.4 GB in size. The gzip
counterpart was 370 MB in size.
real 0m19.066s
user 0m0.668s
sys 0m2.192s
root@thumper:$ time zcat 130.gz | wc -l
84524917
real 0m14.591s
user 0m14.617s
sys 0m1.672s
The command took 19 seconds to complete on the uncompressed file and 14.5 seconds on the
gzip file. The gzip file seems to achieve better performance than the raw file. This is assumed to be
due to less disk reads being required for the gzip file due to its considerably smaller file size
(approximately 6 times smaller). These tests were repeated a number of times to ensure
consistency. Similar preliminary tests on small scripts with PHP and Python were also performed
with similar performance gains with the gzip file over the raw file.
12
2. REDUCING PROBLEM COMPLEXITY
Given the above information, the decision to decompress all the ZPAQs and recompress them in
to gzip files was made as tested indicated that this not only reduced space usage by a factor of 6 but
also provided an improvement in processing speed as well.
Firstly, once the 568 GB of torrent data was copied to the HPC from the AusCERT server, a PBS
script was written to schedule a job on all available nodes that would decompress the 1,101 tar balls
from the “service probes” folder. This resulted in 243,321 ZPAQ files being exposed from within the
tar balls.
Next, in order to determine that data was not corrupted during the multiple transfers, a PBS
script was scheduled on all available nodes to create and record the SHA1 hash sums of all 244,428
ZPAQ files on the HPC. The list of SHA1 hash sums created by this script was checked against the list
of SHA1 hash sums provided by the researcher in a file in the torrent. Files with hash sum
mismatches were deleted and re-downloaded or re-untarred. The re-downloaded or re-untarred
files were hashed again and checked against the provided list again. This process was repeated until
no more corrupt ZPAQ files existed in the data stored on the HPC. Once hash sums confirmed data
integrity for all the ZPAQ files, the 1,101 tar balls that revealed most of the ZPAQ files were deleted
in order to free up space.
As the data was now ready to be recompressed, PBS scripts were scheduled to decompress each
of the 244,428 ZPAQ files and recompress them into gzip files. After the re-compression scripts were
completed, corrupt gzip files were discovered due to nodes terminating the script early or due to
other I/O related error. Therefore, a PBS script was written to check each of the newly created gzip
file for corruption using multiple detection methods. The primary method of detecting a corrupt file
was checking if it ended abruptly without an end-of-file (EOF) character.
If a gzip file was found to be corrupt, the script would then delete the gzip, decompress the
original ZPAQ and recompress it into a new gzip and ensure the new gzip file was not corrupt. Once
final checks were made to ensure no more of gzip files were corrupt, the original ZPAQ files were
deleted from the HPC to free up space. A copy was left on the AusCERT server as backup.
Finally, a PBS script was scheduled on all available nodes to generate and record the SHA1 hash
sums of all of the newly generated gzip files. These hashes can be used to perform data corruption
checks in future data transfers. Once the hashes were compiled, all of the gzip files were copied
across to the AusCERT server.
The above is a simplified description of the methodology employed in order to recompress the
ZPAQ files. Multiple script failures were faced and numerous tests were conducted before much of
the above tasks were carried out successfully. Countless other challenges in the coding of the scripts
13
2. REDUCING PROBLEM COMPLEXITY
were also faced that are not covered in this report. Challenges from multiple nodes attempting to
write to the same file and causing corruption to equally and efficiently distributing the load across
the nodes were faced and resolved during this recompression task.
All the PBS scripts that were used to complete the recompression task are included in the CD
attached to this report as a reference for future researchers. They can be found in the “hpc” folder
on the CD. The scripts will provide an indication of the complexities involved when coding for
massive computational grids.
After the completion of this recompression task, in order to locate the information for a single IP
address (e.g. 130.102.50.50), a total of 1108 gzip files from each of the data type need to be
searched instead of unpacking 1101 tar balls and searching through 1108 ZPAQ files. Reading gzip
files would be drastically faster than reading ZPAQ files; and as benchmarks have shown also faster
than reading an uncompressed raw file.
A short 10 minute presentation was given on this project at the AusNOG conference in Sydney in
early September as part of the “Lightening Talks” of the conference [25]. AusNOG is the Australian
Network Operators Group [25]. The aim of the presentation was to spread the ideas behind this
project, seek to understand the level of interest by the network operators of Australia in this
dataset, and peruse ideas or recommendations on different products that may originate from any
discussion. Although the level of industry interest in this dataset was very minimal, individual
researchers did show interest in the work being conducted for this project. Through the discussion
that ensued during the 10 minute presentation and continued long after over email, a contact was
established with someone within archive.org who, upon being made aware on the existence of the
gzip version of the Internet Census 2012 files, sought to publish a copy of them on the archive.org
website to allow public access.
Prior to establishing contact with archive.org, a few requests from Australian researchers as well
as international researchers had been received for a copy of the gzip files. Some of these researchers
either did not have access to infrastructure similar to the HPC or did not want to waste the time
when the work had already been done. As there seemed to some interest in the dataset and
organising a 1.4 TB transfer for every individual request is bound to become impractical as the
requests increase, a decision to upload a copy of the data archive.org server was made to allow
future researchers easy access to this data.
The 1.4 TB of gzip data was uploaded to archive.org servers within a week. Due to the size of the
data, archive.org split the final public download into 6 parts. Files from each of these parts can either
be downloaded individually, through queuing via a download manager or through a torrent
14
2. REDUCING PROBLEM COMPLEXITY
containing all the files for each of the parts. As of writing this report, Part 1 of the torrent had been
downloaded 25 times already. Interested researchers can obtained a copy of the recompressed gzip
version of the Internet Census 2012 at the following URLs:
- http://archive.org/details/internetcensus_gzip_1
- http://archive.org/details/internetcensus_gzip_2
- http://archive.org/details/internetcensus_gzip_3
- http://archive.org/details/internetcensus_gzip_4
- http://archive.org/details/internetcensus_gzip_5
- http://archive.org/details/internetcensus_gzip_6
15
16
Chapter 3 Research
This is the first academic project on the Internet Census 2012. A few individuals around the world
have attempted to make the data publicly searchable with various levels of success [26]. Their
methodologies have never been published. As a result, no prior literature on searching this particular
dataset exists. However, there is a substantial amount of information available on “big data” analysis
[27]. Such information has formed the foundation of research for this particular project in order to
identify solutions that can be implemented in order to easily search this dataset.
It is vital to first understand the needs of the project before beginning the research phase. The
aim of this project is to “make a 9 terabyte dataset easily searchable, preferably with minimal
resource requirements.” This is an ambiguous aim and was chosen on purpose. The definition of
“easily searchable” will depend on different researchers and what they wish to search in the data
and how they wish to search it. This project recognises two primary types of research uses for this
dataset.
The first use of this dataset would be for a researcher or technical enthusiast who wishes to
obtain the results of just one IP address or a small range of IP addresses. This could be someone
from UQ’s IT department who wishes to analyse UQ’s public IP range as seen in this dataset. They
would only care to search for only a small range of the dataset. This type of research use is assumed
to form the majority of interest in this dataset as most businesses, researchers or institutions would
only want to look up their own IP ranges. For this type of research, a flat file search implementation
that searches through each of the relevant 1108 gzip files has been coded. Due to the nature of this
search, heavy implementation based research was required but very minor theoretical research was
needed. Therefore, details on this implementation are covered in the chapter “Flat file search
implementation” where the implementation challenges are detailed.
The second use of this dataset is for researchers who wish to understand and analyse the entire
dataset to draw conclusions about the entire IPv4 space or the use of the Internet in general. This
type of usage is likely to be of interest to only a minority of researchers who have the time and
resources to explore the dataset in detail. This type of searching requires the ability to search the
data in a relational manner depending on the researcher’s interest. Perhaps the researcher wishes to
analyse the timestamps and create a timeline or perhaps their interest is in all the IP addresses or
just types of results obtained. Therefore, for this type of search, it becomes vital that the
implementation is flexible in how it allows the data to be searched. In order to determine the best
solution of this project, this chapter of the thesis explores the different products that allow
searching large datasets.
17
3. RESEARCH
relational databases (MySQL and MariaDB), Apache Hadoop, MapReduce, Google BigQuery, AT&T
Daytona, Splunk and Amazon Redshift.
Each of these products was researched in detail to assess their suitability to the aim of this
project. The findings of this research is summarised below for each of these products.
3.1.2 Splunk
Splunk is an enterprise product developed for searching and analysing machine-generated big data
[31]. The Internet Census 2012 dataset can easily be considered machine-generated as can be seen
from the details and samples provided in the “Background Information & Theory” chapter.
Unfortunately, the enterprise product comes at a substantial cost [31]. Splunk is also available in a
free version. However, the free version comes with a 500 MB per day processing limit [32].
Processing nearly 9000 GB at a speed of 0.5 GB per day would take 18,000 days or approximately 50
years.
Given the processing limit of the free version and the costs associated with the enterprise
product, Splunk was not considered a serious contender for this project.
Given that MapReduce and Apache Hadoop are free and don’t come with a processing limit,
these were considered as the first serious contenders for this project. Both of these are well known
for processing large data quickly and research reveals that they would make for an ideal
implementation of the relational search as they are extremely flexible in what data is made
searchable and how the dataset is searched [33][35]. However, both of these products depend on a
cluster of computers for effective implementation. This would pose as a serious problem given that
the resources that are available for this project consists primarily of the AusCERT server thumper.
Nonetheless, the High Performance Computing Grid and ITEE’s Hadoop cluster can still be
utilised to achieve the primary aim of this project. Further research in the possibility of using these
two revealed a number of hindrances and problems with each of these grids.
The primary concern with the High Performance Grid (HPC) is that is currently not setup with
MapReduce or Hadoop. As mentioned in “Using the High Performance Computing Grid” section, the
HPC uses PBS as the system for load distribution and parallel data processing. It would be possible to
18
3. RESEARCH
request that an implementation of MapReduce be setup on the HPC. However, this would obviously
come at a substantial time cost given the need to research different softwares and then test it on a
smaller number of the HPC nodes before deploying it to the whole grid.
Another concern with the HPC is that it cannot provide fast 9 TB of storage. The storage of ~1.4
TB of the gzip files already exceeded the per-user limit of 1 TB. For storage requirements above 1 TB,
the data is meant to be stored on tape and comes with a cost of $100 per TB. Given this project does
not have any budget, this is option is infeasible. However, even if there were no costs associated
with storing the 9 TB, accessing the data from tapes would make the data extremely slow to access.
The slow speed would make it impractical for any type of swift searching to be conducted on the
entire dataset.
With the School of ITEE’s Hadoop Cluster, the safety of the data cannot be guaranteed on it as it
is made up of lab machines for students that are re-imaged often. Speed or usability of the grid
cannot be guaranteed or predicted as priority would be given to the console student user. These
complications introduce unnecessary uncertainty and unreliability to the processing and storage of
the dataset. Therefore, ITEE’s Hadoop Cluster is also not an ideal solution for the aims of this project.
Furthermore, even if more time, money and resources were available, neither of these options seem
to be optimal given their respective limitations on speed and reliability of the data.
Moreover, part of the aim of this project is to implement searching but “preferably with minimal
resource requirements”. The motivation for including this in the aim was to allow a wider range of
researchers to have access to analysing this dataset. As mentioned in “Downloading the
recompressed data” section on page 14, a few researchers that were after a copy of the gzip dataset
did not have access to research grids. By leaning towards implementations that have minimal
resource requirements, it would make it possible for these types of researchers to also analyse the
dataset easily. Additionally, given the outcomes of other research conducted for this project which is
detailed next in this chapter, it was ascertained that it was indeed possible to implement a relational
search without the need of a computational grid. Hence, even though an implementation of
MapReduce remains a strong alternative for achieving the aims of this project, other
implementations were given preference in order to fully meet the original aim of this project.
However, data storage and querying both come at a considerable cost [38]. This is the primary
reason why this implementation was not explored for this project. Another concern was the time it
would take to upload the uncompressed 9 terabytes to Google before any querying can be done.
This should be taken into consideration by future researchers.
19
3. RESEARCH
The largest hindrance to using relational databases that other products such Hadoop or Splunk
do not have, is the time required for data insertions. For a relational database, the data first has to
be inserted before it is indexed [42]. With some of the other products, the data can be indexed in-
place without the need for insertions [34][35]. This may be concern to future researchers. However,
it was discovered that once a relational database was finetuned, the time required for
insertions/indexation is acceptable. The finetuning will be presented in detail in the “Relational
Search Implementation” chapter.
Importing all the data into a relational database was one of the very first options considered for
this project. However it was originally quickly dismissed due to multiple reasons. For one, with or
without compression enabled on a modern relational database, it would take an enormous amount
of time and CPU cost to insert the over 265 billion records of the Internet Census 2012 dataset. A
database would also introduce a single point of disk I/O bottleneck even when spread over a large
cluster. A test of inserting approximately 1 million records into a relational database on the AusCERT
server took 2 hours to complete. When extended to 265 billion records, which is 265,000 times
larger, it would take over 60 years to insert all the records. All of these reasons originally eliminated
the option of using a relational database server in order make the dataset easily searchable.
It is vital to note that for the majority of this project, Google BigQuery was assumed to be the
most viable solution despite its costs. The original project solution was to upload a very small set of
data to Google Query (up to the free limit of 5 GB) and detail a proof-of-concept on that dataset and
present the findings in this report [38]. However, as with any research, an accidental discovery was
made. During the implementation of an add-on optional module for the flat file search
implementation, while trying to insert and index the 23.9 GB of data from the lists detailed in “Lists
of IPs considered ‘active’” on page 8, it was discovered, that it was not only possible but also very
practical to use relation databases for making the entire dataset searchable. It was after this
discovery that the decision to use relational databases for the relational search implementation of
this project was made. The ability to set up a relational database in a single server and the
availability of free open-source alternatives finally made this option the most desirable.
20
3. RESEARCH
Next, once the relational database option was chosen as the best option for the relational search
implementation of this project, it was necessary to decide which relational database would be most
the efficient and speedy. Given the very short timeframe of this project, it was entirely beyond the
purview of this project to determine what the best relational database is. Therefore, only brief
research was conducted to determine which relational database to employ for use in the relational
search implementation of this project.
Two major contenders were considered based on their popularity of use: Oracle database and
MySQL. Since the Oracle database is not a free product, MySQL seemed to remain as the only
choice. However, some quick research revealed that MariaDB was a contender to MySQL [43][44].
Wikipedia has used MySQL as the database at the core all its services since the inception of
Wikipedia [46]. However, earlier this year, Wikipedia conducted benchmark tests between MySQL
and MariaDB on their infrastructure and datasets and determined that there was a performance
gain when utilising MariaDB [46][47]. These extensive tests served as the basis for choosing MariaDB
as the relational database to be used for implementing a relational search in this project.
MariaDB is an open-source fork of MySQL. It is virtually same in almost all aspects (it even
responds to MySQL commands) [44][45]. The primary difference between the two is the internal
storage/search engine they use [44]. MariaDB was born as a result of disagreements with the
developers of MySQL and its owner Oracle. As a result of this disagreement, the original developer
of MySQL created an open-source fork and named it MariaDB [43]. Since MariaDB was created with
as a drop-in replacement for MySQL with no changes required and because of the developers’ of
MariaDB’s commitment to maintaining compatibility between the two databases, all MySQL
documentation also applies to MariaDB. Therefore all commands, settings and databases are
completely and easily transferable between MySQL and MariaDB [45]. Therefore, all findings in this
project related to MariaDB also applies to MySQL and researchers employing MySQL can also benefit
just as equally as researchers utilising MariaDB. Most of the references for the settings described in
this report are from the MySQL documentation but all of them apply to MariaDB as well [45].
21
22
Chapter 4 Flat file search implementation
This project has implemented a flat file search to allow searching for a single IP address or a given IP
address range through the Internet Census 2012 dataset. This chapter details the implementation of
the flat file search and provides an overview of its functionality with appropriate screenshots. As
detailed in the “Research” chapter, the flat file search is one of the two solutions implemented in
order to achieve the aims of this project.
The code for this implementation has been included in the attached CD. Resultant CSV files and
log files created by different searches have also been included in order to provide examples of the
expected outputs from searches conducted using this implementation.
The flat file search comes with a few clear advantages over the relational search
implementation. Researchers wishing to search only certain IP addresses or IP ranges need only
download or have access to the relevant files containing that data. Researchers do not need the
entire dataset to use the functionality of the flat file search. In other words, if one is only interested
in IP addresses within the 130.0.0/8 range, then only the relevant 1108 130.gz files need to be
obtained. These could easily be specifically selected for download using a download manager from
the archive.org links previously provided in the “Downloading the recompressed data” section on
page 14.
Once a researcher has obtained the necessary gzip files, the script can easily be run with only
minor configuration changes in the “inc.common.php” file. The scripts only require PHP and certain
Linux utilities that are usually installed by default on all Linux distributions. PHP was chosen for
coding as tests revealed that there was no major difference in speed between PHP, Python or Java
and the author is familiar with coding in PHP. A web server is optional since the scripts also allow
searching using the command line directly. An example of this is provided in one of the screenshots
23
4. FLAT FILE SEARCH IMPLEMENTATION
later in this chapter. Please refer to “Screenshots of the flat file search” section of this chapter for
the screenshots.
An optional module that is included in the PHP scripts uses the lists described in “Lists of IPs
considered ‘active’” section from page 8 to conduct preliminary searches. This module requires a
relational database server in order to function correctly. However, it is trivial to disable this module
if its functionality is not warranted and hence the database server is also optional. Disabling this
module does not limit or hinder the functionality of this implementation in any way.
This relational database module allows researchers to easily and instantly determine whether
conducting a complete dataset search would be worthwhile by performing a preliminary search
against the lists to determine if interesting information in the dataset exists for a particular IP
addresses or IP range. Please refer to “Lists of IPs considered ‘active’” section on page 8 to
understand the contents of these lists in order to understand what is considered “interesting”. That
section will also help understand how these lists may aid researchers to determine whether
interesting information on a particular IP exists in the full dataset. The screenshots of the final
implementation provided later in this chapter will supply further context of the role of this module in
the flat file search.
The implementation of this add-on module was done while finetuning MariaDB as described in
the “Relational Search Implementation” chapter. For researchers interested in enabling this module,
please refer to the details covered in that chapter in order to gain an insight into how to easily insert
and index the over 23 GB of data from the lists described in “Lists of IPs considered ‘active’” section
into a relational database to allow instant querying. A one column table needs to be created for each
of these tables to insert the data and index it before this module can be enabled. Setting up these
tables for use in this module is likely to take a day or two of work. Most of this time would be spent
understanding the “Relational Search Implementation” chapter.
The flat file search also sends two emails to the researcher, first when a search started and
secondly when a search is completed. This functionality is also optional. In order to enable this
functionality minimal configuration setup in PHP has to be completed to allow PHP to send emails.
When a researcher initiates a search over the entire dataset, this flat file search implementation
sends them an email informing them that the search has been started. This email also provides a
web link to a log file of the search that allows the researchers to view the progress of the search.
This log file simply shows raw output of the results as they are found by the search script with
timestamps included to allow monitoring of the progress of the search.
Once a search has been completed another email is sent to the researcher informing them that
the search has ended. This email contains a number of links to allow downloading CSV files where
the search results have been amalgamated. This email still contains a link to the log file. The log file
would contain “Search Ended” with a timestamp as its last line to indicate when the search had
ended. If the web server functionality has not been enabled, then the web links in the emails would
not work or would have to be changed to point at a local file system.
The above covers all the major functionality of the flat file search implementation. Other minor
functionality to check for correct IP range and email address etc. also exists to ensure smooth
24
4. FLAT FILE SEARCH IMPLEMENTATION
functioning of the search. However being minor functionality and in the interest of keeping this
report brief these are not covered in detail here.
4.2.1 Version 1
The very first implementation of this search took nearly 2 hours to complete a search for a single IP
address. Linux tools were not used in this implementation on purpose as using only PHP would allow
the implementation to be platform independent. This first implementation used PHP’s inbuilt gzip
reading libraries to read each of the 1108 gzip files one by one and traverse through each of them
looking for a row that contained the IP address being searched.
This first version had implemented the email functionality to notify the researcher when the
search begun and ended but it had not implemented the optional relation database module to
conduct the preliminary searchers. The functionality to allow results to be exported as a CSV file had
not been implemented either. In this version the results of the search could only be seen in the log
file that was created by the search script to indicate the progress of the search.
4.2.2 Version 2
Further analysis and some rudimentary benchmarks quickly showed that using the Linux utilities
“zcat” and “egrep” in combination with each other in order to search a single IP address through a
single file was considerably faster than when the same IP address was searched through the same
file using the PHP implementation. The reason for the difference in speed is assumed to PHP being
an interpreted language while “zcat” and “egrep” being specifically compiled to be most efficient at
their functionality on a Linux platform.
A second version was developed which replaced the core searching of an IP address through a
gzip file from using PHP’s libraries to using the Linux utilities zcat and egrep by calling an external
command in PHP. At the completion of this version of the implementation, tests showed that a
single IP can be searched through all the 1108 gzip files in approximately 15 minutes - a massive
improvement from the 2 hours taken by the first version of the implementation. In this version, the
optional relational database module was also completed to allow conducting preliminary searches
before continuing to a full search that would take approximately 15 minutes. The aim for
implementing this module to conduct preliminary searches was to allow researchers to decide if it
was worth proceeding to a full search for a given IP address, given the full search would take
approximately 15 minutes to complete.
At this point, a serious consideration was made whether the scripts should be changed from PHP
to Linux bash scripts since the core searching in PHP was being done by externalising command to
Linux tools. Bash scripting is the easiest method of scripting over Linux tools. By changing the scripts
to bash scripts, it would remove the requirement of PHP and would make it that much easier for
future researchers to deploy this script. However, from previous experience in both bash and PHP it
was determined that bash scripting does not allow for complex coding and handling of strings of
arrays.
25
4. FLAT FILE SEARCH IMPLEMENTATION
Also, complex functions are a lot more difficult to code in bash scripts then they are in PHP.
Furthermore, it is trivial to allow access to PHP scripts using a web server but considerably difficult to
allow bash scripts to be executed through a web interface. Therefore a decision to keep the scripts
in PHP was made in order to allow for easier coding and to maintain the web front for searching. The
gain in speed for switching to bash script was also assessed to be minimal since the grunt work of
the search would be conducted by Linux tools as well.
4.2.3 Version 3
Up until the second version of the flat file search implementation, searching was done in a linear
fashion. Each file was searched one after another. In other words, when a search was initiated for an
IP address, a single file would be searched first, the results of the search would be output to the log
file then the next file would be searched and so on and on until all 1101 gzip files had been searched.
It was noticed that the implementation of the flat file search so far was only using a single core.
Given that the AusCERT server has 8 cores available, it was possible to change the implementation to
conduct searches over multiple files in parallel in order to use all available 8 cores and speed up the
searching process as a result.
This option was explored and research revealed that PHP does not natively support creating or
managing threads. At this point, a consideration to re-code the flat file search implementation in C
or Java was made as a result of these languages supporting threading. C was quickly discarded due
to multiple reasons. Firstly, coding in C it would remove the possibility searching through a web
front. Secondly, C needs to be compiled and this would make this implementation a lot less platform
independent and a lot harder for future researchers to modify. Finally as C is a very low level
language, a lot more coding would be involved in order to achieve the same functionality already
achieved with the PHP scripts. Java was a serious consideration. However, further research in PHP
revealed that even though it does not have threading support, it was still possible to achieve the
objective of multiple files being searched in parallel using other techniques. Since the functionality
could be achieved in PHP, it would be a waste of valuable project time to re-coding the
implementation in Java. Therefore, PHP was continued to be used for scripting.
The technique for searching multiple files in parallel in PHP involved initiating a search for an IP
address on a given file through another PHP script. The secondary search script would be called as
an externalised command by the main search process and then detached to allow its execution to
continue separately from the main search process. This allowed the parent PHP script to continue to
initiate further searches on other files while the secondary search script performed the actual
search.
A third version of the flat file search implementation was created utilising this technique. Using
this technique to initiate a search, results in the parent search process spawning 1108 children
processes, in a matter of seconds, in order to conduct a search through each of the 1108 gzip files.
Each of these children will run in parallel using all available CPU resources to conduct the search.
Through lessons learnt while coding for the HPC, the PHP code responsible for amalgamating the
search result to the log file, was modified in order to ensure that the log file did not become corrupt
when multiple children processes attempted to write to it simultaneously.
At the completion of this third implementation, searching for a single IP address through the
entire Internet Census 2012 dataset could now be completed in less than three minutes. This is a
26
4. FLAT FILE SEARCH IMPLEMENTATION
considerable speed gain from the previous versions of the implementation. However, certain
unexpected behaviours were noticed as a result of the change from linear processing to parallel
processing. Firstly, the second email informing the searcher that the search had been completed was
now being sent prematurely on every search. It was sent only a few second after the first email was
sent. Secondly, the words “Search Completed” were being added to the log file near the top of the
file just a few seconds after the search had even begun.
4.2.4 Version 4
In the linear implementation, the addition of the words “Search Completed” and the secondary
email were sent as the last command on the parent script. However, now that the parent was not
conducting the actual searches but externalising them to a sub script, it was finishing quickly and
therefore detecting the end of the search prematurely. Since PHP does not have support for threads
and children processes cannot be monitored for completion, another solution was implemented in
order to detect the true end of the search.
This technique used dummy lock files to detect when all sub scripts had finished. In this
technique, when a search is initiated, the parent script creates a temporary locks folder unique to
this search. It then spawns the 1108 children to conduct the actual search. Each of these children,
before they began the search, creates a dummy file in the temporary locks folder created by the
parent script to indicate that it was still executing. The child would then conduct the actual search,
save the results to the log file and then delete the dummy file to indicate that the search had been
completed. The parent process, once it had spawned all the 1108 children processes would sleep as
long as the temporary locks folder it had created contained at least one file. Once it was detected
the folder contained no files, the folder was deleted before detecting an end of search and sending
the second email and writing “Search Completed” in the log file.
The fourth implementation of the flat file search added the above technique in order to detect
the true end of a search. The technique worked effectively since as long as there is a child still
conducting a search, a file would always exist in the temporary locks folder. The time required to
search a single IP address through the entire dataset remained unchanged at less than three
minutes.
4.2.5 Version 5
So far the flat file search only allowed searching for a single IP address. Functionality to allow
searching an IP address range was added in the fifth version of the implementation by simply adding
a loop over existing code to initiate a search for each of the IP addresses in a given IP range. To
prevent the server from being overloaded, an arbitrary limit of searching a maximum of 256 IP
addresses in any given search IP range was imposed.
With this version, now it was possible to search for an IP range through the entire dataset.
However, the time taken for such a search increased linearly by the number of IP addresses being
searched. In other words, if searching for an IP range that contained 5 IP addresses, it would take 15
minutes (3 minutes per IP address x 5 IP address) to complete. When searching an IP range
containing 256 IP addresses it would take 768 minutes or over 12.5 hours to complete.
27
4. FLAT FILE SEARCH IMPLEMENTATION
4.2.6 Version 6
Further research was conducted to see if the time taken by the IP range search could be reduced. It
was discovered that the “egrep” Linux tool that was used by the PHP sub script to conduct the actual
searches allowed searching multiple items in a single search using an OR operator of “|”.
In theory, according to the documentation of the “egrep” command, searching for a single item
or multiple items through the same file should both take the same amount of time to complete.
Benchmarks were conducted to test this theory. The time taken to perform the following two
commands was compared.
Multiple benchmarks revealed that both commands took on average the same amount of time
to conduct. Small variations in their timing were detected but nothing statistically significant. Given
these findings, the flat search implementation was modified so that instead of simply looping over
existing code to search for multiple IP addresses, only one search per file was conducted which
included all the IP addresses from the IP range being searched.
In the sixth version, the code that creates CSV files from the gathered results was also added to
allow researchers conducting the searchers to export the search results at the completion of the
search. The decision to allow exporting using the CSV format was based on its popularity and it
would be trivial to change the code to allow exporting in another format. The links for downloading
the CSV files were also added to the second email in this version.
Other minor changes to the web front were implemented to ensure if a search for an IP address
or IP range was already under way then it would not be re-initiated if the researcher refreshed the
web page or if another researcher requested the same IP address or IP range for searching.
Furthermore, if a search for a given IP address had been conducted and completed in the past then
the search would also not be initiated but links to the CSV results generated in the past made
available for download instead. Examples of what this looks like can be seen in the screenshots
presented in the “Screenshots of the flat file search” section of this chapter.
With the completion of this version of the implementation, it was now possible to search an IP
range containing hundreds of IP addresses in the same amount of time (in less than three minutes)
as it was to search for just a single IP address. This is the final version of the flat file search. This
version has successfully implemented all the functionality of the flat file search implementation. This
is the version of the implementation that is included on the attached CD.
28
4. FLAT FILE SEARCH IMPLEMENTATION
Researchers looking to improve the speeds of this search should look at running this script on a
device that has even more cores and faster disks (such as SSD disks or RAM disks). Another way of
increasing searching speed would be to insert some of the larger gzip files (such as traceroute.gz)
into a relational database as per the methodologies outlined in “Relational Search Implementation”
chapter. After inserting and indexing of some of the larger gzip files, a hybrid approach of using both
flat file search and relational search can be used to search over the entire dataset even faster than
has been implemented so far in version six of this implementation.
All the CSV files produced by the searches conducted in the screenshots are included on the
attached CD. A copy of these screenshots and is also included on the CD.
29
4. FLAT FILE SEARCH IMPLEMENTATION
30
4. FLAT FILE SEARCH IMPLEMENTATION
31
4. FLAT FILE SEARCH IMPLEMENTATION
Figure 9: Page displayed when conducting search over a previously searched IP address
32
4. FLAT FILE SEARCH IMPLEMENTATION
33
4. FLAT FILE SEARCH IMPLEMENTATION
34
4. FLAT FILE SEARCH IMPLEMENTATION
Figure 16: Page displayed when re-initiating a search on the same IP range
35
4. FLAT FILE SEARCH IMPLEMENTATION
Figure 18: Page displayed when conducting search over a previously searched IP range
36
4. FLAT FILE SEARCH IMPLEMENTATION
Figure 20: Search started email for an IP range for a search started from command line
37
4. FLAT FILE SEARCH IMPLEMENTATION
Figure 21: Search completed email for an IP range for a search started from command line
38
Chapter 5 Relational Search Implementation
This project has implemented a proof-of-concept relational search that, when fully implemented,
would allow searching the entirety of the Internet Census 2012 dataset on any particular field. This
chapter provides details on the poof-of-concept implementation of the relational search and steps
necessary to reproduce such an implementation. As detailed in the “Research” chapter, this is one of
the two solutions implemented to achieve the aims of this project.
As highlighted by the “Relational Databases” section on page 20, the original time estimation
just for insertion into a relational database such as MariaDB was over 60 years. These estimates
were completed by measuring the time taken to insert 1 million records on thumper with the use of
the “insert into table” SQL statements. The first one million lines of data from 130.0.0.0/8 range of
the ICMP ping were extracted for use in this estimation. This original test was conducted by writing a
PHP script that would read a raw file (can easily be a gzip file) one line at a time. Each line was
converted to an SQL insert statement by the script and executed on the database on thumper. The
insertions of these 1 million records using this script took over 2 hours to complete. Given there are
265 billion records that need to be inserted, it was initially assessed infeasible to even consider
relational databases as an option for use in this project.
However, as mentioned in the “Relational Databases” section, as well as in the lists detailed in
the “Lists of IPs considered ‘active’” section on page 8, it was discovered, that after some fine-tuning
of a relational database and using a different command for data insertion, it was not only possible
but also very practical to use relation databases for making the entire dataset searchable.
This chapter of this report highlights some of the discoveries that were made when attempting
to insert the over 23 GB of lists detailed in the section “Lists of IPs considered ‘active’” that made it
possible to reduce the insertion time dramatically down to just 5 days for inserting over 1.8TB data
totalling over 52 billion records. Other research findings that helped make relational databases a
strong candidate for this project are also detailed in this chapter.
This chapter also presents some benchmarks that were conducted on a sample data with
different options enabled on a MariaDB table. These benchmarks should serve as guide to all future
researchers in helping decide which options to use when inserting and/or indexing the dataset.
Please note that this whole chapter constantly refers to MariaDB, however all comments and
commands below are equally valid on MySQL [45]. Please note that the default storage engine used
by MariaDB and MySQL is called InnoDB and this was used in throughout this project for various
reasons [44][45]. Some of these reasons are detailed throughout this chapter.
It is quintessential that the machine being used for setting up a database server has a reliable
backup power source. If data insertions or indexation is interrupted by a power outage, the entire
work would have to be started again. Moreover, since disk I/O is the bottleneck on database servers
when inserting large amounts of data, it is essential when performing large insertions, to ensure no
other disk intensive operations are being conducted on the system.
39
5. RELATIONAL SEARCH IMPLEMENTATION
An example of such a command when executed directly on MariaDB command line would look
like:
The above command would load the contents of the file “traceroute-head” into the table
“search_traceroute”. The columns in “traceroute-head” need to be separated by tabs for this
command to succeed as “load data infile” by default insert values separated by tabs into different
columns in the table. Luckily, all of the Internet Census 2012 data is stored in tab separated format.
However, if the columns were not tab separated then the “load data infile” easily allows changing
the default separator of tab to anything else such a comma to allow inserting other types of files
such as CSV files directly into a table.
As benchmarks show later in this chapter, inserting 10 million records into MariaDB using this
command was done in just over 30 seconds. This is a speed increase of over 2400 times compared
with the insertions completed using the SQL “insert” command. Therefore, the discovery of this
command made the relational databases the most prominent contender as a solution for the
relational search implementation.
However, many other variables had to be finetuned prior to beginning the proof-of-concept
insertions. These variables are detailed in the next section of this chapter. In this project, most of
these variables had already been finetuned prior to the discovery of the “load data infile” command
in an attempt to increase database speed. However, it was the discovery of this command that really
changed the game as the speed increases added by the variable changes described below are
minuscule compared to the speed increase provided by the “load data infile” command.
Please note that since MariaDB is identical to MySQL in almost every aspect, it also uses the
exact same configuration file as MySQL and stores its information in the same directories as MySQL.
A copy of the /etc/mysql/my.cnf file from thumper containing all the variables discussed below is
also provided in the last sub-section below.
40
5. RELATIONAL SEARCH IMPLEMENTATION
Although storing the database on the SSD would provide exceptional data speed, the database
would quickly run out of space due to being limited to only 120 GB.
There are two primary disadvantages to this default variable value. The first is that all the
database files will be located on a single storage medium. Since everything is stored in a single file, it
is not possible to split different tables and databases over different storage mediums depending on
need. Secondly, it is vital, given that space is precious when dealing with large datasets such as the
Internet Census 2012, to have the option of recovering space from the database server by dropping
unnecessary tables or databases when needed. Therefore, if possible, it would ideal to store each
table in its own file on disk to allow for moving between different storage mediums as well as easy
deletion for space recovery.
Ideally this variable must be enabled before any large insertions are done to take advantage of
the functionality. Only the tables created after enabling this option are stored in separate files from
the “ibdata1” file. All the previously created tables and the respective inserted/indexed data will
remain as part of the “ibdata1” file. If an “alter table” command is issued after this option is enabled
then the table will be separated from “ibdata1” into its own file. However, “ibdata1” will not shrink
as a result of this. Therefore, it is highly recommended that this variable is permanently enabled in
the MariaDB configuration file before any work on the database server is done.
There are certain disadvantages to enabling this option. The primary disadvantage is the need
for MariaDB to have more file handles open simultaneously. This shouldn’t be a huge problem when
the number of databases and tables on the database server is relatively small. However, this variable
change can have adverse affects when there hundreds of thousands of tables on the server since
41
5. RELATIONAL SEARCH IMPLEMENTATION
MariaDB will need to have a dedicated file descriptor for each of these table. This can quickly
overload the operating system. This does not have any impact on this project since it is highly
unlikely there would be more than 10 or 20 tables ever needed. Even with a few hundred tables, no
adverse affects are introduced. However, future researchers using a shared database server may
need to evaluate if enabling this option will have an adverse impact on the performance of the
server.
There are many other advantages and disadvantages that are not discussed in detail here as
they do not pertain to this project but it is recommended that all researchers read MySQL’s
complete list of advantages and disadvantages before enabling or disabling this option [17].
The default choice of InnoDB was continued to be used instead of other storage engines
primarily because it is the only storage engine that allows storing one table per file. All other storage
engines only support storing in the “ibdata1” file. For the reasons highlighted above, it is highly
desirable to be able to store one table per file on disk. Hence, the InnoDB storage engine in MariaDB
was used for all the tables created and inserted for the relational search implementation of this
project.
For thumper, this value was changed to 8 GB, which is half of all RAM available on thumper. This
gave insertions, indexing and even searching a massive boost of speed. It is highly recommended not
to set this variable to equal the size of the installed RAM in the system, as no memory would remain
available for other system resources, which is likely to cause issues [19]. The highest recommended
value for this variable is 80% of available RAM. In the case of thumper, a conservative value of only
50% of RAM was chosen to ensure other scripts and work can still be carried out. It is highly
recommended that researchers read MySQL documentation on this variable to understand its
impact on speed [18][19].
This variable controls how both the data and log files are flushed by the database server to disk.
By default InnoDB uses the fsync() system call in order to flush the data for each of the tables to the
operating system [20]. The default behaviour pushes the data to a system call which would cache
the data before reading/writing it to disk [20]. Depending on the disk and system configuration, this
may be slower than having MariaDB access the disk directly by passing the system’s caching
mechanisms [21].
If the tables are stored on a network link then having MariaDB make direct calls to the storage
would be slower than letting the underlying operating system cache requests in order obtain larger
42
5. RELATIONAL SEARCH IMPLEMENTATION
data blocks in one request [20][21]. However, when using SSD disk or RAID 0 arrays, the system
caching may actually slow down the response time of the database server [20][21]. In such a
scenario, this option can be changed to O_DIRECT to force MariaDB to read and write directly to disk
[20][21].
This is not a clear cut variable and the performance gain or loss depends exclusively on the
configuration of the system. Quick tests performed earlier during the project had revealed that for
thumper choosing the O_DIRECT option provided a slight speed increase. Hence, for this project this
variable was changed from the default to O_DIRECT.
Future researchers should conduct tests on their particular system to determine if the default
variable value improves or degrades the performance of their database server compared to the
O_DIRECT option [20][21].
Larger values for this variable decrease the checkpoint flush activity needed in the buffer pool
leading to a reduction in the amount of disk I/O performed by the log file. However, larger log files
means that a recovery after a database crash is slower as it has to read the entire log file in order to
determine what needs to be recovered, if anything. A larger log file also means a longer commit log
is stored. This is useful if big data operations are being performed on the database, then on the
event of a crash, more data can be recovered – although it would be very slow if the log file size is
very large.
For this project, on thumper, the value of this variable was changed to 1GB. This value was
chosen as a middle ground between the need for faster recovery and the need to decrease disk I/O
activity of the log file. Future researcher should consider their individual circumstances before
changing this variable [20][22]. For a large dataset such as the Internet Census 2012, 5 MB is
definitely not an optimal value for this variable.
A larger InnoDB log buffer allows more and larger transactions to run without needing to write
the log to disk before the transactions are committed to the table or database. This is ideal when
running big transactions. However, setting this variable to very large value can create problems for
database server recovery as the buffer may be lost on server crash and if it was not committed to
the log file then certain tables may end up being corrupted.
For thumper 128 MB was chosen as the log buffer size to allow for larger transactions to occur
without too much disk I/O. A larger value was not chosen in order to prevent data corrupt or data
loss on server crash or power loss.
43
5. RELATIONAL SEARCH IMPLEMENTATION
This option can only be set to one of two options, either “Antelope” or “Barracuda” [23].
“Antelope” is the default choice as that is the only way tables in “ibdata1” are stored [23]. The
“Barracuda” file format is a new file format that is not backward compatible as it introduces certain
new features, including the support for table compression [23]. No other storage engine other than
InnoDB and no other file format other than “Barracuda” support table compression [20][23]. This is
another reason why the default storage engine of InnoDB was continued to be used as the storage
engine in MariaDB for this implementing the relational search for this project.
To allow testing and benchmarking compression on tables, this option was enabled on thumper.
As part of the benchmarks it was necessary to determine if enabling compression before or after
insertion may be feasible for future researchers. Future researchers that are not considering using
compression after assessing their needs against the benchmarks detailed later in this chapter, may
consider not changing this variable [20][23].
The content of the MariaDB settings file /etc/mysql/my.cnf from thumper is included below for
reference. A copy of it with comments can also be found on the attached CD.
[client]
port = 3306
socket = /var/run/mysqld/mysqld.sock
[mysqld_safe]
socket = /var/run/mysqld/mysqld.sock
nice =0
[mysqld]
user = mysql
pid-file = /var/run/mysqld/mysqld.pid
socket = /var/run/mysqld/mysqld.sock
port = 3306
basedir = /usr
datadir = /data/carna/mysql/db
tmpdir = /data/carna/mysql/tmp
lc-messages-dir = /usr/share/mysql
skip-external-locking
innodb_file_per_table
innodb_flush_method=O_DIRECT
innodb_log_file_size=1G
innodb_buffer_pool_size=8G
44
5. RELATIONAL SEARCH IMPLEMENTATION
innodb_file_format=Barracuda
innodb_log_buffer_size=128M
bind-address = 127.0.0.1
key_buffer = 16M
max_allowed_packet = 16M
thread_stack = 192K
thread_cache_size =8
myisam-recover = BACKUP
query_cache_limit = 1M
query_cache_size = 16M
expire_logs_days = 10
max_binlog_size = 100M
[mysqldump]
quick
quote-names
max_allowed_packet = 16M
[mysql]
[isamchk]
key_buffer = 16M
!includedir /etc/mysql/conf.d/
Moreover, data stored in a database server always takes more space than the raw data due to
some space being wasted by different MariaDB column data types. Indexes take up even further
storage. Therefore, given 10 TB of space, it would not be possible to insert all the 9TB of raw data as
approximately 13 to 14 TB of storage in the database would actually be required to insert 9 TB of
raw data. To be safe, future researchers inserting and indexing the entire dataset are recommended
to have double the storage, 18 TB, available in order to allow for some room for movement.
Therefore, given the large storage requirements for database storage, if compression is enabled,
it might be easily possible to insert the entire 9 terabytes of data into the 10 TB available on
thumper. Compression on a table can be enabled after data insertion or before data insertion.
Having compression enabled prior to insertion would slow down the data insertion and enabling
compression after insertion would result in a long wait period where the table would become
unavailable for querying while it was being compressed. Tests showed that enabling compression on
InnoDB came at a large processing cost irrelevant of whether compression was enabled before data
45
5. RELATIONAL SEARCH IMPLEMENTATION
insertion or afterwards. Benchmarks revealing time the taken for insertions in each of these
scenarios are explored later in this chapter.
However, it is important to note that compressed tables were no slower to generic select
queries than other tables. Only table changes, such insertions, deletions and alterations came at
significant CPU cost. Simple searching was equally as quick as uncompressed tables.
InnoDB compression can be enabled with different key block sizes. The default key block size for
compression is 8 KB. Other possible values for key block sizes are 1 KB, 2 KB, 3 KB, 8 KB or 16 KB.
Setting the key block size to 16 typically does not result in much compression, since the normal
InnoDB page size is 16KB. In general, the smaller the key block size the better the compression
ration but the more processing power required for compression. The time taken to compress can
increase exponentially with smaller key block sizes.
Once all the correct database settings detailed above were applied, a table called “search_rdns”
was created to allow storing all the data from the reverse DNS data type. The reverse DNS data type
stored three different pieces of information. The IP addressed of the reverse DNS, timestamp
indicating when the reverse DNS was performed and the result of the reverse DNS. This section
shows how the data type for each of these three different fields were decided in order to create a
table schema for storing the reverse DNS data in the database.
The data type for “ip” was chosen as VARCHAR (variable character) with a maximum of 15
characters being allowed to be stored in the field. Including the full stops, the maximum an IP field
can be is 15 characters. However in many instances in the data it is likely to be smaller. The
VARCHAR data type only takes up storage space of the actual characters used plus an extra 1 byte.
Hence it is ideal for the “ip” field where the data itself is variable in length.
Another contender for the “ip” field is the CHAR data type. This data type takes up a fixed
amount of space and is right-padded with spaces if less than the maximum number of characters is
inserted. These spaces are removed when values are retrieved for searching or displaying. This not
only waste storage, since 15 bytes of storage per row would be taken up irrespective of the actual
data stored, but it would also introduce an overhead of processing time to allow the addition and
removal of the white space padding on insertion and retrieval. Therefore, variable character data
type is the most optimal for storing the “ip” field in the table.
The timestamps in all of the 8 date types of the Internet Census 2012 are stored as Unix epoch
time. Unix epoch time is represented by an integer that indicates the number of seconds that have
passed since midnight UTC time on 1st of January 1970. The following table shows the different
integer type options that are available for storing an integer in a table in MariaDB.
46
5. RELATIONAL SEARCH IMPLEMENTATION
The epoch time for 11th of November 2013 at 12 PM AEST is 1384135200. This number is
smaller than the maximum value possible for a Signed INT but bigger than the maximum value
possible for a signed or an unsigned MEDIUMINT. Therefore INT seems like the obvious choice for
storing timestamps in the table.
Since the epoch time for the Internet Census 2012 dataset will not be negative, it makes sense to
choose an unsigned integer. However, the default INT type is a signed integer. By specifically
declaring a field an unsigned INT, extra processing needs to be done in order to interpret the stored
values differently. Since, a signed INT can also store all possible value of the timestamps field from
the dataset easily, it was ultimately chosen in order reduce unnecessary processing that would have
been introduced by choosing an unsigned INT to store timestamps.
Another method of storing the timestamp is the DATETIME data type of MariaDB. This date type
is ideal for storing date and timestamps information. A MariaDB function exists that will convert
epoch time to the correct DATETIME format during insertion. This function is called
“from_unixtime()”. This function can be used with the “load data infile” command to convert the
epoch time integers of the timestamps to the DATETIME format in order to store the data in the
native format. An example of a “load data infile” command using the “from_unixtime()” function can
be seen in the “Benchmarks” section of this chapter.
Benchmarks, detailed in the next section, showed that a considerable overhead was introduced
when using such a command to convert the integers into timestamps. It was much faster to insert
the original integers. For this reason, the data type of INT was chosen to store the timestamps.
Please refer to the “Benchmarks” section of this chapter for details on these benchmarks.
On further investigation it was discovered that storing the timestamps data as DATETIME would
also take double the storage space of an INT. As can be seen from the above table, an INT takes 4
bytes of storage. A DATETIME data type on the other hand takes 8 bytes of storage. This seems like a
waste when an INT can store the same information using half the storage. Furthermore, it is
relatively easy to convert these integers into DATETIME format when retrieving records by using
“from_unixtime()” function with select:
47
5. RELATIONAL SEARCH IMPLEMENTATION
Finally, the data type TEXT allows storing text of arbitrary length with a maximum storage of up
to 216 bytes. Since the data stored in the field “result” cannot be predicted and there is no way to
indentify the largest string stored in the field, using TEXT as the data type for storing “result” was the
obvious choice to allow for maximum flexibility. The space taken up for storage of TEXT data type is
2 bytes more than the actual text inserted, therefore there is a bit of storage wasted using this
technique. However, given the unpredictability of this data type, it is best available option.
The above choices combine to produce the below schema for the “search_rdns” table.
5.5 Benchmarks
In order to determine the preferred way of inserting and indexing the data, a number of benchmarks
were conducted on the “search_rdns” table described in the previous section. In order to perform
various benchmarks, the first 10 million records from the file 156.gz (representing 156.0.0.0/8) from
the reverse DNS data type were extracted and saved into a file called “rdns_test”. The size of this
raw file on disk was 287 MB. This file was used for various different tests.
Please note that between every insertion test, the entire table was deleted and the table
schema re-created. Therefore, all insertions are being made into an empty table.
As can be seen from the first command above, inserting 10 million records into a table with the
data type for timestamps set to INT took 33 seconds. This table was dropped and re-created before
the second command above was run. The second command, inserting the same 10 million records
into a table with the timestamps data type set to DATETIME took 15.99 seconds longer to execute.
48
5. RELATIONAL SEARCH IMPLEMENTATION
This is an increase of over 48% in time taken for insertion. This can have a dramatic impact when
inserting an entire dataset. Therefore, as summarised in the previous section of this chapter, for this
project, it was decided to use INT for storing the values of the timestamps field.
For the purposes of these benchmarks, a key block size of 4 KB was used whenever compression
was enabled. The column “Index or Compress first” in the table below indicates whether after the
data had been inserted, indexing was performed first or compression enabled first. The sizes in
brackets in each of the cells indicate the size of the table on disk after each operation was completed
successfully.
The above table will assist all future researchers determine the best combinations to follow for
inserting, indexing and enabling compression for the dataset of the Internet Census 2012. Depending
on the type of resources and time available, any combination of the options listed in the above table
can be chosen.
Since this project is considerably short on time, the option of insertion without any indexes or
compression was chosen as that combination has the fastest insertion time. If time permitted, the
plan was to then also create an index over the “ip” filed after insertion was completed. Choosing this
option results in a lot wasted space as can be seen above but it remains the fastest option for
inserting the dataset and making it searchable. In the benchmarks above, inserting the sample data
49
5. RELATIONAL SEARCH IMPLEMENTATION
with no index or compression enabled and then creating an index after insertion took a total of
57.37 seconds. The table ended up taking 725 MB of space on disk after these operations.
On the hand, creating a table with an index already defined, then inserting the sample data
produced slightly longer total insertion times (1 minute and 4.41 seconds) and bigger table size on
disk (1.1GB). 1.1GB is massive waste of space considering that the raw size of the test data is only
297 MB. The above benchmarks definitely reveal that it significantly better to create an index after
the data has been inserted into table rather than to insert data into a table with a pre-defined index.
No compression was used in the final proof-of-concept implementation detailed in the next
section due to the time limitations of this project. If, however, data storage was a concern then
future researchers may enable compression before data insertion. The benchmarks show that the
best time to enable compression is prior to data insertion as enabling compression after insertion
takes longer to process than if compression was enabled prior to insertion. However, with
compression enabled, a time increase of over 639% results compared to insertion without
compression. This time increase may not be acceptable if inserting the entire dataset but may not be
huge concern when inserting a relatively small number of records such as the 10 million records
inserted for this benchmark. Future researchers will have to decide the best option for them.
For the purposes of the proof-of-concept insertion, the ICMP ping data type from the dataset
was chosen for insertion. The ICMP ping is the second largest data type in the Internet Census 2012
dataset at a total size of 1.8 TB and containing over 52 billion records. It was assumed that if a proof-
of-concept insertion for the second largest data type can be done in an acceptable timeframe then
most of the other data types that are smaller would definitely be a lot easier to insert.
A table schema for “search_icmp_ping” was created that is identical to the “search_rdns”
schema presented in the “Sample Table Schema” section previously in this chapter. The identical
table schema is a result of the fact that ICMP ping data contains the same type and number of fields
as the reverse DNS data.
Once the above table was created in MariaDB, a simple bash script was written to automatically
decompress each of the gzip files from the ICMP ping data type and insert it into the
“search_icmp_ping” table in the database.
50
5. RELATIONAL SEARCH IMPLEMENTATION
For ease of insertion, the entire ICMP ping folder was copied to a secondary folder
“icmp_ping_raw” where the script was run from. The script deletes each of the Class A files once it
has successfully inserted it into the database. This was done in order to easily allow keeping track of
which parts of the data had already been inserted. This was the primary reason for creating a copy of
the ICMP ping folder for the purpose of insertions into the database.
Below is the “process.sh” script that was coded to perform automatic insertion. A copy of this
file can be found on the CD as well.
#!/bin/bash
for f in *.gz
do
filename="${f%.*}"
In this script “census2012” was the name of the database in MariaDB where the table
“search_icmp_ping” was located. The text “YOURPASSWORD” should be replaced with the correct
password for connecting to MariaDB. This script should be run from the same folder as the gzip files.
The best way of executing this script is to background it and redirect its output to a file. This can
be done by executing the following in a shell on Linux in the directory where this script is located:
“./processing.sh &> processing-log.txt”. This command redirects all output and errors from the script
into the file ”processing-log.txt”. The “processing-log.txt” file can be easily read by researchers
without interrupting the insertion process. For the proof-of-concept insertion, this script was run in
this manner. A copy of the output to the redirected file is also included on the CD as reference.
The above script for inserting the ICMP ping data into the relational database was started on
14:35 on 25th October 2013. The script finished at 12:46 on 30th October 2013. The whole insertion
process took just under 5 days and resulted in a table that took 2.4 TB of space on disk. 1.8 TB of raw
data containing over 52 billion records was inserted into MariaDB in less than 5 days and ended up
using 2.4 TB of space on disk.
51
5. RELATIONAL SEARCH IMPLEMENTATION
This command executed for approximately 5 days before completing. Once the indexing had
been done, test revealed that searches conducted on the table with commands such as
were executed and displayed in around 0.03 seconds. This is proof of instant search over 1.8 TB of
data. If there was a wish to allow instant search over either of the other fields, such as timestamps
or results, then they could also be indexed. This implementation allows this flexibility and lets the
researcher choose which fields are needed to be indexed for their research purposes.
A total of 10 days spent inserting and indexing 1.8 TB of data containing more than 52 billion
records to make it instantly searchable is an acceptable time frame for anyone conducting serious
research into the entire dataset of the Internet Census 2012. Therefore, this proof-of-concept
insertion shows clearly how feasible and practical it is to use a relational database, MariaDB, to
insert and index the entire dataset of the Internet Census 2012 for easy large scale searching and
analysis.
52
Conclusion
In summary, the stated objectives of this project were successfully achieved. At the conclusion of
this project, the Internet Census 2012 data has been made easily searchable with minimal resource
requirements.
The first major contribution of this project is the recompressed gzip version of the Internet
Census 2012 that is available for download from archive.org. This recompressed version will allow
countless more researchers easier access to the data than the ZPAQ files could have ever allowed.
The second contribution of this project is the flat file search implementation. This is a complete
implementation that allows searching the dataset in an extremely simple and intuitive manner. This
implementation is easy to transfer between systems and requires minimal work from a researcher in
order to set up. It also comes with minimal resource and software requirements.
For researchers who find the flat file search implementation limiting in the types and ranges of
data that it allows to be searched, a proof-of-concept relational database implementation has also
been covered in complete detail. The relational search implementation is more resource and time
intensive to set up, but it allows the greatest flexibility in the types of data that can be searched with
it. Due to time limitations, only a proof-of-concept could be completed for the relational database
implementation. However, this proof-of-concept implementation is extremely detailed in the
methodologies it used to achieve the results of inserting and indexing 1.8 TB of data in 10 days. The
details provided in the “Relational Search Implementation” chapter would allow interested
researchers to easily continue research in this area.
As this project is at an end, the AusCERT server thumper will shortly be repurposed. However,
the findings of this research will not be lost. All scripts and code used throughout this project are all
included in the attached CD. As mentioned, the 1.4 TB of gzip data is easily available for download
from the archive.org links listed in “Downloading the recompressed data” section on page 14.
It would have been ideal to have had a chance to insert the entire Internet Census 2012 dataset
into the relational database instead of only completing a proof-of-concept implementation.
However, as detailed in the “Table Compression” section on page 45, nearly 14 TB of space would be
required to implement this successfully. As this space is not available on thumper, even if time had
permitted a complete implementation of the relational search, resources requirements would have
been the ultimate limiting factor.
Furthermore, even if the data had been indexed and inserted during this project, it would not
have been possible to transfer this to someone else easily. Unfortunately, with the way relational
databases work, it is not easy to move indexed/inserted version of the data around to different
devices. This is probably the largest negative for a relational search. For the flat file search, little to
no work is required for setup, and the setup is easily transferable between devices.
For the relational search, considerable initial time must be invested in setting up the database
for searching. However, transferring this indexed data around is not easy, and the
insertion/indexation work would need to be re-done on other systems instead of transferring over
53
CONCLUSION
the already completed work. It is possible to transfer existing inserted/indexed table files, but much
time has to be invested in order to ensure that the setup of the database servers on both devices are
identical so as to avoid data corruption after transfer.
Future Work
Improving the flat file search and relational search is possible. Recommendations for improving the
flat file search have been covered in the “Future Work” sub-section on page 28.
For the relational search implementation, the primary source of bottleneck in speed is the
speeds available for Disk I/O. Faster disks are likely to dramatically improve the speed of insertion
for the relational search implementation. Placing the database on SSD disk or a RAM disk would
allow faster insertions and indexing of the dataset. For the proof-of-concept implementation in this
project, the data was being read from the same disk as where the database was inserting the
records. This slowed disk performance in both reading and writing. It was noticed that almost all of
the time, CPUs on the server were not being utilised to their full potential by MariaDB due to the
database having to wait for disk read/writes. Therefore, future researcher should consider placing
the source gzip files on a different storage device then where the MariaDB database is being stored
to allow an increase in the insertion speed.
Ideally, the all the gzip files would be loaded into RAM with the use of a RAM disk, and the entire
database would also be stored in a RAM disk. This would allow maximum speeds to be achieved
when inserting and indexing the data. Unfortunately in order to achieve this, in excess of 15 TB of
ram would be required.
Some of the other products that were considered serious contenders for this project (see
“Product Alternatives for Relational Search” section on page 17) are worth perusing in order to
attempt to locate faster/easier methods of indexing and searching this dataset. Most of these
products were discarded in this project due to the costs involved. Some of these products were
discarded due to high resource requirements.
However, these products still showed promise and researchers with appropriate budgets and
resources may find a faster implementation than the relational search implementation detailed in
this report. If computation grid resources similar to the HPC are available then an implementation in
MapReduce or similar may reveal faster data processing capabilities. However, as highlighted in this
report, having a requirement of grid reduces the number of researchers that can implement such a
solution. Nonetheless, this is still an avenue worth pursuing.
Google BigQuery holds considerable number of positive attributes. Firstly, once Google BigQuery
has been setup with the Internet Census 2012 dataset, it can easily be shared with anyone or even
made publicly available for querying. It allows data storage and processing to be externalised from
54
CONCLUSION
the researcher. This allows non-technical researchers to be involved with analysing the data. Using
Google BigQuery also moves problems associated with storage and backup to Google. The primary
concern with Google BigQuery has been the costs associated with it. Otherwise it is an ideal
implementation to allow continuation of research on this data without any repeat work needing to
be done by future researchers in order to re-setup the data.
Dr Krishnan S.P.T from the Institute for Infocomm Research in Singapore has proposed working
on a collaborative project in order to upload the 9 TB of data onto Google BigQuery. He wishes to
test the strength of Google BigQuery by performing interesting queries over such a large dataset.
The understanding achieved in this project will be shared with Dr Krishnan through this collaboration
to allow easy uploading of the 9 TB to Google BigQuery. Once the data has been uploaded, it would
be necessary to create interesting queries to search over the dataset. Currently, an IPv4 utilisation
map for different countries/region for different times of the year is being considered as one of the
final products of this collaboration. The hope is that other interesting queries and results will also be
created as part of the final product.
The final goal of this collaboration is to present the findings of our research at a conference in
Singapore next year [48]. The conference is called SyScan 2014 and Dr Krishnan will be submitting an
abstract for consideration in the coming month. For his project on Google BigQuery Dr Krishnan has
received some credit from Google to compensate for the storage and query costs.
If the collaboration is successful, it would be trivial to set up a pay-for-use query system for
researchers around the world to use. Only time will reveal the final outcomes of the project with Dr
Krishnan. However, the collaboration itself is another excellent outcome of this thesis.
55
56
Bibliography
[1] Anonymous. (undated). Internet Census 2012. [Online]. Viewed 2013 August 8. Available:
http://internetcensus2012.bitbucket.org/paper.html
[2] Anonymous. (undated). InternetCensus 2012 Torrent. [Online]. Viewed 2013 August 18.
Available: http://internetcensus2012.bitbucket.org/download/internet_census_2012.torrent
[3] Anonymous. (2013, March 17). Port scanning /0 using insecure embedded devices. [Online].
Available e-mail: http://seclists.org/fulldisclosure/2013/Mar/166
[4] Internet Assigned Numbers Authority. (2013, June 20). IANA IPv4 Address Space Registry.
[Online]. Viewed 2013 August 18. Available: http://www.iana.org/assignments/ipv4-address-
space/ipv4-address-space.xhtml
[6] G. Lyon. (2011). The Official Nmap Project Guide to Network Discovery and Security Scanning.
[Online]. Viewed 2013 August 8. Available: http://nmap.org/book/toc.html
[7] E. Siever, S. Figgins, R. Love and A. Robbins, “Linux Commands, ” in Linux in a Nutshell, 6th ed.
USA: O'Reilly Media, 2009, ch. 3, pp. 501-502.
[8] PHP. (2013, August 16). gzfile. [Online]. Viewed 2013 August 18. Available:
http://php.net/manual/en/function.gzfile.php
[9] Python. (undated). 12.2 gzip – Support for gzip files. [Online]. Viewed 2013 August 18.
Available: http://docs.python.org/2/library/gzip.html
[10] Oracle. (undated). Class GZIPInputStream. [Online]. Viewed 2013 August 18. Available:
http://docs.oracle.com/javase/7/docs/api/java/util/zip/GZIPInputStream.html
[11] J. Gailly and M. Adler. (undated). 1. Compression algorithm (deflate). [Online]. Viewed 2013
August 8. Available: http://www.gzip.org/algorithm.txt
[12] A. Feldspar (1997, August 23). An Explanation of the DEFLATE Algorithm. [Online]. Viewed
2013 August 8. Available: http://www.gzip.org/deflate.html
[13] J. Davies (2011, April 24). Dissecting the GZIP format. [Online]. Viewed 2013 August 8.
Available: http://www.infinitepartitions.com/art001.html
[14] Anonymous. (undated). Internet Census 2012. [Online]. Viewed 2013 October 10. Available:
http://internetcensus2012.bitbucket.org/download.html
[15] Anonymous. (undated). Browsing the Internet Address Space. [Online]. Viewed 2013 October
20. Available: http://internetcensus2012.bitbucket.org/hilbert/index.html
57
BIBLIOGRAPHY
[16] Wikipedia. (2013, October 29). Hilbert Curve. [Online]. Viewed 2013 October 29. Available:
http://en.wikipedia.org/wiki/Hilbert_curve
[17] MySQL. (undated). 14.2.6.2. InnoDB File-Per-Table Mode. [Online]. Viewed 2013 October 20.
Available: http://dev.mysql.com/doc/refman/5.6/en/innodb-multiple-tablespaces.html
[18] MySQL. (undated). 8.9.1 The InnoDB Buffer Pool. [Online]. Viewed 2013 October 10.
Available: http://dev.mysql.com/doc/refman/5.6/en/innodb-buffer-pool.html
[19] P. Zaitsev. (2007, November 3). Choosing innodb_buffer_pool_size. [Online]. Viewed 2013
October 2. Available: http://www.mysqlperformanceblog.com/2007/11/03/choosing-
innodb_buffer_pool_size/
[20] MySQL. (undated). 14.6.8. InnoDB Startup Options and System Variables. [Online]. Viewed
2013 October 1. Available: http://dev.mysql.com/doc/refman/5.1/en/innodb-
parameters.html
[21] P. Zaitsev. (2013, January 3). Is there room for more MySQL IO Optimization?. [Online].
Viewed 2013 October 1. Available: http://www.mysqlperformanceblog.com/2013/01/03/is-
there-a-room-for-more-mysql-io-optimization/
[22] P. Zaitsev. (2006, July 3). Chossing proper innodb_log_file_size?. [Online]. Viewed 2013
October 1. Available: http://www.mysqlperformanceblog.com/2006/07/03/choosing-
proper-innodb_log_file_size/
[23] MySQL. (undated). 14.2.9.1. Enabling File Formats. [Online]. Viewed 2013 October 10.
Available: http://dev.mysql.com/doc/refman/5.6/en/innodb-file-format-enabling.html
[24] UnixTime. (undated). What is Unix Time? [Online]. Viewed 2013 October 20. Available:
http://unixtime.info/
[25] AusNOG. (undated). AusNOG 2013 Programme. [Online]. Viewed 2013 October 2. Available:
http://www.ausnog.net/events/ausnog-2013/programme
[26] W. Wineberg. (2013, June). EXFiLTRATED. [Online]. Viewed 2013 October 30. Available:
http://exfiltrated.com/
[27] A. Katal, M. Wazid, R.H. Goudar, "Big data: Issues, challenges, tools and Good practices,"
Collaboration Technologies and Systems (CTS), 2013 Sixth International Conference on
Contemporary Computing (IC3), pp.404-409, 8-10 Aug. 2013.
[28] S. Sagiroglu, D. Sinanc, "Big data: A review," Collaboration Technologies and Systems (CTS),
2013 Sixth International Conference on Contemporary Computing (IC3), pp.42-47, 20-24
May 2013.
[29] J.M. Tien, "Big Data: Unleashing information," Service Systems and Service Management
(ICSSSM), 2013 10th International Conference on Service Systems and Service Management
(ICSSSM), pp.4, 17-19 July 2013.
58
BIBLIOGRAPHY
[30] AT&T. (2007, June). Dayton. [Online]. Viewed 2013 October 1. Available:
http://www2.research.att.com/~daytona/
[31] Splunk. (undated). What is Splunk Enterprise? [Online]. Viewed 2013 October 10. Available:
http://www.splunk.com/view/splunk/SP-CAAAG57
[32] Splunk. (undated). Free vs Enterprise. [Online]. Viewed 2013 October 10. Available:
http://www.splunk.com/view/SP-CAAAE8W
[33] J. Dean, S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” OSDI'04:
Sixth Symposium on Operating System Design and Implementation, December, 2004,
http://research.google.com/archive/mapreduce-osdi04.pdf
[34] IMB. (undated). What is MapReduce? [Online]. Viewed 2013 October 10. Available:
http://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/
[35] Apache. (2013, October 16). Welcome to Apache Hadoop! [Online]. Viewed 2013 October
20. Available: http://hadoop.apache.org/
[36] Google. (undated). What is BigQuery? [Online]. Viewed 2013 October 1. Available:
https://developers.google.com/bigquery/what-is-bigquery
[37] Google. (undated). Google BigQuery. [Online]. Viewed 2013 October 10. Available:
https://cloud.google.com/products/big-query
[38] Google. (undated). Google App Engine Pricing. [Online]. Viewed 2013 October 20. Available:
https://cloud.google.com/pricing/
[39] Amazon Web Services. (undated). Amazon Redshift (beta). [Online]. Viewed 2013 October
10. Available: http://aws.amazon.com/redshift/
[40] MySQL. (undated). Why MySQL? [Online]. Viewed 2013 October 10. Available:
http://www.mysql.com/why-mysql/
[41] Oracle. (undated). Why Oracle? [Online]. Viewed 2013 October 2. Available:
http://www.oracle.com/us/products/database/overview/index.html
[42] Wikipedia. (2013, November 3). Database Index. [Online]. Viewed 2013 November 3.
Available: http://en.wikipedia.org/wiki/Index_(database)
[43] MariaDB Foundation. (undated). About MariaDB. [Online]. Viewed 2013 October 3.
Available: https://mariadb.org/en/about/
[44] MariaDB Foundation. (undated). MariaDB versus MySQL - Features. [Online]. Viewed 2013
October 5. Available: https://mariadb.com/kb/en/mariadb-versus-mysql-features/
[45] MariaDB Foundation. (undated). MariaDB versus MySQL – Compatibility. [Online]. Viewed
2013 October 5. Available: https://mariadb.com/kb/en/mariadb-versus-mysql-compatibility/
59
BIBLIOGRAPHY
[46] Wikimedia Foundation. (2013, April 22). Wikipedia Adopts MariaDB. [Online]. Viewed 2013
October 10. Available: http://blog.wikimedia.org/2013/04/22/wikipedia-adopts-mariadb/
[47] A. Feldman. (2013, April 16). English Wikipedia mysql 5.1-facebook vs mariadb 5.5.30 read
query comparisons. [Online]. Viewed 2013 October 10. Available:
http://upload.wikimedia.org/wikipedia/commons/1/16/English_Wikipedia_mysql_5.1-
facebook_vs_mariadb_5.5.30_read_query_comparisons.pdf
[48] SyScan. (undated). SyScan 2014. [Online]. Viewed 2013 November 10. Available:
http://www.syscan.org/
60