Data Science Lecture 2 Four Dimensions
Data Science Lecture 2 Four Dimensions
Data Science Lecture 2 Four Dimensions
Tools Abstraction
“Data Jujitsu”
“Data Wrangling” Translation: “We have no idea what
• Current practice based on data download (FTP/GREP) Will not scale to the datasets of
tomorrow
• You can GREP 1 MB in a second
• You can GREP 1 GB in a minute
• You can GREP 1 TB in 2 days • You can GREP 1 PB in 3 years.
• Oh!, and 1PB ~5,000 disks
• You can FTP 1 MB in 1 sec
• You can FTP 1 GB / min (~1$) … 2 days and 1K$ ……… 3 years and 1M$
• At some point you need indices to limit search
• parallel data search and analysis
This is where databases can help
[slide src: Jim Gray]
Hackers vs Analysts
hackers analysts
Empirical
Theoretical
Computational
Theoretical
Empirical
Theoretical
Computational
Computational
Empirical
Theoretical
Computational
eScience
Empirical
Theoretical
Computational
eScience
Science is about asking questions
• Traditionally: “Query the world”
Data acquisition activities coupled to a specific hypothesis
• eScience: “Download the world”
Data acquired en masse in support of many hypotheses
• The cost of data acquisition has dropped precipitously thanks to advances in
technology
• Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
• Life Sciences: lab automation, high-throughput sequencing,
• Oceanography: high-resolution models, cheap sensors, satellites
• The cost of finding, integrating, and analyzing data, then communicating
results, is the new bottleneck
eScience is driven by data more than by
computation
Massive volumes of data from sensors and networks of sensors
Apache Point telescope,
SDSS
• 80TB of raw image data
• (80,000,000,000,000 bytes)
over a 7 year period
Large Synoptic Survey Telescope (LSST)
• 40TB/day
• (an SDSS every two days),
• 100+PB in its 10-year lifetime
IlluminaHiSeq 2000 Sequencer
• ~1TB/day
• Major labs have 25-100 of these machines
Nodes of the NSF Ocean Observatories
Initiative
• 1000 km of fiber optic cable on
the seafloor, connecting
thousands of chemical,
physical, and biological sensors
The Web