Data Science Lecture 2 Four Dimensions

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 25

Four Dimensions

Biasness of Data Science


Four Dimensions
Biasness of Data Science

Tools Abstraction desk cloud desk cloud structs stats


What goes around comes around Tools Abstraction

• Pre-2004: commercial RDBMS, some open source


• 2004 Dean et al. MapReduce
• 2008 Hadoop 0.17 release
• 2008 Olston et al. Pig: Relational Algebra on Hadoop
• 2008 DryadLINQ: Relational Algebra in a Hadoop-like system
• 2009 Thusoo et al. HIVE: SQL on Hadoop
• 2009 Hbase: Indexing for Hadoop
• 2010 Dietrich et al. Schemas and Indexing for Hadoop
• 2012 Transactions in HBase (plus VoltDB, other NewSQL systems)
Relational algebra, SQL on the top of Distributed Hadoop like systems
But also some permanent contributions:
• Fault tolerance
• Schema-on-Read
• User-defined functions that don’t suck
What are the abstractions of data science?

Tools Abstraction

“Data Jujitsu”
“Data Wrangling” Translation: “We have no idea what

“Data Munging” this is all about”


What are the abstractions of data science?
Tools Abstraction

Matrices and linear algebra?


Relations and relational algebra?
Objects and methods?
Files and scripts?
data frames and functions?
Data Access Hitting a Wall desk cloud

• Current practice based on data download (FTP/GREP) Will not scale to the datasets of
tomorrow
• You can GREP 1 MB in a second
• You can GREP 1 GB in a minute
• You can GREP 1 TB in 2 days •  You can GREP 1 PB in 3 years.
• Oh!, and 1PB ~5,000 disks
• You can FTP 1 MB in 1 sec
• You can FTP 1 GB / min (~1$) … 2 days and 1K$ ……… 3 years and 1M$
• At some point you need indices to limit search
• parallel data search and analysis
This is where databases can help
[slide src: Jim Gray]
Hackers vs Analysts
hackers analysts

• US faces shortage of 140,000 to 190,000 people


“with deep analytical skills, as well as 1.5 million managers and analysts
with the know-how to use the analysis of big data to make effective
decisions.”
--Mckinsey Global Institute
Biologists are beginning to write very complex queries (rather than
relying on staff programmers)
structs stats

“80% of analytics is sums and averages”


-- Aaron Kimball, wibidata
Three types of tasks
1) Preparing to run a model
Gathering, cleaning, integrating, restructuring, transforming, loading,
filtering, deleting, combining, merging, verifying, extracting, shaping,
massaging
2) Running the model
3) Interpreting the results
Big Data
• “…no greater barrier to effective data management will exist than the
variety of incompatible data formats, non-aligned data structures, and
inconsistent data semantics.”
Doug Laney, “3-D Data Management: Controlling Data Volume, Velocity
and Variety”, Gartner, 2001
Problem
How much time do you spend “handling data” as opposed to “doing
science”?
Mode answer: “90%”
Data base VS other tools

src: Christian Grant, MADSkills


eScience
“eScience” = “Data Science”
Astronomy, oceanography, and biology= Business

Empirical
Theoretical
Computational
Theoretical
Empirical
Theoretical
Computational
Computational
Empirical
Theoretical
Computational
eScience
Empirical
Theoretical
Computational
eScience
Science is about asking questions
• Traditionally: “Query the world”
Data acquisition activities coupled to a specific hypothesis
• eScience: “Download the world”
Data acquired en masse in support of many hypotheses
• The cost of data acquisition has dropped precipitously thanks to advances in
technology
• Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
• Life Sciences: lab automation, high-throughput sequencing,
• Oceanography: high-resolution models, cheap sensors, satellites
• The cost of finding, integrating, and analyzing data, then communicating
results, is the new bottleneck
eScience is driven by data more than by
computation
Massive volumes of data from sensors and networks of sensors
Apache Point telescope,
SDSS
• 80TB of raw image data
• (80,000,000,000,000 bytes)
over a 7 year period
Large Synoptic Survey Telescope (LSST)

• 40TB/day
• (an SDSS every two days),
• 100+PB in its 10-year lifetime
IlluminaHiSeq 2000 Sequencer
• ~1TB/day
• Major labs have 25-100 of these machines
Nodes of the NSF Ocean Observatories
Initiative
• 1000 km of fiber optic cable on
the seafloor, connecting
thousands of chemical,
physical, and biological sensors
The Web

• 20+ billion web pages x 20KB =


400+TB
• One computer can read 30-35
MB/sec from one disk => 4
months just to read the web
eScience is about the analysis of data
• The automated or semi-automated extraction of knowledge from
massive volumes of data
• There’s simply too much of it to look at
• But it’s not just a matter of volume
• The Three V’s of Big Data:
• Volume: number of rows / objects / bytes
• Variety: number of columns / dimensions / sources
• Velocity: number of rows / bytes per unit time
• More V’s:
• Veracity: Can we trust this data?
Summary
• Science is in the midst of a generational shift from a data-poor
enterprise to a data-rich enterprise
• Data analysis has replaced data acquisition as the new bottleneck to
discovery
• What does this have to do with business?
Business is beginning to look a lot like
science
• Acquire data aggressively and keep it around
• Hire data scientists
• Make empirical decisions

You might also like