Hive Query Language (HQL) is excellent for productivity and enables reuse of SQL skills, but falls short in advanced analytic queries. Hive`s Map & Reduce scripts mechanism lacks the simplicity of SQL and specifying new analysis is cumbersome. We developed SQLWindowing for Hive(SQW) to overcome these issues. SQW introduces both Windowing and Table Functions to the Hive user. SQW appears as a HQL extension with table functions and windowing clauses interspersed with HQL. This means the user stays within a SQL-like interface, while simultaneously having these capabilities available. SQW has been published as an open source project. It is available as both a CLI and an embeddable jar with a simple query API. There are pre-built functions for windowing to do Ranking, Aggregation, Navigation and Linear Regression. There are Table functions to do Time Series Analysis, Allocations, and Data Densification. Functions can be chained for more complex analysis. Under the covers MR mechanics are used to partition and order data. The fundamental interface is the tableFunction, whose core job is to operate on data partitions. Function implemenations are isolated from MR mechanics, focus purely on computation logic. Groovy scripting can be used for core implementation and parameterizing behavior. Writing functions typically involves extending one of the existing Abstract functions.
1 of 54
More Related Content
Analytical Queries with Hive: SQL Windowing and Table Functions
2. Agenda
Ø Why
Ø What are Partitioned Table Functions (PTFs)?
Ø Why are they interesting?
Ø What
Ø Our solution
Ø Demo
Ø How
Ø Our Implementation, briefly
Ø Expand on concept of PTFs:
Ø Multi Pass and Recursive Algorithms
Ø Next steps and Summary
4. What are PTFs
Ø AreFunction invocations that can appear in place of a
table in SQL.
Contract is Table in à Table out.
Ø Input is partitioned (optionally ordered) .
An instance of a PTF operates on a Partition.
Partitioning drives parallel execution
Ø Similar to MR, but predates MR.
Ø Available in many DBs: Oracle, Aster, DB2 etc.
5. Analytics expressed using PTFs
} Aggregations by Partition à Ranking,Top N.
} Rank products within manufacturer by price
} List three largest census tracts by area within each US county
} Inter row Calculations à Time Series Analysis
} Find occurrences where a flight was more than 15 mins. late, five or
more times in a row
} Multi Pass Algorithms à Market Basket Analysis
} Find items bought frequently together
} Find Web pages visited in the same session
} Graph Algorithms à implemented as Recursive Queries
} Find lowest cost flights between two cities
… exposed in SQL as Table Function invocations
6. Analytics expressed using PTFs
Ø Aster SQL/MR Function Library
Ø Time Series Analysis
Ø Graph Analysis
Ø Fraud Detection
Ø Sessionization
Ø …
7. PTFs: bottom-line
Ø Enable
more interesting Questions more simply in the SQL
context
enable analysis not expressible in SQL
Simplify expressing analysis
Ø Foster
Reuse by providing Function Libraries and bridging to
external engines.
Ø App. Developers expect and rely on this in other DBs.
Ø Our solution:
Ø An attempt to provide this for Hive
Ø Still under development
8. PTF Invocation Example
Example: Market Basket Analysis
Ø Input is a large set of Baskets, each contains a set of Items
Ø Find Items that occur frequently together.
No Standard Form. But typical structure is:
from FrequentItemSets(
Basket
partition by basketId order by itemName,
supportThreshold= 0.15) Ø From clause invokes PTF: FIS
select itemset Ø Fn. told how to partition and
order Input
Ø Other Args.: Support threshold
in this case
Ø No change in other parts of SQL
Basket
ItemSets
BasketId
ItemName
{"items":["apples","baguette"]}
1
Apples
{"items":["apples","corned_b"]}
1
Baguette
{"items":["apples","hering"]}
2
Apples
{"items":["apples","olives"]}
2
Avocado
{"items":["apples"]}
2
Olives
9. SQL Windowing
Ø Used to express Aggregations on Partitions
Ø Further Window expressions enable aggregations on a
Window surrounding a row. So row specific Aggregations.
Ø Functions available:
Ø Ranking: Rank, DenseRank, PercentRank, NTile
Ø Aggregation: Sum, Min, Max, Avg, StdDev,Variance
Ø Navigation: First Value, Last Value, Lead, Lag
Ø Statistics: CoVariance, Linear Regression: Slope, Intercept
Ø Enable expressing
Ø Cumulative Sums
Ø Delta Analysis
Ø Ratios
10. SQL Windowing, as a PTF
Simple Example:
Ø Group Sales data by Channel and Month
Ø Within each Channel: compute Rank, DenseRank, for each
Month by sales amount
Ø Also compute Rank over all Months across Channels
select channel, month, sum(amount),
rank() over (order by sum(amount) desc) AS ra,
denserank() over (partition by channel order by sum(amount) desc) as dr,
rank() over (partition by channel order by sum(amount) desc) as r
From sales
Group by channel, month;
Key Observation:
Ø Processing happens in 3 stages
Ø First everything else is executed: join, group by, and having
clauses
Ø Result Set is made available to Windowing Functions: Partition
by Partition.
Similar to how PTFs are executed!
11. SQL Windowing, as a PTF
If all windowing clauses have the same
partition & order expression,
1. Assume Rank
then this can be expressed as a PTF on all rows not in
original Query
select channel, month, r, dr
denserank() AS dr,
rank() as r 2. Form makes
From <Group By Query> explicit: first
partition by channel order by sum(amount) everything else is
executed; then do
Windowing
In PTF form calculations.
- Assume a PTF fn. called WindowingTableFunction
- Input is the Group By Query
- Args. to function are Windowing Clauses
select channel, month, s, r, dr
From WindowingTableFunction(
<select channel, month, sum(amount_sold) as s
from sales
Group by channel, month>
partition by channel Our Solution:
order by s,
[r : <rank()>, dr: <denserank()>] )
1. Windowing Clause Support at this level
2. Can easily add multiple order support
3. Multiple partition support requires more
thought
13. PTFs with Hive
} To use
} Download jar from https://github.com/hbutani/SQLWindowing/wiki
} Setup windowingCli.sh in bin/ext directory
} Use in CLI mode
} Also usable from API
14. PTFs with Hive CLI
Windowing CLI
Hive CLI
Hive Translator
Windowing
Shell
Ø In CLI enter HQL or PTF Queries
Metadata
Ø WindowingCLI embeds Hive
Hive callbacks: Ø HQL passed to Hive
- Metadata Ø Windowing CLI interacts with Hive for
- Execute Embedded Hive metadata and executing embedded Hive
Queries
Queries
Execute PTF Queries
as MR jobs
Ø PTF works in a similar fashion to Hive
Execute Hive Ø Translates Queries into MR jobs
Queries as MR
jobs
Cluster
15. Query Structure
Query abstraction is a select statement:
Ø Input is a Table Function Invocation
Ø Filter & Project on Table Function output
Ø Table Function call:
Ø Input is Hive Table, Query or another PTF => can chain PTFs
Ø Specify partitioning and order of Input
Ø Other Function Args. Input can be:
• Hive Table
• Hive Query
From PartitionedTableFunction(
Input Specification • Another PTF
Partition by …
Order by …
Function Arguments…
Not shown here:
)
Select (ColumnName | Expression)+ 1. Output of Query can be written to Hive
Where Expression Table or Partition
2. Can this form be a SubQuery in HQL? Not
yet.
16. Query Structure: Windowing Clauses
1. Windowing Clauses one
From Input Specification, variation to Simple Query
form
Partition by …
Order by …
with
windowing clause….,
windowing clause…
Select (ColumnName | Expression)+
Where Expression
Syntactic sugar
for From WindowingTableFunction(
Input Specification
Partition by …
Order by …
Windowing Clauses …
)
Select (ColumnName | Expression)+
Where Expression
17. Query Examples: Basic Query
Rank Parts within Manufacturer by price
Part 1. On TPCH Part table
MfrName
PartName
Price
2. Rank Parts within each Manufacturer by
Manufacturer#1
violet almond 2095.99
Price
orange lavender
peach
Manufacturer#2
yellow magenta 2094.99
gainsboro almond
turquoise
Not so straightforward w/o windowing because no
Manufacturer#2
papaya cream 2095.99
smoke yellow khaki
inter row expressions
Manufacturer#1
pink orange peach 2094.99
1. Rank over all Rows: (Mfr, Part, Overall Rank)
beige steel
2. Rank Min Query: (Mfr, Min(Rank) )
3. Join 1 & 2 on Mfr, subtract Rank from Min
(Rank)
from part_rc
partition by p_mfgr Manufacturer#1
violet almond 2095.99
1
orange lavender
order by p_mfgr, p_retailprice desc
peach
with
Manufacturer#1
pink orange peach 2094.99
2
rank() as r
beige steel
select p_mfgr,p_name, p_retailprice, r Manufacturer#2
papaya cream 2095.99
1
smoke yellow khaki
Manufacturer#2
yellow magenta 2094.99
2
gainsboro almond
turquoise
19. Query Examples: Top N
Calculate the Top 3 Tracts(based on land area) by County.
Census Geo Header data:
County
Tract
AreaLand
SumLev
1. Geography dimension for Census data.
001
451101
300
140
2. Contains data from multiple hierarchies and levels.
001
1200
120
005
000102
35
140
3. Query on County-> Census Tract -> Census Block
004
000200
15
140
hierarchy
4. Summary Level column used to identity level
from <select county, tract, arealand
from geo_header_sf1
where sumlev = 140>
partition by county
order by county, arealand desc
with County
Tract
AreaLand
R
Cum_Area
rank() as r, 001
451101
300
1
300
sum(arealand) over rows
between unbounded preceding and current row as cum_area 001
450701
250
2
550
select county, tract, arealand, r, cum_area
where <r <= 3> 001
441503
150
3
700
005
000102
450
1
450
005
000200
200
2
650
1. Input is a HQL 3. Only output top 2. Sum from start
query 3 rows from each of Partition up to
partition. Current row.
20. PTF Example: NPath
Now example of a PTF: NPath
} Look for patterns in Time
} User specifies Labels: interesting conditions, for e.g. LATE : arr_delay > 15 mins
} Then specifies Patterns on Labels. Patterns are simple Regexes. For e.g.
} LATE.LATE.LATE.LATE.LATE+ à look for occurrences where a flight is 5 or more times late.
} On Occurrences found (Occurrences are a set of rows) specify aggregation calculations. For e.g.
} Average Delay among late occurrences
} Number of delays
Note
} This is a non trivial function to implement.
} But from User point of view just another Function invocation. Can specify Function behavior
through arguments
} Also Query executed in the same way: Partition input, invoke function on each Partition…
21. PTF Example: NPath
Find incidents where a Flight(to NY) has been more than 15 minutes
late 5 or more times in a row. 1. Query on
FlightsData table,
restrict to flights
to NY
from npath(<select origin_city, year, month, day_of_month, arr_delay, fl_num
from flightsdata 2. Looking at data
where dest_city = 'New York' and dep_time != ''> per Flight; order
partition by fl_num within partition
order by year, month, day_of_month, by time
<[LATE : "arr_delay > 15"]>,
'LATE.LATE.LATE.LATE.LATE+',
4. This is very hard in SQL.
<["origin_city", "fl_num", "year", "month", "day_of_month", Remember the LABEL and
["(path.sum() { it.arr_delay})/((double)count)", "double", "avgDelay"], PATTERNS are specified at
["count", "int", "numOfDelays"] Query execution time. So
]> window of analysis is
) dynamic.
select origin_city, fl_num, year, month, day_of_month, avgDelay, numOfDelays
Origin
FlNum
Year
Month
Day
AvgDelay
NumOfDely
3. Boston
1017
2010
10
25
59.37
8
- Arg. 3 specify conditions as LABELS Boston
1017
2010
10
26
58.14
7
- Arg. 4 specify PATTERN Boston
1017
2010
10
28
30.83
6
- Arg. 5 specify AGGR. EXPRESSIONS
Boston
1017
2010
10
29
25.67
5
Pittsburgh
1058
2010
12
26
82.62
8
22. Our Solution: What’s available
Ø Windowing Functions
Ø 21 functions available
Ø For Ranking, Aggregation, Navigation and Statistics
Ø Both Row based and Value based windows.
Ø One Pass PTFs
Ø NPath
Ø Others in the works: Allocation, Deallocation etc. in the bucket of
Lightweight Dimensional Analysis: See wiki for details.
Ø Multi Pass PTFs
Ø Market Basket Analysis: using Dynamic Item Set counting Algorithm.
Ø Plans to do Generalized Transitive Closure.
Looking for your input and help implementing others
24. Query Evaluation: a PTF
Hopefully No Surprise
Ø Shuffle to Partition and Sort
Input Map Ø PTFs work on Partitions instead of
DataSet Splits Rows
Ø PTFs use Groovy for expression
evaluation today.
Writables Shuffle controlled by
+ partition and order
SerDe specification
Output
Partition Partition
Ø A PartitionedTableFunction (PTF) given a
Writables Writables
Partition computes an output Partition.
+ PTF +
Ø An invocation of PTF specifies how input SerDe SerDe
dataset should be partitioned and ordered.
Ø A PTF defines shape of Output. Ø Partitions are containers of Rows.
Ø A PTF may operate on raw data before it is Ø List of Writables + ObjectInspector
partitioned and ordered. Ø Rows exposed as Groovy Binding; Partition
exposed as Groovy iterable.
For details see doc. directory on GitHub Ø All Evaluation in context of Row and Partition
and optionally Window
26. Multi Pass and Recursive Queries PTFs
} So far
} functions perform one pass over the input
} function acts on input after it is partitioned.
} But other use cases require multiple passes on the input
Ø Since each Partition executed independently:
Ø you may need to consolidate Output
Ø and based on consolidation revisit data.
Ø Recursive Queries used for implementing Graph Algorithms
are an important subclass of such problems
Ø In the Context of this talk the focus is:
Ø How do these fit into the PTF model: both from an interface
perspective and also from an execution model
27. Multi Pass PTFs
} Key Observation:
} Execution model can be extended with following changes:
} Partition input and persist
} Repeat à a fixed # of iterations or dynamically determined
} Map-side: operate on persisted Input partitions; do Partition à Partition
} use shuffle to consolidate output across partitions.
} Output from one pass read as Input in next pass.
Ø But from Interface Perspective
Ø End User still sees this as a PTF invocation
Multi Pass mechanics still being worked out
28. Query Evaluation: Multi Pass PTF
repeat
Input
DataSet Map
Splits
Partition Partition
Writables Shuffle controlled by
Partition and
Writables
+ PTF mapside + partition and order
optionally order SerDe SerDe specification
Partition Partition
Ø Input to Function is Partitioned.
Writables Writables
Function is applied on Map-Side
PTF
Ø
+ reduceside
+
Ø Shuffle used to collect Output SerDe SerDe
Ø Process repeated
Ø Pass n reads output from Pass n-1
Output
29. PTF Example: Market Basket Analysis
Ø Frequent ItemSets computed by doing 2 passes (the SON alg.):
Ø Pass 1 compute Frequent ItemSets for each Partition independently
and consolidate across Partitions.
Ø Pass 2 go over input again, eliminate false negatives.
Ø Note: computing Frequent ItemSets is complex
Ø number of possible ItemSets is exponential
Ø Many interesting algorithms: we have implemented Dynamic
Item Counting alg.
Ø See wiki for details….
Ø This is not a big jump from 1 pass
Ø Output of Pass 1 is very small relative to input; so no issue of
communication cost
Ø Only 2 passes involved.
30. PTF Example: Market Basket Analysis
Basket
Implementation and Interface needs work:
BasketId
ItemName
• User exposed to the fact that input initially
1
Apples
partitioned. And then output of function is
1
Baguette
partitioned.
• Also Multi Pass mechanics not ready.
2
Apples
2
Avocado
2
Olives
from candidateFrequentItemSets( from FrequentItemSets(
<select * from basketdata Basket
distribute by basketName partition by basketId order by itemName,
sort by basketName, itemName> supportThreshold= 0.15)
partition by itemset order by itemset, select itemset
'basketName', 'itemName', <0.15>)
select itemset
ItemSets
{"items":["apples","baguette"]}
{"items":["apples","corned_b"]}
{"items":["apples","hering"]}
{"items":["apples","olives"]}
{"items":["apples"]}
31. Recursive Queries as PTFs
} Heart of Graph Algorithms is traversal: the discovery and selection of Paths
} In SQL context Graph held in Table R(Src., Dest.) and traversal expressed
as Relational Operators run iteratively to a fixed point.
} Typically until no new Paths discovered.
} Lot of work in DB community to parallelize these Algorithms:
} Partition work
} Reduce communication à naïve impl. will suffer from high
comm. cost.
} Of late revival of interest:
} HaLoop project: Relation based implementation. Based on tweaking
MR mechanics: changes to JobTracker and TaskTracker.
} Giraph project: Matrix based Direct Algorithms
32. Recursive Queries as PTFs
} In PTF context:
} A Graph Algorithm involves
} An Input Relation R(Source, Destination)
} Output contains Paths that meet a certain criteria.
} A Recursive Query is processed in a fashion very similar to a Multi Pass
Algorithms
} Partition input and persist ß Partition R by Destination
} Repeat
} operate on persisted partitions ß Do map-side Join of R with newly discovered
Paths
} use partitioning to consolidate output across partitions. ß partition output of join (the
new Paths) by Source. Dedup; output only new Paths.
} But for End User still sees this as a PTF invocation
Under the covers could use HaLoop or Giraph to implement Alg.?
33. PTF Example: Transitive Closure
Flights
Simple example
Source
Destination
• List all possible Routes
New York
London
New York
Paris
London
Bombay
London
Dubai
Dubai
Bombay
from transitiveClosure(
<select * from Flights
distribute by Dest>
partition by Src,
‘Src', ‘Dest’)
select Src, Dest Source
Destination
New York
London
New York
Paris
London
Bombay
London
Dubai
Dubai
Bombay
New York Bombay
New York Dubai
34. PTF Example: Generalized TC
But TC Mechanics can be generalized. More Interesting example:
Specify: Input: FlightsTable(Src, Dest, ArrTm, DepTm, Cost)
Ø Path Joining Condition: how to generate new Ø Find me the best routes from any airport in New York to
Paths Bombay.
Ø Path Attributes: how to calculate Aggregation Ø The waits at intermediate points must be between 2 to 5 hours.
attributes on Paths à Sets of Edges Ø Pick the lowest cost flight, but for a direct flight I am willing to
Ø Path selection criteria: how to pick from multiple pay a $100 premium.
Paths between a Source and Destination.
from GeneralizedTC(
from GeneralizedTC( <select * from Flights
<select * from Flights where Src = ‘New York’
distribute by Dest> distribute by Dest>
partition by Src, partition by Src,
SourceColumn, ‘Src’,
DestColumn, ‘Dest’,
Path Joining Condition, Path Joining Condition = <Dest = Path.Src &&
[Path Attributes], ArrTm <= Path.DepTm - 120 &&
Path selection condition, ArrTm >= Path.DepTm - 300>,
select Src, Dest,…. Path Attributes = [
Where … totalCost: <sum(Price)>,
hops : <count(*)>
],
Path selection condition = <
p1.cost <= p2.cost ||
Google for ‘Shaul Dhar Generalized
Transitive Closure in SQL.’
(p1.hops == 0 && p1.cost <= p2.cost + 100)
1993 Phd thesis from Univ. of Wisconsin, >,
Madison. select Src, Dest,….
Where Dest = ‘Bombay’
36. Benefits
Ø Enable
more interesting Questions more simply in the SQL
context
enable analysis not expressible in SQL
Simplify expressing analysis
Ø Foster
Reuse by providing Function Libraries and bridging to
external engines.
Ø App. Developers expect and rely on this in other DBs.
37. Next Steps
Ø Make Windowing clauses closer to standard SQL
Ø Use Hive Expressions Evaluation at runtime instead of Groovy
Ø So why cannot this be part of Hive?
Ø Yes makes a lot of sense. See wiki for a step-by-step plan.
Ø Build out Multi-Pass and Recursive Query mechanics
Ø Enhance PTF interface. Ensure simple ifc for End User. No leakage of
implementation.
Ø Investigate using HaLoop/Giraph as execution model
Ø Flush out function library
40. So Why PTFs?
Ø Usability
Ø Table Functions: enable more interesting Questions more simply
Ø Enable analysis not expressible in SQL
Ø Simplification of existing queries: both syntactically and in performance
Ø replace self-joins with intra row computations
Ø Reusability
Ø Functions are parameterized, so reusable in wide range of contexts
Ø Functions can reshape Output Schema at runtime
Ø Bridge
Ø Since ifc to SQL engine is coarse, execution of PTFs may involve other MPP
engines (multi-pass)
Ø Speed
Ø Partitioned Table Functions enable working at scale by breaking down dataset
into manageable Partitions
41. So Why PTFs?
Ø Use Cases:
Ø Time Series Analysis: NPath
Ø Market Basket Analysis
Ø Lightweight Dimensional Analytics
Ø Graph Algorithms: Transitive Closure
Ø Sessionization
Ø SQL Windowing Clauses
Ø Can be treated as a PTF if all windowing clauses are on the same Partition.
42. PTF Interface
3. Function may have Map
side processing
1.
- Controls Shape of Output
- Responsible for Output
2. Enables chaining of
Functions
43. Partitioned Table Function Mechanics
Node
1. Partition input into n Input DataSet
Partition 0
partitions, optionally
order partitions Func.
2. distribute over set of
processing nodes
Partition and Node
optionally order
3. execute Fn. at each node Partition 0
Next Operator
4. merge data from all Func.
instances and stream
merged result to next
Operator .
.
.
Node
Partition 0
Why?
Ø for Performance: enables parallel
Func.
operation on large datasets
Ø for providing advanced analytics This is not the
same as Hive
Supported in many Databases: Oracle, DB2, Aster etc. UDTF
44. Support in RDBMS: Aster SQL/MR
Problem: Group clickstream data into sessions
Ø Consider all rows of a User as a Partition
Ø Order rows by time
Ø Group rows within 60 seconds of each other as
belonging to a Session.
Benefits
Ø Work at Scale: designed to work over 100s of terabytes
Ø Usability and Reusability:
Ø SQL/MR looks like a table, it can appear anywhere a table can appear in SQL
Ø Enable analysis not expressible in SQL
Ø Simplification of existing queries
Ø Functions are parameterized so usable in wide range of contexts.
Aster Function Library
Ø Time Series Analysis
Ø Graph Analysis
Ø Fraud Detection
Ø Sessionization
Ø ....
45. Support in RDBMS : Oracle
FUNCTION test_ptf (p_cursor IN t_parallel_test_ref_cursor)
RETURN t_parallel_test_tab PIPELINED
PARALLEL_ENABLE(PARTITION p_cursor BY HASH (id));
END parallel_ptf_api;
SELECT sid, count(*)
FROM TABLE(test_ptf(CURSOR(SELECT * FROM input) ) ) t2
GROUP BY sid;
46. Hive Script Operators
But one can do PTFs in Hive using Script Operators
From
(From clicks
Select transform(userid, pageid, ts)
Using ‘/bin/cat’
As (userid, pageid, ts)
Distribute by userid
Sort by userid, ts) map_output
Select transform(userid, pageid, sessionId)
Using ‘your java pgm/python script/… pass in args..’
As (userid, pageid, sessionId)
But
Ø Not as Usable
Ø compare to Aster Query
Ø Painful for App. Developer to use. Imagine trying to
chain functions.
Ø Not very Reusable Analogy
Ø arg passing embedded in strings that spawn process Integration via stdin stdout
And implementation artifacts Vs
Ø Data streamed across process boundaries. An embedded Func Library
Ø Type information passing limited: tab separated files.
47. PTF Example: Hierarchical Evaluation
For a Country, State, City Geo hierarchy compute: % of Country Sales,
Top City Sales, Avg. City Sales
from hierarchyEvaluate(
< select Country, State, City, sum(Sales) from Sales group by Country, State, City>
partition by Country
order by Country, State, City,
<['Country', 'State', 'City']>,
<[
["Sales / Ancestor('Country', 'Sales') * 100.0", "% Country"],
["TopN(Descendants('City'), 'Sales', ,1)", "Top City"],
["Avg(Descendants('City'), Sales)", "Avg. City"]
]>)
select Country, State, City, '% Country', 'Top City', 'Avg. City'
48. Simple PTF implementation
Annotation specifies Function
Args, Name, behavior
Shape based on Function
before it in chain; or the
Query Input if this is the first
function.
49. Query Evaluation: complete picture
Ø A Query is a chain of PTFs.
Ø Input of chain is a Hive Query or table
Ø Windowing clauses are syntax sugar for the
special Windowing Table function.
Ø Query translated to a series of Jobs. Each Job
executes part of the Function Chain
Ø For intermediate steps: the output is written
to hdfs and exposed as a Temporary Table to
be used by the next Job in the Chain.
50. Query Evaluation: Windowing Clauses
Input Map
DataSet Splits
Writables Shuffle controlled by
+ partition and order
SerDe specification
Output
Partition Partition
Windowing Table Function:
Writables Writables
1. Pass Partition to each Agg. Function + PTF +
SerDe SerDe
2. If Window Clause specified, pass a Window
also
3. Collect output from each Agg. Function
4. Union over all Agg. Functions + Input used
to evaluate select list
51. Query Eval.: Market Basket Analysis
Based on the 2 pass SON algorithm as described in the
Mining Massive Datasets book.
Job 1 Node Ø Apply Dynamic Item Counting
with support appropriately
Node
Input Basket Partition 0 scaled down.
DataSet Ø Output Candidate IS from
Partition 0
each Mapper.
Map
Reduce
Distribute by itemset
Partition by basket .
Order by basket, item .
Candidate
.
ItemSets
Job 2 Node
Node
Partition 0
Scan baskets for Partition 0
Candidate Itemsets
Map
Reduce
Final
Distribute by itemset ItemSets
52. Query Eval.: Transitive Closure
Sample invocation
From transitiveClosure(<select X, Y from InputTable cluster by Y>
partition by X) } In the ith iteration:
Select X, Y; } On the Map side a partition of
T is joined with the
corresponding deltaR; this is
First Pass: similar to a MapJoin in Hive.
Generates new Paths.
} On the Reduce side the new
Cluster:
Create ith Partition of R paths from deltaNextR are held
Distribute by X
in a searchable structure; the
Node i
Map Function:
ReduceFunction:
Node j rows of the corresponding R
partition are streamed to mark
T(X, i) O/P (X, Row) O/P Row R(i, Y) any rows in deltaNextR that
repeat are duplicates. At the end a new
file is stored in the
correponding R partition with
the new paths
ith Pass:
Cluster:
1. Read in R(i, Y) Create ith deltaR file
1. Read in dR(I, Y) 2. Perform dRnext – R in R(i, Y)
3. Add dR rows to R
Node i
2. Perform dR join T
Map Function: ReduceFunction:
Node j
T(X, i) O/P (X, Row)
3. Distribute by X
Can we use HaLoop
or Giraph to do
Graph Processing?
53. Can this be part of Hive?
} There are undeniable reasons for doing this, briefly:
} end users would want this functionality inside Hive for reasons of consistent behavior, support etc.
} use of a consistent expression language. For e.g. reuse of Hive functions in Windowing
clauses.
} Implementation wise:
} Windowing is orders of magnitude simpler than Hive, and can benefit from using equivalent components that are in
Hive.
} Avoid the trap of constantly chasing changes in the Hive code base.
} Folding in Table function mechanics may open up optimizations not possible with the approach today.
54. Path to folding into Hive
Possible Path to moving into Hive ( Details here)
Ø Step 1: Move to Hive MR mechanics for Job execution
Ø Step 2: Move to Hive Evaluators and Expressions
Ø Step 3: Introduce the concept of Partition Table Functions in
Hive; allow users to invoke PTFs via an exec Function
mechanism.
Ø Step 4: Allow Table function invocations to appear in Table
Expressions; do AST transformations to translate to HQL
followed by exec PTF.
Ø Step 5: Extend HQL with Windowing Clauses