Pig’s Logical Plan Optimizer

Hello from *sunny* Stockholm!

It’s been almost a month since my last thesis post and as I was hoping, Spring is finally here 😀

It’s been a crazy, busy and productive month though, so I will be updating you on my progress by writing two posts today!

This one is about Pig’s Logical Plan Optimizer. In my previous posts (here and here) I have explained how Pig creates a data-flow graph from the Pig Latin script, the Logical Plan, and then transforms this graph into a set of Map-Reduce jobs. The Logical Plan goes through the first compiler and is transformed into a Physical Plan, and the Physical Plan is then sent to the Map-Reduce compiler, which transforms it into a DAG of Map-Reduce jobs:

Logical_to_physical_to_mr

An intermediate and quite interesting stage which is not visible in the above diagram, is the optimization of the Logical Plan. The initial Logical Plan is created by an one-to-one mapping of the Pig Latin statements to Logical Operators. The structure of this plan is of course totally dependent on the scripting skills of the user and can result in highly inefficient execution.

Pig performs a set of transformations on this plan before it compiles it to a Physical one. Most of them are trivial and have been long used in database systems and other high-level languages. However, I think they’re still interesting to discuss in the “Pig context”.

 

Rules, RuleSets, Patterns and Transformers

The base optimizer class is designed to accept a list of RuleSets, i.e. sets of rules. Each RuleSet contains rules that can be applied together without conflicting with each other. Pig applies each rule in a set repeatedly, until no rule is longer applicable or it has reached a maximum number of iterations. It then moves to the next set and never returns to a previous set.

Each rule has a pattern and an associated transformer. A pattern is essentially a sub-plan with specific node types. The optimizer will try to find this pattern inside the Logical Plan and if it exists, we have a match. When a match is found, the optimizer will then have to look more in depth into the matched pattern and decide whether the rule fulfils some additional requirements. If it does, then the rule is applied and the transformer is responsible for making the corresponding changes to the plan.

Some extra caution is needed in two places. The current pattern matching logic assumes that all the leaves in the pattern are siblings. You can read more on this issue here. This assumption creates no problems with the existing rules. However, when new rules are designed, it should be kept in mind that the pattern matching logic might need to be changed.

Another point that needs highlighting has to do with the actual Java implementation. When searching for a matching pattern, the match() method will return a list of all matched sub-plans. Each one of them is a subset of the original plan and the operators returned are the same objects as in the original plan.

 

Some Examples

  • ColumnMapKeyPrune

This rules prunes columns and map keys that are not needed. More specifically, removes a column if it mentioned in a script but never used and a map key if it never mentioned in the script.

  • FilterAboveForeach

Guess what? Pushes Filter operators above Foreach operators! However, it checks if the field that Filter works on is present in the predecessor of Foreach:

Filteraboveforeach

  • MergeFilter

As you can imagine, it merges two consecutive Filter operators, adding the condition of the second Filter to the condition of the first Filter with an AND operator:

Mergefilter

  • MergeForeach

This rule merges Foreach operators, but it’s not as simple as it sounds. There are a few additional requirements that need to be met. For example, if the first Foreach operator has a Flatten in its internal plan, the rule cannot be applied. The optimizer also checks how many times the outputs of the first Foreach are used by the second Foreach. The assumption is that if an output is reffered tomore than once, the overhead of multiple expression calculation might even out the benefits from the application of this rule:

Mergeforeach2

There are several more optimization rules, but I hope the idea is clear from the examples I already mentioned. All the optimizations performed at this level are general-purpose transformations and decoupled from the execution engine and the Map-Reduce model. However, this is not true after the transformation to a Physical Plan. And this is why I now understand why the integration alternatives I had in mind in late February are not worth implementing.

 

The reason will become clear with my next post very very soon.

Until then, happy coding 🙂

V.

Leave a comment