Apache Tez

Apache Tez is an open-source distributed data processing framework built on top of Apache Hadoop and tightly integrated with Apache YARN. It’s designed to execute complex data processing workflows more efficiently than traditional MapReduce.

Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader Hadoop ecosystem.

The Apache Tez™ project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data. It is currently built atop Apache Hadoop YARN.

Why we use Tez?

• To speed up Hadoop-based processing
• To run complex workflows (DAGs) instead of simple MapReduce jobs
• To improve performance of tools like Apache Hive and Apache Pig
• To reduce disk I/O and latency

When should you use TeZ?

Tez is a good fit when:

• You are already using the Hadoop ecosystem
• You use Hive or Pig and want faster performance
• Your jobs involve multiple stages or complex pipelines
• You want better performance than MapReduce without switching frameworks

Not ideal when:

• You want a standalone processing engine (Tez is more of a backend)
• You need real-time streaming processing
• You prefer modern unified engines like Apache Spark

Key features of TeZ

• DAG-based execution model (Directed Acyclic Graph)
• Optimized task execution (fewer unnecessary steps)
• Better resource utilization
• Reduced disk writes (more in-memory processing)
• Tight integration with Hadoop ecosystem
• Reusable containers (less overhead)

Key components of Tez

• DAG (Directed Acyclic Graph): Represents the workflow of tasks
• Vertices: Individual processing steps
• Edges: Data movement between steps
• Tez Application Master: Manages execution of DAGs
• YARN: Handles cluster resource allocation

Advantages

• Faster than MapReduce
• Efficient for multi-stage data processing
• Improves performance of Hive and Pig significantly
• Better resource management
• Reduces latency and overhead

Disadvantages

• Not a standalone system (used behind other tools)
• Less flexible compared to Spark
• Limited ecosystem compared to newer frameworks
• Primarily tied to Hadoop
• Learning curve for DAG-based thinking

Alternatives

Apache Spark

More general-purpose, faster, widely adopted

Apache Flink

Better for real-time streaming

Apache Hadoop MapReduce

Older and less efficient model

Design Themes of TeZ

The 2 main design themes for Tez are:

Empowering end users by

• Expressive dataflow definition APIs
• Flexible Input-Processor-Output runtime model
• Data type agnostic
• Simplifying deployment

Execution Performance

• Performance gains over Map Reduce
• Optimal resource management
• Plan reconfiguration at runtime
• Dynamic physical data flow decisions