Dagster provides the Partition Set abstraction for pipelines where each run deals with a subset of data.
Name | Description |
---|---|
PartitionSetDefinition | The class to define a partition set. |
Partition sets let you define a set of logical "partitions", usually time windows, along with a scheme for building pipeline run config for a partition. Having defined a partition set, you can kick off a pipeline run or set of pipeline runs by simply selecting partitions in the set.
Partitions have two main uses:
Here's a pipeline that computes some data for a given date.
@solid(config_schema={"date": str})
def process_data_for_date(context):
date = context.solid_config["date"]
context.log.info(f"processing data for {date}")
@solid
def post_slack_message(context):
context.log.info("posting slack message")
@pipeline
def my_data_pipeline():
process_data_for_date()
post_slack_message()
The solid process_data_for_date
takes, as config, a string date
. This piece of config will
define which date to compute data for. For example, if we wanted to compute for May 5th, 2020,
we would execute the pipeline with the following config:
solids:
process_data_for_date:
config:
date: "2020-05-05"
You can define a PartitionSetDefinition
that defines the full set of
partitions and how to define the run config for a given partition.
def get_date_partitions():
"""Every day in the month of May, 2020"""
return [Partition(f"2020-05-{str(day).zfill(2)}") for day in range(1, 32)]
def run_config_for_date_partition(partition):
date = partition.value
return {"solids": {"process_data_for_date": {"config": {"date": date}}}}
date_partition_set = PartitionSetDefinition(
name="date_partition_set",
pipeline_name="my_data_pipeline",
partition_fn=get_date_partitions,
run_config_fn_for_partition=run_config_for_date_partition,
)
In Dagit, you can view runs by partition in the Partitions tab of a Pipeline page.
In the "Run Matrix", each column corresponds to one of the partitions in the partition set. Each row corresponds to one of the steps in the pipeline.
You can click on individual boxes to see the history of runs for that step and partition.
You can view and use partitions in the Dagit playground view for a pipeline. In the top bar, you can select from the list of all available partition sets, then choose a specific partition. Within the config editor, the config for the selected partition will be populated.
In the screenshot below, we select the date_partition_set
and the 2020-05-01
partition, and we can see that
the correct run config for the partition has been populated in the editor.