The dagster_databricks
package provides two main pieces of functionality:
A resource, databricks_pyspark_step_launcher
, which will execute a solid within a Databricks
context on a cluster, such that the pyspark
resource uses the cluster’s Spark instance.
A function, create_databricks_job_solid
, which creates a solid that submits an external
configurable job to Databricks using the ‘Run Now’ API.
Note that, for the databricks_pyspark_step_launcher
, either S3 or Azure Data Lake Storage config
must be specified for solids to succeed, and the credentials for this storage must also be
stored as a Databricks Secret and stored in the resource config so that the Databricks cluster can
access storage.
dagster_databricks.
create_databricks_job_solid
(name='databricks_job', num_inputs=1, description=None, required_resource_keys=frozenset({'databricks_client'}))[source]¶Creates a solid that launches a databricks job.
As config, the solid accepts a blob of the form described in Databricks’ job API: https://docs.databricks.com/dev-tools/api/latest/jobs.html.
A solid definition.
dagster_databricks.
databricks_pyspark_step_launcher
ResourceDefinition[source]¶Resource for running solids as a Databricks Job.
When this resource is used, the solid will be executed in Databricks using the ‘Run Submit’ API. Pipeline code will be zipped up and copied to a directory in DBFS along with the solid’s execution context.
Use the ‘run_config’ configuration to specify the details of the Databricks cluster used, and the ‘storage’ key to configure persistent storage on that cluster. Storage is accessed by setting the credentials in the Spark context, as documented here for S3 and here for ADLS.