Skip to main content
Version: Next

DataHub

DataHub

DataHub Rest

For context on getting started with ingestion, check out our metadata ingestion guide.

Setup

To install this plugin, run pip install 'acryl-datahub[datahub-rest]'.

Capabilities

Pushes metadata to DataHub using the GMS REST API. The advantage of the REST-based interface is that any errors can immediately be reported.

Quickstart recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide. This should point to the GMS server.

source:
# source configs
sink:
type: "datahub-rest"
config:
server: "http://localhost:8080"

If you are connecting to a hosted DataHub Cloud instance, your sink will look like

source:
# source configs
sink:
type: "datahub-rest"
config:
server: "https://<your-instance>.acryl.io/gms"
token: <token>

If you are running the ingestion in a container in docker and your GMS is also running in docker then you should use the internal docker hostname of the GMS pod. Usually it would look something like

source:
# source configs
sink:
type: "datahub-rest"
config:
server: "http://datahub-gms:8080"

If GMS is running in a kubernetes pod deployed through the helm charts and you are trying to connect to it from within the kubernetes cluster then you should use the Kubernetes service name of GMS. Usually it would look something like

source:
# source configs
sink:
type: "datahub-rest"
config:
server: "http://datahub-datahub-gms.datahub.svc.cluster.local:8080"

If you are using UI based ingestion then where GMS is deployed decides what hostname you should use.

Config details

Note that a . is used to denote nested fields in the YAML recipe.

FieldRequiredDefaultDescription
serverURL of DataHub GMS endpoint.
timeout_sec30Per-HTTP request timeout.
retry_max_times1Maximum times to retry if HTTP request fails. The delay between retries is increased exponentially
retry_status_codes[429, 502, 503, 504]Retry HTTP request also on these status codes
tokenBearer token used for authentication.
extra_headersExtra headers which will be added to the request.
max_threads15Experimental: Max parallelism for REST API calls
ca_certificate_pathPath to server's CA certificate for verification of HTTPS communications
client_certificate_pathPath to client's CA certificate for HTTPS communications
disable_ssl_verificationfalseDisable ssl certificate validation

DataHub Kafka

For context on getting started with ingestion, check out our metadata ingestion guide.

Setup

To install this plugin, run pip install 'acryl-datahub[datahub-kafka]'.

Capabilities

Pushes metadata to DataHub by publishing messages to Kafka. The advantage of the Kafka-based interface is that it's asynchronous and can handle higher throughput.

Quickstart recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
# source configs

sink:
type: "datahub-kafka"
config:
connection:
bootstrap: "localhost:9092"
schema_registry_url: "http://localhost:8081"

Config details

Note that a . is used to denote nested fields in the YAML recipe.

FieldRequiredDefaultDescription
connection.bootstrapKafka bootstrap URL.
connection.producer_config.<option>Passed to https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#confluent_kafka.SerializingProducer
connection.schema_registry_urlURL of schema registry being used.
connection.schema_registry_config.<option>Passed to https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#confluent_kafka.schema_registry.SchemaRegistryClient
topic_routes.MetadataChangeEventMetadataChangeEventOverridden Kafka topic name for the MetadataChangeEvent
topic_routes.MetadataChangeProposalMetadataChangeProposalOverridden Kafka topic name for the MetadataChangeProposal

The options in the producer config and schema registry config are passed to the Kafka SerializingProducer and SchemaRegistryClient respectively.

For a full example with a number of security options, see this example recipe.

DataHub Lite (experimental)

A sink that provides integration with DataHub Lite for local metadata exploration and serving.

Setup

To install this plugin, run pip install 'acryl-datahub[datahub-lite]'.

Capabilities

Pushes metadata to a local DataHub Lite instance.

Quickstart recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
# source configs
sink:
type: "datahub-lite"

By default, datahub-lite uses a DuckDB database and will write to a database file located under ~/.datahub/lite/.

To configure the location, you can specify it directly in the config:

source:
# source configs
sink:
type: "datahub-lite"
config:
type: "duckdb"
config:
file: "<path_to_duckdb_file>"
note

DataHub Lite currently doesn't support stateful ingestion, so you'll have to turn off stateful ingestion in your recipe to use it. This will be fixed shortly.

Config details

Note that a . is used to denote nested fields in the YAML recipe.

FieldRequiredDefaultDescription
typeduckdbType of DataHub Lite implementation to use
config{"file": "~/.datahub/lite/datahub.duckdb"}Config dictionary to pass through to the DataHub Lite implementation. See below for fields accepted by the DuckDB implementation

DuckDB Config Details

FieldRequiredDefaultDescription
file"~/.datahub/lite/datahub.duckdb"File to use for DuckDB storage
options{}Options dictionary to pass through to DuckDB library. See the official spec for the options supported by DuckDB.