Removing Metadata from DataHub
Removing Metadata from DataHub
To follow this guide, you'll need the DataHub CLI.
There are a two ways to delete metadata from DataHub:
- Delete metadata attached to entities by providing a specific urn or filters that identify a set of urns (delete CLI).
- Delete metadata created by a single ingestion run (rollback).
- Always use
--dry-run
to test your delete command before executing it. - Prefer reversible soft deletes (
--soft
) over irreversible hard deletes (--hard
).
Delete CLI Usage
Deleting metadata using DataHub's CLI is a simple, systems-level action. If you attempt to delete an entity with children, such as a container, it will not delete those children. Instead, you will need to delete each child by URN in addition to deleting the parent.
All the commands below support the following options:
-n/--dry-run
: Execute a dry run instead of the actual delete.--force
: Skip confirmation prompts.
Selecting entities to delete
You can either provide a single urn to delete, or use filters to select a set of entities to delete.
# Soft delete a single urn.
datahub delete --urn "<my urn>"
# Soft delete using a filter.
datahub delete --platform snowflake
# Filters can be combined, which will select entities that match all filters.
datahub delete --platform looker --entity-type chart
datahub delete --platform bigquery --env PROD
# You can also do recursive deletes for container and dataPlatformInstance entities.
datahub delete --urn "urn:li:container:f76..." --recursive
When performing hard deletes, you can optionally add the --only-soft-deleted
flag to only hard delete entities that were previously soft deleted.
Performing the delete
Soft delete an entity (default)
By default, the delete command will perform a soft delete.
This will set the status
aspect's removed
field to true
, which will hide the entity from the UI. However, you'll still be able to view the entity's metadata in the UI with a direct link.
# The `--soft` flag is redundant since it's the default.
datahub delete --urn "<urn>" --soft
# or using a filter
datahub delete --platform snowflake --soft
Hard delete an entity
This will physically delete all rows for all aspects of the entity. This action cannot be undone, so execute this only after you are sure you want to delete all data associated with this entity.
datahub delete --urn "<my urn>" --hard
# or using a filter
datahub delete --platform snowflake --hard
As of datahub v0.10.2.3, hard deleting tags, glossary terms, users, and groups will also remove references to those entities across the metadata graph.
Hard delete a timeseries aspect
It's also possible to delete a range of timeseries aspect data for an entity without deleting the entire entity.
For these deletes, the aspect and time ranges are required. You can delete all data for a timeseries aspect by providing --start-time min --end-time max
.
datahub delete --urn "<my urn>" --aspect <aspect name> --start-time '-30 days' --end-time '-7 days'
# or using a filter
datahub delete --platform snowflake --entity-type dataset --aspect datasetProfile --start-time '0' --end-time '2023-01-01'
The start and end time fields filter on the timestampMillis
field of the timeseries aspect. Allowed start and end times formats:
YYYY-MM-DD
: a specific dateYYYY-MM-DD HH:mm:ss
: a specific timestamp, assumed to be in UTC unless otherwise specified+/-<number> <unit>
(e.g.-7 days
): a relative time, where<number>
is an integer and<unit>
is one ofdays
,hours
,minutes
,seconds
ddddddddd
(e.g.1684384045
): a unix timestampmin
,max
,now
: special keywords
Delete CLI Examples
Make sure you surround your urn with quotes! If you do not include the quotes, your terminal may misinterpret the command.
Note: All of the commands below support --dry-run
and --force
(skips confirmation prompts).
Soft delete a single entity
datahub delete --urn "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)"
Hard delete a single entity
datahub delete --urn "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)" --hard
Delete everything from the Snowflake DEV environment
datahub delete --platform snowflake --env DEV
Delete everything within a specific Snowflake DB
# You can find your container urn by navigating to the relevant
# DB in the DataHub UI and clicking the "copy urn" button.
datahub delete --urn "urn:li:container:77644901c4f574845578ebd18b7c14fa" --recursive
Delete all BigQuery datasets in the PROD environment
# Note: this will leave BigQuery containers intact.
datahub delete --env PROD --entity-type dataset --platform bigquery
Delete everything within a MySQL platform instance
# The instance name comes from the `platform_instance` config option in the ingestion recipe.
datahub delete --urn 'urn:li:dataPlatformInstance:(urn:li:dataPlatform:mysql,my_instance_name)' --recursive
Delete all pipelines and tasks from Airflow
datahub delete --platform "airflow"
Delete all containers for a particular platform
# Note: this will leave S3 datasets intact.
datahub delete --entity-type container --platform s3
Delete everything in the DEV environment
# This is a pretty broad filter, so make sure you know what you're doing!
datahub delete --env DEV
Delete all Looker dashboards and charts
datahub delete --platform looker
Delete all Looker charts (but not dashboards)
datahub delete --platform looker --entity-type chart
Clean up old datasetProfiles
datahub delete --entity-type dataset --aspect datasetProfile --start-time 'min' --end-time '-60 days'
Delete a tag
# Soft delete.
datahub delete --urn 'urn:li:tag:Legacy' --soft
# Or, using a hard delete. This will automatically clean up all tag associations.
datahub delete --urn 'urn:li:tag:Legacy' --hard
Delete all datasets that match a query
# Note: the query is an advanced feature, but can sometimes select extra entities - use it with caution!
datahub delete --entity-type dataset --query "_tmp"
Hard delete everything in Snowflake that was previously soft deleted
datahub delete --platform snowflake --only-soft-deleted --hard
Deletes using the SDK and APIs
The Python SDK's DataHubGraph client supports deletes via the following methods:
soft_delete_entity
hard_delete_entity
hard_delete_timeseries_aspect
Deletes via the REST API are also possible, although we recommend using the SDK instead.
# hard delete an entity by urn
curl "http://localhost:8080/entities?action=delete" -X POST --data '{"urn": "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)"}'
Rollback Ingestion Run
The second way to delete metadata is to identify entities (and the aspects affected) by using an ingestion run-id
. Whenever you run datahub ingest -c ...
, all the metadata ingested with that run will have the same run id.
To view the ids of the most recent set of ingestion batches, execute
datahub ingest list-runs
That will print out a table of all the runs. Once you have an idea of which run you want to roll back, run
datahub ingest show --run-id <run-id>
to see more info of the run.
Alternately, you can execute a dry-run rollback to achieve the same outcome.
datahub ingest rollback --dry-run --run-id <run-id>
Finally, once you are sure you want to delete this data forever, run
datahub ingest rollback --run-id <run-id>
to rollback all aspects added with this run and all entities created by this run. This deletes both the versioned and the timeseries aspects associated with these entities.
Unsafe Entities and Rollback
In some cases, entities that were initially ingested by a run might have had further modifications to their metadata (e.g. adding terms, tags, or documentation) through the UI or other means. During a roll back of the ingestion that initially created these entities (technically, if the key aspect for these entities are being rolled back), the ingestion process will analyse the metadata graph for aspects that will be left "dangling" and will:
- Leave these aspects untouched in the database, and soft delete the entity. A re-ingestion of these entities will result in this additional metadata becoming visible again in the UI, so you don't lose any of your work.
- The datahub cli will save information about these unsafe entities as a CSV for operators to later review and decide on next steps (keep or remove).
The rollback command will report how many entities have such aspects and save as a CSV the urns of these entities under a rollback reports directory, which defaults to rollback_reports
under the current directory where the cli is run, and can be configured further using the --reports-dir
command line arg.
The operator can use datahub get --urn <>
to inspect the aspects that were left behind and either keep them (do nothing) or delete the entity (and its aspects) completely using datahub delete --urn <urn> --hard
. If the operator wishes to remove all the metadata associated with these unsafe entities, they can re-issue the rollback command with the --nuke
flag.