Retention and Expiration

About

A data retention and expiration management subsystem for CrateDB, implementing multiple strategies.

Details

The application manages the life-cycle of data stored in CrateDB, handling concerns of data expiry, size reduction, and archival. Within a system storing and processing large amounts of data, it is crucial to manage data flows between “hot”, “warm”, and “cold” storage types better than using ad hoc solutions.

Data retention policies can be flexibly configured by adding records to the retention policy database table, which is also stored within CrateDB.

Background

With other databases, this technique, or variants thereof, are known as rolling up historical data, downsampling a time series data stream, downsampling and data retention, or just downsampling.

They are useful to reduce the storage size of historical data by decreasing its resolution, where this is acceptable related to your needs.

ES: The Rollup functionality summarizes old, high-granularity data into a reduced granularity format for long-term storage. By “rolling” the data up, historical data can be compressed greatly compared to the raw data.

Side looks

In classical OLAP operations,

a roll-up involves summarizing the data along a dimension. The summarization rule might be an aggregate function, such as computing totals along a hierarchy or by applying a set of formulas.

In classical dashboarding applications, reducing the amount of timeseries-data on-demand is also applicable.

When rendering data to your screen, and its density is larger than the amount of pixels available to display, data is reduced by using time bucketing for grouping records into equal-sized time ranges, before applying a resampling function on it. The width of those time ranges is usually automatically derived from the zoom level, i.e. the total time range the user is looking at.

Contrary to other techniques which may compute data on-demand, the operations performed by this application permanently reduce the amount, size, or resolution of data.

Details

The retention policy database table is also stored within CrateDB.

By default, the ext schema is used for that, so the effective full-qualified database table name is "ext"."retention_policy". The schema is configurable by using the --schema command-line option, or the CRATEDB_EXT_SCHEMA environment variable.

Strategies

This section enumerates the available data retention and expiration strategies. More strategies can be added, and corresponding contributions are welcome.

DELETE

A retention policy algorithm that deletes records from expired partitions.

cratedb-retention create-policy --strategy=delete \
  --table-schema=doc --table-name=raw_metrics \
  --partition-column=ts_day --retention-period=1 \
  "${CRATEDB_URI}"

This retention policy implements the following directive.

Delete all data from the "doc"."raw_metrics" table, on partitions defined by the column ts_day, which is older than 1 day at the given cutoff date when running the retention job.

implementation | tutorial

REALLOCATE

A retention policy algorithm that reallocates expired partitions from “hot” nodes to “warm” nodes.

Because each cluster member is assigned a designated node type by using the -Cnode.attr.storage=hot|warm parameter, this strategy is only applicable in cluster/multi-node scenarios.

On the data expiration run, corresponding partitions will get physically moved to cluster nodes of the warm type, which are mostly designated archive nodes, with large amounts of storage space.

cratedb-retention create-policy --strategy=reallocate \
  --table-schema=doc --table-name=raw_metrics \
  --partition-column=ts_day --retention-period=60 \
  --reallocation-attribute-name=storage --reallocation-attribute-value=warm \
  "${CRATEDB_URI}"

This retention policy implements the following directive.

Reallocate data from the "doc"."raw_metrics" table, on partitions defined by the column ts_day, which is older than 60 days at the given cutoff date, to nodes tagged with the storage=warm attribute.

implementation | tutorial

SNAPSHOT

A retention policy algorithm that snapshots expired partitions to a repository, and prunes them from the database afterwards. It is suitable for long-term data archival purposes.

In CrateDB jargon, a repository is a storage unit, where data in form of snapshots can be exported to, and imported from. A repository can be located on a filesystem, or on a supported object storage backend. CrateDB is able to use buckets on S3-compatible storage backends, or on Azure blob storage, using the CREATE REPOSITORY ... TYPE =  s3|azure|fs SQL statement.

CREATE REPOSITORY
    export_cold
TYPE
    s3
WITH (
    protocol   = 'https',
    endpoint   = 's3-store.example.org:443',
    access_key = '<USERNAME>',
    secret_key = '<PASSWORD>',
    bucket     = 'cratedb-cold-storage'
);
cratedb-retention create-policy --strategy=snapshot \
  --table-schema=doc --table-name=sensor_readings \
  --partition-column=time_month --retention-period=365 \
  --target-repository-name=export_cold \
  "${CRATEDB_URI}"

This retention policy implements the following directive.

Run a snapshot on the data within the "doc"."sensor_readings" table, on partitions defined by the column time_month, which is older than 365 days at the given cutoff date, to a previously created repository called export_cold. Delete the corresponding data afterwards.

implementation | tutorial

Data model

SQL DDL schema

The database schema for the retention policy table is defined like this. Each record represents a rule how to expire and retain data per table. At runtime, each record will correspond to a dedicated retention job task.

-- Set up the retention policy database table schema.
CREATE TABLE IF NOT EXISTS "ext"."retention_policy" (

    -- Strategy to apply for data retention.
    "strategy" TEXT NOT NULL,

    -- Tags: For grouping, multi-tenancy, and more.
    "tags" OBJECT(DYNAMIC),

    -- Source: The database table operated upon.
    "table_schema" TEXT,                        -- The source table schema.
    "table_name" TEXT,                          -- The source table name.
    "partition_column" TEXT NOT NULL,           -- The source table column name used for partitioning.

    -- Retention parameters.
    "retention_period" INTEGER NOT NULL,        -- Retention period in days. The number of days data gets
                                                -- retained before applying the retention policy.

    -- Target: Where data is moved/relocated to.

    -- Targeting specific nodes.
    -- You may want to designate dedicated nodes to be responsible for "hot" or "warm" storage types.
    -- To do that, you can assign attributes to specific nodes, effectively tagging them.
    -- https://crate.io/docs/crate/reference/en/latest/config/node.html#custom-attributes
    "reallocation_attribute_name" TEXT,         -- Name of the node-specific custom attribute.
    "reallocation_attribute_value" TEXT,        -- Value of the node-specific custom attribute.

    -- Targeting a repository.
    "target_repository_name" TEXT,              -- The name of a repository created with `CREATE REPOSITORY ...`.

    PRIMARY KEY ("table_schema", "table_name", "strategy")
)
CLUSTERED INTO 1 SHARDS;

Record examples

You can add a retention policy by using an SQL statement like this.

-- Add a retention policy using the DELETE strategy.
INSERT INTO "ext"."retention_policy"
  (strategy, table_schema, table_name, partition_column, retention_period)
VALUES
  ('delete', 'doc', 'raw_metrics', 'ts_day', 1);

Using the corresponding CLI command is equivalent to the SQL statement.

cratedb-retention create-policy --strategy=delete \
  --table-schema=doc --table-name=raw_metrics \
  --partition-column=ts_day --retention-period=1 \
  "${CRATEDB_URI}"

After running that command, and creating a record, the program reports about its identifier.

Created new retention policy: adb90608-c20f-4de7-be98-93588d8358dc

You can use it to delete the rule again.

cratedb-retention delete-policy --id=adb90608-c20f-4de7-be98-93588d8358dc \
  "${CRATEDB_URI}"

For enumerating all policies, use the list-policies subcommand.

cratedb-retention list-policies "${CRATEDB_URI}"

Tags

By using tags, you can conveniently define groups of retention policies. This policy is being tagged with both foo, and bar.

cratedb-retention create-policy --strategy=delete \
  --tags=foo,bar \
  --table-schema=doc --table-name=raw_metrics \
  --partition-column=ts_day --retention-period=1 \
  "${CRATEDB_URI}"

In order to select retention policies tagged with foo, use such an SQL statement.

SET error_on_unknown_object_key=false;
SELECT * FROM "ext"."retention_policy" WHERE tags['foo'] IS NOT NULL;

Delete all retention policy records tagged with foo and bar.

DELETE FROM "ext"."retention_policy"
       WHERE tags['foo'] IS NOT NULL
       AND   tags['bar'] IS NOT NULL;
cratedb-retention delete-policy --tags=foo,bar "${CRATEDB_URI}"

In order to use tags when running retention jobs, the program accepts an optional command-line option like --tags=foo,bar.

It will limit the retention policies to all records which contain both tags, i.e. they are combined using AND. Currently, there is no interface to obtain multiple tag values, and combine them using OR. Running retention jobs for multiple tags needs multiple invocations.

cratedb-retention run --strategy=delete --tags=foo,bar "${CRATEDB_URI}"

Install

The CrateDB data retention and expiration toolkit program cratedb-retention can be installed by using the corresponding Python package, or it can alternatively be invoked on the command line using Podman or Docker.

For installing the package, please refer to the install documentation section.

cratedb-retention --version

Usage

The toolkit can be used both as a standalone program, and as a library.

Command-line use

This section outlines how to connect to, and run data retention jobs, both on a database cluster running on your premises, and on CrateDB Cloud.

The steps are the same, only the connection parameters are different.

Todo

Note that, currently, the Python driver and the crash database shell need to obtain slightly different parameters.

By using the --verbose option, directly after the main command, before the subcommand, you can inspect the SQL statements issued to the database.

cratedb-retention --verbose setup "${CRATEDB_URI}"

By using the --dry-run option, after the subcommand, the program will not issue altering or modifying SQL statements to the database. You can use it to learn about which actions would be performed.

cratedb-retention setup --dry-run "${CRATEDB_URI}"

The DBURI command-line argument can be omitted when defining the CRATEDB_URI environment variable.

export CRATEDB_URI='crate://localhost/'

There are also a few shortcuts for subcommands. For example, the shortest form of incantation for deleting a retention policy would be:

cratedb-retention rm --tags=baz

Workstation / on-premise

export CRATEDB_URI='crate://localhost/'
export CRATEDB_HOST='http://localhost:4200/'

CrateDB Cloud

export CRATEDB_URI='crate://admin:<PASSWORD>@<CLUSTERNAME>.aks1.eastus2.azure.cratedb.net:4200?ssl=true'
export CRATEDB_HOST='https://admin:<PASSWORD>@<CLUSTERNAME>.aks1.eastus2.azure.cratedb.net:4200/'

General

Install retention policy bookkeeping tables.

cratedb-retention setup "${CRATEDB_URI}"

Add a retention policy rule.

cratedb-retention create-policy --strategy=delete \
  --table-schema=doc --table-name=raw_metrics \
  --partition-column=ts_day --retention-period=1 \
  "${CRATEDB_URI}"

Invoke the data retention job, expiring all data older than one day.

cratedb-retention run --strategy=delete "${CRATEDB_URI}"

Invoke the data retention job, expiring all data older than one day before a specific cutoff date.

cratedb-retention run --cutoff-day=2023-06-27 --strategy=delete "${CRATEDB_URI}"

Simulate the data retention job, display SQL statements only.

cratedb-retention run --dry-run --strategy=delete "${CRATEDB_URI}"

Podman or Docker

This section offers a complete walkthrough, how work with the cratedb-retention program interactively using Podman or Docker, step by step. It starts a single- node CrateDB instance on your workstation for demonstration purposes. Please note that some retention policy strategies will not work on single-node installations.

export CRATEDB_URI='crate://localhost/'
export CRATEDB_HOST='http://localhost:4200/'
export OCI_IMAGE='ghcr.io/crate/cratedb-retention:nightly'

Display version number.

docker run --rm -i --network=host "${OCI_IMAGE}" \
cratedb-retention --version

Start CrateDB.

docker run --rm -it \
  --name=cratedb \
  --publish=4200:4200 --publish=5432:5432 \
  --env=CRATE_HEAP_SIZE=4g crate \
  -Cdiscovery.type=single-node

Install retention policy bookkeeping tables.

docker run --rm -i --network=host "${OCI_IMAGE}" \
cratedb-retention setup "${CRATEDB_URI}"

Add a retention policy rule using the DELETE strategy.

docker run --rm -i --network=host "${OCI_IMAGE}" \
cratedb-retention create-policy --strategy=delete \
  --table-schema=doc --table-name=raw_metrics \
  --partition-column=ts_day --retention-period=1 \
  "${CRATEDB_URI}"

Invoke the data retention job.

docker run --rm -i --network=host "${OCI_IMAGE}" \
cratedb-retention run --strategy=delete "${CRATEDB_URI}"

Library use

This section outlines how to use the toolkit as a library within your own applications. The code displayed below is a stripped-down version of the runnable example program examples/retention_retire_cutoff.py, located within this repository.

from cratedb_toolkit.retention.core import RetentionJob
from cratedb_toolkit.retention import JobSettings, RetentionPolicy, RetentionStrategy
from cratedb_toolkit.retention.setup import setup_schema
from cratedb_toolkit.retention.store import RetentionPolicyStore
from cratedb_toolkit.model import DatabaseAddress

# Define the database URI to connect to.
DBURI = "crate://localhost/"

# A. Set up adapter to retention policy store.
settings = JobSettings(
    database=DatabaseAddress.from_string(DBURI),
)
store = RetentionPolicyStore(settings=settings)

# B. Initialize the subsystem, and create a data retention policy.
# TODO: Refactor to `RetentionPolicyStore`.
setup_schema(settings=settings)

# Add a basic retention policy.
policy = RetentionPolicy(
    strategy=RetentionStrategy.DELETE,
    table_schema="doc",
    table_name="raw_metrics",
    partition_column="ts_day",
    retention_period=1,
)
identifier = store.create(policy, ignore="DuplicateKeyException")

# C. Define job settings, and invoke the data retention job.
settings = JobSettings(
    database=DatabaseAddress.from_string(DBURI),
    strategy=RetentionStrategy.DELETE,
)

job = RetentionJob(settings=settings)
job.start()