Retention and Expiration¶
About¶
A data retention and expiration management subsystem for CrateDB, implementing multiple strategies.
Details¶
The application manages the life-cycle of data stored in CrateDB, handling concerns of data expiry, size reduction, and archival. Within a system storing and processing large amounts of data, it is crucial to manage data flows between “hot”, “warm”, and “cold” storage types better than using ad hoc solutions.
Data retention policies can be flexibly configured by adding records to the retention policy database table, which is also stored within CrateDB.
Background¶
With other databases, this technique, or variants thereof, are known as rolling up historical data, downsampling a time series data stream, downsampling and data retention, or just downsampling.
They are useful to reduce the storage size of historical data by decreasing its resolution, where this is acceptable related to your needs.
ES: The Rollup functionality summarizes old, high-granularity data into a reduced granularity format for long-term storage. By “rolling” the data up, historical data can be compressed greatly compared to the raw data.
Side looks¶
In classical OLAP operations,
a roll-up involves summarizing the data along a dimension. The summarization rule might be an aggregate function, such as computing totals along a hierarchy or by applying a set of formulas.
In classical dashboarding applications, reducing the amount of timeseries-data on-demand is also applicable.
When rendering data to your screen, and its density is larger than the amount of pixels available to display, data is reduced by using time bucketing for grouping records into equal-sized time ranges, before applying a resampling function on it. The width of those time ranges is usually automatically derived from the zoom level, i.e. the total time range the user is looking at.
Contrary to other techniques which may compute data on-demand, the operations performed by this application permanently reduce the amount, size, or resolution of data.
Details¶
The retention policy database table is also stored within CrateDB.
By default, the ext
schema is used for that, so the effective full-qualified
database table name is "ext"."retention_policy"
. The schema is configurable
by using the --schema
command-line option, or the CRATEDB_EXT_SCHEMA
environment variable.
Strategies¶
This section enumerates the available data retention and expiration strategies. More strategies can be added, and corresponding contributions are welcome.
DELETE¶
A retention policy algorithm that deletes records from expired partitions.
cratedb-retention create-policy --strategy=delete \
--table-schema=doc --table-name=raw_metrics \
--partition-column=ts_day --retention-period=1 \
"${CRATEDB_URI}"
This retention policy implements the following directive.
Delete all data from the
"doc"."raw_metrics"
table, on partitions defined by the columnts_day
, which is older than 1 day at the given cutoff date when running the retention job.
REALLOCATE¶
A retention policy algorithm that reallocates expired partitions from “hot” nodes to “warm” nodes.
Because each cluster member is assigned a designated node type by using the
-Cnode.attr.storage=hot|warm
parameter, this strategy is only applicable in
cluster/multi-node scenarios.
On the data expiration run, corresponding partitions will get physically moved to
cluster nodes of the warm
type, which are mostly designated archive nodes, with
large amounts of storage space.
cratedb-retention create-policy --strategy=reallocate \
--table-schema=doc --table-name=raw_metrics \
--partition-column=ts_day --retention-period=60 \
--reallocation-attribute-name=storage --reallocation-attribute-value=warm \
"${CRATEDB_URI}"
This retention policy implements the following directive.
Reallocate data from the
"doc"."raw_metrics"
table, on partitions defined by the columnts_day
, which is older than 60 days at the given cutoff date, to nodes tagged with thestorage=warm
attribute.
SNAPSHOT¶
A retention policy algorithm that snapshots expired partitions to a repository, and prunes them from the database afterwards. It is suitable for long-term data archival purposes.
In CrateDB jargon, a repository is a storage unit, where data in form of snapshots
can be exported to, and imported from. A repository can be located on a filesystem,
or on a supported object storage backend. CrateDB is able to use buckets on S3-compatible
storage backends, or on Azure blob storage, using the CREATE REPOSITORY ... TYPE = s3|azure|fs
SQL statement.
CREATE REPOSITORY
export_cold
TYPE
s3
WITH (
protocol = 'https',
endpoint = 's3-store.example.org:443',
access_key = '<USERNAME>',
secret_key = '<PASSWORD>',
bucket = 'cratedb-cold-storage'
);
cratedb-retention create-policy --strategy=snapshot \
--table-schema=doc --table-name=sensor_readings \
--partition-column=time_month --retention-period=365 \
--target-repository-name=export_cold \
"${CRATEDB_URI}"
This retention policy implements the following directive.
Run a snapshot on the data within the
"doc"."sensor_readings"
table, on partitions defined by the columntime_month
, which is older than 365 days at the given cutoff date, to a previously created repository calledexport_cold
. Delete the corresponding data afterwards.
Data model¶
SQL DDL schema¶
The database schema for the retention policy table is defined like this. Each record represents a rule how to expire and retain data per table. At runtime, each record will correspond to a dedicated retention job task.
-- Set up the retention policy database table schema.
CREATE TABLE IF NOT EXISTS "ext"."retention_policy" (
-- Strategy to apply for data retention.
"strategy" TEXT NOT NULL,
-- Tags: For grouping, multi-tenancy, and more.
"tags" OBJECT(DYNAMIC),
-- Source: The database table operated upon.
"table_schema" TEXT, -- The source table schema.
"table_name" TEXT, -- The source table name.
"partition_column" TEXT NOT NULL, -- The source table column name used for partitioning.
-- Retention parameters.
"retention_period" INTEGER NOT NULL, -- Retention period in days. The number of days data gets
-- retained before applying the retention policy.
-- Target: Where data is moved/relocated to.
-- Targeting specific nodes.
-- You may want to designate dedicated nodes to be responsible for "hot" or "warm" storage types.
-- To do that, you can assign attributes to specific nodes, effectively tagging them.
-- https://crate.io/docs/crate/reference/en/latest/config/node.html#custom-attributes
"reallocation_attribute_name" TEXT, -- Name of the node-specific custom attribute.
"reallocation_attribute_value" TEXT, -- Value of the node-specific custom attribute.
-- Targeting a repository.
"target_repository_name" TEXT, -- The name of a repository created with `CREATE REPOSITORY ...`.
PRIMARY KEY ("table_schema", "table_name", "strategy")
)
CLUSTERED INTO 1 SHARDS;
Record examples¶
You can add a retention policy by using an SQL statement like this.
-- Add a retention policy using the DELETE strategy.
INSERT INTO "ext"."retention_policy"
(strategy, table_schema, table_name, partition_column, retention_period)
VALUES
('delete', 'doc', 'raw_metrics', 'ts_day', 1);
Using the corresponding CLI command is equivalent to the SQL statement.
cratedb-retention create-policy --strategy=delete \
--table-schema=doc --table-name=raw_metrics \
--partition-column=ts_day --retention-period=1 \
"${CRATEDB_URI}"
After running that command, and creating a record, the program reports about its identifier.
Created new retention policy: adb90608-c20f-4de7-be98-93588d8358dc
You can use it to delete the rule again.
cratedb-retention delete-policy --id=adb90608-c20f-4de7-be98-93588d8358dc \
"${CRATEDB_URI}"
For enumerating all policies, use the list-policies
subcommand.
cratedb-retention list-policies "${CRATEDB_URI}"
Install¶
The CrateDB data retention and expiration toolkit program cratedb-retention
can be
installed by using the corresponding Python package, or it can alternatively be invoked
on the command line using Podman or Docker.
For installing the package, please refer to the install documentation section.
cratedb-retention --version
Usage¶
The toolkit can be used both as a standalone program, and as a library.
Command-line use¶
This section outlines how to connect to, and run data retention jobs, both on a database cluster running on your premises, and on CrateDB Cloud.
The steps are the same, only the connection parameters are different.
Todo
Note that, currently, the Python driver and the crash database shell need to obtain slightly different parameters.
By using the --verbose
option, directly after the main command, before the
subcommand, you can inspect the SQL statements issued to the database.
cratedb-retention --verbose setup "${CRATEDB_URI}"
By using the --dry-run
option, after the subcommand, the program will not
issue altering or modifying SQL statements to the database. You can use it
to learn about which actions would be performed.
cratedb-retention setup --dry-run "${CRATEDB_URI}"
The DBURI
command-line argument can be omitted when defining the CRATEDB_URI
environment variable.
export CRATEDB_URI='crate://localhost/'
There are also a few shortcuts for subcommands. For example, the shortest form of incantation for deleting a retention policy would be:
cratedb-retention rm --tags=baz
Workstation / on-premise¶
export CRATEDB_URI='crate://localhost/'
export CRATEDB_HOST='http://localhost:4200/'
CrateDB Cloud¶
export CRATEDB_URI='crate://admin:<PASSWORD>@<CLUSTERNAME>.aks1.eastus2.azure.cratedb.net:4200?ssl=true'
export CRATEDB_HOST='https://admin:<PASSWORD>@<CLUSTERNAME>.aks1.eastus2.azure.cratedb.net:4200/'
General¶
Install retention policy bookkeeping tables.
cratedb-retention setup "${CRATEDB_URI}"
Add a retention policy rule.
cratedb-retention create-policy --strategy=delete \
--table-schema=doc --table-name=raw_metrics \
--partition-column=ts_day --retention-period=1 \
"${CRATEDB_URI}"
Invoke the data retention job, expiring all data older than one day.
cratedb-retention run --strategy=delete "${CRATEDB_URI}"
Invoke the data retention job, expiring all data older than one day before a specific cutoff date.
cratedb-retention run --cutoff-day=2023-06-27 --strategy=delete "${CRATEDB_URI}"
Simulate the data retention job, display SQL statements only.
cratedb-retention run --dry-run --strategy=delete "${CRATEDB_URI}"
Podman or Docker¶
This section offers a complete walkthrough, how work with the cratedb-retention
program interactively using Podman or Docker, step by step. It starts a single-
node CrateDB instance on your workstation for demonstration purposes. Please note
that some retention policy strategies will not work on single-node installations.
export CRATEDB_URI='crate://localhost/'
export CRATEDB_HOST='http://localhost:4200/'
export OCI_IMAGE='ghcr.io/crate/cratedb-retention:nightly'
Display version number.
docker run --rm -i --network=host "${OCI_IMAGE}" \
cratedb-retention --version
Start CrateDB.
docker run --rm -it \
--name=cratedb \
--publish=4200:4200 --publish=5432:5432 \
--env=CRATE_HEAP_SIZE=4g crate \
-Cdiscovery.type=single-node
Install retention policy bookkeeping tables.
docker run --rm -i --network=host "${OCI_IMAGE}" \
cratedb-retention setup "${CRATEDB_URI}"
Add a retention policy rule using the DELETE strategy.
docker run --rm -i --network=host "${OCI_IMAGE}" \
cratedb-retention create-policy --strategy=delete \
--table-schema=doc --table-name=raw_metrics \
--partition-column=ts_day --retention-period=1 \
"${CRATEDB_URI}"
Invoke the data retention job.
docker run --rm -i --network=host "${OCI_IMAGE}" \
cratedb-retention run --strategy=delete "${CRATEDB_URI}"
Library use¶
This section outlines how to use the toolkit as a library within your own
applications. The code displayed below is a stripped-down version of the
runnable example program examples/retention_retire_cutoff.py
, located
within this repository.
from cratedb_toolkit.retention.core import RetentionJob
from cratedb_toolkit.retention import JobSettings, RetentionPolicy, RetentionStrategy
from cratedb_toolkit.retention.setup import setup_schema
from cratedb_toolkit.retention.store import RetentionPolicyStore
from cratedb_toolkit.model import DatabaseAddress
# Define the database URI to connect to.
DBURI = "crate://localhost/"
# A. Set up adapter to retention policy store.
settings = JobSettings(
database=DatabaseAddress.from_string(DBURI),
)
store = RetentionPolicyStore(settings=settings)
# B. Initialize the subsystem, and create a data retention policy.
# TODO: Refactor to `RetentionPolicyStore`.
setup_schema(settings=settings)
# Add a basic retention policy.
policy = RetentionPolicy(
strategy=RetentionStrategy.DELETE,
table_schema="doc",
table_name="raw_metrics",
partition_column="ts_day",
retention_period=1,
)
identifier = store.create(policy, ignore="DuplicateKeyException")
# C. Define job settings, and invoke the data retention job.
settings = JobSettings(
database=DatabaseAddress.from_string(DBURI),
strategy=RetentionStrategy.DELETE,
)
job = RetentionJob(settings=settings)
job.start()