Main Backlog¶
Iteration +1¶
Table Loader: Refactor URL dispatcher, use fsspec
Table Loader/Docs: Advertise using OCI image
MongoDB: Load table with querying by single object id
MongoDB: Multi-phase BulkProcessor batch size adjustments
MongoDB: Report byte sizes (cur/avg/total) in progress bar
Documentation:
The procedure employed by CTK uses the catch-all
data OBJECT(DYNAMIC)
storage strategy, which is sinking the source record/document into a single column in CrateDB.The transformation recipe attempts to outline a few features provided by Zyp Transformations, in this case exclusively applying transformations described by expressions written in jqlang.
MongoDB/Docs: Describe usage of
mongoimport
andmongoexport
.mongoimport --uri 'mongodb+srv://MYUSERNAME:SECRETPASSWORD@mycluster-ABCDE.azure.mongodb.net/test?retryWrites=true&w=majority'
MongoDB: Convert dates like
"date": "Sep 18 2015"
, seetestdrive.city_inspections
.Table Loader: Propagate offset/limit to progress bar
Multi-file import from Tar files? https://github.com/crate/crate/issues/17770
Iteration +2¶
Address
fix_job_info_table_name
Add more items about
ctk load table
toexamples/
folderPython, Bash
Cloud: Parallelize import jobs?
Bug: Use CRATEDB_USERNAME=admin from cluster-info
Cloud: Tests for uploading a local file
Cloud: Use
.ini
file andkeyring
for storing CrateDB Cloud Cluster ID and credentialsCloud: List RUNNING/FAILED/SUCCEEDED jobs
Cloud: Sanitize file name
yc.2019.07-tiny.parquet
to be accepted as table namectk load table
: Acceptoffset
/limit
andstart
/stop
optionsHumanized: https://github.com/panodata/aika
UX: Unlock
testdata://
data sources frominfluxio
UX: No stack traces when
cratedb_toolkit.util.croud.CroudException: 401 - Unauthorized
UX: Explain
cratedb_toolkit.util.croud.CroudException: Another cluster operation is currently in progress, please try again later.
UX: Explain
cratedb_toolkit.util.croud.CroudException: Resource not found.
when accessing unknown cluster id.UX: Make
ctk list-jobs
respect"status": "SUCCEEDED"
etc.UX: Improve textual report from
ctk load table
UX: Accept alias
--format {jsonl,ndjson}
for--format json_row
Catch recursion errors:
CRATEDB_SQLALCHEMY_URL=crate://crate@localhost:4200/
CLI: Verify exit codes.
UX: Rename
ctk cluster info
toctk status cluster --id=foo-bar-baz
UX: Add
ctk start cluster --id=foo-bar-baz
UX: Provide Bash/zsh completion
Beautify
list-jobs
outputctk list-clusters
Store
CRATEDB_CLOUD_CLUSTER_ID
intocratedb_toolkit.constants
Cloud Tests: Verify file uploads
Docs: Add examples in more languages: Java, JavaScript, Lua, PHP
Docs:
Kafka:
CTK INFO/CFR
Migrate / I/O adapter
Iteration +2.5¶
Retention: Improve retention subsystem CLI API.
ctk retention create-policy lalala ctk materialized create lalala ctk schedule add lalala
Retention: Make
--cutoff-day
optional, usetoday()
as default.Retention: Refactor “partition”-based strategies into subfamily/category, in order to make room for other types of strategies not necessarily using partitioned tables.
Retention: Add
examples/retention_tags.py
.
Iteration +3¶
CI: Nightly builds, to verify regressions on CrateDB
CI: Also build OCI images for ARM, maybe only on PRs to
main
, and releases?CI: Add code coverage tracking and reporting.
More subcommands, like
list-policies
(list
) andcheck-policies
(check
).Improve testing for the
reallocate
strategy.Provide components for emulating materialized views
Example:
cratedb-retention create-materialized-view doc.raw_metrics \ --as='SELECT * FROM <table_name>;' cratedb-retention refresh-materialized-view doc.raw_metrics
CI Testcontainers
Iteration +4¶
Add two non-partition-based strategies. Category: timerange
.
Add a shortcut interface for adding policies.
Provide a TTL-like interface.
Rename
retention period
toduration
. It is shorter, and aligns with InfluxDB.Example:
cratedb-retention set-ttl doc.raw_metrics \ --strategy=delete --duration=1w
Provide a solid (considering best-practices, DWIM) cascaded/multi-level downsampling implementation/toolkit, similar to RRDtool or Munin.
Naming things: Generalize to
supertask
.Examples
# Example for a shortcut form of `supertask create-retention-policy`. # <TABLE> <STRATEGY>:<TARGET> <DURATION> st ttl doc.sensor_readings snapshot:export_cold 365d
When using partition-based retention, previously using the
--partition-column=time_month
option, that syntax might be suitable:# <TABLE> <PARTCOL> <STRATEGY>:<TARGET> <DURATION> st ttl doc.sensor_readings:time_month snapshot:export_cold 4w
Iteration +5¶
Periodic/recurrent queries via scheduling.
Humanized: https://github.com/panodata/aika
Either use classic cron, or systemd-timers, or use one of
APScheduler
,schedule
, orscheduler
.import datetime as dt import pendulum @dag( start_date=dt.datetime.strptime("2021-11-19"), # start_date=pendulum.datetime(2021, 11, 19, tz="UTC"), schedule="@daily", catchup=False, )
Document complete “Docker Compose” setup variant, using both CrateDB and
cratedb-retention
Generalize from
cutoff_day
tocutoff_date
? For example, usems
. See https://iotdb.apache.org/UserGuide/latest/Basic-Concept/TTL-Delete.html#ttl-delete-data.More battle testing, in sandboxes and on production systems.
Use storage classes
Iteration +6¶
Review SQL queries: What about details like
ORDER BY 5 ASC
?Use SQLAlchemy as query builder, to prevent SQL injection (S608), see
render_delete.py
spike.Improve configurability by offering to configure schema names and such.
Document how to run multi-tenant operations using “tags”.
Add an audit log (
"ext"."jobs_log"
), which records events when retention policy rules are changed, or executed.Add Webhooks, to connect to other systems
Document usage with Kubernetes, and Nomad/Waypoint.
Job progress
Iteration +7¶
More packaging: Use
fpm
More packaging: What about an Ubuntu Snap, a Helm chart, or a Nomad Pack?
Clarify how to interpret the
--cutoff-day
option.Add policy rule editor UI.
Is “day”-granularity fine with all use-cases? Should it better be generalized?
Currently, the test for the
reallocate
strategy apparently does not remove any records. The reason is probably, because the scenario can’t easily be simulated on a single-node cluster.Ship more package variants: rpm, deb, snap, buildpack?
Verify Docker setup on Windows
Done¶
Use a dedicated schema for retention policy tables, other than
doc
.Refactoring: Manifest the “retention policy” as code entity, using dataclasses, or SQLAlchemy.
Document how to connect to CrateDB Cloud
Add
DatabaseAddress
entity, with.safe
property to omit eventual passwordsDocument library and Docker use
README: Add a good header, with links to relevant resources
Naming things: Use “toolkit” instead of “manager”.
Document the layout of the retention policy entity, and the meaning of its attributes.
CI: Rename OCI workflow build steps.
Move
strategy
column on first position of retention policy table, and update all corresponding occurrences.Add “tags” to data model, for grouping, multi-tenancy, and more.
Improve example
Introduce database and CLI API for editing records
List all tags
Examples: Add “full” example to
basic.py
, rename tofull.py
Improve tests by using
generate_series
Document compact invocation, after applying an alias and exporting an environment variable:
cratedb-retention rm --tags=baz
Default value for
"${CRATEDB_URI}"
aka.dburi
argumentAdd additional check if data table(s) exists, or not.
Dissolve JOIN-based retention task gathering, because, when the application does not discover any retention policy job, it can not discriminate between “no retention policy” and “no data”, and thus, it is not able to report about it correspondingly.
CLI: Provide
--dry-run
optionDocs: Before running the examples, need to invoke
cratedb-retention setup --schema=examples
For testing the snapshot strategy, provide an embedded MinIO S3 instance to the test suite.
Improve SNAPSHOT testing: Microsoft Azure Blob Storage
Improve SNAPSHOT testing: Filesystem
UX: Refactoring towards
cratedb-toolkit
.UX:
ctk load
: Clearly disambiguate between loading data into RDBMS database tables, blob tables, or filesystem objects.ctk load table https://s3.amazonaws.com/my.import.data.gz
ctk load blob /path/to/image.png
ctk load object /local/path/to/image.png /dbfs/assets