MongoDB Table Loader¶

About¶

Load data from MongoDB and its file formats into CrateDB using a one-stop command ctk load, to facilitate convenient data transfers to be used within data pipelines or ad hoc operations.

Coverage¶

CrateDB Toolkit supports different variants to load MongoDB data from server instances and filesystems.

mongodb://

Connect to MongoDB Community or Enterprise Edition.
mongodb+srv://

Connect to MongoDB Atlas.
file+bson://

Read files in MongoDB Extended JSON or BSON format from filesystem.
http+bson://

Read files in MongoDB Extended JSON or BSON format from HTTP resource.

Install¶

pip install --upgrade 'cratedb-toolkit[mongodb]'

Usage¶

The MongoDB I/O adapter can process MongoDB data from different sources. This section enumerates relevant connectivity options on behalf of concrete usage examples.

MongoDB Atlas¶

Transfer a single collection from MongoDB Atlas.

export CRATEDB_CLUSTER_URL=crate://crate@localhost:4200/ticker/stocks
ctk load "mongodb+srv://john:EeY6OocooL8rungu@testdrive.ahnaik1.mongodb.net/ticker/stocks?batch-size=5000"

Transfer all collections in database from MongoDB Atlas.

export CRATEDB_CLUSTER_URL=crate://crate@localhost:4200/ticker
ctk load "mongodb+srv://john:EeY6OocooL8rungu@testdrive.ahnaik1.mongodb.net/ticker?batch-size=5000"

Important

When transferring multiple collections, make sure to use a CrateDB database address which DOES NOT reference an individual table. It MUST stop at the schema label, here, ticker. Likewise, the MongoDB database address also MUST reference a database, NOT a specific collection.

MongoDB Community and Enterprise¶

Transfer data from MongoDB database/collection into CrateDB schema/table.

export CRATEDB_CLUSTER_URL=crate://crate@localhost:4200/testdrive/demo
ctk load "mongodb://localhost:27017/testdrive/demo"

Query data in CrateDB.

export CRATEDB_CLUSTER_URL=crate://crate@localhost:4200/testdrive/demo
ctk shell --command "SELECT * FROM testdrive.demo;"
ctk show table "testdrive.demo"

MongoDB JSON/BSON files¶

Load data from MongoDB JSON/BSON files, for example produced by the mongoexport or mongodump programs.

# Extended JSON, filesystem, full path.
ctk load "file+bson:///path/to/mongodb-json-files/datasets/books.json"

# Extended JSON, HTTP resource.
ctk load "https+bson://github.com/ozlerhakan/mongodb-json-files/raw/master/datasets/books.json"

# BSON, filesystem, relative path, compressed.
ctk load "file+bson:./var/data/testdrive/books.bson.gz"

# Extended JSON, filesystem, multiple files.
ctk load \
  "file+bson:///path/to/mongodb-json-files/datasets/*.json?batch-size=2500" \
  "crate://crate@localhost:4200/datasets"

Important

When transferring multiple collections, make sure to use a CrateDB database address which DOES NOT reference an individual table. It MUST stop at the schema label, here, datasets. Likewise, the path to the MongoDB JSON files also MUST reference the parent folder, NOT a specific JSON or BSON file.

To exercise a full example importing multiple MongoDB Extended JSON files, see File Import Tutorial.

Options¶

Batch Size¶

The default batch size is 100, but for many datasets a much larger batch size is applicable for most efficient data transfers. You can adjust the value by appending the HTTP URL query parameter batch-size to the source URL, like mongodb+srv://managed.mongodb.net/ticker/stocks?batch-size=5000.

Filter¶

Use the HTTP URL query parameter filter on the source URL, like &filter={"exchange":{"$eq":"NASDAQ"}}, or &filter={"_id":"66f0002e98c00fb8261d87c8"}, in order to provide a MongoDB query filter as a JSON string. It works in the same way like mongoexport’s --query option. On more complex query expressions, make sure to properly encode the right value using URL/Percent Encoding.

Limit¶

Use the HTTP URL query parameter limit on the source URL, like &limit=100, in order to limit processing to a total number of records.

Offset¶

Use the HTTP URL query parameter offset on the source URL, like &offset=42, in order to start processing at this record from the beginning.

Transformations¶

You can use Tikray Transformations to change the shape of the data while being transferred. To add it to the pipeline, use the --transformation command line option.

It is also available for the migr8 extract and migr8 export commands. Example transformation files in YAML format can be explored at examples/tikray.

Appendix¶

Insert Exercise¶

Import two data points into MongoDB database testdrive and collection demo, using the mongosh CLI program.

mongosh mongodb://localhost:27017/testdrive <<EOF
db.demo.remove({})
db.demo.insertMany([
  {
    timestamp: new Date(1556896326),
    region: "amazonas",
    temperature: 42.42,
    humidity: 84.84,
  },
  {
    timestamp: new Date(1556896327),
    region: "amazonas",
    temperature: 45.89,
    humidity: 77.23,
    windspeed: 5.4,
  },
])
db.demo.find({})
EOF

File Import Tutorial¶

The mongodb-json-files repository includes a few samples worth of data in MongoDB JSON/BSON format.

Load

Acquire a copy of the repository.

git clone https://github.com/ozlerhakan/mongodb-json-files.git

The data import uses a Tikray project file tikray-mongodb-json-files.yaml that describes a few adjustments needed to import all files flawlessly. Let’s acquire that file.

wget https://github.com/crate/cratedb-toolkit/raw/v0.0.36/examples/tikray/tikray-mongodb-json-files.yaml

Load all referenced .json files into corresponding tables within the CrateDB schema datasets, using a batch size of 2,500 items.

ctk load \
  "file+bson:///path/to/mongodb-json-files/datasets/*.json?batch-size=2500" \
  "crate://crate@localhost:4200/datasets" \
  --transformation tikray-mongodb-json-files.yaml

Query

After importing the example files, you may want to exercise those SQL queries, for example using Admin UI or crash, to get an idea about how to work with CrateDB SQL.

books.json

SELECT 
    data['title'] AS title, 
    LEFT(data['shortDescription'], 60) AS description, 
    DATE_FORMAT('%Y-%m-%d', data['publishedDate']) AS date,
    data['isbn'] AS isbn
FROM datasets.books 
WHERE 'Java' = ANY(data['categories'])
ORDER BY title;

city_inspections.json

SELECT
    data['sector'] AS sector, 
    data['business_name'] AS name
FROM datasets.city_inspections
WHERE 
    data['result'] = 'Violation Issued' AND 
    UPPER(data['address']['city']) = 'STATEN ISLAND'
ORDER BY sector, name;

Tip

Alternatively, have a look at the canonical MongoDB C driver’s libbson test files.

Troubleshooting¶

When importing from a BSON file, and observing a traceback like this,

Traceback (most recent call last):
  File "/path/to/site-packages/bson/__init__.py", line 1356, in decode_file_iter
    yield _bson_to_dict(elements, opts)  # type:ignore[type-var, arg-type, misc]
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
bson.errors.InvalidBSON: not enough data for a BSON document

please try to import the file into a MongoDB Server using mongorestore, and export it again using mongodump or mongoexport, preferably using recent versions like MongoDB 7 and tools version 100.9.5 or higher.