MongoDB Table Loader¶
About¶
Load data from MongoDB and its file formats into CrateDB using a one-stop
command ctk load table
, in order to facilitate convenient data transfers
to be used within data pipelines or ad hoc operations.
Coverage¶
CrateDB Toolkit supports different variants to load MongoDB data from server instances and filesystems.
mongodb://
Connect to MongoDB Community or Enterprise Edition.
mongodb+srv://
Connect to MongoDB Atlas.
file+bson://
Read files in MongoDB Extended JSON or BSON format from filesystem.
http+bson://
Read files in MongoDB Extended JSON or BSON format from HTTP resource.
Install¶
pip install --upgrade 'cratedb-toolkit[mongodb]'
Usage¶
The MongoDB I/O adapter can process MongoDB data from different sources. This section enumerates relevant connectivity options on behalf of concrete usage examples.
MongoDB Atlas¶
Transfer a single collection from MongoDB Atlas.
export CRATEDB_SQLALCHEMY_URL=crate://crate@localhost:4200/ticker/stocks
ctk load table "mongodb+srv://john:EeY6OocooL8rungu@testdrive.ahnaik1.mongodb.net/ticker/stocks?batch-size=5000"
Transfer all collections in database from MongoDB Atlas.
export CRATEDB_SQLALCHEMY_URL=crate://crate@localhost:4200/ticker
ctk load table "mongodb+srv://john:EeY6OocooL8rungu@testdrive.ahnaik1.mongodb.net/ticker?batch-size=5000"
Important
When transferring multiple collections, make sure to use a CrateDB database
address which DOES NOT reference an individual table.
It MUST stop at the schema label, here, ticker
. Likewise, the MongoDB
database address also MUST reference a database, NOT a specific collection.
MongoDB Community and Enterprise¶
Transfer data from MongoDB database/collection into CrateDB schema/table.
export CRATEDB_SQLALCHEMY_URL=crate://crate@localhost:4200/testdrive/demo
ctk load table "mongodb://localhost:27017/testdrive/demo"
Query data in CrateDB.
export CRATEDB_SQLALCHEMY_URL=crate://crate@localhost:4200/testdrive/demo
ctk shell --command "SELECT * FROM testdrive.demo;"
ctk show table "testdrive.demo"
MongoDB JSON/BSON files¶
Load data from MongoDB JSON/BSON files, for example produced by the
mongoexport
or mongodump
programs.
# Extended JSON, filesystem, full path.
ctk load table "file+bson:///path/to/mongodb-json-files/datasets/books.json"
# Extended JSON, HTTP resource.
ctk load table "https+bson://github.com/ozlerhakan/mongodb-json-files/raw/master/datasets/books.json"
# BSON, filesystem, relative path, compressed.
ctk load table "file+bson:./var/data/testdrive/books.bson.gz"
# Extended JSON, filesystem, multiple files.
ctk load table \
"file+bson:///path/to/mongodb-json-files/datasets/*.json?batch-size=2500" \
--cratedb-sqlalchemy-url="crate://crate@localhost:4200/datasets"
Important
When transferring multiple collections, make sure to use a CrateDB database
address which DOES NOT reference an individual table.
It MUST stop at the schema label, here, datasets
. Likewise, the path to
the MongoDB JSON files also MUST reference the parent folder, NOT a specific
JSON or BSON file.
To exercise a full example importing multiple MongoDB Extended JSON files, see File Import Tutorial.
Options¶
Batch Size¶
The default batch size is 100, but for many datasets a much larger batch size
is applicable for most efficient data transfers. You can adjust the value by
appending the HTTP URL query parameter batch-size
to the source URL, like
mongodb+srv://managed.mongodb.net/ticker/stocks?batch-size=5000
.
Filter¶
Use the HTTP URL query parameter filter
on the source URL, like
&filter={"exchange":{"$eq":"NASDAQ"}}
, or
&filter={"_id":"66f0002e98c00fb8261d87c8"}
,
in order to provide a MongoDB query filter as a JSON string.
It works in the same way like mongoexport
’s --query
option.
On more complex query expressions, make sure to properly encode the right
value using URL/Percent Encoding.
Limit¶
Use the HTTP URL query parameter limit
on the source URL, like
&limit=100
, in order to limit processing to a total number of
records.
Offset¶
Use the HTTP URL query parameter offset
on the source URL, like
&offset=42
, in order to start processing at this record from the
beginning.
Transformations¶
You can use Zyp Transformations to change the shape of the data while being
transferred. In order to add it to the pipeline, use the --transformation
command line option.
It is also available for the migr8 extract
and migr8 export
commands.
Example transformation files in YAML format can be explored at examples/zyp.
Appendix¶
Insert Exercise¶
Import two data points into MongoDB database testdrive
and collection demo
,
using the mongosh
CLI program.
mongosh mongodb://localhost:27017/testdrive <<EOF
db.demo.remove({})
db.demo.insertMany([
{
timestamp: new Date(1556896326),
region: "amazonas",
temperature: 42.42,
humidity: 84.84,
},
{
timestamp: new Date(1556896327),
region: "amazonas",
temperature: 45.89,
humidity: 77.23,
windspeed: 5.4,
},
])
db.demo.find({})
EOF
File Import Tutorial¶
The mongodb-json-files repository includes a few samples worth of data in MongoDB JSON/BSON format.
Load
Acquire a copy of the repository.
git clone https://github.com/ozlerhakan/mongodb-json-files.git
The data import uses a Zyp project file zyp-mongodb-json-files.yaml that describes a few adjustments needed to import all files flawlessly. Let’s acquire that file.
wget https://github.com/crate/cratedb-toolkit/raw/v0.0.22/examples/zyp/zyp-mongodb-json-files.yaml
Load all referenced .json
files into corresponding tables within the CrateDB
schema datasets
, using a batch size of 2,500 items.
ctk load table \
"file+bson:///path/to/mongodb-json-files/datasets/*.json?batch-size=2500" \
--cratedb-sqlalchemy-url="crate://crate@localhost:4200/datasets" \
--transformation zyp-mongodb-json-files.yaml
Query
After importing the example files, you may want to exercise those SQL queries, for example using Admin UI or crash, to get an idea about how to work with CrateDB SQL.
books.json
SELECT
data['title'] AS title,
LEFT(data['shortDescription'], 60) AS description,
DATE_FORMAT('%Y-%m-%d', data['publishedDate']) AS date,
data['isbn'] AS isbn
FROM datasets.books
WHERE 'Java' = ANY(data['categories'])
ORDER BY title;
city_inspections.json
SELECT
data['sector'] AS sector,
data['business_name'] AS name
FROM datasets.city_inspections
WHERE
data['result'] = 'Violation Issued' AND
UPPER(data['address']['city']) = 'STATEN ISLAND'
ORDER BY sector, name;
Tip
Alternatively, have a look at the canonical MongoDB C driver’s libbson test files.
Troubleshooting¶
When importing from a BSON file, and observing a traceback like this,
Traceback (most recent call last):
File "/path/to/site-packages/bson/__init__.py", line 1356, in decode_file_iter
yield _bson_to_dict(elements, opts) # type:ignore[type-var, arg-type, misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
bson.errors.InvalidBSON: not enough data for a BSON document
please try to import the file into a MongoDB Server using mongorestore
,
and export it again using mongodump
or mongoexport
, preferably using
recent versions like MongoDB 7 and tools version 100.9.5 or higher.