The source code is available in our Github repository at https://github.com/kubox-ai/1trcThis repository is a work in progress as we iterate and learn before the final submission. It contains the code to quickly spin up a Kubox cluster in AWS
us-east-1 and process 1 trillion rows of data. It give you two options for tackling this challenge:
- ClickHouse – A powerful, high-performance analytics database.
- Daft and Ray – A dynamic duo for distributed computing and cutting-edge data processing.
USD prices for AWS EC2 spot instances are used for the below calculations. The prices are subject to change based on the current market rates.
| Metric/Framework | Daft + Ray | Clickhouse |
|---|---|---|
| Startup time | 320s | 313s |
| Running time | 1189s | 527s |
| Delete time | 122s | 123s |
| Estimate cost | $2.75 | $1.37 |
Dataset
The One Trillion Row Challenge originated as an ambitious benchmark task:- Goal: Compute the minimum, mean, and maximum temperatures per weather station, sorted alphabetically.
- Dataset:
- Format: Parquet
- Size: 2.5 TB (100,000 files, each
24 MiBin size with 10 million rows) - Location:
s3://coiled-datasets-rp/1trc(AWS S3 Requester Pays Bucket)
24 MiB file: