Processing One Trillion Rows

Dataset

The example is inspired from the One Trillion Row Challenge by Coiled - the company behind popular distributed pythonic open-source library Dask.

At its core, Kubox offers a freemium CLI that empowers you to effortlessly spin up Kubernetes clusters in your own cloud. You have the flexibility to deploy your own data processing frameworkds and tools.

The source code is available in our Github repository at https://github.com/kubox-ai/1trc

This repository is a work in progress as we iterate and learn before the final submission. It contains the code to quickly spin up a Kubox cluster in AWS us-east-1 and process 1 trillion rows of data. It give you two options for tackling this challenge:

ClickHouse – A powerful, high-performance analytics database.
Daft and Ray – A dynamic duo for distributed computing and cutting-edge data processing.

We’re excited to have you explore and experiment with Kubox. Feel free to dive in and share your feedback as we continue to enhance this project!

USD prices for AWS EC2 spot instances are used for the below calculations. The prices are subject to change based on the current market rates.

Metric/Framework	Daft + Ray	Clickhouse
Startup time	320s	313s
Running time	1189s	527s
Delete time	122s	123s
Estimate cost	$2.75	$1.37

Work in progress..

Dataset

The One Trillion Row Challenge originated as an ambitious benchmark task:

Goal: Compute the minimum, mean, and maximum temperatures per weather station, sorted alphabetically.
Dataset:
- Format: Parquet
- Size: 2.5 TB (100,000 files, each 24 MiB in size with 10 million rows)
- Location: s3://coiled-datasets-rp/1trc (AWS S3 Requester Pays Bucket)

Here is how to download one of the 24 MiB file:

aws s3 cp s3://coiled-datasets-rp/1trc/measurements-0.parquet . --request-payer requester

Exploratory Data Analysis on Three Billion Rows Urban Growth Hotspots Across Australia for $15

Use Cases

​Dataset

Dataset