Skill v1.0.1
currentLLM-judged scan95/100+5 new
version: "1.0.1" name: chdb-datastore description: >- Use when the user has tabular data (pandas DataFrame, parquet, csv, Arrow, json) and wants to filter, group, aggregate, join, or speed up slow pandas. Provides chDB DataStore — same pandas API, ClickHouse engine underneath. Also handles reading from S3, MySQL, PostgreSQL, MongoDB, ClickHouse Cloud, Iceberg, Delta Lake as DataFrames and joining across sources. TRIGGER when: user mentions DataFrame, parquet, csv, "fast pandas", "speed up pandas", or cross-source DataFrame joins; user imports chdb.datastore or from datastore import DataStore. SKIP this skill for raw SQL syntax (use chdb-sql instead), ClickHouse server administration, or non-Python DataStore API work. license: Apache-2.0 compatibility: Requires Python 3.9+, macOS or Linux. pip install chdb. metadata: author: chdb-io version: "4.1" homepage: https://clickhouse.com/docs/chdb
chdb DataStore — It's Just Faster Pandas
The Key Insight
# Change this:import pandas as pd# To this:import chdb.datastore as pd# Everything else stays the same.
DataStore is a lazy, ClickHouse-backed pandas replacement. Your existing pandas code works unchanged — but operations compile to optimized SQL and execute only when results are needed (e.g., print(), len(), iteration).
pip install chdb
Decision Tree: Pick the Right Approach
1. "I have a file/database and want to analyze it with pandas"→ DataStore.from_file() / from_mysql() / from_s3() etc.→ See references/connectors.md2. "I need to join data from different sources"→ Create DataStores from each source, use .join()→ See examples/examples.md #3-53. "My pandas code is too slow"→ import chdb.datastore as pd — change one line, keep the rest4. "I need raw SQL queries"→ Use the chdb-sql skill instead
Connect to Any Data Source — One Pattern
from datastore import DataStore# Local file (auto-detects .parquet, .csv, .json, .arrow, .orc, .avro, .tsv, .xml)ds = DataStore.from_file("sales.parquet")# Databaseds = DataStore.from_mysql(host="db:3306", database="shop", table="orders", user="root", password="pass")# Cloud storageds = DataStore.from_s3("s3://bucket/data.parquet", nosign=True)# URI shorthand — auto-detects source typeds = DataStore.uri("mysql://root:pass@db:3306/shop/orders")
All 16+ sources and URI schemes → connectors.md
After Connecting — Full Pandas API
result = ds[ds["age"] > 25] # filterresult = ds[["name", "city"]] # select columnsresult = ds.sort_values("revenue", ascending=False) # sortresult = ds.groupby("dept")["salary"].mean() # groupbyresult = ds.assign(margin=lambda x: x["profit"] / x["revenue"]) # computed columnds["name"].str.upper() # string accessords["date"].dt.year # datetime accessorresult = ds1.join(ds2, on="id") # joinresult = ds.head(10) # previewprint(ds.to_sql()) # see generated SQL
209 DataFrame methods supported. Full API → api-reference.md
Cross-Source Join — The Killer Feature
from datastore import DataStorecustomers = DataStore.from_mysql(host="db:3306", database="crm", table="customers", user="root", password="pass")orders = DataStore.from_file("orders.parquet")result = (orders.join(customers, left_on="customer_id", right_on="id").groupby("country").agg({"amount": "sum", "rating": "mean"}).sort_values("sum", ascending=False))print(result)
More join examples → examples.md
Writing Data
source = DataStore.from_mysql(host="db:3306", database="shop", table="orders", user="root", password="pass")target = DataStore("file", path="summary.parquet", format="Parquet")target.insert_into("category", "total", "count").select_from(source.groupby("category").select("category", "sum(amount) AS total", "count() AS count")).execute()
Troubleshooting
| Problem | Fix | |
|---|---|---|
ImportError: No module named 'chdb' | pip install chdb | |
ImportError: cannot import 'DataStore' | Use from datastore import DataStore or from chdb.datastore import DataStore | |
| Database connection timeout | Include port in host: host="db:3306" not host="db" | |
| Join returns empty result | Check key types match (both int or both string); use .to_sql() to inspect | |
| Unexpected results | Call ds.to_sql() to see the generated SQL and debug | |
| Environment check | Run python scripts/verify_install.py (from skill directory) |
References
- API Reference — Full DataStore method signatures
- Connectors — All 16+ data source connection methods
- Examples — 10+ runnable examples with expected output
- Verify Install — Environment verification script
- Official Docs
Note: This skill teaches how to use chdb DataStore.For raw SQL queries, use thechdb-sqlskill.For contributing to chdb source code, see CLAUDE.md in the project root.