<< All versions
Skill v1.0.1
currentAutomated scan100/100artificialanalysis/stirrup/data-analysis
3 files
──Details
PublishedMay 18, 2026 at 05:45 AM
Content Hashsha256:365334b22d4d8e15...
Git SHAcd7f51f999af
Bump Typepatch
──Files
Files (1 file, 4.3 KB)
SKILL.md4.3 KBactive
SKILL.md · 159 lines · 4.3 KB
version: "1.0.1" name: data_analysis description: High-performance data analysis using Polars - load, transform, aggregate, visualize and export tabular data. Use for CSV/JSON/Parquet processing, statistical analysis, time series, and creating charts.
Data Analysis Skill
Comprehensive data analysis toolkit using Polars - a blazingly fast DataFrame library. This skill provides instructions, reference documentation, and ready-to-use scripts for common data analysis tasks.
Iteration Checkpoints
| Step | What to Present | User Input Type | |
|---|---|---|---|
| Data Loading | Shape, columns, sample rows | "Is this the right data?" | |
| Data Exploration | Summary stats, data quality issues | "Any columns to focus on?" | |
| Transformation | Before/after comparison | "Does this transformation look correct?" | |
| Analysis | Key findings, charts | "Should I dig deeper into anything?" | |
| Export | Output preview | "Ready to save, or any changes?" |
Quick Start
python
import polars as plfrom polars import col# Load datadf = pl.read_csv("data.csv")# Exploreprint(df.shape, df.schema)df.describe()# Transform and analyzeresult = (df.filter(col("value") > 0).group_by("category").agg(col("value").sum().alias("total")).sort("total", descending=True))# Exportresult.write_csv("output.csv")
When to Use This Skill
- Loading datasets (CSV, JSON, Parquet, Excel, databases)
- Data cleaning, filtering, and transformation
- Aggregations, grouping, and pivot tables
- Statistical analysis and summary statistics
- Time series analysis and resampling
- Joining and merging multiple datasets
- Creating visualizations and charts
- Exporting results to various formats
Skill Contents
Reference Documentation
Detailed API reference and patterns for specific operations:
reference/loading.md- Loading data from all supported formatsreference/transformations.md- Column operations, filtering, sorting, type castingreference/aggregations.md- Group by, window functions, running totalsreference/time_series.md- Date parsing, resampling, lag featuresreference/statistics.md- Correlations, distributions, hypothesis testing setupreference/visualization.md- Creating charts with matplotlib/plotly
Ready-to-Use Scripts
Executable Python scripts for common tasks:
scripts/explore_data.py- Quick dataset exploration and profilingscripts/summary_stats.py- Generate comprehensive statistics report
Core Patterns
Loading Data
python
# CSV (most common)df = pl.read_csv("data.csv")# Lazy loading for large filesdf = pl.scan_csv("large.csv").filter(col("x") > 0).collect()# Parquet (recommended for large datasets)df = pl.read_parquet("data.parquet")# JSONdf = pl.read_json("data.json")df = pl.read_ndjson("data.ndjson") # Newline-delimited
Filtering and Selection
python
# Select columnsdf.select("col1", "col2")df.select(col("name"), col("value") * 2)# Filter rowsdf.filter(col("age") > 25)df.filter((col("status") == "active") & (col("value") > 100))df.filter(col("name").str.contains("Smith"))
Transformations
python
# Add/modify columnsdf = df.with_columns((col("price") * col("qty")).alias("total"),col("date_str").str.to_date("%Y-%m-%d").alias("date"),)# Conditional valuesdf = df.with_columns(pl.when(col("score") >= 90).then(pl.lit("A")).when(col("score") >= 80).then(pl.lit("B")).otherwise(pl.lit("C")).alias("grade"))
Aggregations
python
# Group bydf.group_by("category").agg(col("value").sum().alias("total"),col("value").mean().alias("avg"),pl.len().alias("count"),)# Window functionsdf.with_columns(col("value").sum().over("group").alias("group_total"),col("value").rank().over("group").alias("rank_in_group"),)
Exporting
python
df.write_csv("output.csv")df.write_parquet("output.parquet")df.write_json("output.json", row_oriented=True)
Best Practices
- Use lazy evaluation for large datasets:
pl.scan_csv()+.collect() - Filter early to reduce data volume before expensive operations
- Select only needed columns to minimize memory usage
- Prefer Parquet for storage - faster I/O, better compression
- Use `.explain()` to understand and optimize query plans