DataClean Studio — Fábio Silva

DataClean Studio end to end: connect a dataset, profile and validate it, clean it with logged and reversible steps, then export a clean file, a reproducible script, or a dashboard — all in one app.

The challenge no one talks about

Ask any data analyst where their time actually goes. Not where they want it to go — where it actually goes.

The honest answer is almost always the same: preparation. Fixing column types. Hunting duplicates. Chasing missing values. Writing the same pandas snippet for the hundredth time. Rebuilding the same Power Query flow in a new file, in a new project, for a new client.

The analysis — the part that requires thinking — comes after all of that. If there’s time left.

This is the most persistent and least glamorous challenge in data work. The tools that were supposed to solve it — SQL, Python, Power Query, Excel — are excellent at what they do. But they were never designed to work together. Every project starts from scratch. Every handoff creates friction. The thinking is always waiting for the plumbing.

DataClean Studio is the workbench built to remove that friction entirely — from raw data to boardroom-ready output, in one place, without rebuilding anything from scratch.

What it actually does

DataClean Studio takes a dataset — from a file or a live database — and walks it through a complete, structured workflow. Each stage is a discrete step with its own interface, its own logic, and its own outputs. None of them require leaving the application.

Connect

Upload CSV, Excel, or Parquet. Or connect directly to a live database: PostgreSQL, MySQL, SQL Server, SQLite. Load up to five datasets simultaneously into a shared workspace and relate them — the same multi-table model you’d build in Power BI, but without the licensing overhead.

Explore

Before any cleaning happens, the analyst needs to know what they’re dealing with. The Explore page delivers an instant structural overview: row counts, column types, null distributions, summary statistics, and distribution charts — without writing a single line of code. The paginated table view includes live row search. The entire profile of a dataset is visible in seconds.

Validate

This is where most projects lose hours they never get back.

The Quality engine runs a full automated diagnostic across every column: missing values flagged by severity, duplicate rows detected and counted, outliers identified at 3× IQR, type mismatches surfaced (numeric data stored as text is one of the most common silent errors in real-world datasets), inconsistent casing, invalid email formats, leading and trailing whitespace. Every column gets an individual quality score. The entire dataset gets an overall health score.

The output is a prioritised list of what to fix, not a wall of raw statistics. The analyst sees exactly what needs attention and in what order.

Clean

The Clean page offers two modes that work side by side.

The rule-based engine generates smart suggestions based on the Quality report — with before/after previews before anything is applied. Every operation is logged. Undo and redo are always available. No destructive changes, no irreversible steps.

For anything beyond the standard operations, the built-in AI Analyst handles freeform transformations via natural language. “Flag all rows where the date is before 2020 and the status is still active.” The analyst describes the transformation; the engine applies it. The AI operates on the actual dataset — it doesn’t hallucinate column names or invent values.

Model

When working across multiple datasets, the Model page replaces the ad-hoc join logic that typically lives in scattered SQL scripts or undocumented Power Query steps. Define relationships between tables visually: link columns, set cardinality, generate a Calendar date dimension table automatically and connect it to any date column in the workspace. The model is validated with an instant join preview before it’s used anywhere. Once defined, it propagates automatically to the analysis layer.

Analyse

The Analyse page is a full BI environment built on top of the cleaned, modelled data.

Auto-generated KPI dashboards give an immediate overview of any dataset on load. The chart builder supports 16+ chart types and 16+ advanced calculation types — running totals, moving averages, YoY comparisons, cohort retention, Pareto distributions, funnel drop-off — applied directly in the UI without writing formulas.

The AI dashboard builder takes a plain-English prompt and generates a complete multi-chart dashboard, using the full workspace model when multiple tables are active. The conversation continues after the build: add charts, modify types, recolour, remove. The slicer panel applies categorical, numeric, and date filters live across the entire dashboard.

Export

Four export paths from a single cleaned dataset:

CSV, Excel, or Parquet — the cleaned file, ready for downstream use
Python (pandas) or SQL script — a fully reproducible cleaning pipeline, auto-generated from the operations log
Database push-back — write the cleaned data directly back to the source database
PDF dashboard report — the current Analyse dashboard, exported as a styled document

The Python and SQL script exports are the feature most analysts underestimate until they use them. Every cleaning operation applied in the UI is translated into executable, version-controllable code. The session becomes documentation. The documentation becomes the pipeline.

The AI layer — and what it doesn’t do

DataClean Studio uses GPT-4o in two places: the Clean page AI Analyst, and the Analyse page dashboard builder and chat interface.

In both cases, the model operates on real data and real structure. It does not estimate column contents, infer schema, or generate synthetic values. It translates natural language into operations that the deterministic engine then executes — and those operations are logged, reviewable, and undoable.

The application runs fully without an API key. All AI features are disabled; everything else works identically. This is a deliberate architectural choice: the core workflow — connect, explore, validate, clean, model, export — does not depend on an external model. The AI accelerates work that already has a deterministic foundation. It does not replace it.

No vendor lock-in. No data sent anywhere except the optional OpenAI API call, which the analyst controls. Runs 100% locally.

Architecture

The backend is a Python 3.11 FastAPI service. Data processing runs on pandas, with file parsing for CSV, Excel, and Parquet via standard loaders and database connectivity via psycopg2, pymysql, and pyodbc. The quality engine and cleaning engine are deterministic rule-based services — no model involved. Session state is held in memory; the workspace resolver uses BFS join logic to traverse multi-table relationships. The entire backend exposes a documented REST API via Swagger.

The frontend is React 18 with a page-per-stage architecture that mirrors the workflow — Upload, Explore, Quality, Clean, Model, Analyse, Export. Global state is managed via React Context. The application is self-contained: no external dashboard service, no BI cloud dependency, no SaaS subscription required.

Deployment is a single command.

docker compose up --build

Backend and frontend spin up together. A Windows one-click launcher (start.bat) and a Mac/Linux shell script (start.sh) cover non-Docker environments. Render and Railway deployment blueprints are included in the repository.

Layer	Technologies
Backend	Python 3.11 · FastAPI · pandas · psycopg2 · pymysql · pyodbc · Uvicorn
Frontend	React 18 · Vite · Axios · Nginx
AI	GPT-4o (OpenAI) — cleaning analyst and dashboard builder only; fully optional
Deployment	Docker · Render · Railway
License	MIT — self-hosted, no vendor lock-in

The broader point

The data-preparation challenge isn’t a tooling gap in the sense of needing better tools. SQL, Python, and Power Query are mature, capable, and trusted. The gap is that none of them were designed to work together as a workflow — and every project pays the tax of reconnecting them from scratch.

DataClean Studio is not a replacement for those tools. It is the connective layer that makes the workflow coherent: a single environment where raw data enters, moves through a structured, auditable preparation process, and exits as either a clean file, a reproducible script, or a presentation-ready dashboard.

The AI layer is narrow and specific. It translates language into operations and prompts into charts. It does not carry the analysis — the deterministic engine does. The model is the last 5%, applied on top of a foundation that doesn’t need it to function.

That is the architecture worth building. The one where the AI is genuinely useful because everything underneath it was done properly first.