Browse latest
Tools & PlatformsMarkTechPost · June 2, 2026

TinyFish Launches BigSet: An Open-Source Multi-Agent System That Builds Structured Live Datasets from Plain-English Descriptions

TinyFish Launches BigSet: An Open-Source Multi-Agent System That Builds Structured Live Datasets from Plain-English Descriptions — MarkTechPost

BigSet is an open-source multi-agent system that automates the creation of structured datasets from natural language descriptions. It uses a multi-step process involving AI models for schema inference and data gathering from the web. This tool simplifies data collection for researchers and businesses alike.

Author: Morein.ai Editorial

Building structured datasets from the web typically involves a complex workflow, including identifying data sources, writing scrapers, designing schemas, and managing updates. This process is often time-consuming and prone to breakage when upstream sites change.

TinyFish has released BigSet, an open-source multi-agent system designed to streamline this process. Licensed under AGPL-3.0, BigSet takes a natural-language description as input and produces a structured, exportable dataset derived from live web data. The complete codebase is accessible on GitHub.

BigSet acts as an intermediary between data requirements and a usable table. Users describe their data needs in a simple sentence, and the system infers the schema, dispatches agents to collect data, deduplicates results, and generates a downloadable CSV or XLSX file.

For example, a user could request "YC companies that are currently hiring engineers, with their funding stage, location, and number of open roles." BigSet would then automatically identify relevant columns, locate entities on the web, and populate the table. The system also supports scheduled refreshes, allowing datasets to update automatically at specified intervals.

The architecture of BigSet is a structured two-tier agent system, not merely a single LLM call. It involves several steps: schema inference using a model like Claude Sonnet, broad discovery by an orchestrator agent, parallel sub-agent fan-out to handle individual entities, deduplication with source attribution, and finally, data export.

Bigset is self-hosted, running on a user's own infrastructure via Docker. It requires API keys from OpenRouter, TinyFish, and Clerk for authentication and model calls.

Read original source

Related articles