2025-02-18 | By Mariusz Jażdżyk
Using pre-built LLM models, we get responses that blend truth with half-truths, mixed with the finesse of a top consultant.
Meanwhile, real data is the crude oil of the economy.
Consciously or unconsciously, we rely on internet resources locked within these models. They are universal but not yet tailored to specific companies—often even fabricated.
To change this, we need our own, real data.
In my book Chief Data Officer, I address this question in a chapter on data acquisition strategy. While opportunities have expanded—and don’t always align with tool licenses—here are some sources:
First-party data (private data) – The most valuable and theoretically the easiest to obtain, yet practically difficult for analytics teams to access. Even when stored on company servers or in the cloud, firms struggle to collect and use it effectively.
Open data (from the internet) – Often restricted from downloading and scraping, yet many companies still acquire it. Despite scraping bans, one of our services is visited by 3 million bots each month attempting to extract our content.
Data from LLMs – Knowledge stored in models can be valuable. Increasingly, we hear about projects from the other side of the world achieving great results by leveraging knowledge from existing models. Although licenses often prohibit this, extracting data from LLMs is possible with the right tools.
These are just a few examples. Once we have well-prepared data, only ~20% of the work remains—developing algorithms, which have become incredibly cheap and accessible (and will become even cheaper).
We don't focus on universal models, which are becoming commodities—cheap products available from major players. Instead, we seek industry-specific knowledge, which, combined with exclusive data, can enhance universal algorithms.
Author: Mariusz Jażdżyk
The author is a lecturer at Kozminski University, specializing in building data-driven organizations in startups. He teaches courses based on his book Chief Data Officer, where he explores the practical aspects of implementing data strategies and AI solutions.