Your entire analytics stack, just a few lines of code away

Amrutha Gujjar•November 15, 2024•4 min read

Category: Trends

Data teams are often tasked with answering questions from departments like sales, revenue, and operations. You likely store your data in structured formats, such as relational databases or data lakes. Traditionally, setting up an analytics stack involves moving the data into a data warehouse, setting up complex ETL pipelines, and configuring different systems for storage, compute, querying, and dashboarding. The total cost of ownership is high.

But what if there was a way to make the setup process simpler? What if you could automate the setup of your entire analytics stack with just a few lines of code?

The Traditional Approach: Complex, Expensive, and Slow

Many organizations have large amounts of structured data. However, to run analytics, this data is usually transferred into a data warehouse. This process often involves ETL pipelines to clean and prepare the data before it can be analyzed. These extra steps add time, complexity, and cost to the overall process.

This process involves several layers of complexity:

Data must often be moved into a data warehouse, creating redundancies and increased storage costs.
Managing multiple systems—databases, ETL pipelines, data warehouses, and BI tools—leads to inefficiencies.
Transformations and metadata management are often manual, error-prone and is bottlenecked on engineering resources.

These inefficiencies slow down teams, increase costs, and create barriers to scaling analytics capabilities.

The New Way: Define Your Entire Analytics Stack in Code

1. Automating Metadata Management with Code and AI

Metadata plays a crucial role in making your data accessible and easy to query. Traditionally, managing metadata involves manual work that can quickly become cumbersome. By using a code-driven approach, you can automate metadata management and create a dynamic, self-updating catalog that’s accessible to both technical and non-technical users.

Here’s how it works:

Metadata is automatically generated to describe the data's schema, data types, relationships, and partitioning. This is all handled in the background through code—no manual entry required.
With the help of AI, your metadata catalog is enriched with automatic classifications, data relationships, and tagging. AI identifies key patterns and relationships in the data, helping business users understand the context and use of each dataset without requiring technical expertise.
The generated metadata catalog isn’t just for data engineers. It’s built to be user-friendly and accessible to business users, allowing them to search, explore, and understand the available datasets. This self-service access empowers teams across the organization to make data-driven decisions without waiting for IT support.
As new data is ingested into storage, the metadata catalog is automatically updated, ensuring it remains current without requiring any manual intervention.

2. Connecting Data Sources Directly with Code/Config

In traditional analytics workflows, the metadata that defines how data is transformed, ingested, and visualized often gets locked away in SaaS tools or proprietary systems. This creates fragmentation and a lack of transparency. Now, imagine a system where the control plane — your code — governs the entire pipeline, so you have consistency across transformation, ingestion, visualization, etc.

Using code, you can connect to your data sources — whether it’s a relational database like PostgreSQL or structured data in a data lake — by defining everything in configuration files or code snippets. These settings include connection details, credentials, and transformation logic, all directly managed in the codebase.

With this setup, the control plane is unified. Everything from schema definitions to transformation steps is visible and editable in code. This lets you have transparency, flexibility, and real-time adaptability as your data evolves. The configuration is simple and declarative. By managing metadata directly in code, you can maintain a consistent and extensible pipeline without being locked into rigid, tool-specific implementations.

This makes your data pipeline simpler, cleaner, and much easier to manage. The same setup works whether you're connecting to a database or pulling from a data lake—it's just code.

DLTHub simplifies the process of loading, transforming, and managing your data pipeline. With just a few lines of code, you can define the entire pipeline, from data ingestion to transformation.

ETL Example with DLTHub

Here’s how you can extract data from a PostgreSQL database, transform it, and store it back using DLTHub:

This setup:

Extracts data from PostgreSQL.
Transforms it by calculating total_sales.
Loads it into a destination (e.g., BigQuery).

import dlt

# Define a pipeline
pipeline = dlt.pipeline(pipeline_name='sales_pipeline', destination='bigquery', dataset_name='sales_data')

# Extract data from PostgreSQL
@dlt.source
def postgres_data():
    return dlt.resource(
        query="SELECT * FROM sales_data",
        table_name="sales_data",
        connection="postgresql://my_user:my_password@localhost:5432/my_database"
    )

# Transformation logic
@dlt.transformer(data_from=postgres_data)
def add_total_sales(data):
    for row in data:
        row['total_sales'] = row['quantity'] * row['price_per_unit']
        yield row

# Run the pipeline
if __name__ == "__main__":
    pipeline.run([add_total_sales])

3. BI as Code

Traditional BI tools like Tableau, Power BI, and Looker lock you into their proprietary systems, where the logic for dashboards, data computations, and visualizations is defined within the tool itself. BI as code changes this by allowing you to define the logic and structure of your dashboards directly in code, giving you more flexibility and control.

The logic of which dashboards to create and how the data should be processed is entirely defined in code. This replaces the need for manually configuring dashboards within a specific BI tool. You can create and modify dashboards through simple code, ensuring everything aligns with your unique business logic and requirements.
With traditional BI tools, the computation logic is hidden behind the tool's interface. BI as code allows you to have full control over how data is calculated and presented, ensuring your analytics pipeline is flexible and can evolve as your needs change.
BI as code replaces traditional BI tools by shifting the logic of your data visualizatios and computations to code, offering you more customization, automation, and control, while eliminating the need for rigid, vendor-lock-in systems.

Evidence.dev provides a streamlined way to define your BI layer using Markdown and SQL. This code-first approach ensures your dashboards remain version-controlled and reproducible.

Dashboard Example with Evidence.dev

Here’s how to create a simple report in Evidence.dev:

Define Your Query: Add a SQL file in the queries/ folder.

-- queries/total_sales.sql
SELECT 
    customer_id, 
    product_id, 
    SUM(total_sales) AS total_sales, 
    DATE(sales_date) AS sales_date
FROM transformed_sales_data
GROUP BY customer_id, product_id, DATE(sales_date)

Build the Report: Use Markdown in the pages/ folder to create a report.

# Sales Performance

## Total Sales Over Time

## Use this chart to track how total sales evolve over time:

<LineChart query="total_sales" x="sales_date" y="total_sales" groupBy="product_id" />

Run Your Report: Start the Evidence.dev server and navigate to your dashboard.

npm start

Conclusion

Building a modern analytics stack doesn’t have to be complicated. With a code-first approach, you can automate the entire setup process and start running analytics directly on structured data with minimal effort. This approach makes it easier, faster, and more cost-effective to access insights—without the need for moving data, managing complex systems, or maintaining separate databases.

By adopting a code-first approach to manage your analytics stack, you can unlock the power of your structured data while keeping costs low and operations simple.

Try Preswald today!

https://github.com/StructuredLabs/preswald