Lakehouse maintenance in Fabric: why VACUUM is essential

A few weeks ago I reviewed a production Fabric workspace. It had been running for 8 months with daily MERGE pipelines. Nobody had touched maintenance. The result: OneLake was storing triple what was needed. Obsolete Parquet files that served no purpose but sat there taking up space and running up costs.

It's the dumbest and most expensive problem I encounter in Fabric. And it's fixed with one command.

Why this happens

Delta tables store data in Parquet files. Every time you do an INSERT, UPDATE, DELETE, or MERGE, new files are created and the old ones get marked as obsolete. But they're not deleted — they stick around for time travel (querying previous table versions). Sounds useful in theory. In practice, if you have daily pipelines, within months a huge chunk of your OneLake is ghost files. And OneLake charges per GB stored. It's money thrown away, literally.

VACUUM: the command nobody schedules

VACUUM delta.`table_path` RETAIN 168 HOURS. That's it. Deletes obsolete files older than 7 days. You lose time travel before that window, which in production you rarely need. Delta has a safety check that won't let you go below 7 days retention — so you don't break anything by accident.

What you need to do: automate VACUUM as part of your pipeline (don't run it manually whenever you remember), combine it with OPTIMIZE to compact small Parquet files that fragment queries, and set 7-day retention which is more than enough for most cases.

What worries me: nobody watches this

This case is a perfect example of why a Fabric workspace can't be left on autopilot. If nobody understands how Delta Lake works underneath, if nobody schedules maintenance, if nobody checks how much storage is being consumed, the bill grows silently until someone gets a shock. I've seen workspaces where weekly VACUUM would have saved thousands per year. And nobody had noticed because nobody was looking.

Script: recursive VACUUM across the entire workspace

This Fabric notebook automatically traverses all Lakehouses in the workspace, finds all Delta tables, executes VACUUM, and generates a report with space recovered per table.

Python · Fabric Notebook

import sempy.fabric as fabric
from notebookutils import mssparkutils
from pyspark.sql.functions import col, round, desc
import time

# --- CONFIGURATION ---
RETENTION_HOURS = 168  # 168 hours = 7 days
WORKSPACE_ID = fabric.get_workspace_id()

# Optional: List of specific Lakehouse names.
# If empty [], will process ALL in the workspace.
LAKEHOUSES_FILTER = []  # E.g: ["LH_Sales", "LH_Logistics"]

# Disable safety check to allow RETAIN 0 (if necessary)
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")

# --- 1. GETTING LAKEHOUSES ---
print(">>> Inventorying Lakehouses...")

df_items = fabric.list_items(workspace=WORKSPACE_ID, type="Lakehouse")

if LAKEHOUSES_FILTER:
    df_items = df_items[df_items['Display Name'].isin(LAKEHOUSES_FILTER)]

lakehouses_info = []
for _, row in df_items.iterrows():
    lh_name = row['Display Name']
    lh_id = row['Id']
    abfss_path = f"abfss://{WORKSPACE_ID}@onelake.dfs.fabric.microsoft.com/{lh_id}/Tables"
    lakehouses_info.append((lh_name, abfss_path))

print(f"Detected {len(lakehouses_info)} Lakehouses to process.")

# --- 2. SUPPORT FUNCTIONS ---
all_tables = []

def find_delta_tables(path, lh_name):
    """Recursively searches for Delta tables and saves context."""
    try:
        items = mssparkutils.fs.ls(path)
        if any(item.name == "_delta_log" and item.isDir for item in items):
            all_tables.append((lh_name, path))
            return
        subdirs = [item.path for item in items if item.isDir and item.name != "_delta_log"]
        for subdir in subdirs:
            find_delta_tables(subdir, lh_name)
    except Exception as e:
        pass

def get_folder_size(path):
    """Calculates size recursively."""
    total = 0
    try:
        files = mssparkutils.fs.ls(path)
        for file in files:
            if file.isDir:
                total += get_folder_size(file.path)
            else:
                total += file.size
    except:
        pass
    return total

# --- 3. GLOBAL SEARCH ---
print("-" * 60)
for lh_name, lh_path in lakehouses_info:
    print(f">>> Scanning Lakehouse: {lh_name} ...")
    find_delta_tables(lh_path, lh_name)

print(f">>> Total tables found: {len(all_tables)}")
print("-" * 60)

# --- 4. RUNNING VACUUM ---
results = []

for i, (lh_name, table_path) in enumerate(all_tables):
    table_name_short = table_path.split("/")[-1]
    full_display_name = f"{lh_name}.{table_name_short}"

    print(f"[{i+1}/{len(all_tables)}] Processing: {full_display_name} ...", end=" ")

    size_pre_bytes = get_folder_size(table_path)
    size_pre_gb = size_pre_bytes / (1024**3)

    status = "OK"
    size_post_gb = size_pre_gb

    try:
        spark.sql(f"VACUUM delta.`{table_path}` RETAIN {RETENTION_HOURS} HOURS")

        time.sleep(2)
        size_post_bytes = get_folder_size(table_path)
        size_post_gb = size_post_bytes / (1024**3)

        ahorro = size_pre_gb - size_post_gb
        print(f"Cleaned. Savings: {ahorro:.4f} GB")

    except Exception as e:
        print("FAILED.")
        status = f"Error: {str(e)[0:100]}"

    results.append((
        lh_name, table_name_short,
        size_pre_gb, size_post_gb,
        size_pre_gb - size_post_gb,
        status, table_path
    ))

# --- 5. FINAL REPORT ---
print("\n--- FINAL SUMMARY ---")

df_res = (
    spark.createDataFrame(
        results,
        ["Lakehouse", "Table", "GB_Before", "GB_After", "GB_Saved", "Status", "Path"]
    )
    .withColumn("GB_Before", round(col("GB_Before"), 4))
    .withColumn("GB_After", round(col("GB_After"), 4))
    .withColumn("GB_Saved", round(col("GB_Saved"), 4))
    .orderBy(desc("GB_Saved"))
)

display(df_res)

try:
    total_saved = df_res.agg({"GB_Saved": "sum"}).collect()[0][0]
    print(f"Total space recovered in Workspace: {total_saved if total_saved else 0:.4f} GB")
except:
    print("Could not calculate total.")

The final report shows GB before, after, and saved per table — sorted from highest to lowest. Schedule it weekly with a Fabric schedule and maintenance is resolved.

Lakehouse maintenance in Fabric: why VACUUM is essential

Why this happens

VACUUM: the command nobody schedules

What worries me: nobody watches this

Script: recursive VACUUM across the entire workspace

Need help with this?