If you work with Lakehouses in Microsoft Fabric and no one takes care of their maintenance, your storage bill grows every day without you noticing. It's one of the most common — and most avoidable — problems I encounter in production workspaces.
The problem: ghost files that cost money
Delta tables store data in Parquet files. Every write operation (INSERT, UPDATE, DELETE, MERGE) creates new files and marks the previous ones as obsolete — but doesn't delete them. They're kept for time travel. If you have pipelines that run MERGE daily, in a few months a significant portion of your OneLake storage is files that are no longer useful. And OneLake charges by GB stored.
The solution: VACUUM
The VACUUM command deletes obsolete files that exceed a retention period. The syntax: VACUUM delta.`table_path` RETAIN 168 HOURS. This deletes everything older than 7 days. You lose time travel before that period, but recover the space. Delta has a safety check that prevents retention periods shorter than 7 days.
What you need to do is simple: automate VACUUM as part of your data pipeline, combine with OPTIMIZE (which compacts small Parquet files), and define coherent retention (7 days is usually sufficient in production).
Why this matters: without administration, the bill skyrockets
This is a perfect example of why a Fabric workspace needs someone to administer it. Without someone who understands how Delta Lake works, who configures maintenance, and who monitors storage, the bill grows silently and indefinitely. I've seen workspaces where a simple scheduled VACUUM would have saved a significant amount every month. If a junior starts creating Lakehouses and loading data without knowing that maintenance exists, the result is the same: uncontrolled storage and unnecessary cost.
Script: recursive VACUUM across the entire workspace
This Fabric notebook automatically traverses all Lakehouses in the workspace, finds all Delta tables, executes VACUUM, and generates a report with space recovered per table.
import sempy.fabric as fabric
from notebookutils import mssparkutils
from pyspark.sql.functions import col, round, desc
import time
# --- CONFIGURATION ---
RETENTION_HOURS = 168 # 168 hours = 7 days
WORKSPACE_ID = fabric.get_workspace_id()
# Optional: List of specific Lakehouse names.
# If empty [], will process ALL in the workspace.
LAKEHOUSES_FILTER = [] # E.g: ["LH_Sales", "LH_Logistics"]
# Disable safety check to allow RETAIN 0 (if necessary)
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
# --- 1. GETTING LAKEHOUSES ---
print(">>> Inventorying Lakehouses...")
df_items = fabric.list_items(workspace=WORKSPACE_ID, type="Lakehouse")
if LAKEHOUSES_FILTER:
df_items = df_items[df_items['Display Name'].isin(LAKEHOUSES_FILTER)]
lakehouses_info = []
for _, row in df_items.iterrows():
lh_name = row['Display Name']
lh_id = row['Id']
abfss_path = f"abfss://{WORKSPACE_ID}@onelake.dfs.fabric.microsoft.com/{lh_id}/Tables"
lakehouses_info.append((lh_name, abfss_path))
print(f"Detected {len(lakehouses_info)} Lakehouses to process.")
# --- 2. SUPPORT FUNCTIONS ---
all_tables = []
def find_delta_tables(path, lh_name):
"""Recursively searches for Delta tables and saves context."""
try:
items = mssparkutils.fs.ls(path)
if any(item.name == "_delta_log" and item.isDir for item in items):
all_tables.append((lh_name, path))
return
subdirs = [item.path for item in items if item.isDir and item.name != "_delta_log"]
for subdir in subdirs:
find_delta_tables(subdir, lh_name)
except Exception as e:
pass
def get_folder_size(path):
"""Calculates size recursively."""
total = 0
try:
files = mssparkutils.fs.ls(path)
for file in files:
if file.isDir:
total += get_folder_size(file.path)
else:
total += file.size
except:
pass
return total
# --- 3. GLOBAL SEARCH ---
print("-" * 60)
for lh_name, lh_path in lakehouses_info:
print(f">>> Scanning Lakehouse: {lh_name} ...")
find_delta_tables(lh_path, lh_name)
print(f">>> Total tables found: {len(all_tables)}")
print("-" * 60)
# --- 4. RUNNING VACUUM ---
results = []
for i, (lh_name, table_path) in enumerate(all_tables):
table_name_short = table_path.split("/")[-1]
full_display_name = f"{lh_name}.{table_name_short}"
print(f"[{i+1}/{len(all_tables)}] Processing: {full_display_name} ...", end=" ")
size_pre_bytes = get_folder_size(table_path)
size_pre_gb = size_pre_bytes / (1024**3)
status = "OK"
size_post_gb = size_pre_gb
try:
spark.sql(f"VACUUM delta.`{table_path}` RETAIN {RETENTION_HOURS} HOURS")
time.sleep(2)
size_post_bytes = get_folder_size(table_path)
size_post_gb = size_post_bytes / (1024**3)
ahorro = size_pre_gb - size_post_gb
print(f"Cleaned. Savings: {ahorro:.4f} GB")
except Exception as e:
print("FAILED.")
status = f"Error: {str(e)[0:100]}"
results.append((
lh_name, table_name_short,
size_pre_gb, size_post_gb,
size_pre_gb - size_post_gb,
status, table_path
))
# --- 5. FINAL REPORT ---
print("\n--- FINAL SUMMARY ---")
df_res = (
spark.createDataFrame(
results,
["Lakehouse", "Table", "GB_Before", "GB_After", "GB_Saved", "Status", "Path"]
)
.withColumn("GB_Before", round(col("GB_Before"), 4))
.withColumn("GB_After", round(col("GB_After"), 4))
.withColumn("GB_Saved", round(col("GB_Saved"), 4))
.orderBy(desc("GB_Saved"))
)
display(df_res)
try:
total_saved = df_res.agg({"GB_Saved": "sum"}).collect()[0][0]
print(f"Total space recovered in Workspace: {total_saved if total_saved else 0:.4f} GB")
except:
print("Could not calculate total.")
The final report shows GB before, after, and saved per table — sorted from highest to lowest. Schedule it weekly with a Fabric schedule and maintenance is resolved.