Data Agent - Marcel Pellicero Esteban

userID INTEGER
full_name TEXT
born_date TEXT
country TEXT
gender TEXT

id SERIAL PK
userID INTEGER
product_id INTEGER
product_name TEXT
purchase_timestamp TEXT

product_id INTEGER PK
product_name TEXT
category TEXT

Hi! I'm the Data Agent. Ask me a question and I'll explore the database step by step. You approve each SQL query before it runs.

▶ 🧪 Corrupted Data — Testing Data Quality

The analytics_purchases table has been intentionally seeded with 40 corrupted rows to test the agent's ability to surface data quality issues. Try asking "Show me data quality of table purchases" to see them.

Injected anomalies

Category	# Rows	What's wrong
NULL values	8	Missing `userID`, `product_id`, `product_name`, or `purchase_timestamp`
Duplicate rows	8	Exact-duplicate purchases for the same user + product + timestamp
Garbage strings	8	Product names like `!!INVALID!!`, `NULL_STRING`, `???`, empty strings
Orphan foreign keys	8	`userID` or `product_id` that don't exist in the parent tables (e.g. 99999, -1)
Format issues	8	Timestamps like `not-a-date`, `9999-01-01`, epoch `0`, and far-future dates

The other two tables — analytics_users and analytics_product_specs — remain clean, making cross-table joins useful for detecting orphan references.

▶ 🛡️ Guardrails — Prompt Injection & Security Controls

This agent applies a defence-in-depth approach with guardrails at both the LLM prompt level and the server-side code.

1. Off-topic refusal

The system prompt instructs the model to decline any question not related to the analytics data. If you ask about the weather, coding help, or general knowledge the agent responds with a polite refusal instead of generating SQL.

2. Sensitive data (PII) blocking

The model is forbidden from returning full_name and born_date together in the same result set. User-level queries must use userID, country, or gender only. Requests to extract or export personal data are refused.

3. Analytics-only table scope

Both the prompt and the server enforce a strict table allowlist:

analytics_users
analytics_purchases
analytics_product_specs

Any SQL referencing auth_user, django_session, chat message tables, or any other table is rejected server-side before execution.

4. Prompt injection rejection

Server-side filter — before the question ever reaches the LLM, it is scanned for 14 known injection patterns such as "ignore previous instructions", "you are now", "system prompt", "jailbreak", etc. Matches are blocked instantly with an error message.

Prompt-level rule — the system prompt tells the model to refuse any attempt to override instructions, change its role, or reveal its prompt.

5. Input & output size caps

Input: questions longer than 500 characters are rejected server-side.
Output: the model is instructed to keep final answers under 500 words.
SQL results: queries return at most 100 rows, and a LIMIT 20 is auto-appended unless the query already uses an aggregate or explicit limit.

6. SQL safety

Only SELECT statements are allowed. DROP, DELETE, UPDATE, INSERT, ALTER, TRUNCATE, CREATE, GRANT, REVOKE, and COPY are blocked.
A 10-second query timeout prevents runaway queries.
Every query is approved by you before execution.