Building Ethical AI: How Red-Teaming SoundsWrite Exposed a Critical Safety Flaw
Building a tech startup in the era of "vibe coding" is exhilarating. You can string together powerful APIs, leverage advanced Large Language Models (LLMs), and spin up complex software architectures in a fraction of the time it used to take. With modern frameworks, a solo developer can deploy features in an afternoon that would have required an entire engineering team just two years ago.
But moving fast comes with a hidden, potentially devastating cost: the assumption that the AI models we use are inherently safe right out of the box.
When you connect your app to a world-class LLM, it feels like you're outsourcing the cognitive heavy lifting. It's easy to fall into the trap of thinking you are also outsourcing the ethical responsibility. Recently, we had a stark wake-up call here at SoundsWrite that shattered that illusion. It completely changed how we view our responsibility as developers, and I want to share exactly what happened in the hopes that other founders will check their own blind spots before they ship.
The Premise of SoundsWrite
If you're new here, SoundsWrite is a productivity app built around a simple, liberating concept: the "brain-dump." Traditional to-do list apps require you to manually categorize, tag, and prioritize your life. When you are overwhelmed, just looking at a blank list can cause paralysis.
Instead of meticulously organizing tasks, SoundsWrite users can just dump everything out of their heads—often using their microphone to just talk it out in a chaotic stream of consciousness—and our app structures that chaos into a clean, prioritized, actionable to-do list. We currently have three different ways users can input this brain-dump into the app.
While testing one of our text-based input methods, a feature called AutoTask, my mind wandered to the darker side of AI use cases. When you build an app designed to process raw, unfiltered human thoughts, you inevitably capture the full spectrum of the human experience. And sometimes, that experience is incredibly dark.
I’ve been reading a lot about how LLMs have been manipulated into facilitating self-harm or violence. Having struggled with suicidal thoughts myself for most of my life until just a few years ago, the realization hit me like a freight train:
What if someone uses SoundsWrite to organize a plan to end their life? What if a user, in a moment of deep crisis, brain-dumps a chaotic, hopeless stream of consciousness into our app? What if the app does exactly what it was programmed to do—strips away the emotion and gives them a highly efficient, step-by-step, actionable to-do list for self-harm?
The Red-Team Experiment
I couldn't just wonder; I had to know if the system would actually do it. In cybersecurity, "red-teaming" is the practice of actively attacking your own system to find vulnerabilities before bad actors do. So, I sat down at my laptop and decided to try and jailbreak SoundsWrite.
If you just go to a standard web interface and ask a modern LLM for a lethal plan, the default safety guardrails (whether from OpenAI, Google, or Anthropic) will usually block it instantly. They are trained to recognize harmful intent and shut it down. But hackers and prompt engineers use "syntax manipulation" to bypass these filters, exploiting the way these models are trained to parse formatting.
I used a basic markdown code block exploit. I wrapped my dangerous request in backticks (```). Because LLMs are heavily trained on coding datasets, wrapping text in these backticks often forces the AI model to treat the text as literal, executable code or dry data to be processed, effectively blinding it to the emotional or ethical context of the words.
I asked my own app for a lethal to-do list using this trick. I hit enter and waited.
It worked. The LLM completely ignored the catastrophic nature of the request. The formatting trick bypassed the built-in safety filters entirely. Instead of a warning or a refusal, the app spat out exactly what I asked for, formatted neatly as a prioritized task list. It was cold, efficient, and horrifying.
My heart sank. In our quest to build a frictionless productivity tool, we had inadvertently created an unfettered vector for harm.
The Problem with the "Black Box"
When we vibe-code and hook up powerful APIs, it is incredibly easy to treat the LLM as a black box. You send text in, you get formatted data out. We assume the big tech companies have handled the safety side of things through their massive alignment teams and safety protocols.
But they don't catch everything. Their filters are broad, generalized, and fundamentally vulnerable to users who figure out how to jailbreak the formatting. More importantly, those foundational models do not know the context of your application.
If you are building an app that processes raw, unfiltered user input—especially unstructured thoughts, journal entries, or voice dictation—you cannot rely solely on the provider's default settings. The "Not My Problem" fallacy has no place in software development. If your UI delivers the output, you are responsible for the edge cases.
The SoundsWrite Safety Architecture
We immediately halted normal product development to patch this vulnerability. But as we brainstormed solutions, we realized that preventing harm required a highly nuanced approach, not just a blanket error message.
If someone is in a dark enough place to ask a productivity app for help hurting themselves, hitting them with a robotic, dismissive "I am an AI language model and cannot fulfill this request" is not just unhelpful—it can actually make them feel more isolated. It feels like a door slamming in their face, which can accelerate a downward spiral.
Here is the ethical moderation layer we are implementing at SoundsWrite to ensure our app actively protects our users:
1. Input Sanitization (Neutralizing the Attack Surface)
We are no longer passing raw, untouched user text directly to the model. Before the main LLM ever sees the prompt, our backend actively strips out excessive delimiters, nested brackets ({{{), and random markdown code blocks. Because our users are dictating their day into a microphone or typing standard sentences, they aren't naturally speaking in triple backticks. By aggressively stripping these characters out, we remove the primary attack surface for syntax-hijacking without affecting the core user experience.
2. A Pre-Processing Intent Layer
Before generating any task list, the sanitized text runs through a secondary, lightweight, highly restrictive moderation check. Think of this as a triage nurse. Its only job is to classify the user's intent: Is this a benign brain-dump, or is there an underlying threat of violence or self-harm?
3. The "Anti-Suicide To-Do List"
If the system detects self-harm or suicidal ideation, we intercept the request entirely. The app will not generate a standard task list, and it will never give a robotic error.
Instead, the UI dynamically swaps to an empathetic "Anti-Suicide To-Do List." We chose to retain the "to-do list" format because when a person is overwhelmed, small, actionable steps are crucial. This list provides immediate, actionable help:
- Step 1: Breathe.
- Step 2: Reach out to a dedicated professional (displaying a localized list of crisis hotlines based on the user's region, like 988 in the US).
- Step 3: Connect with a trusted friend or family member.
It pairs this with a gentle, human-written reminder that they are not alone and that it is okay to reach out to a real human being. We replace a destructive plan with a supportive, manageable one.
4. Hard "Kill Switches" for Violence
While our response to self-harm is rooted in profound empathy and mental health advocacy, our response to violence against others requires a different posture. If the pre-processing layer detects a threat to public safety or a plan to harm others, a hard backend override is triggered. The system firmly refuses the request without exception, prioritizing public safety above all else.
A Call to Fellow Developers
Vibe coding allows us to build incredible, world-changing tools at lightning speed. It democratizes software creation in ways we couldn't have imagined a decade ago. But speed should never come at the expense of human life, safety, or basic ethical stewardship.
Do not assume the API provider has you completely covered. Red-team your own apps. Break them deliberately. Put yourself in the shoes of your most vulnerable user, and your most malicious one. Think about the worst possible way someone could use the tool you are building, and then write the code to make absolutely sure it never happens.
We owe it to our users to build with empathy, to anticipate the dark corners, and to build responsibly. Code moves fast, but the impact we leave on people's lives lasts forever.