Voice To Structured Vikunja Tasks (Offline-First, AI)

I wanted to share a personal productivity workflow I’ve developed for capturing ideas as voice notes, especially when offline, and having them automatically processed and turned into structured tasks in Vikunja.

The inspiration for this comes from the idea that you should never lose a thought.

“Many times you forget them. And if you forget a good idea - you wanna commit suicide”.
(David Lynch).

https://youtu.be/LVeAuwU-uuU?si=s159lKLdZFjD-jnQ

While extreme xD , it highlights how crucial it is to capture ideas with all their details intact before they fade.

This setup is especially for those moments - in an underground parking garage, on a hike in the woods, or traveling abroad with spotty internet - when you can’t afford to lose a valuable thought. It turns a quick voice memo into a perfectly formatted Vikunja task, waiting for you the next morning.

This is a fairly advanced, self-hosted setup that requires some technical knowledge. I’m sharing the concept and architecture here to spark discussion, get feedback, and see how others might be solving similar problems.

If you’re not so worried about loosing your ability to quickly record when you’re offline, here are better and easier online-only options for you:

  1. https://www.reddit.com/r/Vikunja/comments/1ny05ny/vikunja_voice_assistant_for_home_assistant/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button;
  2. TBD.

TL;DR & Prerequisites

The High-Level Flow:

  1. Capture: Record an offline voice note on an iPhone/Apple Watch.
  2. Sync: The recording is saved locally and automatically uploaded to cloud storage when a connection is available.
  3. Process: A scheduled script on a server pulls the audio file.
  4. Transcribe: The audio is sent to a Speech-to-Text (STT) service.
  5. Structure: The raw text is sent to a Large Language Model (LLM) to be summarized, formatted, and structured as a JSON object with a title, description, and other metadata.
  6. Create: The script uses the structured JSON to create a new task in Vikunja via its API, attaching the original audio file.

Core Components & Alternatives:

  • Capture Device: iPhone / Apple Watch (can be adapted to Android);
  • Automation App: iOS Shortcuts, Scriptable
  • Cloud Storage: self-hosted Nextcloud (Can be replaced with hosted Nextcloud options, or any cloud storage that has an API, like Dropbox, Google Drive, etc.)
  • Sync Trigger: Home Assistant (Used for a specific offline-sync mechanism via email sending, could be replaced with other automation tools like n8n.)
  • Processing Server: Any self-hosted server (e.g., a VPS or home server) running a Python script used to run stuff based on schedule (could be replaced with other automation tools like n8n);
  • Speech-to-Text (STT): OpenAI Whisper API (Alternatives: Self-hosted models, or other cloud services.)
  • Language Model (LLM): Self-hosted Gemma 3n E4B (Alternatives: OpenAI’s API (GPT models), or services like OpenRouter which provide access to many models.). LLM is not required, but it’s key part to make it seem like magic.

Prerequisites:
This is a DIY project. You should be comfortable with:

  • Self-hosting applications.
  • Basic scripting (the core logic is a Python script).
  • Working with APIs.

The Workflow in Detail

Step 1: Capture on your phone/watch (e.g. iOS)

I use an iOS Shortcut on my iPhone and Apple Watch to start recording instantly. On my watch, a double-clench gesture triggers it, which is incredibly convenient.

  1. When I trigger the shortcut, it starts recording audio.
  2. When I stop, the recording is saved as an .m4a file directly to the local iPhone file system. This is the crucial offline-first part.
  3. If an internet connection is present, the shortcut tries to immediately upload the file to a specific folder in my Nextcloud storage.

Step 2: Syncing the Recordings When Finally Online

What if I recorded several notes while offline? They need to be uploaded once I’m back online. Manually triggering this is a pain, so I automated it.

  1. I have a Home Assistant automation that runs on a schedule (e.g., every 10 minutes).
  2. This automation sends a specific email to an address linked to my iPhone’s native Mail app (can be done by automation systems other than HASS, e.g. n8n);
  3. An IOS automation is configured to watch for an email with a specific subject (e.g., “Upload Vikunja Voice Recordings”);
  4. When this email is received, it automatically triggers a different iOS Shortcut. This shortcut’s only job is to scan the local folder for any audio files in given folder and send them all to Nextcloud (done through calling Scriptable app script written in JavaScript);
  5. Each succedded upload corresponds to recording being deleted from IPhone local folder so it doesn’t get uploaded next time automation gets executed by email.

Yes, I know, email… But this is the only way I found to remotely call a shortcut after all these years on IOS.
It usually triggers within 10-15 seconds of the email being sent.

Step 3: Server-Side Processing

Now the audio files are sitting in Nextcloud storage. A Python web script I host takes over from here.

  1. A scheduled job (cron) inside of script runs every minute (the more frequent the faster you’re gonna see task in Vikunja);
  2. The script connects to Nextcloud storage, checks the designated folder for new audio files, and downloads any it finds;
  3. It then sends each audio file to my Speach-To-Text (STT) service of choice, OpenAI Whisper. The API is fast, accurate, and cheap.
    • Pro-Tip: You can pass a list of keywords or technical terms to the Whisper API to improve its recognition accuracy for specific jargon.

Step 4: Structuring the Brain Dump with an LLM

A raw transcription of my thoughts is often unstructured. Actually this topic also was written this way… I was just talking to my watch for a whole 20 min bus ride. And here is where the magic happens.

  1. The raw text from Whisper is passed to an LLM. I use a self-hosted Gemma 3n E4B model because it’s lightweight, fast, and keeps my data private.

  2. I use a carefully crafted prompt to ask the LLM to process the text. The prompt looks something like this:

    You are a helpful assistant that converts raw, dictated text into a structured task. Take the following text and:

    1. Create a concise and clear title for the task.
    2. Write a well-formatted description, summarizing the key points without losing any important details.
    3. Keep the original transcribed text at the very end of the description under a “### Raw Transcription” heading.

    Return your response ONLY as a single JSON object with the keys “title” and “description”. Do not include any other text or markdown formatting around the JSON.

    Here is the text:
    “[INSERT RAW TEXT FROM WHISPER HERE]”

  3. The LLM returns a clean JSON object, like {"title": "Book Weekly Barber Appointment", "description": "..."}.

Advanced LLM Capabilities:
You can push this even further by asking the LLM to:

  • Extrapolate Metadata: Ask it to identify due dates, start dates, or recurring schedules (e.g., “remind me to do this every Tuesday at 4pm”) and return them as specific JSON fields.
  • Identify the Project: You could ask it to parse a project name from the text. The script would then need to look up the corresponding project ID in Vikunja before creating the task.

Step 5: Creating the Task in Vikunja

The final step is straightforward:

  1. The Python script parses the JSON response from the LLM.
  2. It constructs a new JSON payload that matches the format required by the Vikunja API’s create task endpoint.
  3. It populates the title and description fields with the data from the LLM.
  4. Crucially, it also attaches the original audio file to the task. This is invaluable. If the LLM ever misunderstands something, or if I want to recall the tone and emotion of my original thought, I can just play the recording directly from the task.
  5. The script makes the API call, and the task appears in Vikunja.

If there were multiple recordings, the script loops through them, creating a batch of tasks all at once.

Let’s Discuss!

While there might be simpler integrations, I love the power and control this setup gives me. It’s fully customizable, works flawlessly offline, and the privacy can be locked down with self-hosted models.

I’m curious to hear your thoughts.

  • How do you handle quick idea capture for Vikunja?
  • Do you see any potential improvements or simplifications for this workflow?
  • Has anyone else experimented with LLMs for task management?

If anyone is interested in the specific Python code or screenshots of the iOS Shortcuts, let me know. I’d be happy to share them.

P.S. I know I’m ignoring a some of you out there in topics and I’m sincerely sorry. I hope I’ll come back to some stuff sooner or later. Just decided to spend some time sharing what might be valuable for fellow automators that love to self host.

Wow this is pretty advanced. Well done!

I actually want to integrate something like this directly into Vikunja since a while now, just didn’t get around to do it. I think it could be a killer-feature when well-integrated into the core Vikunja app, especially on mobile.

Did you think about creating a Telegram/other messenger bot to capture the voice messages? That would reduce a bunch of the complexity, but add some privacy concerns. Still, you seem to use OpenAI’s hosted Whisper, so there is less privacy already involved.

1 Like

Vikunja mobile app

Yep if there was Vikunja app for IOS (is there?) everything would be much easier. Not sure about the battery usage though as everything would need to be running and waiting for output from different 3rd parties like Speech-To-Text/LLM in scope of a background process. And I heard Apple is of course restrictive about time windows and resources allowed to background process. Should be possible, maybe not in one go, but in a span of different background executions. Also Apple seem to allow more when user gives “Always Location Access” permission so that you can run backround stuff any time there is significant location change.
With Android obviously everything should be easier.

Messenger as middle service

Actually no, haven’t thought about using messenger as service in the middle, but that’s interesting. Really intersting, don’t know how I haven’t thought about it before.

After reviewing options out there, I’m intrigued by the idea of using messengers like Telegram or WhatsApp as a middle layer for capturing voice notes, recognizing the reduced setup complexity they offer, but I’m also frustrated by their limitations - especially around offline delivery, privacy concerns, and lack of deep OS integration for a truly seamless experience.

Telegram
In Telegram you grab your phone, open app, tap on pinned bot, record voice msg. That’s it, right? Basically 4 actions, not as slick as hand gesture/action button/2 taps in my design, but on the other hand this is supposed to reduce the setup complexity a lot.
And Telegram also auto delivers audio when you’re back online and actually open the app.
It’s kind of risky though - when audio sits undelivered for a long time because of poor internet connection it may actually be never delivered (requiring you to retry) - at least you’re gonna see error next time you visit Telegram.
In conclusion, problem here is time it takes for offline voice note to end up in Vikunja equals time you haven’t opened a messenger app.
So this is a way to go if you use Telegram as daily driver with some caveats and the bots have really fantastic SDKs. It eliminates the need for automation tool (like n8n/HASS) and also cloud storage (Nextcloud).
Privacy aspect of Telegram is debatable, right, it’s not end-to-end encrypted - but should be fine for most folks out there. I personally trust them too much sometimes xD.
What would be even more fantastic is Telegram exposing IOS shortcut action to either record voice into given chat or at least send message with voice file recorded from IOS Shortcut itself. The benefits are ability to record basically in one tap. For owners of Apple Watch Ultra or new IPhones with action buttons it would be a killer feature. Unfortunately Telegram doesn’t expose actions like this and from what I see none of messengers do so.

Whatsapp
Setup seems to require creating a business account tied to real phone number + credit card and you get 1000 free messages per month that you can receive as webhook. Not sure if you can handle voice messages there, probably yes.
Bonus is that it’s one tap less than Telegram given that Whatsapp exposes Siri Shortcut action to directly open a specific chat conversation.
Same downsides as mentioned with Telegram - audio won’t be sent until you open app again and no Siri actions integration.
Privacy seems to be better given end-to-end encryption, but it’s closed source so who knows.
Situation of Signal is similar to Whatsapp but no exposed IOS actions.

OpenAI’s hosted Whisper

Yep I also don’t like that it’s the only obvisous “privacy hole” in my design.
It’s fine for me given that I and many serious businesses tend to believe their claim about no data collection/training for API users. But yes, ideally it should be replaced with self-hosted STT service for a total Nirvana.

I’ll update topic with info I described above to present folks with alternative option to avoid automation system & cloud storage making the setup easier.

Thank you for reviewing my writeup, Konrad. As always very much appreciated!!

1 Like