Ollama & Gemma 4 26B on Mac Mini: Auto-Start & Preload
This article details setting up Ollama with Gemma 4 26B on an Apple Silicon Mac mini for an always-ready local LLM environment. It covers installation, model pulling, and advanced configurations like auto-starting Ollama, preloading the model using `launchd` agents, and keeping models loaded indefinitely with `OLLAMA_KEEP_ALIVE` to leverage fast inference on Apple Silicon. Practical takeaways emphasize the benefits for developer workflows and memory management considerations.

Running large language models (LLMs) locally on Apple Silicon Macs has become incredibly powerful, offering fast inference and offline capabilities. However, getting a specific model like Google's Gemma 4 26B ready for immediate use, especially with features like auto-start, preloading, and persistent memory residence, requires a thoughtful setup. This guide, relevant for April 2026, walks fellow developers through configuring Ollama to host Gemma 4 26B on a Mac mini with Apple Silicon, ensuring your LLM is always warm and ready.
The Challenge: Always-On Local LLMs
The convenience of having a powerful LLM like Gemma 4 26B available instantly for coding agents, rapid prototyping, or local data analysis is undeniable. The primary hurdles are usually ensuring the model starts with your system, preloads into memory for minimal latency on first use, and remains loaded rather than being offloaded after periods of inactivity. Ollama, combined with macOS's launchd system, provides elegant solutions to these challenges, leveraging the optimized performance of Apple Silicon and its MLX backend.
Prerequisites for Peak Performance
Before diving in, ensure your Mac mini meets these requirements:
- Hardware: An Apple Silicon Mac mini (M1, M2, M3, M4, or M5 chip) is essential for the performance benefits.
- Memory: At least 24GB of unified memory is critical for Gemma 4 26B, which alone consumes roughly 20GB when loaded.
- Software: macOS with Homebrew installed for streamlined package management.
Step-by-Step Setup
1. Install Ollama
Ollama is the gateway to running LLMs locally. Use Homebrew cask to install the Ollama macOS application, which includes automatic updates and the MLX backend optimized for Apple Silicon.
shell brew install --cask ollama-app
This command places Ollama.app in your /Applications/ folder and the ollama command-line interface (CLI) at /opt/homebrew/bin/ollama.
2. Start the Ollama Server
Once installed, launch the Ollama application. You'll see its icon appear in your menu bar.
shell open -a Ollama
Allow a few moments for the server to fully initialize. Verify its status by listing available models (initially none):
shell ollama list
3. Pull Gemma 4 26B
Now, download the Gemma 4 26B model from Ollama's model library. This is a substantial download, approximately 17GB.
shell ollama pull gemma4:26b
Confirm the download by listing models again:
shell ollama list
NAME ID SIZE MODIFIED
gemma4:26b 5571076f3d70 17 GB ...
4. Test Model Functionality and Acceleration
Engage with the model to ensure it's working correctly and leveraging your Mac's GPU acceleration.
shell ollama run gemma4:26b "Hello, what model are you?"
Check ollama ps to observe the CPU/GPU utilization split, confirming that GPU acceleration (e.g., 86% GPU) is active, a key benefit of Apple Silicon's MLX integration.
shell ollama ps
Should show CPU/GPU split, e.g. 14%/86% CPU/GPU
Advanced Configuration: Auto-Start and Keep-Alive
To truly integrate Gemma 4 26B into your developer workflow, configure it to launch automatically and remain loaded.
5a. Auto-Launch Ollama App on Login
Ensure the Ollama application itself starts when you log in. This can be configured directly from the Ollama menu bar icon by selecting Launch at Login, or via System Settings > General > Login Items.
5b. Auto-Preload Gemma 4 on Startup with launchd
To preload Gemma 4 26B into memory immediately after Ollama starts and keep it responsive, create a launchd agent. This agent will send an empty prompt to the model every five minutes, effectively keeping it warm.
Create a .plist file for the launch agent:
shell cat << 'EOF' > ~/Library/LaunchAgents/com.ollama.preload-gemma4.plist
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"> <plist version="1.0"> <dict> <key>Label</key> <string>com.ollama.preload-gemma4</string> <key>ProgramArguments</key> <array> <string>/opt/homebrew/bin/ollama</string> <string>run</string> <string>gemma4:26b</string> <string></string> </array> <key>RunAtLoad</key> <true/> <key>StartInterval</key> <integer>300</integer> <key>StandardOutPath</key> <string>/tmp/ollama-preload.log</string> <key>StandardErrorPath</key> <string>/tmp/ollama-preload.log</string> </dict> </plist> EOFLoad the agent to activate it:
shell launchctl load ~/Library/LaunchAgents/com.ollama.preload-gemma4.plist
This configuration ensures that ollama run gemma4:26b is executed on login (RunAtLoad) and subsequently every 300 seconds (StartInterval), sending an empty prompt (<string></string>) to keep the model loaded and prevent it from being offloaded due to inactivity.
5c. Keep Models Loaded Indefinitely
By default, Ollama unloads models after five minutes of inactivity to free up memory. To override this and keep Gemma 4 26B loaded indefinitely, set the OLLAMA_KEEP_ALIVE environment variable to -1.
shell launchctl setenv OLLAMA_KEEP_ALIVE "-1"
For this change to take effect, you'll need to restart Ollama. To make this setting persistent across reboots, add export OLLAMA_KEEP_ALIVE="-1" to your shell configuration file (e.g., ~/.zshrc). Alternatively, a dedicated launch agent could manage this environment variable for system-wide persistence.
Verify Your Setup
After configuring, perform a final check to ensure everything is working as expected:
shell
Check Ollama server is running
ollama list
Check model is loaded in memory
ollama ps
Check launch agent is registered
launchctl list | grep ollama
Your ollama ps output should resemble this, indicating Gemma 4 26B is loaded and persistent:
NAME ID SIZE PROCESSOR CONTEXT UNTIL gemma4:26b 5571076f3d70 20 GB 14%/86% CPU/GPU 4096 Forever
API Access for Integration
Ollama exposes a convenient local API on http://localhost:11434. This allows easy integration with coding agents, custom applications, or any tool that can make HTTP requests. The API is OpenAI-compatible for chat completions:
shell
curl http://localhost:11434/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "gemma4:26b",
"messages": [{"role": "user", "content": "Hello"}]
}'
Understanding Ollama's Recent Enhancements (v0.19+)
As of March 2026 (Ollama v0.19+), several key improvements enhance the local LLM experience:
- MLX Backend on Apple Silicon: Ollama automatically leverages Apple's MLX framework, providing substantial speedups without extra configuration. M5-series chips gain further acceleration from GPU Neural Accelerators, while M4 and earlier M-series chips still benefit from general MLX optimizations.
- NVFP4 Support (NVIDIA): Although focused on Apple Silicon here, Ollama now supports NVIDIA's NVFP4 format. This reduces memory bandwidth and storage for inference, improving accuracy consistency with production environments and enabling models optimized by NVIDIA's tools.
- Improved Caching for Agentic Tasks: Significant advancements in caching mean lower memory utilization and more cache hits across conversations, especially beneficial for branching prompts and coding agents. Intelligent checkpoints and smarter eviction policies ensure shared prefixes persist longer, speeding up iterative tasks.
Practical Takeaways
By implementing this setup, you gain an always-ready, high-performance local LLM on your Mac mini. This can dramatically accelerate development workflows, enable offline AI capabilities, and reduce reliance on external API calls, keeping your data local and your costs down. Be mindful of the significant memory footprint of Gemma 4 26B (~20GB) and manage other memory-intensive applications accordingly on a 24GB system.
FAQ
Q: Why use a launchd agent for preloading if OLLAMA_KEEP_ALIVE="-1" already keeps the model loaded indefinitely?
A: The launchd agent primarily ensures the model is initially loaded right after login (RunAtLoad) and provides a continuous "warm-up" signal every 5 minutes (StartInterval). While OLLAMA_KEEP_ALIVE="-1" prevents automatic unloading due to inactivity, the launchd agent guarantees the model is brought into memory promptly at startup and remains actively engaged, even if OLLAMA_KEEP_ALIVE wasn't set or temporarily overridden.
Q: What's the true impact of OLLAMA_KEEP_ALIVE="-1" on my Mac mini's resources?
A: Setting OLLAMA_KEEP_ALIVE="-1" means the Gemma 4 26B model (or any other model you've run) will remain resident in your unified memory indefinitely. For Gemma 4 26B, this translates to approximately 20GB of memory being continuously occupied. While it offers instant response times, it significantly reduces the available memory for other applications. On a 24GB Mac mini, this leaves only about 4GB for macOS and other processes, so it's a trade-off between convenience and system memory overhead.
Q: Is 24GB unified memory strictly necessary for Gemma 4 26B, or can I get by with less?
A: For optimal and stable performance with Gemma 4 26B, 24GB of unified memory is strongly recommended, as the model itself consumes around 20GB when loaded. While Ollama can attempt to run models that exceed physical memory by utilizing swap space, this significantly degrades performance and can lead to a sluggish system. To avoid constant swapping and ensure smooth inference, having sufficient dedicated memory as specified is crucial.
Related articles
Volkswagen's MOIA and Uber Launch Self-Driving ID. Buzz Tests in LA
Volkswagen's MOIA America and Uber have officially begun on-road testing of self-driving ID. Buzz minibuses in Los Angeles, marking the first U.S. city in their multi-city rollout strategy. The initial fleet operates with human safety operators, targeting commercial service by late 2026 and fully driverless operations by 2027. This move leverages the specialized ID. Buzz AD equipped with a 27-sensor Mobileye platform and Uber's extensive ride-hailing network.
Exit 8 Review: A Masterful Cinematic Nightmare
Exit 8 offers a chilling, psychological horror experience, transforming a minimalist video game into a profound cinematic nightmare. Director Genki Kawamura's innovative practical filmmaking and deep thematic exploration make it a must-see for fans of unconventional horror.
Building Responsive, Accessible React UIs with Semantic HTML
Build responsive and accessible React UIs. This guide uses semantic HTML, mobile-first design, and ARIA to create inclusive applications, ensuring seamless user experiences across devices.
NASA's Alien Life Search: Explicit Focus, High Hopes
Quick Verdict NASA Administrator Jared Isaacman's recent declaration that the odds of finding alien life are "pretty high" marks a significant, explicit shift in NASA's public narrative and strategic focus. While the
Beyond Vibe Coding: Engineering Quality in the AI Era
The concept of 'vibe coding,' an extreme form of dogfooding where developers avoid inspecting AI-generated code, often leads to significant quality issues. A more effective approach involves actively guiding AI tools to clean up technical debt and refactor, treating them as powerful assistants under human oversight. Ultimately, maintaining high software quality, even with AI, remains a deliberate choice for developers.
Offline-First Social Systems: The Rise of Phone-Free Venues
Mobile technology, while streamlining communication and access, has also ushered in an era of constant digital distraction. For developers familiar with context switching and notification fatigue, the impact on





