Local AI on Your Device: Hands‑On with Ollama & LM Studio
Learn the core ideas behind running AI models locally, then follow step‑by‑step instructions on Windows, macOS, or Linux to run models, structure outputs, call local APIs, and query your own documents.
Local AI fundamentails
Model size (parameters: 7B, 13B, 70B)
Parameters are tiny “knobs” a model uses to represent patterns in language. More knobs (7B → 13B → 70B) often means better reasoning but higher memory and slower responses.
- 7–8B: Great on laptops (e.g., Llama 3 8B Instruct, Qwen2.5 7B Instruct).
- 13B: Better reasoning; needs more memory (Llama 3 13B Instruct).
- 70B: Very capable; usually impractical locally.
Tokens and context window
Models read/write in tokens (small text chunks). The context window is the model’s short‑term memory—your instructions + any pasted docs + the model’s reply. If you exceed it, older parts fall off.
Example: A 4k token window fits roughly 2.5–3k words combined (prompt + answer).
Quantization (Q2, Q3, Q4, Q5, Q8)
Quantization compresses a model so it fits and runs faster—like saving a high‑res photo as a smaller JPEG. You lose a bit of fidelity but gain speed and lower memory use.
- Start with Q4 for balance on laptops.
- Tight memory? Try Q3.
- More headroom and quality? Try Q5/Q8.
Flavors like Q4_K_M are just different compression strategies.
Precision types (FP16, BF16, INT8, INT4)
Precision describes how detailed the model’s numbers are. FP16/BF16 are higher‑detail (common in GPU stacks). INT8/INT4 are smaller/faster (common in local GGUF files).
Model files: GGUF vs safetensors
GGUF is for llama.cpp tools (Ollama, LM Studio). safetensors is for PyTorch/Transformers; not directly loaded here. For this guide, choose GGUF.
Base vs Instruct/Chat models
Base models are book‑smart but not very obedient. Instruct/Chat models are tuned to follow directions—use these for everyday tasks and assistants.
Fine‑tuning, instruction‑tuning, distillation, adapters (LoRA/QLoRA)
- Fine‑tuning: Teach a specialty (e.g., legal summaries).
- Instruction‑tuning: Improves following directions.
- Distillation: Big teacher → small student; fast yet capable.
- Adapters (LoRA/QLoRA): Snap‑on skill modules; QLoRA works on quantized bases.
Embeddings, vector DBs, and RAG
Embeddings turn text into vectors; a vector DB (Chroma/FAISS) finds similar chunks. RAG retrieves those chunks and feeds them to the model, grounding answers in your sources for fewer hallucinations.
Creativity knobs: temperature, top_p, top_k
Temperature controls creativity (lower = safer, higher = more adventurous). top_p and top_k limit candidate next words. For strict formats like JSON, lower temperature helps.
RAM and the KV cache
Models must fit in memory, and while generating they build a KV cache that grows with prompt and output length. Longer contexts and replies use more memory.
- 8 GB RAM: 7–8B at Q4, keep context 2–4k tokens.
- 16 GB RAM: 7–13B at Q4 is comfortable.
- 32 GB RAM: 13B at Q4/Q5; larger if you accept slower speeds.
Local vs cloud
Local is private, offline, and cost‑predictable. Cloud offers the biggest, most capable models. Many prototype and handle private data locally, then use cloud for the toughest reasoning.
Setup (Per‑OS)
Install Ollama
- Download and run the installer from https://ollama.com/download.
- Open PowerShell or Command Prompt.
- Verify the install:bash
ollama --version ollama ps
- Install Homebrew if you don’t have it (brew.sh).
- Install and verify:bash
brew install ollama ollama --version ollama ps
- Install via script:bash
curl -fsSL https://ollama.com/install.sh | sh - Verify:bash
ollama --version ollama ps
Install LM Studio
Download the installer from https://lmstudio.ai and follow the prompts. Launch from Start Menu.
Download the app from https://lmstudio.ai, drag to Applications, then open it.
Download the Linux build from https://lmstudio.ai and follow distro‑specific instructions provided on their site.
Prerequisites & Tips
- Have 10–20 GB free disk space for multiple models.
- Close heavy apps to free RAM before loading models.
- If Python isn’t installed: Windows (Microsoft Store or python.org), macOS (Xcode tools or pyenv), Linux (distro package).