Local AI on Your Device: Hands‑On with Ollama & LM Studio

Learn the core ideas behind running AI models locally, then follow step‑by‑step instructions on Windows, macOS, or Linux to run models, structure outputs, call local APIs, and query your own documents.

Mohammad Tasfiq JawaadLinkedInGitHubUpdated: 5 Nov 2025

Local AI fundamentails

Model size (parameters: 7B, 13B, 70B)

Parameters are tiny “knobs” a model uses to represent patterns in language. More knobs (7B → 13B → 70B) often means better reasoning but higher memory and slower responses.

7–8B: Great on laptops (e.g., Llama 3 8B Instruct, Qwen2.5 7B Instruct).
13B: Better reasoning; needs more memory (Llama 3 13B Instruct).
70B: Very capable; usually impractical locally.

Tokens and context window

Models read/write in tokens (small text chunks). The context window is the model’s short‑term memory—your instructions + any pasted docs + the model’s reply. If you exceed it, older parts fall off.

Example: A 4k token window fits roughly 2.5–3k words combined (prompt + answer).

Quantization (Q2, Q3, Q4, Q5, Q8)

Quantization compresses a model so it fits and runs faster—like saving a high‑res photo as a smaller JPEG. You lose a bit of fidelity but gain speed and lower memory use.

Start with Q4 for balance on laptops.
Tight memory? Try Q3.
More headroom and quality? Try Q5/Q8.

Flavors like Q4_K_M are just different compression strategies.

Precision types (FP16, BF16, INT8, INT4)

Precision describes how detailed the model’s numbers are. FP16/BF16 are higher‑detail (common in GPU stacks). INT8/INT4 are smaller/faster (common in local GGUF files).

Model files: GGUF vs safetensors

GGUF is for llama.cpp tools (Ollama, LM Studio). safetensors is for PyTorch/Transformers; not directly loaded here. For this guide, choose GGUF.

Base vs Instruct/Chat models

Base models are book‑smart but not very obedient. Instruct/Chat models are tuned to follow directions—use these for everyday tasks and assistants.

Fine‑tuning, instruction‑tuning, distillation, adapters (LoRA/QLoRA)

Fine‑tuning: Teach a specialty (e.g., legal summaries).
Instruction‑tuning: Improves following directions.
Distillation: Big teacher → small student; fast yet capable.
Adapters (LoRA/QLoRA): Snap‑on skill modules; QLoRA works on quantized bases.

Embeddings, vector DBs, and RAG

Embeddings turn text into vectors; a vector DB (Chroma/FAISS) finds similar chunks. RAG retrieves those chunks and feeds them to the model, grounding answers in your sources for fewer hallucinations.

Creativity knobs: temperature, top_p, top_k

Temperature controls creativity (lower = safer, higher = more adventurous). top_p and top_k limit candidate next words. For strict formats like JSON, lower temperature helps.

RAM and the KV cache

Models must fit in memory, and while generating they build a KV cache that grows with prompt and output length. Longer contexts and replies use more memory.

8 GB RAM: 7–8B at Q4, keep context 2–4k tokens.
16 GB RAM: 7–13B at Q4 is comfortable.
32 GB RAM: 13B at Q4/Q5; larger if you accept slower speeds.

Local vs cloud

Local is private, offline, and cost‑predictable. Cloud offers the biggest, most capable models. Many prototype and handle private data locally, then use cloud for the toughest reasoning.

Setup (Per‑OS)

Install Ollama

Windows (PowerShell)

Download and run the installer from https://ollama.com/download.
Open PowerShell or Command Prompt.
Verify the install:
bash
```
ollama --version
ollama ps
```

macOS (Terminal)

Install Homebrew if you don’t have it (brew.sh).

Install and verify:

bash

brew install ollama
ollama --version
ollama ps

Linux (Shell)

Install via script:

bash

curl -fsSL https://ollama.com/install.sh | sh

Verify:
bash
```
ollama --version
ollama ps
```

Install LM Studio

Windows

Download the installer from https://lmstudio.ai and follow the prompts. Launch from Start Menu.

macOS

Download the app from https://lmstudio.ai, drag to Applications, then open it.

Linux

Download the Linux build from https://lmstudio.ai and follow distro‑specific instructions provided on their site.

Prerequisites & Tips

Have 10–20 GB free disk space for multiple models.
Close heavy apps to free RAM before loading models.
If Python isn’t installed: Windows (Microsoft Store or python.org), macOS (Xcode tools or pyenv), Linux (distro package).