I. Installation

Installing Ollama

Yacana was initially designed to work with Ollama, but now also supports any OpenAI-compatible endpoints like ChatGPT or VLLM.
This documentation serves as a comprehensive tutorial, guiding you through Yacana's features step by step. Since the programming API is identical across all inference servers, we'll primarily demonstrate examples using the Ollama agent. You can easily adapt any code snippets to work with your preferred inference server by simply swapping the agent type. Let's start by installing Ollama on your computer, it's one of the simplest inference servers to set up.

Click here to get the latest release.

Ollama is:

Compatible with all operating systems Windows/Mac/Linux ;
Installed in seconds using one command ;
Has a great CLi that even a 4-year-old can use to download models ;
Has tons of tutorials out there if you run into any trouble ;

You can connect Yacana to a remote Ollama instance. Read forward.

Choosing an LLM model

Choosing a model that fits

After Ollama is installed you can browse the list of available LLMs on the Ollama website and download any model you want (or your computer can deal with).
For reference, if you don't know what LLM model to choose (we've all been there) here is a list of models you can try out on consumer hardware:

Computer power	LLM models name to try	LLM quality
Out of this world (RTX 4090 / 64 GB RAM)	llama3.3:70b, gemma3:27b, deepseek-r1:32b, mixtral:8x22b	Excellent reasoning and instruction following.
Epic (RTX 4090 / 32 GB RAM)	llama3.1:8b, gemma3:27b, dolphin-mixtral:8x7b, dolphin-mixtral:8x7b-v2.5-q6_K	Good reasoning and instruction following. (q6_K model should be less consuming than the default Mixtral if you have any issues)
Gamer (GTX 1080TI / 16 GB RAM)	llama3.1:8b, mistral:7b	Llama still works but is slower. Expect limited reasoning and no more than 2 complex instructions at a time
Potato	phi:2.7b, phi3:3.8b, tinyllama:1.1b	Almost no reasoning, incapable of following more than 1 instruction at a time, English bound only ; Dumb as a stone

If you are still unsure what LLM to pick remember that loading an LLM won't destroy your computer.
If the model is too large, the inference server will abort or crash, and then you can just delete the LLM and install a smaller one.
To help you choose an LLM that fits you can use this online calculator.

Understanding model names

Show section...

▶️ Parameters size

Large Language Models have a default number of parameters. For instance Lama3.1 has 3 different versions: "8B", "70B" and "405B".
=>8B means 8 Billon parameters and the higher, the smarter. But it also means that it will consume more RAM.
Most consumer hardware can load "8B" parameters models but not much higher.

▶️ Quantization Overview

Next, let’s talk about quantization. Every model has a quantized version of itself.
For example, the model llama3.1:8b-instruct-q4_K_S is quantized to 4 bits, that’s what the -q4 means.

The lower the number after the q, the more heavily quantized the model is. Heavier quantization reduces the model’s RAM usage on your machine.
By default, models use full precision (16-bit), indicated by -fp16 in their name.
You can load a lighter version like -q8, which has half the precision of fp16, uses half the memory, but may also perform slightly worse.

The great thing about quantization is that the drop in performance isn’t directly proportional to the reduction in size.
This means it's often better to run a larger model in a quantized form than a smaller model in full precision.

For instance, llama3.1:70b-instruct-q4_0 (LLaMA 3.1 with 70 billion parameters, fine-tuned for instruction-following, quantized to 4 bits) will generally outperform llama3.1:8b-instruct-fp16 (LLaMA 3.1 with only 8 billion parameters in full precision), while using similar or even less RAM.

In short: it's usually better to use a larger, quantized model than a smaller, full-precision one, if it fits in your system’s RAM.

Note: On Ollama, the latest tag usually points to the 8B q4 model. You can confirm this by checking the hash under the model name, both latest and 8b-q4 will share the same hash.

Computer's RAM VS Graphic Cards' RAM

Show section...

▶️ When using a PC, it has its own native RAM, the classic system memory you're already familiar with.
But if you have a dedicated graphics card, it also comes with its own type of memory called VRAM (Video RAM).
VRAM is faster than regular RAM because it can be accessed directly, without needing to go through the operating system’s kernel.

A program that requires large amounts of memory and can use VRAM will typically run faster. This is why inference servers like Ollama try to load the LLM model directly into VRAM.

▶️ What happens if the model is too large to fit in VRAM?
That’s where shared GPU memory comes in!
This is a portion of your system’s regular RAM that the graphics card can use as if it were its own. From Ollama’s perspective, it will look like your GPU has more VRAM than it actually does.
However, since this memory still goes through the kernel, it's significantly slower than true VRAM.
As a result, using shared GPU memory can cause performance drops, and it’s generally best to stay within the limits of your available VRAM whenever possible.

▶️ On Macs, the memory is unified, meaning the CPU and GPU share the same pool of RAM. This architecture allows both the graphics processor and the main processor to access memory quickly and seamlessly, so there's no distinction between RAM and VRAM ; it's always fast.

Running the model

When you have chosen your model it's time to use the Ollama CLI to pull it on your computer.

To download the model do ollama pull <model_name> ;
Then list installed models using ollama list ;
When ready, test the model locally by doing ollama run <model_name> which will start a conversation with the LLM ;

Installing Yacana


pip install yacana

Imports

When using other frameworks 'import hell' quickly appears. To prevent this bothersome problem we propose that you always import all of Yacana's modules and when finished developing let the IDE remove the unused imports. Unused imports generally appear grayed. Thus, we recommend that you prepend these imports in all your files and clean them later. This way the IDE will have auto-completion available and will help you develop 10 times faster.


                    from yacana import OllamaAgent, OpenAiAgent, GenericAgent, Task, Tool, ToolType, Message, GenericMessage, MessageRole, GroupSolve, EndChat, EndChatMode, OllamaModelSettings, OpenAiModelSettings, LoggerManager, ToolError, MaxToolErrorIter