VII. OpenAI inference servers

Compatibility with OpenAI API

Yacana was initially designed to work only with Ollama. However, many projects require mixing both private LLM providers and local open source models to achieve a greater production grade product. Private LLM providers like OpenAI or Anthropic are great for production due to their quality but cost a lot of money. On the other hand, local open source models are way cheaper to run but their quality is not always there. Hence having the ability to mix both is a great asset.

The force of Yacana is to provide you with the same programming API whether you use Ollama or an OpenAI-compatible endpoint.
To be fair there is one important difference between the OllamaAgent and OpenAiAgent : the way tools are called.
For Ollama, tools are called using an "enhanced tool calling" system where Yacana will iterate over the tools and call the appropriate ones with its own internal method. This system was made specifically for local LLMs to achieve higher call success rates.

For OpenAI, tools are called following the OpenAI standard. So, when using ChatGPT you won't have any troubles calling tools as chatGPT is tailored for this.
However, when using other inference servers like VLLM you will have a lower success rate at calling tools. This is a little bit unfortunate and will be addressed in another update. Our aim is to make both Agents capable of using both tool calling systems.
Stay tuned for future updates!

Using ChatGPT

Using ChatGpt requires using the OpenAiAgent. It has the same constructor and functionnalities than the OllamaAgent. It just has one more parameter: the api_token required to authenticate to OpenAI servers.
To connect Yacana to ChatGPT, simply use the OpenAiAgent like this:


from yacana import OpenAiAgent, Task, GenericMessage

openai_agent = OpenAiAgent("AI assistant", "gpt-4o-mini", system_prompt="You are a helpful AI assistant", api_token="sk-proj-XXXXXXXXXXXXXXX")

# Use the agent to solve a task
message: GenericMessage = Task("What is the capital of France?", openai_agent).solve()
print(message.content)

Using tools

The Yacana API for calling tools with the OpenAiAgent is the same as with the OllamaAgent. The difference lies in how they are called internaly. Let me show you what I mean:


from yacana import OpenAiAgent, Task, ToolError, Tool

def get_temperature(city: str):
    if type(city) is not str:
        raise ToolError("City name must be a string.")
    return f"The temperature in {city} is 18 degrees celcius."

get_temperature_tool = Tool("get_temperature", "Takes a city name and returns the temperature in this city.", get_temperature)

openai_agent = OpenAiAgent("AI assistant", "gpt-4o-mini", system_prompt="You are a helpful AI assistant", api_token="sk-proj-XXXXXXXXXXXXXXXXX")

Task("What's the temperature in Paris?", openai_agent, tools=[get_temperature_tool]).solve()

Output:


INFO: [PROMPT][To: AI assistant]: What's the temperature in Paris?

INFO: [AI_RESPONSE][From: AI assistant]: [{"id": "call_aJvyCX0wamCaORoNfQcyE5O2", "type": "function", "function": {"name": "get_temperature", "arguments": "{\"city\": \"Paris\"}"}}]

INFO: [TOOL_RESPONSE][get_temperature]: The temperature in Paris is 18 degrees celcius.


INFO: [PROMPT][To: AI assistant]: Retrying with original task and tools answer: 'What's the temperature in Paris?'

INFO: [AI_RESPONSE][From: AI assistant]: The temperature in Paris is currently 18 degrees Celsius.

First, in the ouput we can see that the tool calling is more concise than the one done by the OllamaAgent. This effectiveness can only work with very smart LLMs like ChatGPT.
What’s particularly interesting is this line:


INFO: [PROMPT][To: AI assistant]: Retrying with original task and tools answer: 'What's the temperature in Paris?'

This line indicates that after the tool has responded, the LLM is prompted again to complete the original task. It's essentially a kind of replay!
In fact, within a single Task, the LLM is prompted twice: first, when it doesn't yet know the answer and calls the tool; then again, once the tool's response is available, so the LLM can generate a final answer. The result you get from this Task is that final answer.
In contrast, the OllamaAgent does not perform this replay, you would need to create a second Task manually to handle the tool’s output.
Using the enhaced tool calling for the OpenAIAgent will be addressed in a future update.

Configuring model settings

Yacana offers a class to configure the LLM. For Ollama it's OllamaModelSettings and for OpenAi it's OpenAiModelSettings
The class was made to aggregate all the parameters understood by the OpenAI API so you don't have to do it yourself ^^.
⚠️ Note that it was made specifically for chatGPT and some settings may not work with other OpenAi compatible inference servers.

In the following example we will use the OpenAiModelSettings class to configure the LLM to show the logprobs. Logprobs are the probability for each next token. It ranges from 0 to minus infinity. So the closer to 0 the most probable will the next token be. Using the model config, we'll ask for the best 3 candidates for each token.

However, Yacana was not meant to parse the logprobs in particular. So, we'll use the .raw_llm_json member to access the raw JSON output of the LLM. In there we'll find all the information we need.
let's see!


import json

from yacana import OpenAiAgent, Task, ToolError, Tool, OpenAiModelSettings, HistorySlot

# Defining parameters for our agent
model_settings = OpenAiModelSettings(temperature=0, logprobs=True, top_logprobs=3)

openai_agent = OpenAiAgent("AI assistant", "gpt-4o-mini", system_prompt="You are a helpful AI assistant", model_settings=model_settings, api_token="sk-proj-XXXXXXXXXXXXXXX")

Task("Tell me 1 facts about Canada.", openai_agent).solve()

# Getting the last slot added to the History
slot: HistorySlot = openai_agent.history.get_last_slot()

# Showing the output of the LLM to get the logprobs
print("Raw JSON output from LLM :")
print(json.dumps(json.loads(slot.raw_llm_json), indent=2))

Output (truncated):


INFO: [PROMPT][To: AI assistant]: Tell me 1 facts about Canada.

INFO: [AI_RESPONSE][From: AI assistant]: Canada is the second-largest country in the world by total area, covering approximately 9.98 million square kilometers (3.85 million square miles).
Raw JSON output from LLM :
{
    "id": "chatcmpl-BTrS4UC2GyJSNTf9AqOYDrklPd3Sx",
    "choices": [
    {
        "finish_reason": "stop",
        "index": 0,
        "logprobs": {
        "content": [
            {
            "token": "Canada",
            "bytes": [
                67,
                97,
                110,
                97,
                100,
                97
            ],
            "logprob": -0.011068690568208694,
            "top_logprobs": [
                {
                "token": "Canada",
                "bytes": [
                    67,
                    97,
                    110,
                    97,
                    100,
                    97
                ],
                "logprob": -0.011068690568208694
                },
                {
                "token": "One",
                "bytes": [
                    79,
                    110,
                    101
                ],
                "logprob": -4.511068820953369
                },
                {
                "token": " Canada",
                "bytes": [
                    32,
                    67,
                    97,
                    110,
                    97,
                    100,
                    97
                ],
                "logprob": -11.386068344116211
                }
            ]
            },
            {
            "token": " is",
            "bytes": [
                32,
                105,
                115
            ],
            "logprob": -0.08894743025302887,
            "top_logprobs": [
                {
                "token": " is",
                "bytes": [
                    32,
                    105,
                    115
                ],
                "logprob": -0.08894743025302887
                },
                {
                "token": " has",
                "bytes": [
                    32,
                    104,
                    97,
                    115
                ],
                "logprob": -2.4639475345611572
                },
                {
                "token": " possesses",
                "bytes": [
                    32,
                    112,
                    111,
                    115,
                    115,
                    101,
                    115,
                    115,
                    101,
                    115
                ],
                "logprob": -12.588947296142578
                }
            ]
            },
            ...

About the output:
We can see that for the first token, the 3 best candidates were: "Canada" (-0.011), "One" (-4.51), " Canada" (-11.38). Makes sense that "Canada" (-0,11) was picked.
For the second token we have: " is" ( -0.08), " has" (-2.46), " possesses" (-12.58). It's quite interesting to see the possible outcomes and their probability.

About the HistorySlot:
The HistorySlot is class that wraps the message from the LLM. There are very few usecases where you'll need to interact with it. When accessing a Message from the history in reality you go through the HistorySlot object even though you can't see it.
However, in this case the logprobs data is not available inside the Message object itself. So, we must use the .raw_llm_json member from the surrounding slot to access the raw JSON LLM output and get a look at the logprobs.

Using multiple medias

Using media with ChatGPT is similar to the OllamaAgent, with the key difference being that ChatGPT supports multiple media files in a single request.


from yacana import Task, OpenAiAgent

openai_agent = OpenAiAgent("AI assistant", "gpt-4o-mini", system_prompt="You are a helpful AI assistant", api_token="sk-proj-XXXXXXXXXXXXXXXXX")

Task("Describe this image", openai_agent, medias=["./tests/assets/burger.jpg", "./tests/assets/flower.png"]).solve()

⚠️ Images are transformed into base64 before being sent to the LLM. This means that the size of the request will be quite large and you will pay for a lot of tokens.

Getting alternative responses

ChatGPT has a feature allowing it to return multiple version of the same message. It's used to provide you with alternative responses. For instance, some could be more formal, some could be more creative, etc.
Yacana offers a way to get these alternative responses.
Let's say we want 3 alternative responses to the prompt "What is the main invention of Nicolas Tesla?" then select the third one as the main message of the slot instead of the first one (default).


from typing import List
from yacana import Task, OpenAiAgent, GenericMessage, OpenAiModelSettings, HistorySlot

# Requesting 3 alternative responses using "n" parameter
model_settings = OpenAiModelSettings(n=3, temperature=1.0)

openai_agent = OpenAiAgent("AI assistant", "gpt-4o-mini", system_prompt="You are a helpful AI assistant", model_settings=model_settings, api_token="sk-proj-XXXXXXXXXXXXXXX")

Task("What is the main invention of Nicolas Tesla (short response) ?", openai_agent).solve()

message: GenericMessage = openai_agent.history.get_last_message()
print(f"\nCurrent main message is: {message.content}\n")

# Getting the last slot from the history
slot: HistorySlot = openai_agent.history.get_last_slot()

# Getting the messages from the slot
messages: List[GenericMessage] = slot.messages

# Printing the messages with a counter for readability
for i, message in enumerate(messages, start=1):
    print(f"\n{i}): {message.content}")

# Setting the main message index to 2 (3rd message)
slot.set_main_message_index(2)

# Getting the main message again
message: GenericMessage = openai_agent.history.get_last_message()
print(f"\nCurrent main message is: {message.content}\n")

Output:


INFO: [PROMPT][To: AI assistant]: What is the main invention of Nicolas Tesla (short response) ?

INFO: [AI_RESPONSE][From: AI assistant]: Nicolas Tesla is best known for his development of the alternating current (AC) electrical system, which became the standard for electrical power distribution. He also made significant contributions to wireless communication, induction motors, and numerous other innovations in electrical engineering.

Current main message is: Nicolas Tesla is best known for his development of the alternating current (AC) electrical system, which became the standard for electrical power distribution. He also made significant contributions to wireless communication, induction motors, and numerous other innovations in electrical engineering.


1): Nicolas Tesla is best known for his development of the alternating current (AC) electrical system, which became the standard for electrical power distribution. He also made significant contributions to wireless communication, induction motors, and numerous other innovations in electrical engineering.

2): Nicolas Tesla is best known for his development of the alternating current (AC) electrical system, which is the basis for modern electrical power distribution. Additionally, he made significant contributions to numerous innovations, including the Tesla coil, radio technology, and wireless transmission of energy.

3): One of Nikola Tesla's main inventions is the alternating current (AC) electrical system, which includes the AC motor and transformer. This system revolutionized the way electricity is generated and transmitted, enabling long-distance power distribution and laying the foundation for the modern electrical grid.

Current main message is: One of Nikola Tesla's main inventions is the alternating current (AC) electrical system, which includes the AC motor and transformer. This system revolutionized the way electricity is generated and transmitted, enabling long-distance power distribution and laying the foundation for the modern electrical grid.

The HistorySlot has been discused above. But to put it simply, it wraps each message in the history. This means that the history is not a list of Message but a list of slots (makes sense?).
Each slot has one or more message. In our case, we have 3. The first one is the main message and is the one presented to the LLM during inference and the other 2 are alternate messages.
The slot allows to switch what message should be considered as main message using .set_main_message_index(n).

In the output, the first look at Current main message shows that the first message is selected (ends with "engineering").
After setting the index at 2 (starting from 0) the second round of Current main message show the third message is selected (ends with "grid").

Using VLLM

Installation

To help you go through the installation process on WSL you can follow this tutorial: Installing VLLM on WSL.

First, let's install the VLLM.
You can find the detailed installation steps on the VLLM documentation.
We recommand you use conda to install VLLM. Note that conda does not allow you to use it in an enterprise environment without paying for a license. You can use 'uv' instead.
Once conda is installed you can simply pip install vllm.


conda create -n vllm python=3.12 -y
conda activate vllm
pip install vllm

Now, let's start the inference server with a model. If it's not already present it will be downloaded.
We'll use the Llama-3.2-1B-Instruct model.


vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192 --guided-decoding-backend outlines --enable-auto-tool-choice --tool-call-parser llama3_json

For the inference to start you will need a HuggingFace Account to validate the Facebook's license agreement.
Read the VLLM tutorial if you need help with that step.

About the vllm command line parameters:

guided-decoding-backend: This parameter is used to specify the backend for guided decoding. It's used for structured outputs.
enable-auto-tool-choice: This parameter is used to enable the auto tool choice. Required when at least one tool is defined as optional.
tool-call-parser: This parameter is used to specify what shema to expect when doing tool calls. As we are using facebook's llama we will use the "llama3_json" parser.

Once the inference server is running, you can use the following code to create an OpenAI-compatible agent:


from yacana import OpenAiAgent, GenericMessage, Task

# Note the endpoint parameter is set to the VLLM server address
vllm_agent = OpenAiAgent("AI assistant", "meta-llama/Llama-3.1-8B-Instruct", system_prompt="You are a helpful AI assistant", endpoint="http://127.0.0.1:8000/v1", api_token="leave blank", runtime_config={"extra_body": {'guided_decoding_backend': 'outlines'}})

# Use the agent to solve a task
message: GenericMessage = Task("What is the capital of France?", vllm_agent).solve()
print(message.content)

Tool calling

Doing simple tool calling:


from yacana import OpenAiAgent, Tool, Task

vllm_agent = OpenAiAgent("AI assistant", "meta-llama/Llama-3.1-8B-Instruct", system_prompt="You are a helpful AI assistant", endpoint="http://127.0.0.1:8000/v1", api_token="leave blank", runtime_config={"extra_body": {'guided_decoding_backend': 'outlines'}})

# Defining a fake weather tool
def get_weather(city: str) -> str:
    return f"The weather in {city} is sunny with a high of 25°C."

# Defining the tool
get_weather_tool = Tool("Get_weather", "Calls a weather API and returns the current weather in the given city.", get_weather)

# Adding runtime configuration to the underlying OpenAi library so it works with VLLM
extra_body = {
    'guided_decoding_backend': 'outlines',
    'tool_choice': 'auto',
    'enable_auto_tool_choice': True,
    'tool_call_parser': 'auto'
}
Task("What's the weather in paris ?", vllm_agent, tools=[get_weather_tool], runtime_config={"extra_body": extra_body}).solve()

Note how we used the runtime_config parameter to specify the guided decoding backend. You can use this parameter to specify other parameters as well. This is a direct access to the underlying library.
For OpenAI we use the OpenAI python client. You can set any parameter supported by this library.
These settings can either be set at the Agent level or at the Task level. For more information please refer to the Accessing the underlying client library section.

Using structured outputs

Using structured outputs is the same as with the OllamaAgent. This is the power of Yacana. It provides you with the same API for structured outputs on local LLMs as on OpenAI. However, you still need to provide the outline parameter. In this example we set it at the Agent level because it will be usefull for every future task requiring grammar enforcement.


from pydantic import BaseModel
from yacana import OpenAiAgent, GenericMessage, Task

class CountryFact(BaseModel):
    name: str
    fact: str

class Facts(BaseModel):
    countryFacts: list[CountryFact]

vllm_agent = OpenAiAgent("AI assistant", "meta-llama/Llama-3.1-8B-Instruct", system_prompt="You are a helpful AI assistant", endpoint="http://127.0.0.1:8000/v1", api_token="leave blank", runtime_config={"extra_body": {'guided_decoding_backend': 'outlines'}})

message: GenericMessage = Task("Tell me 3 facts about Canada.", vllm_agent, structured_output=Facts).solve()

# Print the content of the message as a JSON string
print(message.content)
# Print the structured output as a real class instance
print("Name = ", message.structured_output.countryFacts[0].name)
print("Fact = ", message.structured_output.countryFacts[0].fact)

All other features, like medias, streaming, etc. are also available with the OpenAiAgent and can be used in the exact same way. Please refer to the main documentation for more information.