spring AI (5) Local Deployment Model - llama3

In the previous chapters, we used online large models like ChatGPT. Although the API prices for various large models are decreasing, the costs can still add up significantly with frequent calls, not to mention various limitations. Deploying some smaller models locally can often provide good results in many cases, not only reducing costs but also having fewer restrictions, and most importantly, allowing for some customization. This is undoubtedly more attractive for companies and organizations.

This section will introduce how to install and deploy the latest Llama3 model using Ollama, and how to call it using Spring AI.

Installing Ollama#

The installation of Ollama is straightforward; simply download the installation package for your operating system from the official website and install it. After installation and startup, an icon will appear in the system tray.

Then open the terminal and enter ollama -v to verify the version.

At this point, Ollama is installed, but it only supports command-line interactions. If you want to use it in a graphical interface, you can install a webUI (this step can be skipped if you are only calling via code).

For Windows systems, if you have already installed WSL and Docker, you can run the following command:

(1) Run under CPU:

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main

(2) Support GPU running:

docker run -d -p 3000:8080 --gpus=all -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:ollama

After installation, access it via the local address: http://127.0.0.1:3000

Known Issues

After installing this image in WSL, it cannot directly download models online and can only read locally installed models, but even after selecting a model, the conversation still fails. It is uncertain whether this is a port mapping issue between WSL and the Windows host.
Installing Docker Desktop on the Windows host and then running this image works without issues. If you want to run it with GPU, you may need to install a Conda environment.

Llama3 Model#

To install the model, you can first search on the Ollama official website.

The native Llama3 is primarily trained in English. Although you can force it to respond in Chinese using prompts, it is often ignored. Therefore, I recommend using some Llama3 models fine-tuned in Chinese.

You can search for llama3-chinese to find the Chinese fine-tuned versions.

Taking the model wangshenzhi/llama3-8b-chinese-chat-ollama-q8 that I used as an example.

Copy the command here and run it in the terminal.

In most cases, it is only recommended to install the 8B model; larger models are best installed on dedicated compute cards.

ollama run wangshenzhi/llama3-8b-chinese-chat-ollama-q8

ollama run indicates running the model. If the model is not downloaded, it will be downloaded first; if the model is already downloaded, it will run directly.

At this point, you can directly have a conversation in the terminal.

Framework Call#

Based on the learning from the previous section, we can directly reuse the code, only needing to modify the model configuration.

private static OllamaChatClient getClient(){
        var ollamaApi = new OllamaApi();

        return new OllamaChatClient(ollamaApi).withDefaultOptions(OllamaOptions.create()
                        .withModel("wangshenzhi/llama3-8b-chinese-chat-ollama-q8")
                        .withTemperature(0.4f));
    }

Here, OllamaApi has a default baseUrl of http://localhost:11434. If your model is not deployed locally, you need to modify this address.
.withModel("wangshenzhi/llama3-8b-chinese-chat-ollama-q8") specifies the model; make sure to use the full name.

Other parts do not need to be changed; just run it directly.

Known issues with Llama3 are: 1. The model is too small, often producing incorrect responses or hallucinations; 2. It seems to not support Function Calling.

The complete code is as follows:

/**
 * @author lza
 * @date 2024/04/22-10:31
 **/

@RestController
@RequestMapping("ollama")
@RequiredArgsConstructor
@CrossOrigin
public class OllamaController {


    private static final Integer MAX_MESSAGE = 10;

    private static Map<String, List<Message>> chatMessage = new ConcurrentHashMap<>();


    /**
     * Create OpenAiChatClient
     * @return OpenAiChatClient
     */
    private static OllamaChatClient getClient(){
        var ollamaApi = new OllamaApi();

        return new OllamaChatClient(ollamaApi).withDefaultOptions(OllamaOptions.create()
                        .withModel("wangshenzhi/llama3-8b-chinese-chat-ollama-q8")
                        .withTemperature(0.4f));
    }


    /**
     * Return prompt
     * @param message User input message
     * @return Prompt
     */
    private List<Message> getMessages(String id, String message) {
        String systemPrompt = "{prompt}";
        SystemPromptTemplate systemPromptTemplate = new SystemPromptTemplate(systemPrompt);

        Message userMessage = new UserMessage(message);

        Message systemMessage = systemPromptTemplate.createMessage(MapUtil.of("prompt", "you are a helpful AI assistant"));

        List<Message> messages = chatMessage.get(id);


        // If no messages are retrieved, create new messages and add the system prompt and user input to the message list
        if (messages == null){
            messages = new ArrayList<>();
            messages.add(systemMessage);
            messages.add(userMessage);
        } else {
            messages.add(userMessage);
        }

        return messages;
    }

    /**
     * Initialize function call
     * @return ChatOptions
     */
    private ChatOptions initFunc(){
        return OpenAiChatOptions.builder().withFunctionCallbacks(List.of(
                FunctionCallbackWrapper.builder(new MockWeatherService()).withName("weather").withDescription("Get the weather in location").build(),
                FunctionCallbackWrapper.builder(new WbHotService()).withName("wbHot").withDescription("Get the hot list of Weibo").build(),
                FunctionCallbackWrapper.builder(new TodayNews()).withName("todayNews").withDescription("60s watch world news").build(),
                FunctionCallbackWrapper.builder(new DailyEnglishFunc()).withName("dailyEnglish").withDescription("A daily inspirational sentence in English").build())).build();
    }

    /**
     * Create connection
     */
    @SneakyThrows
    @GetMapping("/init/{message}")
    public String init() {
        return String.valueOf(UUID.randomUUID());
    }

    @GetMapping("chat/{id}/{message}")
    public SseEmitter chat(@PathVariable String id, @PathVariable String message, HttpServletResponse response) {

        response.setHeader("Content-type", "text/html;charset=UTF-8");
        response.setCharacterEncoding("UTF-8");

        OllamaChatClient client = getClient();
        SseEmitter emitter = SseEmitterUtils.connect(id);
        List<Message> messages = getMessages(id, message);
        System.err.println("chatMessage size: " + messages.size());
        System.err.println("chatMessage: " + chatMessage);

        if (messages.size() > MAX_MESSAGE){
            SseEmitterUtils.sendMessage(id, "Too many conversation attempts, please try again later 🤔");
        }else {
            // Get the model's output stream
            Flux<ChatResponse> stream = client.stream(new Prompt(messages));

            // Send messages in the stream using SSE
            Mono<String> result = stream
                    .flatMap(it -> {
                        StringBuilder sb = new StringBuilder();
                        String content = it.getResult().getOutput().getContent();
                        Optional.ofNullable(content).ifPresent(r -> {
                            SseEmitterUtils.sendMessage(id, content);
                            sb.append(content);
                        });
                        return Mono.just(sb.toString());
                    })
                    // Concatenate messages into a string
                    .reduce((a, b) -> a + b)
                    .defaultIfEmpty("");

            // Store the message in chatMessage as AssistantMessage
            result.subscribe(finalContent -> messages.add(new AssistantMessage(finalContent)));

            // Store the message in chatMessage
            chatMessage.put(id, messages);

        }
        return emitter;

    }

Free Resources#

Here are some free Llama3 models and APIs:

Groq from Elon Musk's company, reportedly uses dedicated cards instead of NVIDIA, and is very fast.
NVIDIA has many free models available for use.
Cloudflare also offers many models for deployment, with free quotas.

Ollama
Local Deployment of Llama3 – 8B/70B Large Models! The Easiest Method: Supports CPU/GPU Running 【3 Solutions】
Easily Build Llama3 Web Interactive Interface - Ollama + Open WebUI
Quick and Easy Local Deployment of Llama3 Using Ollama + AnythingLLM
Ollama: Local Large Model Running Guide