Editing IBM Granite 4.0-Nano

IBM Granite 4.0 Nano
You can use it on your everyday’s laptop 🤩

by Alain Airom (Ayrom)

Oct 29, 2025

IBM has introduced the Granite 4.0 Nano model family, making a strong commitment to create powerful and useful large language models (LLMs) specifically optimized for edge and on-device applications. These models, which range from approximately 350 million to 1.5 billion parameters, are highlighted for delivering significantly increased capabilities compared to similarly sized models from competitors, as validated across standard benchmarks in areas like General Knowledge, Math, Code, and Safety.

The release comprises four main variants, including models based on a new, efficient hybrid-SSM architecture (like the Granite 4.0 H 1B and H 350M) and alternative traditional transformer versions to ensure compatibility with diverse runtimes (such as llama.cpp). Crucially, the Granite 4.0 Nano models are built upon the same robust training pipelines and over 15 trillion tokens of data used for the larger Granite 4.0 family.

For broad usability and confidence, all Nano models are released under the Apache 2.0 license and come with IBM’s ISO 42001 certification for responsible model development and governance, allowing users to deploy them with assurance of global standards compliance.

Press enter or click to view image in full size
Press enter or click to view image in full size
Press enter or click to view image in full size
❓How Try Granite 4 Nano?
These state-of-the-art models are conveniently accessible through two primary channels: the streamlined deployment platform, Ollama, and the comprehensive repository maintained on Hugging Face. Direct access links for both resources are consolidated within the dedicated “Links” section for your convenience. To clearly illustrate the remarkable simplicity and speed of integration, I have adapted and enhanced the foundational code samples, which are presented in the following section.

Prepare your environment
python3 -m venv venv
source venv/bin/activate

pip install --upgrade pip
Install the required and necessary packages 📦
# requirements.txt
HuggingFace
torch
transformers
accelerate
torchvision
torchaudio
pip install -r requirements.txt 
Once the environment is ready, just copy and run these two simple apps!
The 1st code constructs simply a chat prompt using the tokenizer’s template to enable tool-calling behavior in response to the user’s query about Boston weather. After tokenizing the prepared input and moving it to the selected device, the model generates an output sequence.

The 2nd script performs a complete machine learning inference pipeline using a pre-trained LLM. It then loads the IBM Granite 4.0 (350M parameter) model and tokenizer, prepares a user prompt for a research lab location query, and generates a response from the model.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import json
import os # Import os for file system operations

# --- Device Detection and Selection ---
# Automatically determine the best device available (CUDA > MPS > CPU)
if torch.cuda.is_available():
    device = "cuda"
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    # MPS (Metal Performance Shaders) is the accelerator for Apple Silicon (M1/M2/M3)
    device = "mps"
else:
    device = "cpu"

print(f"Selected device: {device}")
# --- End Device Detection ---


model_path = "ibm-granite/granite-4.0-350M"
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Pass the automatically determined device to device_map
# The model will now load onto the CPU (or MPS/CUDA if available)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
model.eval()

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather for a specified city.",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "Name of the city"
                    }
                },
                "required": ["city"]
            }
        }
    }
]

# change input text as desired
chat = [
    { "role": "user", "content": "What's the weather like in Boston right now?" },
]

chat = tokenizer.apply_chat_template(chat, \
                                     tokenize=False, \
                                     tools=tools, \
                                     add_generation_prompt=True)

# tokenize the text and move to the selected device
input_tokens = tokenizer(chat, return_tensors="pt").to(device)

# generate output tokens
output = model.generate(**input_tokens, 
                        max_new_tokens=100)

# decode output tokens into text
output = tokenizer.batch_decode(output)

# --- Save output to file in Markdown format ---
output_dir = "./output"
output_file = os.path.join(output_dir, "output.md")

# Create the output directory if it doesn't exist.
# The exist_ok=True argument prevents an error if the directory already exists.
try:
    os.makedirs(output_dir, exist_ok=True)
    
    # Write the output to the Markdown file
    with open(output_file, "w", encoding="utf-8") as f:
        # The output from batch_decode is a list, we take the first item (the generated text)
        f.write(output[0])
    
    # Confirmation message
    print(f"\nModel output saved successfully to {output_file}")
    
    # Optionally print the content to the console for immediate review
    print("\n--- Generated Content ---\n")
    print(output[0])
    print("\n-------------------------")

except Exception as e:
    print(f"\nAn error occurred while trying to save the output file: {e}")
# --- End File Saving Logic ---
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import os # Import os for file system operations

# --- Device Detection and Selection ---
# Automatically determine the best device available (CUDA > MPS > CPU)
if torch.cuda.is_available():
    device = "cuda"
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    # MPS (Metal Performance Shaders) is the accelerator for Apple Silicon (M1/M2/M3)
    device = "mps"
else:
    device = "cpu"

print(f"Selected device: {device}")
# --- End Device Detection ---


model_path = "ibm-granite/granite-4.0-350M"
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Pass the automatically determined device to device_map
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
model.eval()

# change input text as desired
chat = [
    { "role": "user", "content": "Please list one IBM Research laboratory located in the United States. You should only output its name and location." },
]

chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

# tokenize the text and move to the selected device
input_tokens = tokenizer(chat, return_tensors="pt").to(device)

# generate output tokens
output = model.generate(**input_tokens, 
                        max_new_tokens=100)

# decode output tokens into text
output = tokenizer.batch_decode(output)

# --- Save output to file in Markdown format ---
output_dir = "./output"
output_file = os.path.join(output_dir, "output.md")

# Create the output directory if it doesn't exist.
try:
    os.makedirs(output_dir, exist_ok=True)
    
    # Write the output to the Markdown file
    with open(output_file, "w", encoding="utf-8") as f:
        # The output from batch_decode is a list, we take the first item (the generated text)
        f.write(output[0])
    
    # Confirmation message
    print(f"\nModel output saved successfully to {output_file}")
    
    # Optionally print the content to the console for immediate review
    print("\n--- Generated Content ---\n")
    print(output[0])
    print("\n-------------------------")

except Exception as e:
    print(f"\nAn error occurred while trying to save the output file: {e}")
# --- End File Saving Logic ---
You’ll get these outputs 📄

<|start_of_role|>system<|end_of_role|>You are a helpful assistant with access to the following tools. You may call one or more tools to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "get_current_weather", "description": "Get the current weather for a specified city.", "parameters": {"type": "object", "properties": {"city": {"type": "string", "description": "Name of the city"}}, "required": ["city"]}}}
</tools>

For each tool call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>. If a tool does not exist in the provided list of tools, notify the user that you do not have the ability to fulfill the request.<|end_of_text|>
<|start_of_role|>user<|end_of_role|>What's the weather like in Boston right now?<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|><tool_call>
{"name": "get_current_weather", "arguments": {"city": "Boston"}}
</tool_call><|end_of_text|>


======


<|start_of_role|>system<|end_of_role|>You are a helpful assistant. Please ensure responses are professional, accurate, and safe.<|end_of_text|>
<|start_of_role|>user<|end_of_role|>Please list one IBM Research laboratory located in the United States. You should only output its name and location.<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>IBM Research Laboratory: Cambridge Research Laboratory<|end_of_text|>

Et voilà 🥇

Conclusion
The Granite 4.0 model family represents a pivotal shift toward highly efficient, enterprise-grade AI, redefining performance by focusing on accessibility rather than sheer scale. A key strength lies in its innovative hybrid Mamba/Transformer architecture, which significantly reduces memory requirements — often by over 70% — enabling robust inference on rudimentary and affordable hardware, including consumer-grade GPUs and edge devices. Crucially, as an open-source offering under the Apache 2.0 license, Granite 4.0 models empower developers with complete operational sovereignty, allowing for deep customization, on-premise deployment for enhanced data privacy, and full transparency. This combination of superior efficiency and open governance lowers the barrier to entry, democratizing advanced AI for complex workflows like RAG and function calling, while ensuring the control and trust necessary for real-world business adoption.