Skip to content

gemma-7b-it-lora

Model ID: @cf/google/gemma-7b-it-lora

This is a Gemma-7B base model that Cloudflare dedicates for inference with LoRA adapters. Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models.

Properties

Task Type: Text Generation

Use the Playground

Try out this model with Workers AI Model Playground. It does not require any setup or authentication and an instant way to preview and test a model directly in the browser.

Launch the Model Playground

Worker

export interface Env {
AI: Ai;
}
export default {
async fetch(request, env): Promise<Response> {
const response = await env.AI.run("@cf/google/gemma-7b-it-lora", {
prompt: "tell me a story",
raw: true, //skip applying the default chat template
lora: "00000000-0000-0000-0000-000000000", //the finetune id OR name
});
return Response.json(response);
},
} satisfies ExportedHandler<Env>;

cURL

Terminal window
curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run/@cf/google/gemma-7b-it-lora \
-X POST \
-H "Authorization: Bearer $CLOUDFLARE_AUTH_TOKEN" \
-d '{
"prompt": "tell me a story",
"raw": "true",
"lora": "00000000-0000-0000-0000-000000000"
}'

Prompting

Part of getting good results from text generation models is asking questions correctly. LLMs are usually trained with specific predefined templates, which should then be used with the model's tokenizer for better results when doing inference tasks

We recommend using unscoped prompts for inference with LoRA.

Unscoped prompts

You can use unscoped prompts to send a single question to the model without worrying about providing any context. Workers AI will automatically convert your prompt input to a reasonable default scoped prompt internally so that you get the best possible prediction.

{
prompt: "tell me a joke about cloudflare";
}

You can also use unscoped prompts to construct the model chat template manually. In this case, you can use the raw parameter. Here's an input example of a Mistral chat template prompt:

{
prompt: "<s>[INST]comedian[/INST]</s>
[INST]tell me a joke about cloudflare[/INST]",
raw: true
};

Responses

Using streaming

The recommended method to handle text generation responses is streaming.

LLMs work internally by generating responses sequentially using a process of repeated inference — the full output of a LLM model is essentially a sequence of hundreds or thousands of individual prediction tasks. For this reason, while it only takes a few milliseconds to generate a single token, generating the full response takes longer, on the order of seconds.

You can use streaming to start displaying the response as soon as the first tokens are generated, and append each additional token until the response is complete. This yields a much better experience for the end user. Displaying text incrementally as it's generated not only provides instant responsiveness, but also gives the end-user time to read and interpret the text.

To enable, set the stream parameter to true.

Using the Workers API:

const stream = await env.AI.run('@cf/google/gemma-7b-it-lora', {
stream: true,
messages,
});
return new Response(stream, {
headers: {
"content-type": "text/event-stream",
},
});

Using the REST API:

Terminal window
curl https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/google/gemma-7b-it-lora \
-X POST \
-H "Authorization: Bearer {API_TOKEN}" \
-d '{ "stream": true, "messages": [{ "role": "system", "content": "You are a friendly assistant" }, { "role": "user", "content": "Why is pizza so good?" }]}'

Streaming responses use server-sent events; the are easy to use, simple to implement on the server side, standardized, and broadly available across many platforms natively or as a polyfill.

Terminal window
curl -X POST \
"https://api.cloudflare.com/client/v4/accounts/<account>/ai/run/@cf/google/gemma-7b-it-lora" \
-H "Authorization: Bearer {API_TOKEN}" \
-H "Content-Type:application/json" \
-d '{ "prompt": "where is new york?", "stream": true }'
data: {"response":"New"}
data: {"response":" York"}
data: {"response":" is"}
data: {"response":" located"}
data: {"response":" in"}
data: {"response":" the"}
...
data: [DONE]

Handling streaming responses in the client

Below is an example showing how to parse this response in JavaScript, from the browser:

const source = new EventSource("/"); // Workers AI streaming endpoint
source.onmessage = (event) => {
if (event.data == "[DONE]") {
source.close();
return;
}
const data = JSON.parse(event.data);
el.innerHTML += data.response;
};

Non-streaming response

Non-streaming responses may be helpful in some contexts, and they are possible; however, be aware that we limit the maximum number of output sequence tokens to avoid timeouts. Whenever possible, use streaming.

{
"response": "The origin of the phrase \"Hello, World\" is not well-documented, but it is believed to have originated in the early days of computing. In the 1970s, when personal computers were first becoming popular, many programming languages, including C, had a simple \"Hello, World\" program that was used to demonstrate the basics of programming.\nThe idea behind the program was to print the words \"Hello, World\" on the screen, and it was often used as a first program for beginners to learn the basics of programming. Over time, the phrase \"Hello, World\" became a common greeting among programmers and computer enthusiasts, and it is now widely recognized as a symbol of the computing industry.\nIt's worth noting that the phrase \"Hello, World\" is not a specific phrase that was coined by any one person or organization, but rather a catchphrase that evolved over time as a result of its widespread use in the computing industry."
}

API Schema

The following schema is based on JSON Schema

Input JSON Schema

{
"type": "object",
"oneOf": [
{
"properties": {
"prompt": {
"type": "string",
"minLength": 1,
"maxLength": 131072
},
"raw": {
"type": "boolean",
"default": false
},
"stream": {
"type": "boolean",
"default": false
},
"max_tokens": {
"type": "integer",
"default": 256
},
"temperature": {
"type": "number",
"minimum": 0,
"maximum": 5
},
"top_p": {
"type": "number",
"minimum": 0,
"maximum": 2
},
"top_k": {
"type": "integer",
"minimum": 1,
"maximum": 50
},
"seed": {
"type": "integer",
"minimum": 1,
"maximum": 9999999999
},
"repetition_penalty": {
"type": "number",
"minimum": 0,
"maximum": 2
},
"frequency_penalty": {
"type": "number",
"minimum": 0,
"maximum": 2
},
"presence_penalty": {
"type": "number",
"minimum": 0,
"maximum": 2
},
"lora": {
"type": "string"
}
},
"required": [
"prompt"
]
},
{
"properties": {
"messages": {
"type": "array",
"items": {
"type": "object",
"properties": {
"role": {
"type": "string"
},
"content": {
"type": "string",
"maxLength": 131072
}
},
"required": [
"role",
"content"
]
}
},
"stream": {
"type": "boolean",
"default": false
},
"max_tokens": {
"type": "integer",
"default": 256
},
"temperature": {
"type": "number",
"minimum": 0,
"maximum": 5
},
"top_p": {
"type": "number",
"minimum": 0,
"maximum": 2
},
"top_k": {
"type": "integer",
"minimum": 1,
"maximum": 50
},
"seed": {
"type": "integer",
"minimum": 1,
"maximum": 9999999999
},
"repetition_penalty": {
"type": "number",
"minimum": 0,
"maximum": 2
},
"frequency_penalty": {
"type": "number",
"minimum": 0,
"maximum": 2
},
"presence_penalty": {
"type": "number",
"minimum": 0,
"maximum": 2
}
},
"required": [
"messages"
]
}
]
}

Output JSON Schema

{
"oneOf": [
{
"type": "object",
"contentType": "application/json",
"properties": {
"response": {
"type": "string"
}
}
},
{
"type": "string",
"contentType": "text/event-stream",
"format": "binary"
}
]
}