REST Resource: projects.locations.evaluationRuns

Resource: EvaluationRun

EvaluationRun is a resource that represents a single evaluation run, which includes a set of prompts, model responses, evaluation configuration and the resulting metrics.

Fields

name string

Identifier. The resource name of the EvaluationRun. This is a unique identifier. Format: projects/{project}/locations/{location}/evaluationRuns/{evaluationRun}

displayName string

Required. The display name of the Evaluation Run.

metadata value (Value format)

Optional. metadata about the evaluation run, can be used by the caller to store additional tracking information about the evaluation run.

labels map (key: string, value: string)

Optional. Labels for the evaluation run.

dataSource object (DataSource)

Required. The data source for the evaluation run.

inferenceConfigs map (key: string, value: object (InferenceConfig))

Optional. The candidate to inference config map for the evaluation run. The candidate can be up to 128 characters long and can consist of any UTF-8 characters.

evaluationConfig object (EvaluationConfig)

Required. The configuration used for the evaluation.

state enum (State)

Output only. The state of the evaluation run.

error object (Status)

Output only. Only populated when the evaluation run's state is FAILED or CANCELLED.

evaluationResults object (EvaluationResults)

Output only. The results of the evaluation run. Only populated when the evaluation run's state is SUCCEEDED.

createTime string (Timestamp format)

Output only. time when the evaluation run was created.

Uses RFC 3339, where generated output will always be Z-normalized and use 0, 3, 6 or 9 fractional digits. Offsets other than "Z" are also accepted. Examples: "2014-10-02T15:01:23Z", "2014-10-02T15:01:23.045123456Z" or "2014-10-02T15:01:23+05:30".

completionTime string (Timestamp format)

Output only. time when the evaluation run was completed.

evaluationSetSnapshot string

Output only. The specific evaluation set of the evaluation run. For runs with an evaluation set input, this will be that same set. For runs with BigQuery input, it's the sampled BigQuery dataset.

JSON representation

JSON representation
{ "name": string, "displayName": string, "metadata": value, "labels": { string: string, ... }, "dataSource": { object (`DataSource`) }, "inferenceConfigs": { string: { object (`InferenceConfig`) }, ... }, "evaluationConfig": { object (`EvaluationConfig`) }, "state": enum (`State`), "error": { object (`Status`) }, "evaluationResults": { object (`EvaluationResults`) }, "createTime": string, "completionTime": string, "evaluationSetSnapshot": string }

{
  "name": string,
  "displayName": string,
  "metadata": value,
  "labels": {
    string: string,
    ...
  },
  "dataSource": {
    object (DataSource)
  },
  "inferenceConfigs": {
    string: {
      object (InferenceConfig)
    },
    ...
  },
  "evaluationConfig": {
    object (EvaluationConfig)
  },
  "state": enum (State),
  "error": {
    object (Status)
  },
  "evaluationResults": {
    object (EvaluationResults)
  },
  "createTime": string,
  "completionTime": string,
  "evaluationSetSnapshot": string
}

DataSource

The data source for the evaluation run.

Fields

source Union type

One of multiple supported sources. source can be only one of the following:

evaluationSet string

The EvaluationSet resource name. Format: projects/{project}/locations/{location}/evaluationSets/{evaluationSet}

bigqueryRequestSet object (BigQueryRequestSet)

Evaluation data in bigquery.

JSON representation
{ // source "evaluationSet": string, "bigqueryRequestSet": { object (`BigQueryRequestSet`) } // Union type }

BigQueryRequestSet

The request set for the evaluation run.

Fields

uri string

Required. The URI of a BigQuery table. e.g. bq://projectId.bqDatasetId.bqTableId

promptColumn string

Optional. The name of the column that contains the requests to evaluate. This will be in evaluationItem.EvalPrompt format.

rubricsColumn string

Optional. The name of the column that contains the rubrics. This is in evaluation_rubric.RubricGroup format.

candidateResponseColumns map (key: string, value: string)

Optional. Map of candidate name to candidate response column name. The column will be in evaluationItem.CandidateResponse format.

samplingConfig object (SamplingConfig)

Optional. The sampling config for the bigquery resource.

JSON representation
{ "uri": string, "promptColumn": string, "rubricsColumn": string, "candidateResponseColumns": { string: string, ... }, "samplingConfig": { object (`SamplingConfig`) } }

SamplingConfig

The sampling config.

Fields

samplingCount integer

Optional. The total number of logged data to import. If available data is less than the sampling count, all data will be imported. Default is 100.

samplingMethod enum (SamplingMethod)

Optional. The sampling method to use.

samplingDuration string (Duration format)

Optional. How long to wait before sampling data from the BigQuery table. If not specified, defaults to 0.

A duration in seconds with up to nine fractional digits, ending with 's'. Example: "3.5s".

JSON representation
{ "samplingCount": integer, "samplingMethod": enum (`SamplingMethod`), "samplingDuration": string }

SamplingMethod

The sampling method to use.

Enums
`SAMPLING_METHOD_UNSPECIFIED`	Unspecified sampling method.
`RANDOM`	Random sampling.

InferenceConfig

An inference config used for model inference during the evaluation run.

Fields

model string

Optional. The fully qualified name of the publisher model or endpoint to use.

Publisher model format: projects/{project}/locations/{location}/publishers/*/models/*

Endpoint format: projects/{project}/locations/{location}/endpoints/{endpoint}

model_config Union type

Configuration for the LLM. model_config can be only one of the following:

generationConfig object (GenerationConfig)

Optional. Generation config.

JSON representation
{ "model": string, // model_config "generationConfig": { object (`GenerationConfig`) } // Union type }

GenerationConfig

Configuration for content generation.

This message contains all the parameters that control how the model generates content. It allows you to influence the randomness, length, and structure of the output.

Fields

stopSequences[] string

Optional. A list of character sequences that will stop the model from generating further tokens. If a stop sequence is generated, the output will end at that point. This is useful for controlling the length and structure of the output. For example, you can use ["\n", "###"] to stop generation at a new line or a specific marker.

responseMimeType string

Optional. The IANA standard MIME type of the response. The model will generate output that conforms to this MIME type. Supported values include 'text/plain' (default) and 'application/json'. The model needs to be prompted to output the appropriate response type, otherwise the behavior is undefined. This is a preview feature.

responseModalities[] enum (Modality)

Optional. The modalities of the response. The model will generate a response that includes all the specified modalities. For example, if this is set to [TEXT, IMAGE], the response will include both text and an image.

thinkingConfig object (ThinkingConfig)

Optional. Configuration for thinking features. An error will be returned if this field is set for models that don't support thinking.

temperature number

Optional. Controls the randomness of the output. A higher temperature results in more creative and diverse responses, while a lower temperature makes the output more predictable and focused. The valid range is (0.0, 2.0].

topP number

Optional. Specifies the nucleus sampling threshold. The model considers only the smallest set of tokens whose cumulative probability is at least topP. This helps generate more diverse and less repetitive responses. For example, a topP of 0.9 means the model considers tokens until the cumulative probability of the tokens to select from reaches 0.9. It's recommended to adjust either temperature or topP, but not both.

topK number

Optional. Specifies the top-k sampling threshold. The model considers only the top k most probable tokens for the next token. This can be useful for generating more coherent and less random text. For example, a topK of 40 means the model will choose the next word from the 40 most likely words.

candidateCount integer

Optional. The number of candidate responses to generate.

A higher candidateCount can provide more options to choose from, but it also consumes more resources. This can be useful for generating a variety of responses and selecting the best one.

maxOutputTokens integer

Optional. The maximum number of tokens to generate in the response.

A token is approximately four characters. The default value varies by model. This parameter can be used to control the length of the generated text and prevent overly long responses.

responseLogprobs boolean

Optional. If set to true, the log probabilities of the output tokens are returned.

log probabilities are the logarithm of the probability of a token appearing in the output. A higher log probability means the token is more likely to be generated. This can be useful for analyzing the model's confidence in its own output and for debugging.

logprobs integer

Optional. The number of top log probabilities to return for each token.

This can be used to see which other tokens were considered likely candidates for a given position. A higher value will return more options, but it will also increase the size of the response.

presencePenalty number

Optional. Penalizes tokens that have already appeared in the generated text. A positive value encourages the model to generate more diverse and less repetitive text. Valid values can range from [-2.0, 2.0].

frequencyPenalty number

Optional. Penalizes tokens based on their frequency in the generated text. A positive value helps to reduce the repetition of words and phrases. Valid values can range from [-2.0, 2.0].

seed integer

Optional. A seed for the random number generator.

By setting a seed, you can make the model's output mostly deterministic. For a given prompt and parameters (like temperature, topP, etc.), the model will produce the same response every time. However, it's not a guaranteed absolute deterministic behavior. This is different from parameters like temperature, which control the level of randomness. seed ensures that the "random" choices the model makes are the same on every run, making it essential for testing and ensuring reproducible results.

responseSchema object (Schema)

Optional. Lets you to specify a schema for the model's response, ensuring that the output conforms to a particular structure. This is useful for generating structured data such as JSON. The schema is a subset of the OpenAPI 3.0 schema object object.

When this field is set, you must also set the responseMimeType to application/json.

responseJsonSchema value (Value format)

Optional. When this field is set, responseSchema must be omitted and responseMimeType must be set to application/json.

routingConfig object (RoutingConfig)

Optional. Routing configuration.

audioTimestamp boolean

Optional. If enabled, audio timestamps will be included in the request to the model. This can be useful for synchronizing audio with other modalities in the response.

mediaResolution enum (MediaResolution)

Optional. The token resolution at which input media content is sampled. This is used to control the trade-off between the quality of the response and the number of tokens used to represent the media. A higher resolution allows the model to perceive more detail, which can lead to a more nuanced response, but it will also use more tokens. This does not affect the image dimensions sent to the model.

speechConfig object (SpeechConfig)

Optional. The speech generation config.

enableAffectiveDialog boolean

Optional. If enabled, the model will detect emotions and adapt its responses accordingly. For example, if the model detects that the user is frustrated, it may provide a more empathetic response.

imageConfig object (ImageConfig)

Optional. Config for image generation features.

JSON representation

JSON representation
{ "stopSequences": [ string ], "responseMimeType": string, "responseModalities": [ enum (`Modality`) ], "thinkingConfig": { object (`ThinkingConfig`) }, "temperature": number, "topP": number, "topK": number, "candidateCount": integer, "maxOutputTokens": integer, "responseLogprobs": boolean, "logprobs": integer, "presencePenalty": number, "frequencyPenalty": number, "seed": integer, "responseSchema": { object (`Schema`) }, "responseJsonSchema": value, "routingConfig": { object (`RoutingConfig`) }, "audioTimestamp": boolean, "mediaResolution": enum (`MediaResolution`), "speechConfig": { object (`SpeechConfig`) }, "enableAffectiveDialog": boolean, "imageConfig": { object (`ImageConfig`) } }

{
  "stopSequences": [
    string
  ],
  "responseMimeType": string,
  "responseModalities": [
    enum (Modality)
  ],
  "thinkingConfig": {
    object (ThinkingConfig)
  },
  "temperature": number,
  "topP": number,
  "topK": number,
  "candidateCount": integer,
  "maxOutputTokens": integer,
  "responseLogprobs": boolean,
  "logprobs": integer,
  "presencePenalty": number,
  "frequencyPenalty": number,
  "seed": integer,
  "responseSchema": {
    object (Schema)
  },
  "responseJsonSchema": value,
  "routingConfig": {
    object (RoutingConfig)
  },
  "audioTimestamp": boolean,
  "mediaResolution": enum (MediaResolution),
  "speechConfig": {
    object (SpeechConfig)
  },
  "enableAffectiveDialog": boolean,
  "imageConfig": {
    object (ImageConfig)
  }
}

RoutingConfig

The configuration for routing the request to a specific model. This can be used to control which model is used for the generation, either automatically or by specifying a model name.

Fields

routing_config Union type

The routing mode for the request. routing_config can be only one of the following:

autoMode object (AutoRoutingMode)

In this mode, the model is selected automatically based on the content of the request.

manualMode object (ManualRoutingMode)

In this mode, the model is specified manually.

JSON representation
{ // routing_config "autoMode": { object (`AutoRoutingMode`) }, "manualMode": { object (`ManualRoutingMode`) } // Union type }

AutoRoutingMode

The configuration for automated routing.

When automated routing is specified, the routing will be determined by the pretrained routing model and customer provided model routing preference.

Fields

modelRoutingPreference enum (ModelRoutingPreference)

The model routing preference.

JSON representation
{ "modelRoutingPreference": enum (`ModelRoutingPreference`) }

ModelRoutingPreference

The model routing preference.

Enums
`UNKNOWN`	Unspecified model routing preference.
`PRIORITIZE_QUALITY`	The model will be selected to prioritize the quality of the response.
`BALANCED`	The model will be selected to balance quality and cost.
`PRIORITIZE_COST`	The model will be selected to prioritize the cost of the request.

ManualRoutingMode

The configuration for manual routing.

When manual routing is specified, the model will be selected based on the model name provided.

Fields

modelName string

The name of the model to use. Only public LLM models are accepted.

JSON representation
{ "modelName": string }

Modality

The modalities of the response.

Enums
`MODALITY_UNSPECIFIED`	Unspecified modality. Will be processed as text.
`TEXT`	Text modality.
`IMAGE`	Image modality.
`AUDIO`	Audio modality.

MediaResolution

Media resolution for the input media.

Enums
`MEDIA_RESOLUTION_UNSPECIFIED`	Media resolution has not been set.
`MEDIA_RESOLUTION_LOW`	Media resolution set to low (64 tokens).
`MEDIA_RESOLUTION_MEDIUM`	Media resolution set to medium (256 tokens).
`MEDIA_RESOLUTION_HIGH`	Media resolution set to high (zoomed reframing with 256 tokens).

SpeechConfig

Configuration for speech generation.

Fields

voiceConfig object (VoiceConfig)

The configuration for the voice to use.

languageCode string

Optional. The language code (ISO 639-1) for the speech synthesis.

multiSpeakerVoiceConfig object (MultiSpeakerVoiceConfig)

The configuration for a multi-speaker text-to-speech request. This field is mutually exclusive with voiceConfig.

JSON representation
{ "voiceConfig": { object (`VoiceConfig`) }, "languageCode": string, "multiSpeakerVoiceConfig": { object (`MultiSpeakerVoiceConfig`) } }

VoiceConfig

Configuration for a voice.

Fields

voice_config Union type

The configuration for the speaker to use. voice_config can be only one of the following:

prebuiltVoiceConfig object (PrebuiltVoiceConfig)

The configuration for a prebuilt voice.

replicatedVoiceConfig object (ReplicatedVoiceConfig)

Optional. The configuration for a replicated voice. This enables users to replicate a voice from an audio sample.

JSON representation
{ // voice_config "prebuiltVoiceConfig": { object (`PrebuiltVoiceConfig`) }, "replicatedVoiceConfig": { object (`ReplicatedVoiceConfig`) } // Union type }

PrebuiltVoiceConfig

Configuration for a prebuilt voice.

Fields

voiceName string

The name of the prebuilt voice to use.

JSON representation
{ "voiceName": string }

ReplicatedVoiceConfig

The configuration for the replicated voice to use.

Fields

mimeType string

Optional. The mimetype of the voice sample. The only currently supported value is audio/wav. This represents 16-bit signed little-endian wav data, with a 24kHz sampling rate. mimeType will default to audio/wav if not set.

voiceSampleAudio string (bytes format)

Optional. The sample of the custom voice.

A base64-encoded string.

JSON representation
{ "mimeType": string, "voiceSampleAudio": string }

MultiSpeakerVoiceConfig

Configuration for a multi-speaker text-to-speech request.

Fields

speakerVoiceConfigs[] object (SpeakerVoiceConfig)

Required. A list of configurations for the voices of the speakers. Exactly two speaker voice configurations must be provided.

JSON representation
{ "speakerVoiceConfigs": [ { object (`SpeakerVoiceConfig`) } ] }

SpeakerVoiceConfig

Configuration for a single speaker in a multi-speaker setup.

Fields

speaker string

Required. The name of the speaker. This should be the same as the speaker name used in the prompt.

voiceConfig object (VoiceConfig)

Required. The configuration for the voice of this speaker.

JSON representation
{ "speaker": string, "voiceConfig": { object (`VoiceConfig`) } }

ThinkingConfig

Configuration for the model's thinking features.

"Thinking" is a process where the model breaks down a complex task into smaller, manageable steps. This allows the model to reason about the task, plan its approach, and execute the plan to generate a high-quality response.

Fields

includeThoughts boolean

Optional. If true, the model will include its thoughts in the response. "Thoughts" are the intermediate steps the model takes to arrive at the final response. They can provide insights into the model's reasoning process and help with debugging. If this is true, thoughts are returned only when available.

thinkingBudget integer

Optional. The token budget for the model's thinking process. The model will make a best effort to stay within this budget. This can be used to control the trade-off between response quality and latency.

thinkingLevel enum (ThinkingLevel)

Optional. The number of thoughts tokens that the model should generate.

JSON representation
{ "includeThoughts": boolean, "thinkingBudget": integer, "thinkingLevel": enum (`ThinkingLevel`) }

ThinkingLevel

The thinking level for the model.

Enums
`THINKING_LEVEL_UNSPECIFIED`	Unspecified thinking level.
`LOW`	Low thinking level.
`HIGH`	High thinking level.

ImageConfig

Configuration for image generation.

This message allows you to control various aspects of image generation, such as the output format, aspect ratio, and whether the model can generate images of people.

Fields

imageOutputOptions object (ImageOutputOptions)

Optional. The image output format for generated images.

aspectRatio string

Optional. The desired aspect ratio for the generated images. The following aspect ratios are supported:

"1:1" "2:3", "3:2" "3:4", "4:3" "4:5", "5:4" "9:16", "16:9" "21:9"

personGeneration enum (PersonGeneration)

Optional. Controls whether the model can generate people.

imageSize string

Optional. Specifies the size of generated images. Supported values are 1K, 2K, 4K. If not specified, the model will use default value 1K.

JSON representation
{ "imageOutputOptions": { object (`ImageOutputOptions`) }, "aspectRatio": string, "personGeneration": enum (`PersonGeneration`), "imageSize": string }

ImageOutputOptions

The image output format for generated images.

Fields

mimeType string

Optional. The image format that the output should be saved as.

compressionQuality integer

Optional. The compression quality of the output image.

JSON representation
{ "mimeType": string, "compressionQuality": integer }

PersonGeneration

Enum for controlling the generation of people in images.

Enums
`PERSON_GENERATION_UNSPECIFIED`	The default behavior is unspecified. The model will decide whether to generate images of people.
`ALLOW_ALL`	Allows the model to generate images of people, including adults and children.
`ALLOW_ADULT`	Allows the model to generate images of adults, but not children.
`ALLOW_NONE`	Prevents the model from generating images of people.

EvaluationConfig

The Evalution configuration used for the evaluation run.

Fields

metrics[] object (EvaluationRunMetric)

Required. The metrics to be calculated in the evaluation run.

rubricConfigs[] object (EvaluationRubricConfig)

Optional. The rubric configs for the evaluation run. They are used to generate rubrics which can be used by rubric-based metrics. Multiple rubric configs can be specified for rubric generation but only one rubric config can be used for a rubric-based metric. If more than one rubric config is provided, the evaluation metric must specify a rubric group key. Note that if a generation spec is specified on both a rubric config and an evaluation metric, the rubrics generated for the metric will be used for evaluation.

outputConfig object (OutputConfig)

Optional. The output config for the evaluation run.

autoraterConfig object (AutoraterConfig)

Optional. The autorater config for the evaluation run.

promptTemplate object (PromptTemplate)

The prompt template used for inference. The values for variables in the prompt template are defined in EvaluationItem.EvaluationPrompt.PromptTemplateData.values.

JSON representation

JSON representation
{ "metrics": [ { object (`EvaluationRunMetric`) } ], "rubricConfigs": [ { object (`EvaluationRubricConfig`) } ], "outputConfig": { object (`OutputConfig`) }, "autoraterConfig": { object (`AutoraterConfig`) }, "promptTemplate": { object (`PromptTemplate`) } }

{
  "metrics": [
    {
      object (EvaluationRunMetric)
    }
  ],
  "rubricConfigs": [
    {
      object (EvaluationRubricConfig)
    }
  ],
  "outputConfig": {
    object (OutputConfig)
  },
  "autoraterConfig": {
    object (AutoraterConfig)
  },
  "promptTemplate": {
    object (PromptTemplate)
  }
}

EvaluationRunMetric

The metric used for evaluation runs.

Fields

metric string

Required. The name of the metric.

metricConfig object (Metric)

The metric config.

metric_spec Union type

The metric spec used for evaluation. metric_spec can be only one of the following:

rubricBasedMetricSpec object (RubricBasedMetricSpec)

Spec for rubric based metric.

predefinedMetricSpec object (PredefinedMetricSpec)

Spec for a pre-defined metric.

llmBasedMetricSpec object (LLMBasedMetricSpec)

Spec for an LLM based metric.

JSON representation

JSON representation
{ "metric": string, "metricConfig": { object (`Metric`) }, // metric_spec "rubricBasedMetricSpec": { object (`RubricBasedMetricSpec`) }, "predefinedMetricSpec": { object (`PredefinedMetricSpec`) }, "llmBasedMetricSpec": { object (`LLMBasedMetricSpec`) } // Union type }

{
  "metric": string,
  "metricConfig": {
    object (Metric)
  },

  // metric_spec
  "rubricBasedMetricSpec": {
    object (RubricBasedMetricSpec)
  },
  "predefinedMetricSpec": {
    object (PredefinedMetricSpec)
  },
  "llmBasedMetricSpec": {
    object (LLMBasedMetricSpec)
  }
  // Union type
}

RubricBasedMetricSpec

Specification for a metric that is based on rubrics.

Fields

metricPromptTemplate string

Optional. Template for the prompt used by the judge model to evaluate against rubrics.

rubrics_source Union type

Source of the rubrics to be used for evaluation. rubrics_source can be only one of the following:

inlineRubrics object (RepeatedRubrics)

Use rubrics provided directly in the spec.

rubricGroupKey string

Use a pre-defined group of rubrics associated with the input content. This refers to a key in the rubricGroups map of RubricEnhancedContents.

rubricGenerationSpec object (RubricGenerationSpec)

Dynamically generate rubrics for evaluation using this specification.

judgeAutoraterConfig object (AutoraterConfig)

Optional. Optional configuration for the judge LLM (Autorater). The definition of AutoraterConfig needs to be provided.

JSON representation

JSON representation
{ "metricPromptTemplate": string, // rubrics_source "inlineRubrics": { object (`RepeatedRubrics`) }, "rubricGroupKey": string, "rubricGenerationSpec": { object (`RubricGenerationSpec`) } // Union type "judgeAutoraterConfig": { object (`AutoraterConfig`) } }

{
  "metricPromptTemplate": string,

  // rubrics_source
  "inlineRubrics": {
    object (RepeatedRubrics)
  },
  "rubricGroupKey": string,
  "rubricGenerationSpec": {
    object (RubricGenerationSpec)
  }
  // Union type
  "judgeAutoraterConfig": {
    object (AutoraterConfig)
  }
}

RepeatedRubrics

Defines a list of rubrics, used when providing rubrics inline.

Fields

rubrics[] object (Rubric)

The list of rubrics.

JSON representation
{ "rubrics": [ { object (`Rubric`) } ] }

RubricGenerationSpec

Specification for how rubrics should be generated.

Fields

promptTemplate string

Optional. Template for the prompt used to generate rubrics. The details should be updated based on the most-recent recipe requirements.

rubricContentType enum (RubricContentType)

Optional. The type of rubric content to be generated.

rubricTypeOntology[] string

Optional. An optional, pre-defined list of allowed types for generated rubrics. If this field is provided, it implies include_rubric_type should be true, and the generated rubric types should be chosen from this ontology.

modelConfig object (AutoraterConfig)

Optional. Configuration for the model used in rubric generation. Configs including sampling count and base model can be specified here. Flipping is not supported for rubric generation.

JSON representation
{ "promptTemplate": string, "rubricContentType": enum (`RubricContentType`), "rubricTypeOntology": [ string ], "modelConfig": { object (`AutoraterConfig`) } }

AutoraterConfig

The autorater config used for the evaluation run.

Fields

autoraterModel string

Optional. The fully qualified name of the publisher model or tuned autorater endpoint to use.

Publisher model format: projects/{project}/locations/{location}/publishers/*/models/*

Tuned model endpoint format: projects/{project}/locations/{location}/endpoints/{endpoint}

generationConfig object (GenerationConfig)

Optional. Configuration options for model generation and outputs.

sampleCount integer

Optional. Number of samples for each instance in the dataset. If not specified, the default is 4. Minimum value is 1, maximum value is 32.

JSON representation
{ "autoraterModel": string, "generationConfig": { object (`GenerationConfig`) }, "sampleCount": integer }

RubricContentType

Specifies the type of rubric content to generate.

Enums
`RUBRIC_CONTENT_TYPE_UNSPECIFIED`	The content type to generate is not specified.
`PROPERTY`	Generate rubrics based on properties.
`NL_QUESTION_ANSWER`	Generate rubrics in an NL question answer format.
`PYTHON_CODE_ASSERTION`	Generate rubrics in a unit test format.

PredefinedMetricSpec

Specification for a pre-defined metric.

Fields

metricSpecName string

Required. The name of a pre-defined metric, such as "instruction_following_v1" or "text_quality_v1".

parameters object (Struct format)

Optional. The parameters needed to run the pre-defined metric.

JSON representation
{ "metricSpecName": string, "parameters": { object } }

LLMBasedMetricSpec

Specification for an LLM based metric.

Fields

rubrics_source Union type

Source of the rubrics to be used for evaluation. rubrics_source can be only one of the following:

rubricGroupKey string

Use a pre-defined group of rubrics associated with the input. Refers to a key in the rubricGroups map of EvaluationInstance.

rubricGenerationSpec object (RubricGenerationSpec)

Dynamically generate rubrics using this specification.

predefinedRubricGenerationSpec object (PredefinedMetricSpec)

Dynamically generate rubrics using a predefined spec.

metricPromptTemplate string

Required. Template for the prompt sent to the judge model.

systemInstruction string

Optional. System instructions for the judge model.

judgeAutoraterConfig object (AutoraterConfig)

Optional. Optional configuration for the judge LLM (Autorater).

additionalConfig object (Struct format)

Optional. Optional additional configuration for the metric.

JSON representation

JSON representation
{ // rubrics_source "rubricGroupKey": string, "rubricGenerationSpec": { object (`RubricGenerationSpec`) }, "predefinedRubricGenerationSpec": { object (`PredefinedMetricSpec`) } // Union type "metricPromptTemplate": string, "systemInstruction": string, "judgeAutoraterConfig": { object (`AutoraterConfig`) }, "additionalConfig": { object } }

{

  // rubrics_source
  "rubricGroupKey": string,
  "rubricGenerationSpec": {
    object (RubricGenerationSpec)
  },
  "predefinedRubricGenerationSpec": {
    object (PredefinedMetricSpec)
  }
  // Union type
  "metricPromptTemplate": string,
  "systemInstruction": string,
  "judgeAutoraterConfig": {
    object (AutoraterConfig)
  },
  "additionalConfig": {
    object
  }
}

Metric

The metric used for running evaluations.

Fields

aggregationMetrics[] enum (AggregationMetric)

Optional. The aggregation metrics to use.

metric_spec Union type

The spec for the metric. It would be either a pre-defined metric, or a inline metric spec. metric_spec can be only one of the following:

predefinedMetricSpec object (PredefinedMetricSpec)

The spec for a pre-defined metric.

llmBasedMetricSpec object (LLMBasedMetricSpec)

Spec for an LLM based metric.

customCodeExecutionSpec object (CustomCodeExecutionSpec)

Spec for Custom code Execution metric.

pointwiseMetricSpec object (PointwiseMetricSpec)

Spec for pointwise metric.

pairwiseMetricSpec object (PairwiseMetricSpec)

Spec for pairwise metric.

exactMatchSpec object (ExactMatchSpec)

Spec for exact match metric.

bleuSpec object (BleuSpec)

Spec for bleu metric.

rougeSpec object (RougeSpec)

Spec for rouge metric.

JSON representation

JSON representation
{ "aggregationMetrics": [ enum (`AggregationMetric`) ], // metric_spec "predefinedMetricSpec": { object (`PredefinedMetricSpec`) }, "llmBasedMetricSpec": { object (`LLMBasedMetricSpec`) }, "customCodeExecutionSpec": { object (`CustomCodeExecutionSpec`) }, "pointwiseMetricSpec": { object (`PointwiseMetricSpec`) }, "pairwiseMetricSpec": { object (`PairwiseMetricSpec`) }, "exactMatchSpec": { object (`ExactMatchSpec`) }, "bleuSpec": { object (`BleuSpec`) }, "rougeSpec": { object (`RougeSpec`) } // Union type }

{
  "aggregationMetrics": [
    enum (AggregationMetric)
  ],

  // metric_spec
  "predefinedMetricSpec": {
    object (PredefinedMetricSpec)
  },
  "llmBasedMetricSpec": {
    object (LLMBasedMetricSpec)
  },
  "customCodeExecutionSpec": {
    object (CustomCodeExecutionSpec)
  },
  "pointwiseMetricSpec": {
    object (PointwiseMetricSpec)
  },
  "pairwiseMetricSpec": {
    object (PairwiseMetricSpec)
  },
  "exactMatchSpec": {
    object (ExactMatchSpec)
  },
  "bleuSpec": {
    object (BleuSpec)
  },
  "rougeSpec": {
    object (RougeSpec)
  }
  // Union type
}

PredefinedMetricSpec

The spec for a pre-defined metric.

Fields

metricSpecName string

Required. The name of a pre-defined metric, such as "instruction_following_v1" or "text_quality_v1".

metricSpecParameters object (Struct format)

Optional. The parameters needed to run the pre-defined metric.

JSON representation
{ "metricSpecName": string, "metricSpecParameters": { object } }

LLMBasedMetricSpec

Specification for an LLM based metric.

Fields

rubrics_source Union type

Source of the rubrics to be used for evaluation. rubrics_source can be only one of the following:

rubricGroupKey string

Use a pre-defined group of rubrics associated with the input. Refers to a key in the rubricGroups map of EvaluationInstance.

rubricGenerationSpec object (RubricGenerationSpec)

Dynamically generate rubrics using this specification.

predefinedRubricGenerationSpec object (PredefinedMetricSpec)

Dynamically generate rubrics using a predefined spec.

metricPromptTemplate string

Required. Template for the prompt sent to the judge model.

systemInstruction string

Optional. System instructions for the judge model.

judgeAutoraterConfig object (AutoraterConfig)

Optional. Optional configuration for the judge LLM (Autorater).

additionalConfig object (Struct format)

Optional. Optional additional configuration for the metric.

JSON representation

JSON representation
{ // rubrics_source "rubricGroupKey": string, "rubricGenerationSpec": { object (`RubricGenerationSpec`) }, "predefinedRubricGenerationSpec": { object (`PredefinedMetricSpec`) } // Union type "metricPromptTemplate": string, "systemInstruction": string, "judgeAutoraterConfig": { object (`AutoraterConfig`) }, "additionalConfig": { object } }

{

  // rubrics_source
  "rubricGroupKey": string,
  "rubricGenerationSpec": {
    object (RubricGenerationSpec)
  },
  "predefinedRubricGenerationSpec": {
    object (PredefinedMetricSpec)
  }
  // Union type
  "metricPromptTemplate": string,
  "systemInstruction": string,
  "judgeAutoraterConfig": {
    object (AutoraterConfig)
  },
  "additionalConfig": {
    object
  }
}

RubricGenerationSpec

Specification for how rubrics should be generated.

Fields

promptTemplate string

Template for the prompt used to generate rubrics. The details should be updated based on the most-recent recipe requirements.

rubricContentType enum (RubricContentType)

The type of rubric content to be generated.

rubricTypeOntology[] string

modelConfig object (AutoraterConfig)

Configuration for the model used in rubric generation. Configs including sampling count and base model can be specified here. Flipping is not supported for rubric generation.

JSON representation
{ "promptTemplate": string, "rubricContentType": enum (`RubricContentType`), "rubricTypeOntology": [ string ], "modelConfig": { object (`AutoraterConfig`) } }

AutoraterConfig

The configs for autorater. This is applicable to both EvaluateInstances and EvaluateDataset.

Fields

autoraterModel string

Optional. The fully qualified name of the publisher model or tuned autorater endpoint to use.

Publisher model format: projects/{project}/locations/{location}/publishers/*/models/*

Tuned model endpoint format: projects/{project}/locations/{location}/endpoints/{endpoint}

generationConfig object (GenerationConfig)

Optional. Configuration options for model generation and outputs.

samplingCount integer

Optional. Number of samples for each instance in the dataset. If not specified, the default is 4. Minimum value is 1, maximum value is 32.

flipEnabled boolean

Optional. Default is true. Whether to flip the candidate and baseline responses. This is only applicable to the pairwise metric. If enabled, also provide PairwiseMetricSpec.candidate_response_field_name and PairwiseMetricSpec.baseline_response_field_name. When rendering PairwiseMetricSpec.metric_prompt_template, the candidate and baseline fields will be flipped for half of the samples to reduce bias.

JSON representation
{ "autoraterModel": string, "generationConfig": { object (`GenerationConfig`) }, "samplingCount": integer, "flipEnabled": boolean }

RubricContentType

Specifies the type of rubric content to generate.

Enums
`RUBRIC_CONTENT_TYPE_UNSPECIFIED`	The content type to generate is not specified.
`PROPERTY`	Generate rubrics based on properties.
`NL_QUESTION_ANSWER`	Generate rubrics in an NL question answer format.
`PYTHON_CODE_ASSERTION`	Generate rubrics in a unit test format.

CustomCodeExecutionSpec

Specificies a metric that is populated by evaluating user-defined Python code.

Fields

evaluationFunction string

Required. Python function. Expected user to define the following function, e.g.: def evaluate(instance: dict[str, Any]) -> float: Please include this function signature in the code snippet. Instance is the evaluation instance, any fields populated in the instance are available to the function as instance[fieldName].

Example: Example input:

instance= EvaluationInstance( response=EvaluationInstance.InstanceData(text="The answer is 4."), reference=EvaluationInstance.InstanceData(text="4") )

Example converted input:

{ 'response': {'text': 'The answer is 4.'}, 'reference': {'text': '4'} }

Example python function:

def evaluate(instance: dict[str, Any]) -> float: if instance['response']['text'] == instance['reference']['text']: return 1.0 return 0.0

CustomCodeExecutionSpec is also supported in Batch Evaluation (EvalDataset RPC) and Tuning Evaluation. Each line in the input jsonl file will be converted to dict[str, Any] and passed to the evaluation function.

JSON representation
{ "evaluationFunction": string }

PointwiseMetricSpec

Spec for pointwise metric.

Fields

customOutputFormatConfig object (CustomOutputFormatConfig)

Optional. CustomOutputFormatConfig allows customization of metric output. By default, metrics return a score and explanation. When this config is set, the default output is replaced with either: - The raw output string. - A parsed output based on a user-defined schema. If a custom format is chosen, the score and explanation fields in the corresponding metric result will be empty.

metricPromptTemplate string

Required. Metric prompt template for pointwise metric.

systemInstruction string

Optional. System instructions for pointwise metric.

JSON representation
{ "customOutputFormatConfig": { object (`CustomOutputFormatConfig`) }, "metricPromptTemplate": string, "systemInstruction": string }

CustomOutputFormatConfig

Spec for custom output format configuration.

Fields

custom_output_format_config Union type

Custom output format configuration. custom_output_format_config can be only one of the following:

returnRawOutput boolean

Optional. Whether to return raw output.

JSON representation
{ // custom_output_format_config "returnRawOutput": boolean // Union type }

PairwiseMetricSpec

Spec for pairwise metric.

Fields

candidateResponseFieldName string

Optional. The field name of the candidate response.

baselineResponseFieldName string

Optional. The field name of the baseline response.

customOutputFormatConfig object (CustomOutputFormatConfig)

Optional. CustomOutputFormatConfig allows customization of metric output. When this config is set, the default output is replaced with the raw output string. If a custom format is chosen, the pairwiseChoice and explanation fields in the corresponding metric result will be empty.

metricPromptTemplate string

Required. Metric prompt template for pairwise metric.

systemInstruction string

Optional. System instructions for pairwise metric.

JSON representation

JSON representation
{ "candidateResponseFieldName": string, "baselineResponseFieldName": string, "customOutputFormatConfig": { object (`CustomOutputFormatConfig`) }, "metricPromptTemplate": string, "systemInstruction": string }

{
  "candidateResponseFieldName": string,
  "baselineResponseFieldName": string,
  "customOutputFormatConfig": {
    object (CustomOutputFormatConfig)
  },
  "metricPromptTemplate": string,
  "systemInstruction": string
}

ExactMatchSpec

This type has no fields.

Spec for exact match metric - returns 1 if prediction and reference exactly matches, otherwise 0.

BleuSpec

Spec for bleu score metric - calculates the precision of n-grams in the prediction as compared to reference - returns a score ranging between 0 to 1.

Fields

useEffectiveOrder boolean

Optional. Whether to useEffectiveOrder to compute bleu score.

JSON representation
{ "useEffectiveOrder": boolean }

RougeSpec

Spec for rouge score metric - calculates the recall of n-grams in prediction as compared to reference - returns a score ranging between 0 and 1.

Fields

rougeType string

Optional. Supported rouge types are rougen[1-9], rougeL, and rougeLsum.

useStemmer boolean

Optional. Whether to use stemmer to compute rouge score.

splitSummaries boolean

Optional. Whether to split summaries while using rougeLsum.

JSON representation
{ "rougeType": string, "useStemmer": boolean, "splitSummaries": boolean }

AggregationMetric

The aggregation metrics supported by EvaluationService.EvaluateDataset.

Enums
`AGGREGATION_METRIC_UNSPECIFIED`	Unspecified aggregation metric.
`AVERAGE`	Average aggregation metric. Not supported for Pairwise metric.
`MODE`	Mode aggregation metric.
`STANDARD_DEVIATION`	Standard deviation aggregation metric. Not supported for pairwise metric.
`VARIANCE`	Variance aggregation metric. Not supported for pairwise metric.
`MINIMUM`	Minimum aggregation metric. Not supported for pairwise metric.
`MAXIMUM`	Maximum aggregation metric. Not supported for pairwise metric.
`MEDIAN`	Median aggregation metric. Not supported for pairwise metric.
`PERCENTILE_P90`	90th percentile aggregation metric. Not supported for pairwise metric.
`PERCENTILE_P95`	95th percentile aggregation metric. Not supported for pairwise metric.
`PERCENTILE_P99`	99th percentile aggregation metric. Not supported for pairwise metric.

EvaluationRubricConfig

Configuration for a rubric group to be generated/saved for evaluation.

Fields

rubricGroupKey string

Required. The key used to save the generated rubrics. If a generation spec is provided, this key will be used for the name of the generated rubric group. Otherwise, this key will be used to look up the existing rubric group on the evaluation item. Note that if a rubric group key is specified on both a rubric config and an evaluation metric, the key from the metric will be used to select the rubrics for evaluation.

generation_config Union type

The configuration for generating rubrics. generation_config can be only one of the following:

rubricGenerationSpec object (RubricGenerationSpec)

Dynamically generate rubrics using this specification.

predefinedRubricGenerationSpec object (PredefinedMetricSpec)

Dynamically generate rubrics using a predefined spec.

JSON representation
{ "rubricGroupKey": string, // generation_config "rubricGenerationSpec": { object (`RubricGenerationSpec`) }, "predefinedRubricGenerationSpec": { object (`PredefinedMetricSpec`) } // Union type }

OutputConfig

The output config for the evaluation run.

Fields

bigqueryDestination object (BigQueryDestination)

BigQuery destination for evaluation output.

gcsDestination object (GcsDestination)

Cloud Storage destination for evaluation output.

JSON representation
{ "bigqueryDestination": { object (`BigQueryDestination`) }, "gcsDestination": { object (`GcsDestination`) } }

PromptTemplate

Prompt template used for inference.

Fields

source Union type

The source of the prompt template. source can be only one of the following:

promptTemplate string

Inline prompt template. Template variables should be in the format "{var_name}". Example: "Translate the following from {source_lang} to {target_lang}: {text}"

gcsUri string

Prompt template stored in Cloud Storage. Format: "gs://my-bucket/file-name.txt".

JSON representation
{ // source "promptTemplate": string, "gcsUri": string // Union type }

State

The state of the evaluation run.

Enums
`STATE_UNSPECIFIED`	Unspecified state.
`PENDING`	The evaluation run is pending.
`RUNNING`	The evaluation run is running.
`SUCCEEDED`	The evaluation run has succeeded.
`FAILED`	The evaluation run has failed.
`CANCELLED`	The evaluation run has been cancelled.
`INFERENCE`	The evaluation run is performing inference.
`GENERATING_RUBRICS`	The evaluation run is performing rubric generation.

EvaluationResults

The results of the evaluation run.

Fields

summaryMetrics object (SummaryMetrics)

Optional. The summary metrics for the evaluation run.

evaluationSet string

The evaluation set where item level results are stored.

JSON representation
{ "summaryMetrics": { object (`SummaryMetrics`) }, "evaluationSet": string }

SummaryMetrics

The summary metrics for the evaluation run.

Fields

metrics map (key: string, value: value (Value format))

Optional. Map of metric name to metric value.

totalItems integer

Optional. The total number of items that were evaluated.

failedItems integer

Optional. The number of items that failed to be evaluated.

JSON representation
{ "metrics": { string: value, ... }, "totalItems": integer, "failedItems": integer }

Methods
`cancel`	Cancels an Evaluation Run.
`create`	Creates an Evaluation Run.
`delete`	Deletes an Evaluation Run.
`get`	Gets an Evaluation Run.
`list`	Lists Evaluation Runs.

REST Resource: projects.locations.evaluationRuns Stay organized with collections Save and categorize content based on your preferences.

Resource: EvaluationRun

DataSource

BigQueryRequestSet

SamplingConfig

SamplingMethod

InferenceConfig

GenerationConfig

RoutingConfig

AutoRoutingMode

ModelRoutingPreference

ManualRoutingMode

Modality

MediaResolution

SpeechConfig

VoiceConfig

PrebuiltVoiceConfig

ReplicatedVoiceConfig

MultiSpeakerVoiceConfig

SpeakerVoiceConfig

ThinkingConfig

ThinkingLevel

ImageConfig

ImageOutputOptions

PersonGeneration

EvaluationConfig

EvaluationRunMetric

RubricBasedMetricSpec

RepeatedRubrics

RubricGenerationSpec

AutoraterConfig

RubricContentType

PredefinedMetricSpec

LLMBasedMetricSpec

Metric

PredefinedMetricSpec

LLMBasedMetricSpec

RubricGenerationSpec

AutoraterConfig

RubricContentType

CustomCodeExecutionSpec

PointwiseMetricSpec

CustomOutputFormatConfig

PairwiseMetricSpec

ExactMatchSpec

BleuSpec

RougeSpec

AggregationMetric

EvaluationRubricConfig

OutputConfig

PromptTemplate

State

EvaluationResults

SummaryMetrics

Methods

cancel

create

delete

get

list

REST Resource: projects.locations.evaluationRuns

`cancel`

`create`

`delete`

`get`

`list`