평가 구성하기

Colab에서 실행해 보기 GitHub 소스 코드

평가는 애플리케이션을 변경한 이후 예제 집합에 대해 테스트함으로써, 애플리케이션을 반복적으로 개선하는 데 도움을 줍니다. Weave는 Model 및 Evaluation 클래스를 통해 평가 추적을 위한 일급 지원을 제공합니다. 이 API는 사전 가정을 최소화해, 매우 다양한 사용 사례를 유연하게 지원할 수 있도록 설계되었습니다.

이 가이드에서 배우는 내용:

이 가이드는 다음 내용을 다룹니다:

Model을 설정하는 방법
LLM의 응답을 검증할 데이터셋을 생성하는 방법
모델 출력과 기대 출력 값을 비교할 스코어링 함수를 정의하는 방법
스코어링 함수와 추가로 제공되는 내장 스코어러를 사용해 데이터셋에 대해 모델을 테스트하는 평가를 실행하는 방법
Weave UI에서 평가 결과를 확인하는 방법

사전 준비 사항

W&B 계정
Python 3.8+ 또는 Node.js 18+
필수 패키지 설치:
- Python: pip install weave openai
- TypeScript: npm install weave openai
OpenAI API key를 환경 변수로 설정

필요한 라이브러리와 함수 임포트하기

다음 라이브러리를 스크립트에 임포트하세요:

Python
TypeScript

import json
import openai
import asyncio
import weave
from weave.scorers import MultiTaskBinaryClassificationF1

import * as weave from 'weave';
import OpenAI from 'openai';

`Model` 빌드하기

Weave에서 Model은 객체로, 모델/에이전트의 동작(로직, 프롬프트, 파라미터)과 버전 관리되는 메타데이터(파라미터, 코드, 마이크로 설정)를 모두 캡처하여, 안정적으로 추적·비교·평가·반복 개선할 수 있게 해 줍니다. Model을 인스턴스화하면, Weave가 해당 구성과 동작을 자동으로 캡처하고 변경이 있을 때 버전을 업데이트합니다. 이를 통해 반복하면서 시간에 따른 성능을 추적할 수 있습니다. Model은 Model을 서브클래싱하고, 하나의 예제를 입력으로 받아 응답을 반환하는 predict 함수(메서드)를 구현하여 선언합니다. 다음 예제 모델은 OpenAI를 사용하여 입력 문장에서 외계 과일의 이름, 색, 풍미를 추출합니다.

Python
TypeScript

class ExtractFruitsModel(weave.Model):
    model_name: str
    prompt_template: str

    @weave.op()
    async def predict(self, sentence: str) -> dict:
        client = openai.AsyncClient()

        response = await client.chat.completions.create(
            model=self.model_name,
            messages=[
                {"role": "user", "content": self.prompt_template.format(sentence=sentence)}
            ],
        )
        result = response.choices[0].message.content
        if result is None:
            raise ValueError("No response from model")
        parsed = json.loads(result)
        return parsed

// 참고: weave.Model은 TypeScript에서는 아직 지원되지 않습니다.
// 대신, 모델 역할을 하는 함수를 weave.op으로 래핑하세요.

import * as weave from 'weave';
import OpenAI from 'openai';

const openaiClient = new OpenAI();

const model = weave.op(async function myModel({datasetRow}) {
  const prompt = `Extract fields ("fruit": <str>, "color": <str>, "flavor") from the following text, as json: ${datasetRow.sentence}`;
  const response = await openaiClient.chat.completions.create({
    model: 'gpt-3.5-turbo',
    messages: [{ role: 'user', content: prompt }],
    response_format: { type: 'json_object' }
  });
  return JSON.parse(response.choices[0].message.content);
});

ExtractFruitsModel 클래스는 weave.Model을 상속(또는 서브클래싱)하여 Weave가 인스턴스화된 객체를 추적할 수 있게 합니다. @weave.op은 predict 함수를 데코레이터로 지정해 그 입력과 출력을 추적합니다. Model 객체는 다음과 같이 인스턴스화할 수 있습니다:

Python
TypeScript

# 팀과 프로젝트 이름을 설정합니다.
weave.init('<team-name>/eval_pipeline_quickstart')

model = ExtractFruitsModel(
    model_name='gpt-3.5-turbo-1106',
    prompt_template='Extract fields ("fruit": <str>, "color": <str>, "flavor": <str>) from the following text, as json: {sentence}'
)

sentence = "There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy."

print(asyncio.run(model.predict(sentence)))
# Jupyter Notebook에서라면, 다음을 실행하세요:
# await model.predict(sentence)

await weave.init('eval_pipeline_quickstart');

const sentence = "There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy.";

const result = await model({ datasetRow: { sentence } });

console.log(result);

데이터셋 생성

이제 모델을 평가할 때 사용할 데이터셋이 필요합니다. Dataset은 Weave 객체로 저장된 예제들의 모음입니다. 다음 예시 데이터셋은 세 개의 입력 문장과 그에 대응하는 정답(labels)을 정의한 뒤, 스코어링 함수에서 읽을 수 있는 JSON 테이블 형식으로 변환합니다. 이 예제에서는 코드에서 예제 목록을 직접 생성하지만, 실행 중인 애플리케이션에서 예제를 하나씩 로깅할 수도 있습니다.

Python
TypeScript

sentences = ["There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy.",
"Pounits are a bright green color and are more savory than sweet.",
"Finally, there are fruits called glowls, which have a very sour and bitter taste which is acidic and caustic, and a pale orange tinge to them."]
labels = [
    {'fruit': 'neoskizzles', 'color': 'purple', 'flavor': 'candy'},
    {'fruit': 'pounits', 'color': 'bright green', 'flavor': 'savory'},
    {'fruit': 'glowls', 'color': 'pale orange', 'flavor': 'sour and bitter'}
]
examples = [
    {'id': '0', 'sentence': sentences[0], 'target': labels[0]},
    {'id': '1', 'sentence': sentences[1], 'target': labels[1]},
    {'id': '2', 'sentence': sentences[2], 'target': labels[2]}
]

const sentences = [
  "There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy.",
  "Pounits are a bright green color and are more savory than sweet.",
  "Finally, there are fruits called glowls, which have a very sour and bitter taste which is acidic and caustic, and a pale orange tinge to them."
];
const labels = [
  { fruit: 'neoskizzles', color: 'purple', flavor: 'candy' },
  { fruit: 'pounits', color: 'bright green', flavor: 'savory' },
  { fruit: 'glowls', color: 'pale orange', flavor: 'sour and bitter' }
];
const examples = sentences.map((sentence, i) => ({
  id: i.toString(),
  sentence,
  target: labels[i]
}));

이어서 weave.Dataset() 클래스를 사용해 데이터셋을 생성하고 게시합니다:

Python
TypeScript

weave.init('eval_pipeline_quickstart')
dataset = weave.Dataset(name='fruits', rows=examples)
weave.publish(dataset)

import * as weave from 'weave';
await weave.init('eval_pipeline_quickstart');
const dataset = new weave.Dataset({
  name: 'fruits',
  rows: examples
});
await dataset.save();

사용자 정의 스코어링 함수 정의하기

Weave 평가를 사용할 때 Weave는 output과 비교할 target이 필요합니다. 다음 스코어링 함수는 두 개의 딕셔너리(target과 output)를 입력으로 받아, output이 target과 일치하는지를 나타내는 불리언 값으로 구성된 딕셔너리를 반환합니다. @weave.op() 데코레이터는 Weave가 이 스코어링 함수의 실행을 추적할 수 있도록 해 줍니다.

Python
TypeScript

@weave.op()
def fruit_name_score(target: dict, output: dict) -> dict:
    return {'correct': target['fruit'] == output['fruit']}

import * as weave from 'weave';

const fruitNameScorer = weave.op(
  function fruitNameScore({target, output}) {
    return { correct: target.fruit === output.fruit };
  }
);

직접 스코어링 함수를 만들려면 Scorers 가이드를 참고하세요. 일부 애플리케이션에서는 사용자 정의 Scorer 클래스를 만들고 싶을 수 있습니다. 예를 들어, 특정 파라미터(예: 챗 모델 또는 프롬프트), 특정 행 단위 스코어링 방식, 그리고 집계 점수 계산 로직을 가진 표준화된 LLMJudge 클래스를 만들 수 있습니다. Scorer 클래스를 정의하는 방법에 대한 자세한 내용은 다음 장의 RAG 애플리케이션의 모델 기반 평가 튜토리얼을 참고하세요.

기본 제공 스코어러를 사용해 평가 실행하기

사용자 정의 스코어링 함수와 함께 Weave의 기본 제공 스코어러도 사용할 수 있습니다. 아래 예시에서는 weave.Evaluation()이 이전 섹션에서 정의한 fruit_name_score 함수와 기본 제공 MultiTaskBinaryClassificationF1 스코어러를 함께 사용합니다. 이 스코어러는 F1 점수를 계산합니다. 다음 예시는 ExtractFruitsModel을 fruits 데이터셋에 대해 위 두 개의 스코어링 함수를 사용해 평가하고, 결과를 Weave에 기록합니다.

Python
TypeScript

weave.init('eval_pipeline_quickstart')

evaluation = weave.Evaluation(
    name='fruit_eval',
    dataset=dataset, 
    scorers=[
        MultiTaskBinaryClassificationF1(class_names=["fruit", "color", "flavor"]), 
        fruit_name_score
    ],
)
print(asyncio.run(evaluation.evaluate(model)))
# Jupyter Notebook에서 실행 중이라면 다음을 실행하세요:
# await evaluation.evaluate(model)

import * as weave from 'weave';

await weave.init('eval_pipeline_quickstart');

const evaluation = new weave.Evaluation({
  name: 'fruit_eval',
  dataset: dataset,
  scorers: [fruitNameScorer],
});
const results = await evaluation.evaluate(model);
console.log(results);

Python 스크립트에서 실행하는 경우 asyncio.run을 사용해야 합니다. Jupyter Notebook에서 실행하는 경우에는 await를 바로 사용할 수 있습니다.

전체 예제

단일 스크립트로 전체 평가 파이프라인 실행:

Python
TypeScript

import json
import asyncio
import openai
import weave
from weave.scorers import MultiTaskBinaryClassificationF1

# Weave 초기화 (한 번만)
weave.init('eval_pipeline_quickstart')

# 1. 모델 정의
class ExtractFruitsModel(weave.Model):
    model_name: str
    prompt_template: str

    @weave.op()
    async def predict(self, sentence: str) -> dict:
        client = openai.AsyncClient()
        response = await client.chat.completions.create(
            model=self.model_name,
            messages=[{"role": "user", "content": self.prompt_template.format(sentence=sentence)}],
        )
        result = response.choices[0].message.content
        if result is None:
            raise ValueError("No response from model")
        return json.loads(result)

# 2. 모델 인스턴스 생성
model = ExtractFruitsModel(
    model_name='gpt-3.5-turbo-1106',
    prompt_template='Extract fields ("fruit": <str>, "color": <str>, "flavor": <str>) from the following text, as json: {sentence}'
)

# 3. 데이터셋 생성
sentences = ["There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy.",
"Pounits are a bright green color and are more savory than sweet.",
"Finally, there are fruits called glowls, which have a very sour and bitter taste which is acidic and caustic, and a pale orange tinge to them."]
labels = [
    {'fruit': 'neoskizzles', 'color': 'purple', 'flavor': 'candy'},
    {'fruit': 'pounits', 'color': 'bright green', 'flavor': 'savory'},
    {'fruit': 'glowls', 'color': 'pale orange', 'flavor': 'sour and bitter'}
]
examples = [
    {'id': '0', 'sentence': sentences[0], 'target': labels[0]},
    {'id': '1', 'sentence': sentences[1], 'target': labels[1]},
    {'id': '2', 'sentence': sentences[2], 'target': labels[2]}
]

dataset = weave.Dataset(name='fruits', rows=examples)
weave.publish(dataset)

# 4. 채점 함수 정의
@weave.op()
def fruit_name_score(target: dict, output: dict) -> dict:
    return {'correct': target['fruit'] == output['fruit']}

# 5. 평가 실행
evaluation = weave.Evaluation(
    name='fruit_eval',
    dataset=dataset,
    scorers=[
        MultiTaskBinaryClassificationF1(class_names=["fruit", "color", "flavor"]),
        fruit_name_score
    ],
)
print(asyncio.run(evaluation.evaluate(model)))

import * as weave from 'weave';
import OpenAI from 'openai';

// Weave를 한 번 초기화합니다
await weave.init('eval_pipeline_quickstart');

// 1. 모델 정의
// 참고: weave.Model은 아직 TypeScript에서 지원되지 않습니다.
// 대신, 모델과 유사한 함수를 weave.op로 래핑하세요
const openaiClient = new OpenAI();

const model = weave.op(async function myModel({datasetRow}) {
  const prompt = `Extract fields ("fruit": <str>, "color": <str>, "flavor": <str>) from the following text, as json: ${datasetRow.sentence}`;
  const response = await openaiClient.chat.completions.create({
    model: 'gpt-3.5-turbo',
    messages: [{ role: 'user', content: prompt }],
    response_format: { type: 'json_object' }
  });
  return JSON.parse(response.choices[0].message.content);
});

// 2. 데이터셋 생성
const sentences = [
  "There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy.",
  "Pounits are a bright green color and are more savory than sweet.",
  "Finally, there are fruits called glowls, which have a very sour and bitter taste which is acidic and caustic, and a pale orange tinge to them."
];
const labels = [
  { fruit: 'neoskizzles', color: 'purple', flavor: 'candy' },
  { fruit: 'pounits', color: 'bright green', flavor: 'savory' },
  { fruit: 'glowls', color: 'pale orange', flavor: 'sour and bitter' }
];
const examples = sentences.map((sentence, i) => ({
  id: i.toString(),
  sentence,
  target: labels[i]
}));

const dataset = new weave.Dataset({
  name: 'fruits',
  rows: examples
});
await dataset.save();

// 3. 채점 함수 정의
const fruitNameScorer = weave.op(
  function fruitNameScore({target, output}) {
    return { correct: target.fruit === output.fruit };
  }
);

// 4. 평가 실행
const evaluation = new weave.Evaluation({
  name: 'fruit_eval',
  dataset: dataset,
  scorers: [fruitNameScorer],
});
const results = await evaluation.evaluate(model);
console.log(results);

평가 결과 보기

Weave는 각 예측과 점수의 트레이스를 자동으로 캡처합니다. 평가 시 출력되는 링크를 클릭하여 Weave UI에서 결과를 확인하세요.

Weave 평가에 대해 더 알아보기

Scorer를 구성하고 사용하는 방법에 대해 자세히 알아보세요.
Weave의 내장 스코어링 함수를 확인하세요.
LLM을 판정자로 활용하는 Model-Based Evaluation에 대해 알아보세요.

다음 단계

RAG 애플리케이션을 구축하여 retrieval-augmented generation을 평가하는 방법을 익혀 보세요.

시작하기

가이드

쿡북

레퍼런스

자세한 정보와 지원

오픈 소스

커뮤니티

이 가이드에서 배우는 내용:

사전 준비 사항

필요한 라이브러리와 함수 임포트하기

`Model` 빌드하기

데이터셋 생성

사용자 정의 스코어링 함수 정의하기

기본 제공 스코어러를 사용해 평가 실행하기

전체 예제

평가 결과 보기

Weave 평가에 대해 더 알아보기

다음 단계

시작하기

가이드

쿡북

레퍼런스

자세한 정보와 지원

오픈 소스

커뮤니티

Documentation Index

​이 가이드에서 배우는 내용:

​사전 준비 사항

​필요한 라이브러리와 함수 임포트하기

​Model 빌드하기

​데이터셋 생성

​사용자 정의 스코어링 함수 정의하기

​기본 제공 스코어러를 사용해 평가 실행하기

​전체 예제

​평가 결과 보기

​Weave 평가에 대해 더 알아보기

​다음 단계

이 가이드에서 배우는 내용:

사전 준비 사항

필요한 라이브러리와 함수 임포트하기

`Model` 빌드하기

데이터셋 생성

사용자 정의 스코어링 함수 정의하기

기본 제공 스코어러를 사용해 평가 실행하기

전체 예제

평가 결과 보기

Weave 평가에 대해 더 알아보기

다음 단계