> ## Documentation Index > Fetch the complete documentation index at: https://wb-21fd5541-sdk-testing-latest.mintlify.site/llms.txt > Use this file to discover all available pages before exploring further. # 평가 만들기 > Weave Models 및 평가를 사용해 평가 파이프라인을 구축하는 방법을 알아보세요 export const GitHubLink = ({url}) => GitHub 소스 코드 ; export const ColabLink = ({url}) => Colab에서 사용해 보기 ;

이 튜토리얼에서는 Weave에서 엔드투엔드 평가 파이프라인을 구축하는 방법을 안내합니다. 이를 통해 LLM 애플리케이션을 반복적으로 개선하는 과정에서 품질을 측정하고 추적할 수 있습니다. 평가는 일관된 예시 세트를 기준으로 변경 사항을 비교하고, 문제가 사용자에게 영향을 미치기 전에 회귀를 감지하는 데 도움이 됩니다. 이 튜토리얼은 LLM 기반 애플리케이션을 구축하며, 이를 반복적으로 테스트할 수 있는 재현 가능한 방법을 원하는 개발자를 대상으로 합니다. Weave는 [`Model`](/ko/weave/guides/core-types/models) 및 [`평가`](/ko/weave/guides/core-types/evaluations) 클래스를 통해 평가 추적을 기본적으로 지원합니다. API는 전제를 최소화하도록 설계되어 있어 다양한 사용 사례에 적합합니다. Evals hero

## 학습할 내용

이 가이드에서는 다음을 수행하는 방법을 알아봅니다. * `Model`을 설정합니다. * LLM의 응답을 테스트할 데이터셋을 만듭니다. * 모델 출력을 예상 출력과 비교하는 점수화 함수를 정의합니다. * 점수화 함수와 추가 기본 제공 Scorer를 사용해 데이터셋으로 모델을 테스트하는 평가를 실행합니다. * Weave UI에서 평가 결과를 확인합니다. 이 가이드를 마치면 예시 모델을 데이터셋으로 평가하고 결과를 Weave에 기록하는 작동하는 평가 파이프라인을 갖추게 됩니다.

## 사전 요구 사항

* [W\&B 계정](https://wandb.ai/signup) * Python 3.10+ 또는 Node.js 18+ * 필수 패키지가 설치되어 있어야 합니다: * **Python**: `pip install weave openai` * **TypeScript**: `npm install weave openai` * [OpenAI API 키](https://platform.openai.com/api-keys)를 환경 변수로 설정해야 합니다.

## 필요한 라이브러리와 함수를 임포트하기

다음 라이브러리를 스크립트에 임포트합니다. ```python lines theme={null} import json import openai import asyncio import weave from weave.scorers import MultiTaskBinaryClassificationF1 ``` ```typescript twoslash lines theme={null} // @noErrors import * as weave from 'weave'; import OpenAI from 'openai'; ```

## `Model` 구축하기

라이브러리를 준비했으면, 다음 단계는 평가할 모델을 정의하는 것입니다. Weave에서 [`Models`는 객체](/ko/weave/guides/core-types/models)이며, 모델 또는 에이전트의 동작(로직, prompt, parameters)과 버전 관리되는 메타데이터(parameters, 코드, 마이크로 설정)를 함께 캡처하므로 안정적으로 추적, 비교, 평가하고 반복적으로 개선할 수 있습니다. `Model`을 인스턴스화하면 Weave가 해당 설정과 동작을 자동으로 캡처하고, 변경이 발생하면 버전을 업데이트합니다. 따라서 반복적으로 개선하는 과정에서 시간에 따른 성능 변화를 추적할 수 있습니다. `Model`을 선언하려면 `Model`을 서브클래싱하고, 하나의 예시를 입력받아 응답을 반환하는 `predict` 함수를 구현하세요. 다음 예시 모델은 OpenAI를 사용해 입력 문장에서 외계 과일의 이름, 색상, 맛을 추출합니다. ```python lines {1,5} theme={null} class ExtractFruitsModel(weave.Model): model_name: str prompt_template: str @weave.op() async def predict(self, sentence: str) -> dict: client = openai.AsyncClient() response = await client.chat.completions.create( model=self.model_name, messages=[ {"role": "user", "content": self.prompt_template.format(sentence=sentence)} ], ) result = response.choices[0].message.content if result is None: raise ValueError("No response from model") parsed = json.loads(result) return parsed ``` ```typescript twoslash lines {9} theme={null} // @noErrors // 참고: TypeScript에서는 아직 weave.Model이 지원되지 않습니다. // 대신 모델과 유사한 함수를 weave.op로 래핑하세요 import * as weave from 'weave'; import OpenAI from 'openai'; const openaiClient = new OpenAI(); const model = weave.op(async function myModel({datasetRow}) { const prompt = `Extract fields ("fruit": , "color": , "flavor") from the following text, as json: ${datasetRow.sentence}`; const response = await openaiClient.chat.completions.create({ model: 'gpt-3.5-turbo', messages: [{ role: 'user', content: prompt }], response_format: { type: 'json_object' } }); return JSON.parse(response.choices[0].message.content); }); ``` `ExtractFruitsModel` 클래스는 Weave가 인스턴스화된 객체를 추적할 수 있도록 `weave.Model`을 상속합니다. `@weave.op`는 `predict` 함수를 데코레이트하여 입력과 출력을 추적합니다. 다음과 같이 `Model` 객체를 인스턴스화할 수 있습니다. ```python lines theme={null} # 팀과 프로젝트 이름을 설정하세요 weave.init('[YOUR-TEAM]/eval_pipeline_quickstart') model = ExtractFruitsModel( model_name='gpt-3.5-turbo-1106', prompt_template='Extract fields ("fruit": , "color": , "flavor": ) from the following text, as json: {sentence}' ) sentence = "There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy." print(asyncio.run(model.predict(sentence))) # Jupyter Notebook에서는 다음을 실행하세요: # await model.predict(sentence) ``` ```typescript twoslash theme={null} // @noErrors await weave.init('eval_pipeline_quickstart'); const sentence = "There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy."; const result = await model({ datasetRow: { sentence } }); console.log(result); ```

## 데이터셋 만들기

`Model`을 정의했으면, 이제 이를 평가할 데이터셋이 필요합니다. [`Dataset`](/ko/weave/guides/core-types/datasets)은 Weave 객체로 저장된 예시 모음입니다. 데이터셋을 Weave에 게시하면 버전이 지정되며 여러 평가 run에서 재사용할 수 있습니다. 다음 예제 데이터셋은 세 개의 입력 예시 문장과 해당 정답(`labels`)을 정의한 뒤, 스코어링 함수s가 읽을 수 있는 JSON 테이블 형식으로 구성합니다. 이 예시에서는 코드에서 예시 목록을 만들지만, 실행 중인 애플리케이션에서 예시를 하나씩 로깅할 수도 있습니다. ```python lines theme={null} sentences = ["There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy.", "Pounits are a bright green color and are more savory than sweet.", "Finally, there are fruits called glowls, which have a very sour and bitter taste which is acidic and caustic, and a pale orange tinge to them."] labels = [ {'fruit': 'neoskizzles', 'color': 'purple', 'flavor': 'candy'}, {'fruit': 'pounits', 'color': 'bright green', 'flavor': 'savory'}, {'fruit': 'glowls', 'color': 'pale orange', 'flavor': 'sour and bitter'} ] examples = [ {'id': '0', 'sentence': sentences[0], 'target': labels[0]}, {'id': '1', 'sentence': sentences[1], 'target': labels[1]}, {'id': '2', 'sentence': sentences[2], 'target': labels[2]} ] ``` ```typescript twoslash theme={null} // @noErrors const sentences = [ "There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy.", "Pounits are a bright green color and are more savory than sweet.", "Finally, there are fruits called glowls, which have a very sour and bitter taste which is acidic and caustic, and a pale orange tinge to them." ]; const labels = [ { fruit: 'neoskizzles', color: 'purple', flavor: 'candy' }, { fruit: 'pounits', color: 'bright green', flavor: 'savory' }, { fruit: 'glowls', color: 'pale orange', flavor: 'sour and bitter' } ]; const examples = sentences.map((sentence, i) => ({ id: i.toString(), sentence, target: labels[i] })); ``` 그런 다음 `weave.Dataset()` 클래스를 사용해 데이터셋을 만들고 게시합니다. ```python lines {2} theme={null} weave.init('eval_pipeline_quickstart') dataset = weave.Dataset(name='fruits', rows=examples) weave.publish(dataset) ``` ```typescript twoslash lines {3-6} theme={null} // @noErrors import * as weave from 'weave'; await weave.init('eval_pipeline_quickstart'); const dataset = new weave.Dataset({ name: 'fruits', rows: examples }); await dataset.save(); ```

## 맞춤형 점수화 함수 정의하기

이제 모델과 데이터셋이 있으므로, 각 예시에서 모델이 얼마나 잘 수행하는지 측정할 방법이 필요합니다. 점수화 함수는 모델의 `output`을 예상 `target`과 비교하고 평가에서 보고할 메트릭을 생성합니다. Weave 평가를 사용할 때 Weave는 `output`과 비교할 `target`이 있어야 합니다. 다음 점수화 함수는 두 개의 딕셔너리(`target` 및 `output`)를 받아, output이 target과 일치하는지를 나타내는 불리언 값 딕셔너리를 반환합니다. `@weave.op()` 데코레이터를 사용하면 Weave가 점수화 함수의 실행을 추적할 수 있습니다. ```python lines theme={null} @weave.op() def fruit_name_score(target: dict, output: dict) -> dict: return {'correct': target['fruit'] == output['fruit']} ``` ```typescript twoslash theme={null} // @noErrors import * as weave from 'weave'; const fruitNameScorer = weave.op( function fruitNameScore({target, output}) { return { correct: target.fruit === output.fruit }; } ); ``` 직접 점수화 함수를 만드는 방법에 대한 자세한 내용은 [Scorer](/ko/weave/guides/evaluation/scorers) 가이드를 참조하세요. 일부 애플리케이션에서는 맞춤형 `Scorer` 클래스를 만들고 싶을 수 있습니다. 예를 들어, 특정 매개변수(예: 채팅 모델 또는 프롬프트), 특정 행 스코어링, 집계 점수 계산을 포함하는 표준화된 `LLMJudge` 클래스를 만들 수 있습니다. 자세한 내용은 [RAG 애플리케이션의 모델 기반 평가](/ko/weave/tutorial-rag#optional-defining-a-scorer-class)에서 `Scorer` 클래스를 정의하는 튜토리얼을 참조하세요.

## 기본 제공 Scorer를 사용해 평가 실행하기

모델, 데이터셋, 맞춤형 Scorer가 준비되었으면, 이제 모든 요소를 연결해 평가 run으로 구성할 수 있습니다. 맞춤형 점수화 함수와 함께 [Weave의 기본 제공 Scorer](/ko/weave/guides/evaluation/builtin_scorers)도 사용할 수 있습니다. 다음 평가에서 `weave.Evaluation()`은 이전 섹션에서 정의한 `fruit_name_score` 점수화 함수와 [F1 점수](https://en.wikipedia.org/wiki/F-score)를 계산하는 기본 제공 `MultiTaskBinaryClassificationF1` Scorer를 사용합니다. 다음 예제에서는 두 함수로 점수를 계산해 `fruits` 데이터셋에서 `ExtractFruitsModel`을 평가하고, 결과를 Weave에 기록합니다. ```python lines {3-10} theme={null} weave.init('eval_pipeline_quickstart') evaluation = weave.Evaluation( name='fruit_eval', dataset=dataset, scorers=[ MultiTaskBinaryClassificationF1(class_names=["fruit", "color", "flavor"]), fruit_name_score ], ) print(asyncio.run(evaluation.evaluate(model))) # Jupyter Notebook에서 실행 중이라면 다음을 실행하세요: # await evaluation.evaluate(model) ``` ```typescript twoslash lines {5-9} theme={null} // @noErrors import * as weave from 'weave'; await weave.init('eval_pipeline_quickstart'); const evaluation = new weave.Evaluation({ name: 'fruit_eval', dataset: dataset, scorers: [fruitNameScorer], }); const results = await evaluation.evaluate(model); console.log(results); ``` Python 스크립트에서 실행하는 경우 `asyncio.run`을 사용해야 합니다. 하지만 Jupyter Notebook에서 실행하는 경우에는 `await`를 직접 사용할 수 있습니다.

### 전체 예제

```python lines theme={null} import json import asyncio import openai import weave from weave.scorers import MultiTaskBinaryClassificationF1 # Weave 초기화 (한 번만) weave.init('eval_pipeline_quickstart') # 1. 모델 정의 class ExtractFruitsModel(weave.Model): model_name: str prompt_template: str @weave.op() async def predict(self, sentence: str) -> dict: client = openai.AsyncClient() response = await client.chat.completions.create( model=self.model_name, messages=[{"role": "user", "content": self.prompt_template.format(sentence=sentence)}], ) result = response.choices[0].message.content if result is None: raise ValueError("No response from model") return json.loads(result) # 2. 모델 인스턴스화 model = ExtractFruitsModel( model_name='gpt-3.5-turbo-1106', prompt_template='Extract fields ("fruit": , "color": , "flavor": ) from the following text, as json: {sentence}' ) # 3. 데이터셋 생성 sentences = ["There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy.", "Pounits are a bright green color and are more savory than sweet.", "Finally, there are fruits called glowls, which have a very sour and bitter taste which is acidic and caustic, and a pale orange tinge to them."] labels = [ {'fruit': 'neoskizzles', 'color': 'purple', 'flavor': 'candy'}, {'fruit': 'pounits', 'color': 'bright green', 'flavor': 'savory'}, {'fruit': 'glowls', 'color': 'pale orange', 'flavor': 'sour and bitter'} ] examples = [ {'id': '0', 'sentence': sentences[0], 'target': labels[0]}, {'id': '1', 'sentence': sentences[1], 'target': labels[1]}, {'id': '2', 'sentence': sentences[2], 'target': labels[2]} ] dataset = weave.Dataset(name='fruits', rows=examples) weave.publish(dataset) # 4. 점수화 함수 정의 @weave.op() def fruit_name_score(target: dict, output: dict) -> dict: return {'correct': target['fruit'] == output['fruit']} # 5. 평가 실행 evaluation = weave.Evaluation( name='fruit_eval', dataset=dataset, scorers=[ MultiTaskBinaryClassificationF1(class_names=["fruit", "color", "flavor"]), fruit_name_score ], ) print(asyncio.run(evaluation.evaluate(model))) ``` ```typescript twoslash lines theme={null} // @noErrors import * as weave from 'weave'; import OpenAI from 'openai'; // Weave를 한 번 초기화합니다 await weave.init('eval_pipeline_quickstart'); // 1. 모델 정의 // 참고: weave.Model은 아직 TypeScript에서 지원되지 않습니다. // 대신, 모델과 유사한 함수를 weave.op으로 래핑하세요 const openaiClient = new OpenAI(); const model = weave.op(async function myModel({datasetRow}) { const prompt = `Extract fields ("fruit": , "color": , "flavor": ) from the following text, as json: ${datasetRow.sentence}`; const response = await openaiClient.chat.completions.create({ model: 'gpt-3.5-turbo', messages: [{ role: 'user', content: prompt }], response_format: { type: 'json_object' } }); return JSON.parse(response.choices[0].message.content); }); // 2. 데이터셋 생성 const sentences = [ "There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy.", "Pounits are a bright green color and are more savory than sweet.", "Finally, there are fruits called glowls, which have a very sour and bitter taste which is acidic and caustic, and a pale orange tinge to them." ]; const labels = [ { fruit: 'neoskizzles', color: 'purple', flavor: 'candy' }, { fruit: 'pounits', color: 'bright green', flavor: 'savory' }, { fruit: 'glowls', color: 'pale orange', flavor: 'sour and bitter' } ]; const examples = sentences.map((sentence, i) => ({ id: i.toString(), sentence, target: labels[i] })); const dataset = new weave.Dataset({ name: 'fruits', rows: examples }); await dataset.save(); // 3. 점수화 함수 정의 const fruitNameScorer = weave.op( function fruitNameScore({target, output}) { return { correct: target.fruit === output.fruit }; } ); // 4. 평가 실행 const evaluation = new weave.Evaluation({ name: 'fruit_eval', dataset: dataset, scorers: [fruitNameScorer], }); const results = await evaluation.evaluate(model); console.log(results); ```

## 평가 결과 보기

평가가 완료되면 Weave UI에서 각 예측과 Scorer 결과를 살펴볼 수 있습니다. Weave는 각 예측과 점수에 대한 트레이스를 자동으로 캡처합니다. 평가에서 출력된 링크를 클릭해 Weave UI에서 결과를 확인하세요. 평가 결과

## Weave 평가에 대해 더 알아보기

이제 완전한 평가 파이프라인이 준비되었습니다. Weave의 평가 특성을 더 자세히 살펴보려면 다음 리소스를 참조하세요. * [Scorer를 빌드하고 사용하는 방법](/ko/weave/guides/evaluation/scorers)을 자세히 알아보세요. * Weave의 [기본 제공 스코어링 함수](/ko/weave/guides/evaluation/builtin_scorers)를 확인해 보세요. * LLM을 평가자로 사용하는 [모델 기반 평가](/ko/weave/guides/evaluation/scorers#model-based-evaluation)에 대해 알아보세요.

## 다음 단계

[RAG 애플리케이션 구축하기](/ko/weave/tutorial-rag)를 통해 검색 증강 생성 평가를 알아보세요.