> ## Documentation Index
> Fetch the complete documentation index at: https://wb-21fd5541-sdk-testing-latest.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# NVIDIA NeMo Inference Microservice Deploy Job

> Deploy a W&B model artifact to NVIDIA NeMo Inference Microservice using W&B Launch for scalable model serving.

Deploy a model artifact from W\&B to a NVIDIA NeMo Inference Microservice. To do this, use W\&B Launch. W\&B Launch converts model artifacts to NVIDIA NeMo Model and deploys to a running NIM/Triton server.

W\&B Launch currently accepts the following compatible model types:

1. [Llama2](https://llama.meta.com/llama2/)
2. [StarCoder](https://github.com/bigcode-project/starcoder)
3. NV-GPT (coming soon)

<Note>
  Deployment time varies by model and machine type. The base Llama2-7b config takes about 1 minute on Google Cloud's `a2-ultragpu-1g`.
</Note>

## Quickstart

1. [Create a launch queue](/platform/launch/add-job-to-queue/) if you don't have one already. See an example queue config below.

   ```yaml theme={null}
   net: host
   gpus: all # can be a specific set of GPUs or `all` to use everything
   runtime: nvidia # also requires nvidia container runtime
   volume:
     - model-store:/model-store/
   ```

   <Frame>
     <img src="https://mintcdn.com/wb-21fd5541-sdk-testing-latest/5BwwFpNAnQO_33rW/images/integrations/nim1.png?fit=max&auto=format&n=5BwwFpNAnQO_33rW&q=85&s=805887ac45471269ea3fa4a89038c583" alt="image" width="972" height="570" data-path="images/integrations/nim1.png" />
   </Frame>

2. Create this job in your project:

   ```bash theme={null}
   wandb job create -n "deploy-to-nvidia-nemo-inference-microservice" \
      -e $ENTITY \
      -p $PROJECT \
      -E jobs/deploy_to_nvidia_nemo_inference_microservice/job.py \
      -g andrew/nim-updates \
      git https://github.com/wandb/launch-jobs
   ```

3. Launch an agent on your GPU machine:
   ```bash theme={null}
   wandb launch-agent -e $ENTITY -p $PROJECT -q $QUEUE
   ```

4. Submit the deployment launch job with your desired configs from the [Launch UI](https://wandb.ai/launch)
   1. You can also submit via the CLI:
      ```bash theme={null}
      wandb launch -d gcr.io/playground-111/deploy-to-nemo:latest \
        -e $ENTITY \
        -p $PROJECT \
        -q $QUEUE \
        -c $CONFIG_JSON_FNAME
      ```
      <Frame>
        <img src="https://mintcdn.com/wb-21fd5541-sdk-testing-latest/5BwwFpNAnQO_33rW/images/integrations/nim2.png?fit=max&auto=format&n=5BwwFpNAnQO_33rW&q=85&s=0513e25043f6ae75f1ab04ee7fd29ce3" alt="image" width="903" height="1263" data-path="images/integrations/nim2.png" />
      </Frame>

5. You can track the deployment process in the Launch UI.
   <Frame>
     <img src="https://mintcdn.com/wb-21fd5541-sdk-testing-latest/5BwwFpNAnQO_33rW/images/integrations/nim3.png?fit=max&auto=format&n=5BwwFpNAnQO_33rW&q=85&s=f48a1f0afecdbbad868e524138302569" alt="image" width="928" height="692" data-path="images/integrations/nim3.png" />
   </Frame>

6. Once complete, you can immediately curl the endpoint to test the model. The model name is always `ensemble`.
   ```bash theme={null}
    #!/bin/bash
    curl -X POST "http://0.0.0.0:9999/v1/completions" \
        -H "accept: application/json" \
        -H "Content-Type: application/json" \
        -d '{
            "model": "ensemble",
            "prompt": "Tell me a joke",
            "max_tokens": 256,
            "temperature": 0.5,
            "n": 1,
            "stream": false,
            "stop": "string",
            "frequency_penalty": 0.0
            }'
   ```
