> ## Documentation Index
> Fetch the complete documentation index at: https://wb-21fd5541-sdk-testing-latest.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

> Integrate W&B with Metaflow to track experiments and manage ML workflows with automatic metric and artifact logging.

# Metaflow

## Overview

[Metaflow](https://docs.metaflow.org) is a framework created by Netflix for creating and running ML workflows.

This integration lets users apply decorators to Metaflow [steps and flows](https://docs.metaflow.org/metaflow/basics) to automatically log parameters and artifacts to W\&B.

* Decorating a step will turn logging off or on for certain types within that step.
* Decorating the flow will turn logging off or on for every step in the flow.

## Quickstart

### Sign up and create an API key

An API key authenticates your machine to W\&B. You can generate an API key from your user profile.

<Note>
  For a more streamlined approach, go to [User Settings](https://wandb.ai/settings) and create an API key. Copy the API key immediately and save it in a secure location such as a password manager.
</Note>

1. Click your user profile icon in the upper right corner.
2. Select **User Settings**, then scroll to the **API Keys** section.

### Install the `wandb` library and log in

To install the `wandb` library locally and log in:

<Note>
  For `wandb` version 0.19.8 or below, install `fastcore` version 1.8.0 or below (`fastcore<1.8.0`) instead of `plum-dispatch`.
</Note>

<Tabs>
  <Tab title="Command Line">
    1. Set the `WANDB_API_KEY` [environment variable](/models/track/environment-variables/) to your API key.

       ```bash theme={null}
       export WANDB_API_KEY=<your_api_key>
       ```

    2. Install the `wandb` library and log in.

       ```shell theme={null}
       pip install -Uqqq metaflow "plum-dispatch<3.0.0" wandb

       wandb login
       ```
  </Tab>

  <Tab title="Python">
    ```bash theme={null}
    pip install -Uqqq metaflow "plum-dispatch<3.0.0" wandb
    ```

    ```python theme={null}
    import wandb
    wandb.login()
    ```
  </Tab>

  <Tab title="Python notebook">
    ```notebook theme={null}
    !pip install -Uqqq metaflow "plum-dispatch<3.0.0" wandb

    import wandb
    wandb.login()
    ```
  </Tab>
</Tabs>

### Decorate your flows and steps

<Tabs>
  <Tab title="Step">
    Decorating a step turns logging off or on for certain types within that step.

    In this example, all datasets and models in `start` will be logged

    ```python theme={null}
    from wandb.integration.metaflow import wandb_log

    class WandbExampleFlow(FlowSpec):
        @wandb_log(datasets=True, models=True, settings=wandb.Settings(...))
        @step
        def start(self):
            self.raw_df = pd.read_csv(...).    # pd.DataFrame -> upload as dataset
            self.model_file = torch.load(...)  # nn.Module    -> upload as model
            self.next(self.transform)
    ```
  </Tab>

  <Tab title="Flow">
    Decorating a flow is equivalent to decorating all the constituent steps with a default.

    In this case, all steps in `WandbExampleFlow` default to logging datasets and models by default, just like decorating each step with `@wandb_log(datasets=True, models=True)`

    ```python theme={null}
    from wandb.integration.metaflow import wandb_log

    @wandb_log(datasets=True, models=True)  # decorate all @step 
    class WandbExampleFlow(FlowSpec):
        @step
        def start(self):
            self.raw_df = pd.read_csv(...).    # pd.DataFrame -> upload as dataset
            self.model_file = torch.load(...)  # nn.Module    -> upload as model
            self.next(self.transform)
    ```
  </Tab>

  <Tab title="Flow and Steps">
    Decorating the flow is equivalent to decorating all steps with a default. That means if you later decorate a Step with another `@wandb_log`, it overrides the flow-level decoration.

    In this example:

    * `start` and `mid` log both datasets and models.
    * `end` logs neither datasets nor models.

    ```python theme={null}
    from wandb.integration.metaflow import wandb_log

    @wandb_log(datasets=True, models=True)  # same as decorating start and mid
    class WandbExampleFlow(FlowSpec):
      # this step will log datasets and models
      @step
      def start(self):
        self.raw_df = pd.read_csv(...).    # pd.DataFrame -> upload as dataset
        self.model_file = torch.load(...)  # nn.Module    -> upload as model
        self.next(self.mid)

      # this step will also log datasets and models
      @step
      def mid(self):
        self.raw_df = pd.read_csv(...).    # pd.DataFrame -> upload as dataset
        self.model_file = torch.load(...)  # nn.Module    -> upload as model
        self.next(self.end)

      # this step is overwritten and will NOT log datasets OR models
      @wandb_log(datasets=False, models=False)
      @step
      def end(self):
        self.raw_df = pd.read_csv(...).    
        self.model_file = torch.load(...)
    ```
  </Tab>
</Tabs>

## Access your data programmatically

You can access the information we've captured in three ways: inside the original Python process being logged using the [`wandb` client library](/models/ref/python/), with the [web app UI](/models/track/workspaces/), or programmatically using [our Public API](/models/ref/python/public-api/). `Parameter`s are saved to W\&B's [`config`](/models/) and can be found in the [Overview tab](/models/runs/#overview-tab). `datasets`, `models`, and `others` are saved to [W\&B Artifacts](/models/artifacts/) and can be found in the [Artifacts tab](/models/runs/#artifacts-tab). Base python types are saved to W\&B's [`summary`](/models/) dict and can be found in the Overview tab. See our [guide to the Public API](/models/track/public-api-guide/) for details on using the API to get this information programmatically from outside .

### Quick reference

| Data                                            | Client library                                | UI                    |
| ----------------------------------------------- | --------------------------------------------- | --------------------- |
| `Parameter(...)`                                | `wandb.Run.config`                            | Overview tab, Config  |
| `datasets`, `models`, `others`                  | `wandb.Run.use_artifact("{var_name}:latest")` | Artifacts tab         |
| Base Python types (`dict`, `list`, `str`, etc.) | `wandb.Run.summary`                           | Overview tab, Summary |

### `wandb_log` kwargs

| kwarg      | Options                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| ---------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `datasets` | <ul><li><code>True</code>: Log instance variables that are a dataset</li><li><code>False</code></li></ul>                                                                                                                                                                                                                                                                                                                                                                                     |
| `models`   | <ul><li><code>True</code>: Log instance variables that are a model</li><li><code>False</code></li></ul>                                                                                                                                                                                                                                                                                                                                                                                       |
| `others`   | <ul><li><code>True</code>: Log anything else that is serializable as a pickle</li><li><code>False</code></li></ul>                                                                                                                                                                                                                                                                                                                                                                            |
| `settings` | <ul><li><code>wandb.Settings(...)</code>: Specify your own <code>wandb</code> settings for this step or flow</li><li><code>None</code>: Equivalent to passing <code>wandb.Settings()</code></li></ul><p>By default, if:</p><ul><li><code>settings.run\_group</code> is <code>None</code>, it will be set to <code>\{flow\_name}/\{run\_id}</code></li><li><code>settings.run\_job\_type</code> is <code>None</code>, it will be set to <code>\{run\_job\_type}/\{step\_name}</code></li></ul> |

## Frequently asked questions

### What exactly do you log? Do you log all instance and local variables?

`wandb_log` only logs instance variables. Local variables are NEVER logged. This is useful to avoid logging unnecessary data.

### Which data types get logged?

We currently support these types:

| Logging Setting     | Type                                                                                                                        |
| ------------------- | --------------------------------------------------------------------------------------------------------------------------- |
| default (always on) | <ul><li><code>dict, list, set, str, int, float, bool</code></li></ul>                                                       |
| `datasets`          | <ul><li><code>pd.DataFrame</code></li><li><code>pathlib.Path</code></li></ul>                                               |
| `models`            | <ul><li><code>nn.Module</code></li><li><code>sklearn.base.BaseEstimator</code></li></ul>                                    |
| `others`            | <ul><li>Anything that is <a href="https://wiki.python.org/moin/UsingPickle">pickle-able</a> and JSON serializable</li></ul> |

### How can I configure logging behavior?

| Kind of Variable | behavior                       | Example         | Data Type      |
| ---------------- | ------------------------------ | --------------- | -------------- |
| Instance         | Auto-logged                    | `self.accuracy` | `float`        |
| Instance         | Logged if `datasets=True`      | `self.df`       | `pd.DataFrame` |
| Instance         | Not logged if `datasets=False` | `self.df`       | `pd.DataFrame` |
| Local            | Never logged                   | `accuracy`      | `float`        |
| Local            | Never logged                   | `df`            | `pd.DataFrame` |

### Is artifact lineage tracked?

Yes. If you have an artifact that is an output of step A and an input to step B, we automatically construct the lineage DAG for you.

For an example of this behavior, please see this [notebook](https://colab.research.google.com/drive/1wZG-jYzPelk8Rs2gIM3a71uEoG46u_nG#scrollTo=DQQVaKS0TmDU) and its corresponding [W\&B Artifacts page](https://wandb.ai/megatruong/metaflow_integration/artifacts/dataset/raw_df/7d14e6578d3f1cfc72fe/graph)
