0. Generative AI & LLMs

https://docs.databricks.com/en/generative-ai/generative-ai.html

This article provides an overview of generative AI on Databricks and includes links to example notebooks and demos.

What is generative AI?

Generative AI is a type of artificial intelligence focused on the ability of computers to use models to create content like images, text, code, and synthetic data.

Generative AI applications are built on top of large language models (LLMs) and foundation models.

LLMs are deep learning models that consume and train on massive datasets to excel in language processing tasks. They create new combinations of text that mimic natural language based on its training data.
Foundation models are large ML models pre-trained with the intention that they are to be fine-tuned for more specific language understanding and generation tasks. These models are utilized to discern patterns within the input data.

After these models have completed their learning processes, together they generate statistically probable outputs when prompted and they can be employed to accomplish various tasks, including:

Image generation based on existing ones or utilizing the style of one image to modify or create a new one.
Speech tasks such as transcription, translation, question/answer generation, and interpretation of the intent or meaning of text.

Important

While many LLMs or other generative AI models have safeguards, they can still generate harmful or inaccurate information.

Generative AI has the following design patterns:

Prompt Engineering: Crafting specialized prompts to guide LLM behavior
Retrieval Augmented Generation (RAG): Combining an LLM with external knowledge retrieval
Fine-tuning: Adapting a pre-trained LLM to specific data sets of domains
Pre-training: Training an LLM from scratch

Develop generative AI and LLMs on Databricks

Databricks unifies the AI lifecycle from data collection and preparation, to model development and LLMOps, to serving and monitoring. The following features are specifically optimized to facilitate the development of generative AI applications:

Unity Catalog for governance, discovery, versioning, and access control for data, features, models, and functions.
MLflow for model development tracking and LLM evaluation.
Feature engineering and serving.
Databricks Model Serving for deploying LLMs. You can configure a model serving endpoint specifically for accessing foundation models:
- State-of-the-art open LLMs using Foundation Model APIs
- Third-party models hosted outside of Databricks. See External models in Databricks Model Serving.
Databricks Vector Search provides a queryable vector database that stores embedding vectors and can be configured to automatically sync to your knowledge base.
Lakehouse Monitoring for data monitoring and tracking model prediction quality and drift using automatic payload logging with inference tables.
AI Playground for testing foundation models from your Databricks workspace. You can prompt, compare and adjust settings such as system prompt and inference parameters.

Additional resources

See Retrieval Augmented Generation (RAG) on Databricks.
- See Build a Q&A chatbot with LLama2 and Databricks.
For information about using Hugging Face models on Databricks, see Hugging Face Transformers.
The databricks-ml-examples repo in Github contains example implementations of state-of-the-art (SOTA) LLMs.

Develop generative AI and LLMs on Databricks

Unity Catalog for governance, discovery, versioning, and access control for data, features, models, and functions.
MLflow for model development tracking and LLM evaluation.
Feature engineering and serving.
Databricks Model Serving for deploying LLMs. You can configure a model serving endpoint specifically for accessing foundation models:
- State-of-the-art open LLMs using Foundation Model APIs
- Third-party models hosted outside of Databricks. See External models in Databricks Model Serving.
Databricks Vector Search provides a queryable vector database that stores embedding vectors and can be configured to automatically sync to your knowledge base.
Lakehouse Monitoring for data monitoring and tracking model prediction quality and drift using automatic payload logging with inference tables.
AI Playground for testing foundation models from your Databricks workspace. You can prompt, compare and adjust settings such as system prompt and inference parameters.

Additional resources

See Retrieval Augmented Generation (RAG) on Databricks.
- See Build a Q&A chatbot with LLama2 and Databricks.
For information about using Hugging Face models on Databricks, see Hugging Face Transformers.
The databricks-ml-examples repo in Github contains example implementations of state-of-the-art (SOTA) LLMs.

1. AI Playground

https://docs.databricks.com/en/large-language-models/ai-playground.html

Chat with supported LLMs using AI Playground

March 19, 2024

You can interact with supported large language models using the AI Playground. The AI Playground is a chat-like environment where you can test, prompt, and compare LLMs. This functionality is available in your Databricks workspace.

Requirements

Databricks workspace in a supported region for Foundation Model APIs pay-per-token.

Use AI Playground

To use the AI Playground:

Select Playground from the left navigation pane under Machine Learning.
Select the model you want to interact with using the dropdown list on the top left.
You can do either of the following:
1. Type in your question or prompt.
2. Select a sample AI instruction from those listed in the window.
You can select the + to add an endpoint. Doing so allows you to compare multiple model responses side-by-side.

2. Get Started querying LLMs on Databricks

https://docs.databricks.com/en/large-language-models/llm-serving-intro.html

This article describes how to get started using Foundation Model APIs to serve and query LLMs on Databricks.

The easiest way to get started with serving and querying LLM models on Databricks is using Foundation Model APIs on a pay-per-token basis. The APIs provide access to popular foundation models from pay-per-token endpoints that are automatically available in the Serving UI of your Databricks workspace. See Supported models for pay-per-token.

You can also test out and chat with pay-per-token models using the AI Playground. See Chat with supported LLMs using AI Playground.

For production workloads, particularly if you have a fine-tuned model or a workload that requires performance guarantees, Databricks recommends you upgrade to using Foundation Model APIs on a provisioned throughput endpoint.

Requirements

Databricks workspace in a supported region for Foundation Model APIs pay-per-token.
Databricks personal access token to query and access Databricks model serving endpoints using the OpenAI client.

Important

As a security best practice for production scenarios, Databricks recommends that you use machine-to-machine OAuth tokens for authentication during production.

For testing and development, Databricks recommends using a personal access token belonging to service principals instead of workspace users. To create tokens for service principals, see Manage tokens for a service principal.

Get started using Foundation Model APIs

The following example queries the databricks-dbrx-instruct model that’s served on the pay-per-token endpoint,databricks-dbrx-instruct. Learn more about the DBRX Instruct model.

In this example, you use the OpenAI client to query the model by populating the model field with the name of the model serving endpoint that hosts the model you want to query. Use your personal access token to populate the DATABRICKS_TOKEN and your Databricks workspace instance to connect the OpenAI client to Databricks.

from openai import OpenAI
import os

DATABRICKS_TOKEN = os.environ.get("DATABRICKS_TOKEN")

client = OpenAI(
  api_key=DATABRICKS_TOKEN, # your personal access token
  base_url='https://<workspace_id>.databricks.com/serving-endpoints', # your Databricks workspace instance
)

chat_completion = client.chat.completions.create(
  messages=[
    {
      "role": "system",
      "content": "You are an AI assistant",
    },
    {
      "role": "user",
      "content": "What is a mixture of experts model?",
    }
  ],
  model="databricks-dbrx-instruct",
  max_tokens=256
)

print(chat_completion.choices[0].message.content)

Expected output:

{
  "id": "xxxxxxxxxxxxx",
  "object": "chat.completion",
  "created": "xxxxxxxxx",
  "model": "databricks-dbrx-instruct",
  "choices": [
    {
      "index": 0,
      "message":
        {
          "role": "assistant",
          "content": "A Mixture of Experts (MoE) model is a machine learning technique that combines the predictions of multiple expert models to improve overall performance. Each expert model specializes in a specific subset of the data, and the MoE model uses a gating network to determine which expert to use for a given input."
        },
      "finish_reason": "stop"
    }
  ],
  "usage":
    {
      "prompt_tokens": 123,
      "completion_tokens": 23,
      "total_tokens": 146
    }
}

Next steps

Use the AI playground to try out different models in a familiar chat interface.
Query foundation models.
Access models hosted outside of Databricks using external models.
Learn how to deploy fine-tuned models using provisioned throughput endpoints.
Explore methods to monitor model quality and endpoint health.

3. Using LLMs

https://docs.databricks.com/en/large-language-models/index.html

Large language models (LLMs) on Databricks

February 21, 2024

Databricks makes it simple to access and build off of publicly available large language models.

Databricks Runtime for Machine Learning includes libraries like Hugging Face Transformers and LangChain that allow you to integrate existing pre-trained models or other open-source libraries into your workflow. From here, you can leverage Databricks platform capabilities to fine-tune LLMs using your own data for better domain performance.

In addition, Databricks offers built-in functionality for SQL users to access and experiment with LLMs like Azure OpenAI and OpenAI using AI functions.

Hugging Face Transformers

With Hugging Face Transformers on Databricks you can scale out your natural language processing (NLP) batch applications and fine-tune models for large-language model applications.

The Hugging Face transformers library comes preinstalled on Databricks Runtime 10.4 LTS ML and above. Many of the popular NLP models work best on GPU hardware, so you might get the best performance using recent GPU hardware unless you use a model specifically optimized for use on CPUs.

LangChain

LangChain is available as an experimental MLflow flavor which allows LangChain customers to leverage the robust tools and experiment tracking capabilities of MLflow directly from the Databricks environment.

LangChain is a software framework designed to help create applications that utilize large language models (LLMs) and combine them with external data to bring more training context for your LLMs.

Databricks Runtime ML includes langchain in Databricks Runtime 13.1 ML and above.

Learn about Databricks specific LangChain integrations.

AI functions

Preview

This feature is in Public Preview.

AI functions are built-in SQL functions that allow SQL users to:

Use Databricks Foundation Model APIs to complete various tasks on your company’s data.
Access external models like GPT-4 from OpenAI and experiment with them.
Query models hosted by Databricks model serving endpoints from SQL queries.

3.1. What are Hugging Face Transformer

: https://docs.databricks.com/en/machine-learning/train-model/huggingface/index.html

3.1.1. Prepare data for fine tuning Hugging Face models

https://docs.databricks.com/en/machine-learning/train-model/huggingface/load-data.html

Prepare data for fine tuning Hugging Face models

December 20, 2023

This article demonstrates how to prepare your data for fine-tuning open source large language models with Hugging Face Transformers and Hugging Face Datasets.

Requirements

Databricks Runtime for Machine Learning 13.0 or above. The examples in this guide use Hugging Face datasets which is included in Databricks Runtime 13.0 ML and above.

Load data from Hugging Face

Hugging Face Datasets is a Hugging Face library for accessing and sharing datasets for audio, computer vision, and natural language processing (NLP) tasks. With Hugging Face datasets you can load data from various places. The datasets library has utilities for reading datasets from the Hugging Face Hub. There are many datasets downloadable and readable from the Hugging Face Hub by using the load_dataset function. Learn more about loading data with Hugging Face Datasets in the Hugging Face documentation.

from datasets import load_dataset
dataset = load_dataset("imdb")

Some datasets in the Hugging Face Hub provide the sizes of data that is downloaded and generated when load_dataset is called. You can use load_dataset_builder to know the sizes before downloading the dataset with load_dataset.

from datasets import load_dataset_builder
from psutil._common import bytes2human

def print_dataset_size_if_provided(*args, **kwargs):
  dataset_builder = load_dataset_builder(*args, **kwargs)

  if dataset_builder.info.download_size and dataset_builder.info.dataset_size:
    print(f'download_size={bytes2human(dataset_builder.info.download_size)}, dataset_size={bytes2human(dataset_builder.info.dataset_size)}')
  else:
    print('Dataset size is not provided by uploader')

print_dataset_size_if_provided("imdb")

See the Download datasets from Hugging Face best practices notebook for guidance on how to download and prepare datasets on Databricks for different sizes of data.

Format your training and evaluation data

To use your own data for model fine-tuning, you must first format your training and evaluation data into Spark DataFrames. Then, load the DataFrames using the Hugging Face datasets library.

Start by formatting your training data into a table meeting the expectations of the trainer. For text classification, this is a table with two columns: a text column and a column of labels.

To perform fine-tuning, you need to provide a model. The Hugging Face Transformer AutoClasses library makes it easy to load models and configuration settings, including a wide range of Auto Models for natural language processing.

For example, Hugging Face transformers provides AutoModelForSequenceClassification as a model loader for text classification, which expects integer IDs as the category labels. However, if you have a DataFrame with string labels, you must also specify mappings between the integer labels and string labels when creating the model. You can collect this information as follows:

labels = df.select(df.label).groupBy(df.label).count().collect()
id2label = {index: row.label for (index, row) in enumerate(labels)}
label2id = {row.label: index for (index, row) in enumerate(labels)}

Then, create the integer IDs as a label column with a Pandas UDF:

from pyspark.sql.functions import pandas_udf
import pandas as pd
@pandas_udf('integer')
def replace_labels_with_ids(labels: pd.Series) -> pd.Series:
  return labels.apply(lambda x: label2id[x])

df_id_labels = df.select(replace_labels_with_ids(df.label).alias('label'), df.text)

Load a Hugging Face dataset from a Spark DataFrame

Hugging Face datasets supports loading from Spark DataFrames using datasets.Dataset.from_spark. See the Hugging Face documentation to learn more about the from_spark() method.

For example, if you have train_df and test_df DataFrames, you can create datasets for each with the following code:

import datasets
train_dataset = datasets.Dataset.from_spark(train_df, cache_dir="/dbfs/cache/train")
test_dataset = datasets.Dataset.from_spark(test_df, cache_dir="/dbfs/cache/test")

Dataset.from_spark caches the dataset. This example describes model training on the driver, so data must be made available to it. Additionally, since cache materialization is parallelized using Spark, the provided cache_dir must be accessible to all workers. To satisfy these constraints, cache_dir should be a Databricks File System (DBFS) root volume or mount point.

The DBFS root volume is accessible to all users of the workspace and should only be used for data without access restrictions. If your data requires access controls, use a mount point instead of DBFS root.

If your dataset is large, writing it to DBFS can take a long time. To speed up the process, you can use the working_dir parameter to have Hugging Face datasets write the dataset to a temporary location on disk, then move it to DBFS. For example, to use the SSD as a temporary location:

import datasets
dataset = datasets.Dataset.from_spark(
  train_df,
  cache_dir="/dbfs/cache/train",
  working_dir="/local_disk0/tmp/train",
)

Caching for datasets

The cache is one of the ways datasets improves efficiency. It stores all downloaded and processed datasets so when the user needs to use the intermediate datasets, they are reloaded directly from the cache.

The default cache directory of datasets is ~/.cache/huggingface/datasets. When a cluster is terminated, the cache data is lost too. To persist the cache file on cluster termination, Databricks recommends changing the cache location to DBFS by setting the environment variable HF_DATASETS_CACHE:

import os
os.environ["HF_DATASETS_CACHE"] = "/dbfs/place/you/want/to/save"

Fine-tune a model

When your data is ready, you can use it to fine-tune a Hugging Face model.

Notebook: Download datasets from Hugging Face

This example notebook provides recommended best practices of using the Hugging Face load_dataset function to download and prepare datasets on Databricks for different sizes of data.

Download datasets from Hugging Face best practices notebook

Open notebook in new tab

3.1.2. Fine-tune Hugging Face models for a single GPU

https://docs.databricks.com/en/machine-learning/train-model/huggingface/fine-tune-model.html

Fine-tune Hugging Face models for a single GPU

December 05, 2023

This article describes how to fine-tune a Hugging Face model with the Hugging Face transformers library on a single GPU. It also includes Databricks-specific recommendations for loading data from the lakehouse and logging models to MLflow, which enables you to use and govern your models on Databricks.

The Hugging Face transformers library provides the Trainer utility and Auto Model classes that enable loading and fine-tuning Transformers models.

These tools are available for the following tasks with simple modifications:

Loading models to fine-tune.
Constructing the configuration for the Hugging Face Transformers Trainer utility.
Performing training on a single GPU.

See What are Hugging Face Transformers?

Requirements

A single-node cluster with one GPU on the driver.
The GPU version of Databricks Runtime 13.0 ML and above.
- This example for fine-tuning requires the 🤗 Transformers, 🤗 Datasets, and 🤗 Evaluate packages which are included in Databricks Runtime 13.0 ML and above.
MLflow 2.3.
Data prepared and loaded for fine-tuning a model with transformers.

Tokenize a Hugging Face dataset

Hugging Face Transformers models expect tokenized input, rather than the text in the downloaded data. To ensure compatibility with the base model, use an AutoTokenizer loaded from the base model. Hugging Face datasets allows you to directly apply the tokenizer consistently to both the training and testing data.

For example:

Copy

Python

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(base_model)
def tokenize_function(examples):
    return tokenizer(examples["text"], padding=False, truncation=True)

train_test_tokenized = train_test_dataset.map(tokenize_function, batched=True)

Set up the training configuration

Hugging Face training configuration tools can be used to configure a Trainer. The Trainer classes require the user to provide:

Metrics
A base model
A training configuration

You can configure evaluation metrics in addition to the default loss metric that the Trainer computes. The following example demonstrates adding accuracy as a metric:

Copy

Python

import numpy as np
import evaluate
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Use the Auto Model classes for NLP to load the appropriate model for your task.

For text classification, use AutoModelForSequenceClassification to load a base model for text classification. When creating the model, provide the number of classes and the label mappings created during dataset preparation.

Copy

Python

from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
        base_model,
        num_labels=len(label2id),
        label2id=label2id,
        id2label=id2label
        )

Next, create the training configuration. The TrainingArguments class allows you to specify the output directory, evaluation strategy, learning rate, and other parameters.

Copy

Python

from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(output_dir=training_output_dir, evaluation_strategy="epoch")

Using a data collator batches input in training and evaluation datasets. DataCollatorWithPadding gives good baseline performance for text classification.

Copy

Python

from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer)

With all of these parameters constructed, you can now create a Trainer.

Copy

Python

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_test_dataset["train"],
    eval_dataset=train_test_dataset["test"],
    compute_metrics=compute_metrics,
    data_collator=data_collator,
)

Train and log to MLflow

Hugging Face interfaces well with MLflow and automatically logs metrics during model training using the MLflowCallback. However, you must log the trained model yourself.

Wrap training in an MLflow run. This constructs a Transformers pipeline from the tokenizer and the trained model, and writes it to local disk. Finally, log the model to MLflow with mlflow.transformers.log_model.

Copy

Python

from transformers import pipeline

with mlflow.start_run() as run:
  trainer.train()
  trainer.save_model(model_output_dir)
  pipe = pipeline("text-classification", model=AutoModelForSequenceClassification.from_pretrained(model_output_dir), batch_size=1, tokenizer=tokenizer)
  model_info = mlflow.transformers.log_model(
        transformers_model=pipe,
        artifact_path="classification",
        input_example="Hi there!",
    )

If you don’t need to create a pipeline, you can submit the components that are used in training into a dictionary:

Copy

Python

model_info = mlflow.transformers.log_model(
  transformers_model={"model": trainer.model, "tokenizer": tokenizer},
  task="text-classification",
  artifact_path="text_classifier",
  input_example=["MLflow is great!", "MLflow on Databricks is awesome!"],
)

Load the model for inference

When your model is logged and ready, loading the model for inference is the same as loading the MLflow wrapped pre-trained model.

Copy

Python

logged_model = "runs:/{run_id}/{model_artifact_path}".format(run_id=run.info.run_id, model_artifact_path=model_artifact_path)

# Load model as a Spark UDF. Override result_type if the model does not return double values.
loaded_model_udf = mlflow.pyfunc.spark_udf(spark, model_uri=logged_model, result_type='string')

test = test.select(test.text, test.label, loaded_model_udf(test.text).alias("prediction"))
display(test)

See Model serving with Databricks for more information.

Troubleshoot common CUDA errors

This section describes common CUDA errors and guidance on how to resolve them.

OutOfMemoryError: CUDA out of memory

When training large models, a common error you may encounter is the CUDA out of memory error.

Example:

Copy

Console

OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 14.76 GiB total capacity; 666.34 MiB already allocated; 17.75 MiB free; 720.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.

Try the following recommendations to resolve this error:

Reduce the batch size for training. You can reduce the per_device_train_batch_size value in TrainingArguments.
Use lower precision training. You can set fp16=True in TrainingArguments.
Use gradient_accumulation_steps in TrainingArguments to effectively increase overall batch size.
Use 8-bit Adam optimizer.
Clean up the GPU memory before training. Sometimes, GPU memory may be occupied by some unused code.Copy
Python
```
from numba import cuda
device = cuda.get_current_device()
device.reset()
```

CUDA kernel errors

When running the training, you may get CUDA kernel errors.

Example:

Copy

Console

CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

For debugging, consider passing CUDA_LAUNCH_BLOCKING=1.

To troubleshoot:

Try running the code on CPU to see if the error is reproducible.
Another option is to get a better traceback by setting CUDA_LAUNCH_BLOCKING=1:Copy
Python
```
import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
```

Notebook: Fine-tune text classification on a single GPU

To get started quickly with example code, this example notebook provides an end-to-end example for fine-tuning a model for text classification. The subsequent sections of this article go into more detail around using Hugging Face for fine-tuning on Databricks.

Fine-tuning Hugging Face text classification models notebook

Open notebook in new tab

3.1.3. Model inference using Huggig Face Transformers for natural language processing (NLP)

https://docs.databricks.com/en/machine-learning/train-model/huggingface/model-inference-nlp.html

3.2. AI Functions on Databricks

https://docs.databricks.com/en/large-language-models/ai-functions.html

3.2.1. Query a served model with ai_query()

https://docs.databricks.com/en/large-language-models/how-to-ai-query.html

3.2.2. Query an external model with ai_query()

https://docs.databricks.com/en/large-language-models/ai-query-external-model.html

3.2.3. Analyze customer reviews using AI Functions

https://docs.databricks.com/en/large-language-models/ai-functions-example.html

3.3. LangChain on Databricks for LLM Development

https://docs.databricks.com/en/large-language-models/langchain.html

4. Evaluate LLMs with MLflow

https://docs.databricks.com/en/mlflow/llm-evaluate.html

5. Vector Search

https://docs.databricks.com/en/generative-ai/vector-search.html

5.1. How to Create and Query a Vector Search Index

https://docs.databricks.com/en/generative-ai/create-query-vector-search.html

This article describes how to create and query a vector search index using Databricks Vector Search.

You can create and manage Vector Search components, like a vector search endpoint and vector search indices, using the UI, the Python SDK, or the REST API.

Requirements

Unity Catalog enabled workspace.
Serverless compute enabled.
Source table must have Change Data Feed enabled.
To create an index, you must have CREATE TABLE privileges on catalog schema(s) to create indexes. To query an index that is owned by another user, you must have additional privileges. See Query a Vector Search endpoint.
If you want to use personal access tokens (not recommended for production workloads), check that Personal access tokens are enabled. To use a service principal token instead, pass it explicitly using SDK or API calls.

To use the SDK, you must install it in your notebook. Use the following code:

%pip install databricks-vectorsearch

dbutils.library.restartPython()

from databricks.vector_search.client import VectorSearchClient

Create a vector search endpoint

You can create a vector search endpoint using the Databricks UI, Python SDK, or the API

Create a vector search endpoint using the UI

Follow these steps to create a vector search endpoint using the UI

1. In the left sidebar, click Compute

2. Click the Vector Search tab and click Create

3. The Create endpoint form opens. Enter a name for this endpoint

4. Click Confirm

Create a vector search endpoint using the Python SDK

The following example uses the create_endpoint() SDK function to create a Vector Search endpoint.

# The following line automatically generates a PAT Token for authentication
client = VectorSearchClient()

# The following line uses the service principal token for authentication
# client = VectorSearch(service_principal_client_id=<CLIENT_ID>,service_principal_client_secret=<CLIENT_SECRET>)

client.create_endpoint(
    name="vector_search_endpoint_name",
    endpoint_type="STANDARD"
)

Create a vector search endpoint using the REST API

See POST /api/2.0/vector-search/endpoints.

(Optional) Create and configure an endpoint to serve the embedding model

If you choose to have Databricks compute the embeddings, you must set up a model serving endpoint to serve the embedding model. See Create foundation model serving endpoints for instructions. For example notebooks, see Notebook examples for calling an embeddings model.

When you configure an embedding endpoint, Databricks recommends that you remove the default selection of Scale to zero. Serving endpoints can take a couple of minutes to warm up, and the initial query on an index with a scaled down endpoint might timeout.

Note

The vector search index initialization might time out if the embedding endpoint isn’t configured appropriately for the dataset. You should only use CPU endpoints for small datasets and tests. For larger datasets, use a GPU endpoint for optimal performance.

Create a vector search index

You can create a vector search index using the UI, the Python SDK, or the REST API. The UI is the simplest approach.

There are two types of indexes:

- Delta Sync Index automatically syncs with a source Delta Table, automatically and incrementally updating the index as the underlying dat ain the Delta Table changes.

- Direct Vector Access Index supports direct read and write of vectors and metadata. The user is responsible for updating this table using the REST API or the Python SDK. This type of index cannot be created using the UI. You must use the REST API or the SDK.

6. Retrieval Augmented Generation (RAG) on Databricks

https://docs.databricks.com/en/generative-ai/retrieval-augmented-generation.html

This article provides an overview of retrieval augmented generation (RAG) and describes RAG application support in Databricks.

What is Retrieval Augmented Generation?

RAG is a generative AI design pattern that involves combining a large language model (LLM) with external knowledge retrieval. https://docs.databricks.com/en/generative-ai/generative-ai.html

RAG is required to connect real-time data to your generative AI applications.

Doing so improves the accuracy and quality of the application, by providing your data as context to the LLM at inference time.

The Databricks platform provides an integrated set of tools that supports the following RAG scenarios.

Type of RAG	Description	Example use case
Unstructured data	Use of documents - PDFs, wikis, website contents, Google or Microsoft Office documents, and so on.	Chatbot over product documentation
Structured data	Use of tabular data - Delta Tables, data from existing application APIs	Chatbot to check order status
Tools & function calling	Call third party or internal APIs to perform specific tasks or update statuses. For example, performing calculations or triggering a business workflow	Chatbot to place an order
Agents	Dynamically decide how to respond to a user's query by using an LLM to choose a sequence of actions	Chatbot that replaces a customer service agent

RAG application architecture

The following illustrates the components that make up a RAG application

RAG applications require a pipeline and a chain component to perform the following:

Indexing A pipeline that ingests data from a source and indexes it. This data can be structured or unstructured.
Retrieval and generation This is the actual RAG chain. It takes the user query and retrieves similar data from the index, then passes the data, along with the query, to the LLM model.

The below diagram demonstrates these core components:

Unstructured data RAG example.

The following sections describe the details of the indexing pipeline and RAG chain in the context of an unstructured data RAG example.

Indexing pipeline in a RAG app

The following steps describe the indexing pipeline:

1. Ingest data from your proprietary data source.

2. Split the data into chunks that can fit into the context window of the foundational LLM. This step also includes parsing the data and extracting metadata. This data is commonly referred to as a knowledge base that the foundational LLM is trained on.

3. Use an embedding model to create vector embeddings for the data chunks.

4. Store the embeddings and metadata in a vector database to make them accessible for querying by the RAG chain.

Retrieval using the RAG chain

After the index is prepared, the RAG chain of the application can be served to respod to questions.

The following steps and diagram describe how the RAG application responds to an incoming request.

1. Embed the request using the same embedding model that was used to embed the data in the knowledge base.

2. Query the vector database to do a similiar search between the embedded request and the embeded data chunks in the vector databse.

3. Retrieve the data chunks that are most relevant to the request.

4. Feed the relevant data chunks and the request to a customized LLM. The data chunks provide context that helps the LLM generate an appropriate response. Often, the LLM has a template for how to format the response.

5. Generate a response.

The following diagram illustrates this process:

Develop RAG applications with Databricks

Databricks provides the following capabilites to help you develop RAG applications.

Unity Catalog for governance, discovery, versioning, and access control for data, features, models, and functions.
Notebooks and workflows for data pipeline creation and orchestration.
Delta tables for storing structured data and unstructured data chunks and embeddings.
Vector search provides a queryable vector database that stores embedding vectors and can be configured to automatically sync to your knowledge base.
Databricks model serving for deploying LLMs and hosting your RAG chain. You can configure a dedicated model serving endpoint specifically for accessing state-of-the-art open LLMs with Foundation Model APIs or third-party models with External models.
MLflow for RAG chain development tracking and LLM evaluation.
Feature engineering and serving. This typically applies for structured data RAG scenarios.
Online Tables. You can serve online tables as a low-latency API to include the data in RAG applications.
Lakehouse Monitoring for data monitoring and tracking model prediction quality and drift using automatic payload logging with inference tables.
AI Playground. A chat-based UI to test and compare LLMs.

RAG Architecture with Databricks

The following architecture diagrams demonstrate where each databricks feature fits in the RAG workflow.

For an example, see the Deploy Your LLM Chatbot With Retrieval Augmented Generation Demo.

Process unstructured data and Databricks-managed embeddings

For processing unstructured data and Databricks - managed embeddings, the following diagram steps and diagram show:

1. Data ingestion from your proprietary data source. You can store this data in a Delta Table or Unity Catalog Volume.

2. The data is then split into chunks that can fit into the context window of the foundational LLM. This step also includes parsing the data and extracting metadata. You can use Databricks Workflow, Databricks notebooks and Delta Live Tables to perform these tasks. This data is commonly referred to as a knowledge base that the foundatinoal LLM is trained on

3. The parsed and chunked data is then consumed by an embedding model to create vector embeddings, In this scenario, Databricks computes the embeddings for you as part of the Vector Search functionality which uses Model Serving to provide an embedding model.

4. After Vector Search computes embeddings, Databricks stores them in a Delta Table.

5. Also as part of Vector Search, the embeddings and metadata are indexed and stored in a vector database to make them accessible for querying by the RAG chain. Vector Search automatically computes embeddings for new data that is added to the source data table and updates the vector search index.

Process unstructured data and customer-managed embeddings

For processing unstructured data and customer-managed embeddings, the following steps and diagram show:

1. Data ingestion from your proprietary data source. You can store this data in a Delta table or Unity Catalog Volume.

2. You can then split the data into chunks that can fit into the context window of the foundationnal LLM. This step also includes parsing the data and extracting metadata. YOu can use Databricks Workflows, DAtabricks Notebooks and Delta Live Tables to perform these tasks. This data is commonly referred to as a knowledge base that the foundational LLM is trained on.

3. Next, the parsed and chunked data can be consumed by an embedding model to create vector embeddings. In this scenario, you compute the embeddings yourself and can use Model Serving to serve an embedding model.

4. After you compute embeddings, you can store them in a Delta table, that can be synced with Vector Search.

5. As part of Vector Search, the embeddings and metadata are indexed and stored in a vector databse to make them accessible for querying by the RAG chain. Vector Search automatically syncs new embeddings that are added to your Delta table and updates the vector search index.

Process structured data

For processing structured data, the following steps and diagram show:

1. Data ingestion from your proprietary data source. You can store this data in a Delta table or Unity Catalog Volume.

2. For feature engineering you can use Databricks notebooks, Databricks workflows, and Delta Live Tables.

3. Create a feature table. A feature table is a Delta Table in Unity Catalog that has a primary key

4. Create an online table and host it on a feature serving endpoint. The endpoint automatically stays synced with the feature table.

For an example notebook illustrating the use of online tables and feature serving for RAG applications, see the Databricks online tables and feature serving endpoints for RAG example notebook.

https://docs.databricks.com/en/machine-learning/feature-store/online-tables.html#notebook-examples

RAG chain

After the index is prepared, the RAG chain of the application can be served to respond to questions. The follwoing steps and diagram describe how the RAG chain operates in response to an incoming question.

1. The incoming question gets embedded using the same embedding model that was used to embed the data in the knowledge base. Model Serving is used to serve the embedding model.

2. After the question is embedded, you can use Vector Search to do a similiar search between the embedded question and the embedded data chunks in the vector database.

3. After Vector Search retrieves the data chunks that are most relevant to the request, those data chunks along with relevant features from Feature Serving and the embedded question are consumed in a customized LLM for post processing before it a response is generated.

4. The data chunks and features provide context that help the LLM generate an appropriate response. Often, the LLM has a template for how to format the response. Once again, Model Serving is used to serve the LLM. You can also use Unity Catalog and Lakehose Monitoring to store logs and monitor the chain workflow, respectively.

5. Generate a response.

Region availability

The features that support RAG application development on Databricks are available in the same regions as model serving.

If you plan on using Foundation Model APIs as part of your RAG application development, you are limited to the supported regions for Foundation Model APIs.

LIST

저작자표시 (새창열림)

'Generative AI' 카테고리의 다른 글

[FastCampus] GenAI / LLM Lecture (0)	2024.04.28
[AWS] Generative AI, Foundation Model, Large Language Models, Prompt Engineering, Retrieval Augmented Generation, LangChain, Vector Database, Hugging Face (0)	2024.04.21
[Databricks] The Big Book of Generative AI (1)	2024.04.20
[Databricks] Generative AI Engineering Pathway (0)	2024.04.19
Generative AI on Artefact (0)	2024.04.18

[Databricks][Documentation] Generative AI & LLMs

0. Generative AI & LLMs

What is generative AI?

Develop generative AI and LLMs on Databricks

Additional resources

Develop generative AI and LLMs on Databricks

Additional resources

1. AI Playground

Chat with supported LLMs using AI Playground

Requirements

Use AI Playground

2. Get Started querying LLMs on Databricks

Requirements

Get started using Foundation Model APIs

Next steps

3. Using LLMs

Large language models (LLMs) on Databricks

Hugging Face Transformers

LangChain

AI functions

3.1. What are Hugging Face Transformer

Prepare data for fine tuning Hugging Face models

Requirements

Load data from Hugging Face

Format your training and evaluation data

Load a Hugging Face dataset from a Spark DataFrame

Caching for datasets

Fine-tune a model

Notebook: Download datasets from Hugging Face

Download datasets from Hugging Face best practices notebook

Fine-tune Hugging Face models for a single GPU

Requirements

Tokenize a Hugging Face dataset

Set up the training configuration

Train and log to MLflow

Load the model for inference

Troubleshoot common CUDA errors

OutOfMemoryError: CUDA out of memory

CUDA kernel errors

Notebook: Fine-tune text classification on a single GPU

Fine-tuning Hugging Face text classification models notebook

5. Vector Search

5.1. How to Create and Query a Vector Search Index

6. Retrieval Augmented Generation (RAG) on Databricks

'Generative AI' 카테고리의 다른 글

'Generative AI' Related Articles

티스토리툴바