Stacey on Software

Agile

Diversion #2: LangChain & Pydantic

June 11, 2023

I’ve been doing a lot of my ML coding in Python. Honestly, for a long time I resisted it, but the tooling moves so fast and is so simple to assemble I just gave in so that I can concentrate more on my actual subject of interest.

In this post I’m going to walk through the quickest way to set up a small project in which you can fiddle with your own experiments that begin to draw upon programmatic interaction with not only the LLM but also aspects of your conversation.

We’re going to make us a to-do app! (I had to, I couldn’t resist!)

So let’s start with Pydantic. Pydantic adds strong typing support to Python, and also a decent means of describing your models with metadata. You might already be familiar with doing this in order to support an OpenAPI integration with auto-generated yaml or json-schema, etc.

Let’s set up our project:

mkdir todo-gpt && cd todo-gpt
python3 -m venv .venv && source .venv/bin/activate
echo "langchain" >> requirements.txt
echo "openai" >> requirements.txt
echo "pydantic" >> requirements.txt
pip install --upgrade pip
pip install -r requirements.txt

Now, let’s create our minimal main.py entry-point:

from textwrap import dedent
from langchain import PromptTemplate
def pretty_prompt(str: str) -> str:
return dedent(str).strip()
def main():
prompt = pretty_prompt("""
Tell me the steps I need to take in order to:
{request}
""")
tpl = PromptTemplate(
input_variables=["request"],
template=prompt
)
print(
tpl.format(
request="write a book"
)
)
if __name__ == '__main__':
main()

When you run this, it will emit the following output:

Tell me the steps I need to take in order to:
write a book

Notice the variable substitution that happens in the prompt via the {request} portion of the string, how the input_variables parameter of the PromptTemplate constructor sets that up as a required variable when formatting the prompt.

OK let’s see what happens when we run this prompt through an LLM. Here’s the updated main.py:

import os
from textwrap import dedent
from langchain import PromptTemplate, LLMChain
from langchain.chat_models import ChatOpenAI
os.environ['OPENAI_API_KEY'] = os.environ['OPENAI_ACCESS_TOKEN']
def pretty_prompt(str: str) -> str:
return dedent(str).strip()
def main():
prompt = pretty_prompt("""
Tell me the steps I need to take in order to:
{request}
""")
prompt_template = PromptTemplate(
input_variables=["request"],
template=prompt
)
llm = ChatOpenAI(model_name="gpt-4", temperature=0.7)
chain = LLMChain(llm=llm, prompt=prompt_template, verbose=True)
output = chain.run(request="write a book")
print(output)
if __name__ == '__main__':
main()

This will emit the following output - note that we’ve asked the LLMChain to be verbose.

> Entering new LLMChain chain...
Prompt after formatting:
Tell me the steps I need to take in order to:
write a book
> Finished chain.
1. Choose a topic or genre: Decide what you want to write about, whether it's fiction, non-fiction, or a specific genre such as romance, science fiction, or mystery.
2. Develop an idea or concept: Brainstorm ideas for your book, including the main theme, plot, and characters. Make sure your idea is unique and engaging to capture the interest of readers.
3. Create an outline: Plan the structure of your book by creating an outline. This will help you organize your thoughts, set the pace of the story, and determine the progression of events or chapters.
4. Set writing goals: Establish realistic goals for yourself, such as a daily or weekly word count. This will help you stay disciplined and focused on completing your book.
5. Write a first draft: Start writing the first draft of your book, focusing on getting your ideas down on paper rather than perfecting your prose. Don't worry about grammar or punctuation at this stage – you'll have time to edit and polish your work later.
6. Revise and edit: Once you've completed your first draft, go through it multiple times to revise and edit your work. This may involve reorganizing chapters, deleting or adding content, and refining your writing to improve clarity and flow.
7. Seek feedback: Share your manuscript with trusted friends, family, or beta readers to get feedback on your work. This can help you identify any inconsistencies, plot holes, or areas that need improvement.
8. Incorporate feedback: Use the feedback you receive to make necessary changes and improvements to your manuscript. This may involve multiple rounds of revisions and edits until you're satisfied with the final product.
9. Proofread: Carefully proofread your manuscript for any grammatical, spelling, or punctuation errors. You may want to consider hiring a professional editor or proofreader to ensure your book is polished and error-free.
10. Prepare for publication: Research your publishing options, such as traditional publishing or self-publishing. If you choose traditional publishing, you'll need to write a query letter and book proposal to pitch your manuscript to agents or publishers. If you opt for self-publishing, you'll need to format your manuscript, design a cover, and choose a platform to publish your book.
11. Promote your book: Develop a marketing plan to promote your book, which may include building an author website, engaging with readers on social media, writing blog posts or articles, and attending book fairs or other events to connect with readers and industry professionals.
12. Publish your book: Once you've completed the necessary steps, publish your book and celebrate your accomplishment! Remember that writing a book is a significant achievement, and be proud of the hard work and dedication you've put into it.

That’s nice and human-readable, but what if we wanted the to-do list as a computer-readable data-structure rather than a chunk of natural language?

That’s where another key class in LangChain comes to play - the PydanticOutputParser.

This one class provides you both a means to create the prompt text that convinces the LLM to output a data-format rather than plain english, and to parse its output to hydrate the data into your program model.

Let’s set up a basic model for our to-do list:

from pydantic import BaseModel
class ToDo(BaseModel):
title: str
description: str
done: bool
class ToDoList(BaseModel):
todos: list[ToDo]
title: str

This is minimal for now, we’ll explore some useful things to better control the content that the LLM fills the model with later.

To use this model with our main code, we create a PydanticOutputParser with the ToDoList which is the root of our object graph. Then we use the partial_variables feature of PromptTemplate, the get_format_instructions() method of the PydanticOutputParser, and the output_parser parameter on the run method of LLMChain.

import os
from textwrap import dedent
from langchain import PromptTemplate, LLMChain
from langchain.chat_models import ChatOpenAI
from langchain.output_parsers import PydanticOutputParser
from todo import ToDoList
os.environ['OPENAI_API_KEY'] = os.environ['OPENAI_ACCESS_TOKEN']
def pretty_prompt(str: str) -> str:
return dedent(str).strip()
def main():
prompt = pretty_prompt("""
When responding, always use the following format:
{format_instructions}
Tell me the steps I need to take in order to:
{request}
""")
parser = PydanticOutputParser(pydantic_object=ToDoList)
prompt_template = PromptTemplate(
input_variables=["request"],
template=prompt,
partial_variables={"format_instructions": parser.get_format_instructions()}
)
llm = ChatOpenAI(model_name="gpt-4", temperature=0.7)
chain = LLMChain(llm=llm, prompt=prompt_template, verbose=True)
output = chain.run(request="write a book", output_parser=parser)
print(f"\n\nRaw Output of LLM:\n\n{output}")
todo_list = parser.parse(output)
print(f"\n\nParsed Output from PydanticOutputParser:\n\n{todo_list}")
if __name__ == '__main__':
main()

And this will produce the following output:

> Entering new LLMChain chain...
Prompt after formatting:
When responding, always use the following format:
The output should be formatted as a JSON instance that conforms to the JSON schema below.
As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.
Here is the output schema:
{"properties": {"todos": {"title": "Todos", "type": "array", "items": {"$ref": "#/definitions/ToDo"}}, "title": {"title": "Title", "type": "string"}}, "required": ["todos", "title"], "definitions": {"ToDo": {"title": "ToDo", "type": "object", "properties": {"title": {"title": "Title", "type": "string"}, "description": {"title": "Description", "type": "string"}, "done": {"title": "Done", "type": "boolean"}}, "required": ["title", "description", "done"]}}}
Tell me the steps I need to take in order to:
write a book
> Finished chain.
Raw Output of LLM:
{
"title": "Write a Book",
"todos": [
{
"title": "Choose a topic",
"description": "Select a subject or theme for your book.",
"done": false
},
{
"title": "Research",
"description": "Gather information and conduct research on your chosen topic.",
"done": false
},
{
"title": "Create an outline",
"description": "Organize your ideas and create a structure for your book.",
"done": false
},
{
"title": "Develop characters and setting",
"description": "If writing fiction, create detailed characters and a vivid setting.",
"done": false
},
{
"title": "Write a draft",
"description": "Begin writing your book, following the outline and structure.",
"done": false
},
{
"title": "Revise and edit",
"description": "Review your draft, make improvements and correct any errors.",
"done": false
},
{
"title": "Get feedback",
"description": "Share your draft with beta readers or a writing group for feedback.",
"done": false
},
{
"title": "Finalize manuscript",
"description": "Incorporate feedback and finalize your book manuscript.",
"done": false
},
{
"title": "Find a publisher or self-publish",
"description": "Submit your manuscript to publishers or choose to self-publish your book.",
"done": false
},
{
"title": "Promote your book",
"description": "Market and promote your book through social media, book signings, and other channels.",
"done": false
}
]
}
Parsed Output from PydanticOutputParser:
todos=[ToDo(title='Choose a topic', description='Select a subject or theme for your book.', done=False), ToDo(title='Research', description='Gather information and conduct research on your chosen topic.', done=False), ToDo(title='Create an outline', description='Organize your ideas and create a structure for your book.', done=False), ToDo(title='Develop characters and setting', description='If writing fiction, create detailed characters and a vivid setting.', done=False), ToDo(title='Write a draft', description='Begin writing your book, following the outline and structure.', done=False), ToDo(title='Revise and edit', description='Review your draft, make improvements and correct any errors.', done=False), ToDo(title='Get feedback', description='Share your draft with beta readers or a writing group for feedback.', done=False), ToDo(title='Finalize manuscript', description='Incorporate feedback and finalize your book manuscript.', done=False), ToDo(title='Find a publisher or self-publish', description='Submit your manuscript to publishers or choose to self-publish your book.', done=False), ToDo(title='Promote your book', description='Market and promote your book through social media, book signings, and other channels.', done=False)] title='Write a Book'

Examine the fully-resolved prompt that is used to the LLM in the above output - under Prompt after formatting: until write a book. Notice how it provides a fair bit of guidance in how to format its output.

Let’s add some extra metadata to the Pydantic model and see what it looks like on the next run.

Here is the updated Pydantic model:

from pydantic import BaseModel, Field
class ToDo(BaseModel):
title: str = Field(description="A short title for the ToDo", min_length=3, max_length=50)
description: str = Field(description="Details or notes about the ToDo")
done: bool = Field(default=False, description="Whether the ToDo is done or not")
class ToDoList(BaseModel):
todos: list[ToDo] = Field(default_factory=list, description="A list of ToDos")
title: str = Field(default="My ToDos", description="A title for the ToDo list", min_length=3, max_length=50)

Now let’s look at the impact on the generated prompt… I expanded the output schema so that it’s easier to read for you.

Old Prompt:

When responding, always use the following format:
The output should be formatted as a JSON instance that conforms to the JSON schema below.
As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.
Here is the output schema:
{
"properties": {
"todos": {
"title": "Todos",
"type": "array",
"items": {
"$ref": "#/definitions/ToDo"
}
},
"title": {
"title": "Title",
"type": "string"
}
},
"required": [
"todos",
"title"
],
"definitions": {
"ToDo": {
"title": "ToDo",
"type": "object",
"properties": {
"title": {
"title": "Title",
"type": "string"
},
"description": {
"title": "Description",
"type": "string"
},
"done": {
"title": "Done",
"type": "boolean"
}
},
"required": [
"title",
"description",
"done"
]
}
}
}
Tell me the steps I need to take in order to:
write a book

New Prompt with Metadata:

When responding, always use the following format:
The output should be formatted as a JSON instance that conforms to the JSON schema below.
As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.
Here is the output schema:
{
"properties": {
"todos": {
"title": "Todos",
"description": "A list of ToDos",
"type": "array",
"items": {
"$ref": "#/definitions/ToDo"
}
},
"title": {
"title": "Title",
"description": "A title for the ToDo list",
"default": "My ToDos",
"maxLength": 50,
"minLength": 3,
"type": "string"
}
},
"definitions": {
"ToDo": {
"title": "ToDo",
"type": "object",
"properties": {
"title": {
"title": "Title",
"description": "A short title for the ToDo",
"maxLength": 50,
"minLength": 3,
"type": "string"
},
"description": {
"title": "Description",
"description": "Details or notes about the ToDo",
"type": "string"
},
"done": {
"title": "Done",
"description": "Whether the ToDo is done or not",
"default": false,
"type": "boolean"
}
},
"required": [
"title",
"description"
]
}
}
}
Tell me the steps I need to take in order to:
write a book

Notice how the natural language in the Pydantic Field metadata enriches one’s understanding of the purpose of each field. This helps the LLM decide what to generate in each field.

Up Next

OK, I think this is finally enough background now to continue our MLCoder journey. Stay tuned for the 6th instalment next week!


Welcome to my personal blog. Writing that I've done is collected here, some is technical, some is business oriented, some is trans related.