Writing great instructions

The instruction document is the primary document shared with annotators to inform their work. It should be the one-stop shop for annotators to get questions answered when initiating, going through their workflows, or reviewing your feedback on their work. It’s expected for this to be a living document.

What needs to be in the instructions document can vary by project, but common components include:

Useful links
Context and background
Task guidelines
FAQs
Style Guide
Grading rubric
Examples

We’ll describe each of these below, followed by some examples of what it could look like when it’s all put together.

Workflows around instructions

Before diving in to what makes a set of instructions great, we wanted to touch on some key workflows around how instructions get used.

Clarification Tracking

Especially in the early stages of a project, annotators and researchers will likely find a lot of issues and edge cases. We recommend having a process in place to collect feedback from annotators and researchers and update the annotation instructions.

What we see most often is a “Clarification tracker” spreadsheet to track clarifications and their outcomes. Annotators could have direct edit access to the spreadsheet, or fill out a form to add a clarification for the vendor and customer leads to discuss and come back to them with.

It is usually very tempting to let all clarifications and instruction changes just live in a Slack channel, but this can make it incredibly challenging for new people to come up to speed.

Instruction Versioning

While the work to be done on a project is consistently evolving, it can be tempting to simply keep making edits to the same Google doc to track all of these changes. We highly recommend versioning your instructions with both major and minor changes, i.e. v1.0, 1.1, 2.0, 2.1, 2.2, 2.3.

Once your instructions are versioned, ideally you are connecting what version of the instructions were used to label a specific task to the task itself. This allows you to be much more honest with what the expectations were when that task was completed. Without doing this, you run the risk of incorporating off-policy, old data into the model or penalizing mistakes that weren’t mistakes when the data was produced.

To do this practically, some have made a change log at the bottom of their document to track what changed and when, while others make changes in a separate copy of the instructions for each version.

Instruction Components

Useful links

At the top of the document, we recommend maintaining a list of useful references for annotators. This can include links to the style guide, rubric, task queue, Slack channel, or anything else.

Context and background

It’s easy to forget this when you’re the one setting up the project, but annotators don’t have the same context you do on goals and criteria. Annotators can contribute much more effectively when they understand the context and outcomes of their work.

This doesn’t have to be too long, but a brief paragraph on the objectives of your project, how the data they’re creating will be used, and a rough timeline are all good things to cover.

Workflow overview

The workflow is a step-by-step breakdown of the process annotators should follow when working on your project. It’s best to be in-depth when describing each step, but try to avoid creating many dozens of steps — the longer it gets, the harder it is for annotators to get started.

It can be helpful to have a “TL;DR” section, followed by a more detailed workflow to get people quickly up to speed with what they will need to do at a high level.

We recommend running through the annotation process yourself for a couple of tasks, and writing the workflow steps out as you go.

Almost always, the workflow will be modified as the project proceeds, and often even after doing the first few tasks, you’ll find inconsistencies you didn’t realize existed when you first wrote the instructions.

Task guidelines

This should be a crisp description of your expectations for the annotators’ work output.

One simple option is just to list out all of your expectations for final tasks, and continue to add to the list as you collect initial annotator submissions.

Another way to structure it is to reflect your rubric (see below), where each set of criteria comes directly from how you expect to evaluate tasks.

For very specific criteria/guidelines, it can be useful to provide a table of examples of dos/don’ts.

FAQs

Your annotator team will likely have lots of questions — besides basic logistics questions, they’ll find ambiguities and edge cases you couldn’t have thought of when starting.

It’s useful to include an FAQ section where you can add all the questions you’re seeing regularly to save yourself and the annotators time. This can just be the clarifications tracker to start with, but often the readability of a more curated and organized FAQ section is worth the effort.

Style guide

The style guide is your definition of best practices for formatting, tone, and other stylistic elements of annotators’ work output.

Depending on the length of the instruction doc and style guide, this can simply be included as a section within the instructions. For more complex projects, it can make sense to break out the style guide into a separate document to ensure it’s closely observed.

Some potential components of a style guide:

Formatting, structure
Tone (formality, humor, enthusiasm)
Language (natural language or programming language, selection of locality for natural languages)
Coding standard (for code projects)

This should be very custom to your project and use case, so add categories as needed. It can also be useful to provide a table of good and bad examples for each category. The model’s output does pick up the stylistic preferences people introduce in the ranking and fine-tuning stages of RLHF.

An example of a best-in-class style guide from OpenAI can be found in their blog post:

Introducing the Model Spec

To deepen the public conversation about how AI models should behave, we’re sharing the Model Spec, our approach to shaping desired model behavior.

Grading rubric

The rubric is what you’ll use to review and grade work done by annotators. If you create a peer review system where annotators review each other’s work, they’ll be using it too.

It’s important for the rubric to be extremely comprehensive and as objective as possible. Naturally, you won’t be able to make the first draft perfect, but regularly update it as needed as you start to review annotators’ submissions.

The rubric criteria can vary quite a bit based on your use case. The following criteria can hold pretty generally for non-coding tasks:

Quality
- Factuality/accuracy
- Relevance
- Completeness
Linguistic quality
- Spelling, grammar, punctuation
- Clarity
- Conciseness
Adherence to guidelines
- Formatting
- Tone
- Content boundaries (bias, sensitive or offensive content, etc.)

You can think about scoring each criterion on a numeric scale or based on other categories

Numeric scale
- 1-3, 1-4, 1-7, 1-10 are all options
- You should consider if you want to allow for a “median” rating or force a split between bottom half and top half ratings.
Error-based categories
- An example of this is “Major error(s)”, “Minor error(s)”, “No errors/Good/Perfect”.
- This can be helpful in forcing consistency about how task errors determine ratings.
Error tagging
- As opposed to grading the entire task, you allow portions of the human output to be tagged with errors. Errors can be in different categories (i.e. hallucination, typo) as well as by severity (i.e. major, minor). This results in very granular feedback and allows multiple tags per task, but requires more investment in tooling.
Binary
- Only two options, “Accept” and “Reject”.

You should also consider how the task is broadly graded based on scores across criteria. I think about this in conjunction with your preferred QA process — if you want to just pass or fail tasks, it might be simplest to do binary task grading.

Error-based categories The simplest way to do this is to either “Accept” or “Reject” tasks, depending on some rule (e.g., any major errors → reject or any score below X → reject).
Error tagging You can also calculate an overall numerical grade based on criteria scores.

Examples

General chatbot fine-tuning

We are creating a high-quality SFT (Supervised Fine-Tuning) dataset consisting of prompt-response pairs for general chatbot use cases. This dataset will be used to train a large language model (LLM) to improve its conversational and instruction-following abilities. Your contributions are crucial to ensuring the model is helpful, accurate, and harmless. The project is expected to span over the next three months, with regular check-ins and feedback sessions to ensure high-quality data collection.

Workflow

Understanding the Task:
- Read through the provided guidelines and examples to familiarize yourself with the desired output.
- If you have any questions, refer to the FAQ section or reach out via the designated communication channel (e.g., Slack).
Selecting Prompts:
- Select a topic or scenario that is relevant to general chatbot use cases. This could range from casual conversation, technical support, customer service, to educational assistance.
Creating Responses:
- Craft a thoughtful, relevant, and coherent response to the selected prompt.
- Ensure the response is natural, informative, and appropriately addresses the prompt.
Review and Refinement:
- Review your prompt-response pair for clarity, coherence, and grammatical accuracy.
- Make any necessary adjustments to improve the quality of the response.
Submission:
- Submit the prompt-response pair through the designated submission form or platform.
- Ensure to fill in any required metadata or annotations as specified.
Feedback and Iteration:
- Review feedback provided on your submissions and make necessary improvements.
- Continuously refine your approach based on feedback to enhance the quality of your future submissions.

Task criteria

Relevance: The response should directly address the prompt and be relevant to the topic.
Coherence: The response should be logically structured and easy to understand.
Informative: Provide sufficient information in your response to be helpful or engaging.
Naturalness: The response should mimic natural human conversation.
Correctness: Ensure the response is free of factual, grammatical, and spelling errors.
Appropriateness: Maintain a respectful and appropriate tone throughout the response.

Examples of Dos and Don’ts:

Do	Don't
Provide detailed, thoughtful responses	Give short, unhelpful answers
Ensure the response is relevant to the prompt	Go off-topic or provide irrelevant information
Use proper grammar and spelling	Submit responses with grammatical or spelling errors
Maintain a polite and respectful tone	Use inappropriate or disrespectful language

FAQ

What types of prompts should I create?
- Prompts can vary widely, including casual conversation starters, customer service queries, technical questions, or educational topics. Aim for diversity to train a versatile chatbot.
How detailed should responses be?
- Responses should be detailed enough to provide value but concise enough to remain engaging. Aim for a balance between thoroughness and brevity.
Can I use external sources to craft responses?
- Yes, but ensure the response is original and not directly copied. Paraphrase and add your own insights to maintain originality.
What if I’m unsure about the relevance of a prompt?
- Refer to the provided examples and guidelines. If still in doubt, seek clarification from the project coordinator.
How often will I receive feedback?:
- Feedback will be provided regularly, typically on a weekly basis, to help you improve and align with project expectations.

Prompt-response pair examples

Prompt: "Can you explain the process of photosynthesis?"
Response: "Photosynthesis is the process by which green plants and some other organisms use sunlight to synthesize foods with the help of chlorophyll. It involves the transformation of carbon dioxide and water into glucose and oxygen. The overall chemical reaction is represented as 6CO2 + 6H2O → C6H12O6 + 6O2. This process is crucial for the production of oxygen and forms the basis of the food chain."
Prompt: "What are some effective ways to manage stress?"
Response: "Managing stress can involve various techniques such as regular physical exercise, mindfulness meditation, proper time management, and maintaining a healthy lifestyle. It's also important to establish a strong support network, whether through friends, family, or professional help. Engaging in hobbies, ensuring adequate sleep, and practicing relaxation techniques can also significantly reduce stress levels."

Example style guide

Formatting and Structure

Prompt: Clear, concise, direct question, statement, or scenario.
Response: Logical structure with a clear beginning, middle, and end. Use paragraphs for readability. Ensure grammatical correctness.

Good Example:

Prompt: "Can you explain the process of photosynthesis?"
Response: "Photosynthesis is the process by which green plants and some other organisms use sunlight to synthesize foods with the help of chlorophyll. It involves the transformation of carbon dioxide and water into glucose and oxygen. This process is crucial for the production of oxygen and forms the basis of the food chain."

Bad Example:

Prompt: "photosynthesis?"
Response: "it's how plants make food from sunlight."

Tone

Formality: Polite, professional, neither too formal nor too casual.
Humor: Use sparingly and appropriately.
Enthusiasm: Balanced, friendly, not overly enthusiastic.

Good Example:

Prompt: "What are some effective ways to manage stress?"
Response: "Managing stress involves regular exercise, mindfulness meditation, and proper time management. Establishing a healthy lifestyle and support network can also help."

Bad Example:

Prompt: "How to manage stress?"
Response: "Oh, it's super easy! Just chill out, have some fun, and don't worry about it! You'll be fine."

Narrative Perspective

Use: Second-person perspective (you, your).
Avoid: First-person perspective (I, we) unless specified.

Good Example:

Prompt: "Can you explain the process of photosynthesis?"
Response: "Photosynthesis is the process by which green plants and some other organisms use sunlight to synthesize foods with the help of chlorophyll."

Bad Example:

Prompt: "Can you explain the process of photosynthesis?"
Response: "I think photosynthesis is really interesting. We use sunlight to make food."

Language

Clarity: Use clear, natural language.
Inclusivity: Avoid offensive or exclusionary terms.
Consistency: Use American English for spelling and grammar.

Good Example:

Prompt: "What are some effective ways to manage stress?"
Response: "Managing stress involves regular physical exercise, mindfulness meditation, and proper time management. Establishing a support network and engaging in hobbies can also be helpful."

Bad Example:

Prompt: "What are some effective ways to manage stress?"
Response: "Managing stress involves regular physical exercise, mindfulness meditation, and proper time management. Establishing a support network and engaging in hobbies can also be helpful."

Example rubric

Category	Component	Description	Grade	Grade criteria
Content quality	Factuality/ accuracy	Is the response factually accurate?	Major error(s)	The response contains significant factual inaccuracies or false information.
			Minor error(s)	The response contains minor factual inaccuracies that do not significantly affect the overall message. The response does not correct factual errors found in the prompt.
			Perfect	No factual inaccuracies in the response. If applicable, the response corrects any factual errors found in the prompt.
	Relevance	Is the prompt relevant to the topic? Is the response relevant to the prompt?	Major error(s)	The response is completely irrelevant to the prompt or the prompt is entirely off-topic.
			Minor error(s)	The response partially addresses the prompt, or the prompt is only somewhat relevant to the topic.
			Perfect	The response is fully relevant to the prompt and addresses it appropriately, and the prompt is completely relevant to the topic.
	Completeness	Does the response comprehensively address all parts of the prompt?	Major error(s)	Response does not address several major parts of the prompt, either by omission or incompleteness.
			Minor error(s)	Response addresses all major parts of the prompt, but may miss small details through incompleteness.
			Perfect	The response is fully relevant to the prompt and addresses it appropriately, and the prompt is completely relevant to the topic.
Linguistic quality	Spelling, grammar, and punctuation	Are the prompt and response in the correct language? Do both the prompt and response follow spelling, grammar, and punctuation rules of that language?	Major error(s)	Incorrect language in the prompt or response. Major spelling, grammatical, or punctuation errors in either the prompt or response.
			Minor error(s)	Minor errors in spelling, grammar, or punctuation; for example, using British English vs American English (colour vs color) or excluding Oxford comma. Error(s) do not significantly hinder understanding.
			Perfect	Both prompt and response in the correct language, with no spelling, grammatical, or punctuation errors.
	Clarity	Is the response easy to understand?	Major error(s)	The response is unclear and difficult to understand due to poor language use.
			Minor error(s)	The response is mostly clear but may contain some awkward phrasing or minor clarity issues.
			Perfect	The response is very clear and easy to understand for a general audience.
	Conciseness	Are the prompt and response both presented concisely without unnecessary details?	Major error(s)	The response contains many unnecessary details or is overly verbose.
			Minor error(s)	The response is somewhat concise but could be more succinct.
			Perfect	The response is concise and to the point, without unnecessary details.
Adherence to guidelines	Formatting	Do the prompt and response both follow the formatting rules specified in the guidelines?	Major error(s)	The prompt or response does not follow the specified formatting rules at all.
			Minor error(s)	The prompt or response mostly follows the formatting rules but contains some minor deviations.
			Perfect	The prompt and response strictly follow all specified formatting rules.
	Style/tone	Is the language of both the prompt and response consistent with the writing style and tone defined in the guidelines/style guide?	Major error(s)	The language of the prompt or response is completely inconsistent with the defined writing style and tone.
			Minor error(s)	The language of the prompt or response is mostly consistent with the defined writing style and tone but contains some minor deviations.
			Perfect	The language of both the prompt and response is fully consistent with the defined writing style and tone.
	Content boundaries	Are the prompt and response free from biased or inappropriate content?	Major error(s)	The prompt or response contains biased, inappropriate, or offensive content.
			Minor error(s)	The prompt and response do not contain clearly biased, inappropriate, or offensive content, but could be understood to be biased, inappropriate, or offensive.
			Perfect	The prompt and response are entirely free from biased, inappropriate, or offensive content.

Example rubric

RLHF sample documents

Example RLHF guidelines