Juan Angel Suarez
Published on

Understanding and Implementing Instructor Embeddings by HKU NLP

Authors

As someone involved in implementing AI-powered software, especially within healthcare, I've observed how embedding models dramatically enhance AI applications. Embeddings convert complex data like text or images into numerical vectors that computers can easily handle. If you've used OpenAI's embedding APIs, you'll find Instructor embeddings by HKU NLP familiar and equally powerful—without ongoing API costs.

Why Embeddings Are Essential

Embedding models like those provided by OpenAI translate language or concepts into numerical vectors that encapsulate semantic relationships, making it easier for algorithms to interpret text or other data types meaningfully. Similar concepts become numerically close, allowing systems to identify patterns effectively.

  • Word Embeddings: Like OpenAI’s text-embedding-ada-002, Word2Vec, GloVe.
  • Image Embeddings: Derived using convolutional neural networks.
  • Graph Embeddings: Useful for relationship-rich data.
  • Domain-specific Embeddings: Customized embeddings optimized for particular fields, such as healthcare.

Instructor: An OpenAI-like Embedding Alternative

Instructor embeddings function similarly to OpenAI’s embeddings, with the added benefit of instruction-driven customization. By providing natural language instructions, Instructor adapts embeddings specifically to your task, aligning well with OpenAI's approach to versatile and contextual embeddings:

  • Multi-task Capability: Handles classification, retrieval, clustering, and more.
  • Adaptable Domains: Works seamlessly across healthcare, finance, science, etc.
  • Compact and Efficient: Achieves impressive performance with fewer parameters (335M).

For teams seeking OpenAI-level embedding quality without the API overhead, Instructor embeddings are an excellent option.

Quickstart with Instructor

Choose the Right Model

Instructor is available in several sizes:

  • Base (137M parameters): Lightweight and efficient.
  • Large (335M parameters): Ideal balance—recommended for most projects.
  • XL (1.5B parameters): Best suited for very complex tasks.

The "large" variant typically delivers the best trade-off between performance and efficiency.

Easy Environment Setup

Install quickly in Python:

python -m venv instructor-env
source instructor-env/bin/activate
pip install InstructorEmbedding

Simple Embedding Generation

Here's how simple it is to start creating embeddings:

from InstructorEmbedding import INSTRUCTOR

model = INSTRUCTOR('hkunlp/instructor-large')

text = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the science title:"

embedding = model.encode([[instruction, text]])
print(f"Embedding shape: {embedding.shape}")

Real-World Use Cases

Text Classification

Easily build sentiment classification:

from sklearn.linear_model import LogisticRegression

texts = ["Great product!", "Not worth buying."]
labels = [1, 0]

instruction = "Represent the review for sentiment classification:"
embeddings = model.encode([[instruction, t] for t in texts])

classifier = LogisticRegression().fit(embeddings, labels)

new_text = "Absolutely fantastic!"
new_embedding = model.encode([[instruction, new_text]])
prediction = classifier.predict(new_embedding)
print(f"Predicted sentiment: {prediction}")

Implement semantic search quickly:

from sklearn.metrics.pairwise import cosine_similarity

documents = ["AI is revolutionizing medicine.", "Advances in quantum computing."]
instruction = "Represent the document for retrieval:"
doc_embeddings = model.encode([[instruction, doc] for doc in documents])

query = "Technology in healthcare"
query_embedding = model.encode([[instruction, query]])

similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
most_relevant = documents[similarities.argmax()]
print(f"Most relevant document: {most_relevant}")

Best Practices

  • Clear Instructions: Similar to OpenAI embeddings, specificity in your prompts enhances results.
  • Efficient Processing: Batch embeddings to save resources.
  • Caching: Store embeddings for repeated tasks.

Common pitfalls include overly generic instructions and ignoring domain context, potentially reducing embedding quality.

Explore more through the Instructor repository or their research paper.