Job Description:
• Build the evaluation layer in the ThirdLaw platform for LLM prompts and responses
• Design and tune guardrails, classifiers, and semantic judgment systems in real-time
• Implement evaluation strategies with semantic similarity, foundation model scoring, and rule-based systems
• Integrate model outputs with downstream enforcement actions (e.g. redaction, escalation, blocking)
• Prototype, tune, and productize small language models for classification, labeling, or scoring
• Collaborate with data infrastructure engineers to connect evaluation logic with ingestion and storage
• Build tools to observe, debug, and improve evaluator performance across data distributions
• Define abstractions for reusable evaluation components that can scale across use cases
Requirements:
• 7+ years of experience in ML systems or AI engineering roles
• At least 1–2 years working directly with LLMs, NLP pipelines, or semantic search
• Deep understanding of foundation models (e.g. OpenAI, Claude, Mistral, Llama) and APIs
• Hands-on experience with vector search (e.g. FAISS, Qdrant, Weaviate) and embeddings pipelines
• Proven ability to implement real-time or near-real-time evaluation logic using semantic similarity, classifier scoring, or structured rules
• Strong in Python, with familiarity using libraries like Hugging Face Transformers, LangChain, and PyTorch or TensorFlow
• Ability to reason about model behavior, test prompt configurations, and debug complex decision logic in production
Benefits:
• Generous benefits
• Market cash compensation
• Above-market equity
• Well-designed benefits