Index
A
- accelerators, AI Accelerators-Power consumption
- computational capabilities, Computational capabilities
- defined, What’s an accelerator?-What’s an accelerator?
- memory size and bandwidth, Memory size and bandwidth-Memory size and bandwidth
- power consumption, Power consumption-Power consumption
- active injection, Indirect prompt injection
- adapter-based methods, PEFT techniques
-
adapters
- finetuning, Finetuning methods
- LoRA, LoRA-Quantized LoRA
- merging with concatenation, Concatenation
- PEFT techniques, PEFT techniques-PEFT techniques
- agents, Agents-Efficiency
- agent failure modes and evaluation, Agent Failure Modes and Evaluation-Efficiency
- efficiency, Efficiency
- planning failures, Planning failures
- tool failures, Tool failures
- overview, Agent Overview-Agent Overview
- planning agents, Planning-Tool selection
- foundation models as planners, Foundation models as planners-Foundation models as planners
- overview, Planning overview-Planning overview
- plan generation, Plan generation-Complex plans
- reflection and error correction, Reflection and error correction-Reflection and error correction
- tool selection, Tool selection-Tool selection
- tools, Tools-Write actions
- capability extension, Capability extension
- knowledge augmentation, Knowledge augmentation
- write actions, Write actions
- agent failure modes and evaluation, Agent Failure Modes and Evaluation-Efficiency
- AI accelerators (see accelerators)
- AI application building (see application building)
- AI application planning (see application planning)
-
AI engineering (AIE)
- defined, From Foundation Models to AI Engineering
- ML engineering versus, AI Engineering Versus ML Engineering-AI interface
- rise of AI engineering, The Rise of AI Engineering-From Foundation Models to AI Engineering
- AI engineering architecture (see engineering architecture)
- AI engineering stack (see engineering stack)
- AI judge, AI as a Judge
- (see also AI-as-a-judge)
- AI pipeline orchestration (see pipeline orchestration)
- AI systems evaluation (see systems evaluation)
- AI-as-a-judge, AI as a Judge-What Models Can Act as Judges?
- limitations, Limitations of AI as a Judge-Biases of AI as a judge
- biases, Biases of AI as a judge
- criteria ambiguity, Criteria ambiguity-Criteria ambiguity
- inconsistency, Inconsistency
- increased costs and latency, Increased costs and latency
- models, What Models Can Act as Judges?-What Models Can Act as Judges?
- reasons, Why AI as a Judge?
- reference-based, What Models Can Act as Judges?
- uses, How to Use AI as a Judge-How to Use AI as a Judge
- limitations, Limitations of AI as a Judge-Biases of AI as a judge
- AI-powered data synthesis (see data synthesis, AI-powered)
- AMP (automatic mixed precision), Training quantization
- ANN (approximate nearest neighbor), Embedding-based retrieval
- Annoy (approximate nearest neighbors oh yeah), Embedding-based retrieval
- anomaly detection, Similarity Measurements Against Reference Data
-
Anthropic
- contextual retrieval, Contextual retrieval
- inverse scaling and alignment training, Model Size
- prompt caching, Prompt caching
- RAG and, RAG
- APIs (see open source models, model APIs versus)
- application building, Introduction to Building AI Applications with Foundation Models-Summary
- application planning, Planning AI Applications-Maintenance
- maintenance, Maintenance
- milestone planning, Milestone Planning
- set expectations, Setting Expectations
- use case evaluation, Use Case Evaluation-AI product defensibility
- engineering stack, The AI Engineering Stack-AI Engineering Versus Full-Stack Engineering
- AI engineering versus ML engineering, AI Engineering Versus ML Engineering-AI interface
- application development, Application development-AI interface
- full-stack engineering versus, AI Engineering Versus Full-Stack Engineering
- three layers of AI stack, Three Layers of the AI Stack-Three Layers of the AI Stack
- foundation model use cases, Foundation Model Use Cases-Workflow Automation
- coding, Coding-Coding
- conversational bots, Conversational Bots
- data organization, Data Organization
- education, Education
- image and video production, Image and Video Production
- information aggregation, Information Aggregation
- workflow automation, Workflow Automation
- writing, Writing-Writing
- rise of AI engineering, The Rise of AI Engineering-From Foundation Models to AI Engineering
- foundation models to AI engineering, From Foundation Models to AI Engineering-From Foundation Models to AI Engineering
- application planning, Planning AI Applications-Maintenance
- application development, Three Layers of the AI Stack, Application development-AI interface
- AI interface, AI interface
- evaluation, Evaluation
- prompt engineering and context construction, Prompt engineering and context construction
- application planning, Planning AI Applications-Maintenance
- maintenance, Maintenance
- milestone planning, Milestone Planning
- set expectations, Setting Expectations
- use case evaluation, Use Case Evaluation-AI product defensibility
- approximate nearest neighbor (ANN), Embedding-based retrieval
- approximate string matching, Lexical similarity
- ARC-C, Public leaderboards
- attention mechanisms, Attention mechanism-Attention mechanism
- attention modules, Transformer block
- MLP modules, Transformer block
- optimization, Attention mechanism optimization-Writing kernels for attention computation
- attention mechanism redesign, Redesigning the attention mechanism
- wiring kernels for attention computation, Writing kernels for attention computation
- redesign, Redesigning the attention mechanism
- attention modules, Transformer block
-
augmentation of data
- defined, Data Augmentation and Synthesis
- automated attacks, Automated attacks
- automatic mixed precision (AMP), Training quantization
- autoregressive decoding bottleneck, Overcoming the autoregressive decoding bottleneck-Parallel decoding
- inference with reference, Inference with reference
- parallel decoding, Parallel decoding
- speculative decoding, Speculative decoding-Speculative decoding
- autoregressive language model, Language models
B
- backpropagation, Backpropagation and Trainable Parameters-Backpropagation and Trainable Parameters
- batch inference APIs, Online and batch inference APIs-Online and batch inference APIs
- batch size, Batch size
-
batching
- batch inference APIs, Online and batch inference APIs-Online and batch inference APIs
- batch size, Batch size
- continuous, Batching
- dynamic, Batching
- static, Batching
-
benchmarks
- for comparative evaluation, The Future of Comparative Evaluation
- data contamination detection, Perplexity Interpretation and Use Cases
- domain distribution and, Domain-Specific Models
- domain-specific, Domain-Specific Capability-Domain-Specific Capability
- instruction-following criteria, Instruction-following criteria-Instruction-following criteria
- model-centric versus data-centric, Dataset Engineering
- navigating public benchmarks, Navigate Public Benchmarks-Custom leaderboards with public benchmarks
- biases, Biases of AI as a judge, Biases
- bits-per-byte (BPB), Bits-per-Character and Bits-per-Byte
- bits-per-character (BPC), Bits-per-Character and Bits-per-Byte
-
bottlenecks
- autoregressive decoding, Overcoming the autoregressive decoding bottleneck-Parallel decoding
- computational, Computational bottlenecks-Computational bottlenecks
- compute-bound, Computational bottlenecks
- memory, Memory Bottlenecks-Training quantization, Computational bottlenecks
- scaling, Scaling bottlenecks-Scaling bottlenecks, Scalability bottlenecks
- BPB (bits-per-byte), Bits-per-Character and Bits-per-Byte
- BPC (bits-per-character), Bits-per-Character and Bits-per-Byte
- build time, Comparing retrieval algorithms
C
- canonical responses, Similarity Measurements Against Reference Data
- capability extension, Capability extension
- chain-of-thought (CoT), Give the Model Time to Think-Give the Model Time to Think, Data Curation
- chaining, AI Pipeline Orchestration
- change failure rate (CFR), Monitoring and Observability
- CharacterEval, Roleplaying
-
ChatGPT
- comparative evaluation, Ranking Models with Comparative Evaluation
- data privacy issues, Data privacy
- effect on AI investment, From Foundation Models to AI Engineering
- Gemini versus, Evaluation
- hallucinations, Hallucination
- and human writing quality, Writing
- introduction of, Preface
- and languages other than English, Multilingual Models
- query rewriting, Query rewriting
- reverse prompt engineering attacks, Proprietary Prompts and Reverse Prompt Engineering
- in schools, Education
- Chinchilla scaling law, Scaling law: Building compute-optimal models
- chunking, RAG Architecture, Chunking strategy-Chunking strategy
- Claude, RAG and, RAG
- CLIP, From Large Language Models to Foundation Models, Domain-Specific Models, Introduction to Embedding
- clustering, Similarity Measurements Against Reference Data
- Common Crawl dataset, Training Data-Multilingual Models
- comparative evaluation, Ranking Models with Comparative Evaluation-The Future of Comparative Evaluation
- comparison data, Reward model
- compilers, Kernels and compilers
- components definition, AI Pipeline Orchestration
- computational bottlenecks, Computational bottlenecks-Computational bottlenecks
- computational capabilities, of AI accelerators, Computational capabilities
- compute-bound bottlenecks, Computational bottlenecks
- compute-optimal models, Scaling law: Building compute-optimal models-Scaling law: Building compute-optimal models
- compute-optimal training, Scaling law: Building compute-optimal models
- concatenation, Concatenation
- constrained sampling, Constrained sampling
- context construction, Prompt engineering and context construction, Provide Sufficient Context, Step 1. Enhance Context
- context efficiency, Context Length and Context Efficiency-Context Length and Context Efficiency
- context length, Context Length and Context Efficiency-Context Length and Context Efficiency
- context parallelism, Parallelism
- context precision, Comparing retrieval algorithms
- context recall, Comparing retrieval algorithms
- contextual retrieval, Contextual retrieval-Contextual retrieval
- continuous batching, Batching
- control flow, Complex plans
- conversational bots, Conversational Bots
-
conversational feedback
- conversation length, Conversation length
- conversation organization, Conversation organization
- extracting, Extracting Conversational Feedback-Dialogue diversity
- language diversity, Dialogue diversity
- natural language feedback, Natural language feedback-Sentiment
- complaints, Complaints
- early termination, Early termination
- error correction, Error correction
- sentiment, Sentiment
- regeneration, Regeneration
- copyright regurgitation, Information Extraction
- copyright, model training and, Data lineage and copyright
- CoT (chain-of-thought), Give the Model Time to Think-Give the Model Time to Think
- CPU memory (DRAM), Memory size and bandwidth
- criteria ambiguity, Criteria ambiguity-Criteria ambiguity
- cross entropy, Cross Entropy
- cross-layer attention, Redesigning the attention mechanism
D
- data annotation, Data Acquisition and Annotation-Data Acquisition and Annotation
- and data curation, Data Curation-Data Acquisition and Annotation
- and data inspection, Inspect Data
- dataset engineering and, Dataset engineering
- data augmentation, Data Augmentation and Synthesis-Model Distillation
- defined, Data Augmentation and Synthesis
- data cleaning/filtering, Clean and Filter Data
- data contamination, Data contamination with public benchmarks-Handling data contamination
- data coverage, Data Coverage-Data Coverage
- data curation, Data Curation-Data Acquisition and Annotation
- data deduplication, Similarity Measurements Against Reference Data, Deduplicate Data-Deduplicate Data
- data flywheels, Data Acquisition and Annotation
- data formatting, Format Data-Format Data
- data inspection, Inspect Data-Inspect Data
- data lineage, Data lineage and copyright
- data organization, Data Organization
- data privacy, Data privacy
- data processing, Data Processing-Format Data
- data cleaning/filtering, Clean and Filter Data
- data formatting, Format Data-Format Data
- deduplicating data, Deduplicate Data-Deduplicate Data
- inspecting data, Inspect Data-Inspect Data
- data synthesis, Data Augmentation and Synthesis-Model Distillation
- AI-powered, AI-Powered Data Synthesis-Obscure data lineage
- data verification, Data verification-Data verification
- instruction data synthesis, Instruction data synthesis-Instruction data synthesis
- limitations, Limitations to AI-generated data-Obscure data lineage
- obscure data lineage problems, Obscure data lineage
- potential model collapse, Potential model collapse
- quality control problems, Quality control
- reasons for synthesizing data, Why Data Synthesis-Why Data Synthesis
- superficial imitation problems, Superficial imitation
- model distillation, Model Distillation
- traditional techniques, Traditional Data Synthesis Techniques-Simulation
- rule-based, Rule-based data synthesis-Rule-based data synthesis
- simulation, Simulation
- AI-powered, AI-Powered Data Synthesis-Obscure data lineage
- data verification, Data verification-Data verification
- dataset engineering, Dataset engineering, Dataset Engineering-Summary
- data augmentation/synthesis, Data Augmentation and Synthesis-Model Distillation
- data curation, Data Curation-Data Acquisition and Annotation
- data acquisition/annotation, Data Acquisition and Annotation-Data Acquisition and Annotation
- data coverage, Data Coverage-Data Coverage
- data quality, Data Quality-Data Quality
- data quantity, Data Quantity-Data Quantity
- data processing, Data Processing-Format Data
- data cleaning and filtering, Clean and Filter Data
- data formatting, Format Data-Format Data
- deduplicating data, Deduplicate Data-Deduplicate Data
- inspecting data, Inspect Data-Inspect Data
- data-centric view of AI, Dataset Engineering
- DDR SDRAM (doubled data rate synchronous dynamic random-access memory), Memory size and bandwidth
- debugging, Break Complex Tasks into Simpler Subtasks
-
decoding
- autoregressive decoding bottleneck, Overcoming the autoregressive decoding bottleneck-Parallel decoding
- decoupling from prefilling, Decoupling prefill and decode
- in transformer architecture, Transformer architecture
-
defensive prompt engineering
- jailbreaking and prompt injection, Jailbreaking and Prompt Injection-Indirect prompt injection
- automated attacks, Automated attacks
- direct manual prompt hacking, Direct manual prompt hacking-Direct manual prompt hacking
- indirect prompt injection, Indirect prompt injection-Indirect prompt injection
- prompt attack defense, Defenses Against Prompt Attacks-System-level defense
- model-level defense, Model-level defense
- prompt-level defense, Prompt-level defense
- system-level defense, System-level defense
- jailbreaking and prompt injection, Jailbreaking and Prompt Injection-Indirect prompt injection
- degenerate feedback loops, Degenerate feedback loop
- demonstration data, Supervised Finetuning
- dense retrievers, Retrieval Algorithms
- dimensionality reduction, Deduplicate Data
- direct manual prompt hacking, Direct manual prompt hacking-Direct manual prompt hacking
- Direct Preference Optimization (DPO), Preference Finetuning
- distillation, Reasons to Finetune
- base, Base models
- model distillation, Open source, open weight, and model licenses, Model Distillation, Model compression
- synthetic data and, Why Data Synthesis
- domain-specific capability, Domain-Specific Capability-Domain-Specific Capability
- domain-specific task finetuning, Reasons Not to Finetune
- domain-specific training data models, Domain-Specific Models-Domain-Specific Models
- dot products, Attention mechanism
- doubled data rate synchronous dynamic random-access memory (DDR SDRAM), Memory size and bandwidth
- DPO (Direct Preference Optimization), Preference Finetuning
- DRAM (CPU memory), Memory size and bandwidth
- drift detection, Drift detection
- dynamic batching, Batching
- dynamic features, The role of AI and humans in the application
E
- edit distance, Lexical similarity
- Elo, Ranking Models with Comparative Evaluation, Scalability bottlenecks, Quantized LoRA
- embedding, Introduction to Embedding-Introduction to Embedding
- embedding algorithm, Semantic similarity, Introduction to Embedding
- embedding model, From Large Language Models to Foundation Models
- embedding-based retrieval, Embedding-based retrieval-Embedding-based retrieval
- multimodal RAG and, Multimodal RAG
- embedding models, Introduction to Embedding
- engineering architecture, AI Engineering Architecture-AI Pipeline Orchestration
- AI pipeline orchestration, AI Pipeline Orchestration-AI Pipeline Orchestration
- monitoring and observability, Monitoring and Observability-Drift detection
- drift detection, Drift detection
- logs and traces, Logs and traces-Logs and traces
- metrics, Metrics-Metrics
- monitoring versus observability, Monitoring and Observability
- step 1: enhancing context, Step 1. Enhance Context
- step 2: putting in guardrails, Step 2. Put in Guardrails-Guardrail implementation
- guardrail implementation, Guardrail implementation
- input guardrails, Input guardrails-Input guardrails
- output guardrails, Output guardrails-Output guardrails
- step 3: adding model router and gateway, Step 3. Add Model Router and Gateway-Gateway
- step 4: reducing latency with caches, Step 4. Reduce Latency with Caches-Semantic caching
- exact caching, Exact caching
- semantic caching, Semantic caching
- step 5: adding agent patterns, Step 5. Add Agent Patterns
- engineering stack, Three Layers of the AI Stack-Three Layers of the AI Stack
- application development, Three Layers of the AI Stack
- AI interface, AI interface
- evaluation, Evaluation
- prompt engineering and context construction, Prompt engineering and context construction
- infrastructure, Three Layers of the AI Stack
- ML engineering versus, Model development-Inference optimization
- model development, Three Layers of the AI Stack
- application development, Three Layers of the AI Stack
- entropy, Entropy
- epochs, Number of epochs
- error correction, Reflection and error correction-Reflection and error correction
- evaluation, Evaluation
- evaluation harnesses, Navigate Public Benchmarks
- evaluation methodology, Evaluation Methodology-Summary
- AI as a judge, AI as a Judge-What Models Can Act as Judges?
- AI systems evaluation (see systems evaluation)
- challenges, Challenges of Comparative Evaluation-From comparative performance to absolute performance
- challenges of foundation model evaluation, Challenges of Evaluating Foundation Models-Challenges of Evaluating Foundation Models
- comparative performance to absolute performance, From comparative performance to absolute performance
- lack of standardization and quality control, Lack of standardization and quality control-Lack of standardization and quality control
- scalability bottlenecks, Scalability bottlenecks
- exact evaluation, Exact Evaluation-Introduction to Embedding
- future, The Future of Comparative Evaluation
- language model for computing text perplexity, Perplexity Interpretation and Use Cases
- language modeling metrics, Understanding Language Modeling Metrics-Perplexity Interpretation and Use Cases
- rank models with comparative evaluation, Ranking Models with Comparative Evaluation-The Future of Comparative Evaluation
- evaluation pipeline design, Design Your Evaluation Pipeline-Iterate
- step 1: creating an evaluation guideline, Step 2. Create an Evaluation Guideline -Tie evaluation metrics to business metrics
- step 2: evaluating all components in a system, Step 1. Evaluate All Components in a System-Step 1. Evaluate All Components in a System
- creating scoring rubrics with examples, Create scoring rubrics with examples
- defining evaluation criteria, Define evaluation criteria
- tying evaluation metrics to business metrics, Tie evaluation metrics to business metrics
- step 3: defining evaluation methods and data, Step 3. Define Evaluation Methods and Data-Iterate
- annotating evaluation data, Annotate evaluation data-Annotate evaluation data
- evaluating evaluation pipeline, Evaluate your evaluation pipeline
- iteration, Iterate
- selecting evaluation methods, Select evaluation methods
- evaluation-driven development, Evaluation Criteria-Evaluation Criteria
- eviction policies, Exact caching
- exact caching, Exact caching
- exact evaluation, Exact Evaluation-Introduction to Embedding
- functional correctness, Functional Correctness-Functional Correctness
- similarity measurements against reference data, Similarity Measurements Against Reference Data-Semantic similarity
- exact matches, Exact match
- expectation setting, Setting Expectations
- explicit feedback, Extracting Conversational Feedback-Dialogue diversity
F
- factual consistency, Factual consistency-Factual consistency, Create scoring rubrics with examples
- faithfulness, Generation Capability
- feature-based transfers, Finetuning, Finetuning Overview
- feature-free transfers, Finetuning
- federated learning, Model Merging and Multi-Task Finetuning
-
feedback design
- how to collect feedback, How to collect feedback-How to collect feedback
-
when to collect feedback
- in the beginning, In the beginning
- when something bad happens, When something bad happens
- when the model has low confidence, When the model has low confidence-When the model has low confidence
- feedforward computation, Parallelism
- feedforward layer, Transformer block, LoRA configurations
- few-shot learning, In-Context Learning: Zero-Shot and Few-Shot-In-Context Learning: Zero-Shot and Few-Shot
- finetuning, Finetuning-Summary
- defined, Modeling and training
- domain-specific tasks, Reasons Not to Finetune
- finetuning and RAG, Finetuning and RAG-Finetuning and RAG
- hyperparameters, Finetuning hyperparameters-Prompt loss weight
- batch size, Batch size
- learning rate, Learning rate
- number of epochs, Number of epochs
- prompt loss rate, Prompt loss weight
- memory bottlenecks, Memory Bottlenecks-Training quantization
- backpropagation and trainable parameters, Backpropagation and Trainable Parameters-Backpropagation and Trainable Parameters
- memory math, Memory Math-Memory needed for training
- numerical representations, Numerical Representations-Numerical Representations
- quantization, Quantization-Training quantization
- overview, Finetuning Overview-Finetuning Overview
- structured outputs, Finetuning
- tactics, Finetuning Tactics-Prompt loss weight
- techniques, Finetuning Techniques-Prompt loss weight
- LoRA, LoRA-Quantized LoRA
- model merging and multi-task finetuning, Model Merging and Multi-Task Finetuning-Concatenation
- parameter-efficient finetuning, Parameter-Efficient Finetuning-Quantized LoRA
- PEFT techniques, PEFT techniques-PEFT techniques
- when to finetune, When to Finetune-Finetuning and RAG
- reasons not to finetune, Reasons Not to Finetune-Reasons Not to Finetune
- reasons to finetune, Reasons to Finetune
- FLOP (floating point operation), Model Size
- foundation models, From Foundation Models to AI Engineering, Understanding Foundation Models-Summary
- evaluation challenges, Challenges of Evaluating Foundation Models-Challenges of Evaluating Foundation Models
- comparative performance to absolute performance, From comparative performance to absolute performance
- lack of standardization and quality control, Lack of standardization and quality control-Lack of standardization and quality control
- scalability bottlenecks, Scalability bottlenecks
- inverse scaling, Model Size
- modeling, Modeling-Scaling bottlenecks
- model architecture, Model Architecture-Other model architectures
- model size, Model Size-Scaling bottlenecks
- parameter versus hyperparameter, Scaling extrapolation
- post-training, Post-Training-Finetuning using the reward model
- preference finetuning, Preference Finetuning-Finetuning using the reward model
- supervised finetuning, Supervised Finetuning-Supervised Finetuning
- sampling, Sampling-Hallucination
- probabilistic nature of AI, The Probabilistic Nature of AI-Hallucination
- sampling fundamentals, Sampling Fundamentals-Sampling Fundamentals
- sampling strategies, Sampling Strategies-Stopping condition
- structured outputs, Structured Outputs-Finetuning
- test time compute, Test Time Compute-Test Time Compute
- training data, Training Data-Domain-Specific Models
- domain-specific models, Domain-Specific Models-Domain-Specific Models
- multilingual models, Multilingual Models-Multilingual Models
- use cases, Foundation Model Use Cases-Workflow Automation
- coding, Coding-Coding
- conversational bots, Conversational Bots
- data organization, Data Organization
- education, Education
- image and video production, Image and Video Production
- workflow automation, Workflow Automation
- writing, Writing-Writing
- evaluation challenges, Challenges of Evaluating Foundation Models-Challenges of Evaluating Foundation Models
- full finetuning, Parameter-Efficient Finetuning-Quantized LoRA
- function calling, Function calling-Function calling
- fuzzy matching, Lexical similarity
G
- gateways, Gateway-Gateway
- Gemini, Evaluation, Test Time Compute, Prompt caching, When the model has low confidence
- generation capability, Generation Capability-Safety
- global factual consistency, Factual consistency
- goodput, Throughput and goodput-Throughput and goodput
- GPU on-chip SRAM, Memory size and bandwidth
- ground truths, Similarity Measurements Against Reference Data
- grouped-query attention, Redesigning the attention mechanism
- guardrail implementation, Guardrail implementation
- guardrails, Control, access, and transparency, System-level defense, Step 2. Put in Guardrails-Guardrail implementation
H
- H3 architecture, Other model architectures
-
hallucinations
- causes of, Hallucination-Hallucination
- defined, The Probabilistic Nature of AI
- and finetuning, Finetuning and RAG
- measurement, Factual consistency
- metrics for, Metrics
- superficial imitation and, Superficial imitation
- hard attributes, Model Selection Workflow
- hashing, Deduplicate Data
- HellaSwag, Public leaderboards
- hierarchical navigable small world (HNSW), Embedding-based retrieval
- high-bandwidth memory (HBM), Memory size and bandwidth
- hyperparameters, Scaling extrapolation, Finetuning hyperparameters-Prompt loss weight
I
- IDF (inverse document frequency), Term-based retrieval
- IFEval, Instruction-following criteria
- implicit feedback, Extracting Conversational Feedback
- in-context learning, In-Context Learning: Zero-Shot and Few-Shot-In-Context Learning: Zero-Shot and Few-Shot
- inconsistency, Inconsistency-Inconsistency, Inconsistency
-
indexing
- chunking strategy and, Chunking strategy-Chunking strategy
- defined, RAG Architecture
- with embedding-based retrieval, Embedding-based retrieval
- retrieval systems and, Comparing retrieval algorithms
- indirect prompt injection, Indirect prompt injection-Indirect prompt injection
- inference APIs, Online and batch inference APIs-Online and batch inference APIs
- inference optimization, Inference optimization, Inference Optimization-Summary
- AI accelerators
- computational capabilities, Computational capabilities
- defined, What’s an accelerator?-What’s an accelerator?
- memory size and bandwidth, Memory size and bandwidth-Memory size and bandwidth
- power consumption, Power consumption-Power consumption
- case study from PyTorch, Kernels and compilers
- inference overview
- computational bottlenecks, Computational bottlenecks-Computational bottlenecks
- online and batch inference APIs, Online and batch inference APIs-Online and batch inference APIs
- inference performance metrics, Inference Performance Metrics-Utilization, MFU, and MBU
- latency, TTFT, and TPOT, Latency, TTFT, and TPOT-Latency, TTFT, and TPOT
- throughput/goodput, Throughput and goodput-Throughput and goodput
- utilization, MFU, and MBU, Utilization, MFU, and MBU-Utilization, MFU, and MBU
- inference service optimization, Inference Service Optimization-Parallelism
- batching, Batching
- decoupling prefill and decode, Decoupling prefill and decode
- parallelism, Parallelism-Parallelism
- prompt caching, Prompt caching-Prompt caching
- KV cache size calculation, Attention mechanism optimization
- memory-bound versus bandwidth-bound interference, Computational bottlenecks
- at model/hardware/service levels, Inference Optimization
- model optimization, Model Optimization-Kernels and compilers
- attention mechanism optimization, Attention mechanism optimization-Writing kernels for attention computation
- autoregressive decoding bottleneck, Overcoming the autoregressive decoding bottleneck-Parallel decoding
- kernels and compilers, Kernels and compilers-Kernels and compilers
- model compression, Model compression
- understanding, Understanding Inference Optimization-Power consumption
- AI accelerators, AI Accelerators-Power consumption
- inference overview, Inference Overview-Online and batch inference APIs
- inference performance metrics, Inference Performance Metrics-Utilization, MFU, and MBU
- AI accelerators
- inference performance metrics, Inference Performance Metrics-Utilization, MFU, and MBU
- latency, TTFT, and TPOT, Latency, TTFT, and TPOT-Latency, TTFT, and TPOT
- throughput/goodput, Throughput and goodput-Throughput and goodput
- utilization, MFU, and MBU, Utilization, MFU, and MBU-Utilization, MFU, and MBU
- inference quantization, Inference quantization-Inference quantization
-
inference service
- defined, Open source models versus model APIs
- and inference optimization, Inference Overview
- throughput/goodput, Throughput and goodput-Throughput and goodput
- inference service optimization, Inference Service Optimization-Parallelism
- decoupling prefill and decode, Decoupling prefill and decode
- parallelism, Parallelism-Parallelism
- prompt caching, Prompt caching-Prompt caching
- inference with reference, Inference with reference
- INFOBench, Instruction-following criteria
- information aggregation, Information Aggregation
- information extraction, Information Extraction-Information Extraction
- information retrieval optimization, Retrieval Optimization-Contextual retrieval
- chunking strategy, Chunking strategy-Chunking strategy
- contextual retrieval, Contextual retrieval-Contextual retrieval
- query rewriting, Query rewriting
- reranking, Reranking
- instruction data synthesis, Instruction data synthesis-Instruction data synthesis
- instruction-following capability, Instruction-Following Capability-Roleplaying
- instruction-following criteria, Instruction-following criteria-Instruction-following criteria
- intent classifiers, Router
- inter-token latency (ITL), Latency, TTFT, and TPOT
- interface, AI, AI interface
- internal knowledge, Memory
- inverse document frequency (IDF), Term-based retrieval
- inverted file index (IVF), Embedding-based retrieval
- iteration, Iterate
J
- jailbreaking, Jailbreaking and Prompt Injection-Indirect prompt injection
- automated attacks, Automated attacks
- direct manual prompt hacking, Direct manual prompt hacking-Direct manual prompt hacking
- indirect prompt injection, Indirect prompt injection-Indirect prompt injection
- Jamba architecture, Other model architectures
- judges (see AI judges)
K
- k-nearest neighbors (k-NN), Embedding-based retrieval
- kernels, Writing kernels for attention computation, Kernels and compilers-Kernels and compilers
- key vector (K), Attention mechanism
- key-value (KV) cache, Attention mechanism optimization-Optimizing the KV cache size
- key-value vectors, Memory needed for inference
- knowledge augmentation, Knowledge augmentation
- knowledge-augmented verification, Factual consistency
- KV cache (see key-value cache)
L
- LangChain, Evaluate Prompt Engineering Tools, Prompt-level defense, Memory
- language modeling metrics, Understanding Language Modeling Metrics-Perplexity Interpretation and Use Cases
- bits-per-byte, Bits-per-Character and Bits-per-Byte
- bits-per-character, Bits-per-Character and Bits-per-Byte
- cross entropy, Cross Entropy
- entropy, Entropy
- perplexity, Perplexity
- perplexity interpretation and use cases, Perplexity Interpretation and Use Cases-Perplexity Interpretation and Use Cases
- language models, Language models-Language models, Perplexity Interpretation and Use Cases
- large language models, From Large Language Models to Foundation Models-From Large Language Models to Foundation Models
- AI product defensibility, AI product defensibility
- role of AI and humans in the application, The role of AI and humans in the application-The role of AI and humans in the application
- set expectations, AI product defensibility
- large multimodal model (LMM), From Large Language Models to Foundation Models
-
latency
- AI judges and, Increased costs and latency
- inference performance and, Latency, TTFT, and TPOT-Latency, TTFT, and TPOT
- metrics, Setting Expectations
- reliability versus, Guardrail implementation
- layer stacking, Layer stacking-Layer stacking
- leaderboards, Scalability bottlenecks-Lack of standardization and quality control, Benchmark selection and aggregation-Custom leaderboards with public benchmarks
- learning rate, Learning rate
- leniency bias, Biases
- lexical similarity, Lexical similarity-Lexical similarity
- linear combination summing, Linear combination-Linear combination
-
Llama
- attention function, Attention mechanism
- data coverage, Data Coverage
- data quality, Data Quality
- data quantity, Data Quantity
- data synthesis, AI-Powered Data Synthesis, Instruction data synthesis
- finetuning, Finetuning Overview
- inference optimization, Kernels and compilers
- inference quantization, Inference quantization
- model distillation, Model Distillation
- open source models, Open source, open weight, and model licenses
- prefer, Preference Finetuning
- preference finetuning, Post-Training
- prompt template, System Prompt and User Prompt
- scaling law and, Scaling law: Building compute-optimal models
- LLM-as-a-judge, AI as a Judge
- (see also AI-as-a-judge)
- LMM (large multimodal model), From Large Language Models to Foundation Models
- local factual consistency, Factual consistency
- locality-sensitive hashing (LSH), Embedding-based retrieval
- logit vectors, Sampling Fundamentals
- logprobs, Temperature, Select evaluation methods
- logs, Logs and traces-Logs and traces
- long-term memory, Memory
- loop tiling, Kernels and compilers
- LoRA (low-rank adaptation), LoRA-Quantized LoRA
- configurations, LoRA configurations-LoRA configurations
- LoRA adapters service, Serving LoRA adapters-Serving LoRA adapters
- mechanism of operation, Why does LoRA work?
- quantized LoRA (QLoRA), Quantized LoRA-Quantized LoRA
- low-rank factorization, LoRA
- LSH (locality-sensitive hashing), Embedding-based retrieval
M
- Mamba architecture, Other model architectures
- manual generation, Traditional Data Synthesis Techniques-Simulation
- masked language models, Language models
- Massive Multitask Language Understanding (MMLU), Maintenance, Public leaderboards
- matches, Ranking Models with Comparative Evaluation
- MBU (model bandwidth utilization), Utilization, MFU, and MBU-Utilization, MFU, and MBU
- MCQs (multiple-choice questions), Domain-Specific Capability
- mean time to detection (MTTD), Monitoring and Observability
- mean time to response (MTTR), Monitoring and Observability
- memory, Memory-Memory
- memory bottlenecks, Memory Bottlenecks-Training quantization
- bandwidth-bound, Computational bottlenecks
- memory math, Memory Math-Memory needed for training
- memory needed for inference, Memory needed for inference
- memory needed for training, Memory needed for training-Memory needed for training
- quantization, Quantization-Training quantization
- inference quantization, Inference quantization-Inference quantization
- training quantization, Training quantization-Training quantization
- size and bandwidth, Memory size and bandwidth-Memory size and bandwidth
- memory math, Memory Math-Memory needed for training
- metrics, Metrics-Metrics
- correlations between, Evaluate your evaluation pipeline
- for AI as a judge, Criteria ambiguity-Criteria ambiguity
- for generation capability, Generation Capability
- for hallucination measurement, Factual consistency
- inference performance metrics, Inference Performance Metrics-Utilization, MFU, and MBU
- language modeling (see language modeling metrics)
- observability metrics, Monitoring and Observability
- reference-based versus reference-free, Similarity Measurements Against Reference Data
- tying evaluation metrics to business metrics, Tie evaluation metrics to business metrics
- usefulness thresholds, Setting Expectations
- MFU (model FLOPs utilization), Utilization, MFU, and MBU-Utilization, MFU, and MBU
- milestone planning, Milestone Planning
- mixture-of-experts (MoE) models, Model Size, Layer stacking
- ML engineering, AI engineering versus, AI Engineering Versus ML Engineering-AI interface
- MLP modules, Transformer block
- MMLU (Massive Multitask Language Understanding), Maintenance, Public leaderboards
- model APIs, open source models versus (see open source models, model APIs versus)
- model architecture, Model Architecture-Other model architectures
- (see also specific architectures, e.g.: transformer architecture)
- model bandwidth utilization (MBU), Utilization, MFU, and MBU-Utilization, MFU, and MBU
- model compression, Model compression
- model development, Three Layers of the AI Stack, Model development-Inference optimization
- dataset engineering, Dataset engineering
- inference optimization, Inference optimization-Inference optimization
- modeling and training, Modeling and training-Modeling and training
- model distillation, Model Distillation
- model FLOPs utilization (MFU), Utilization, MFU, and MBU-Utilization, MFU, and MBU
- model inference, Maintenance
- model merging, Model Merging and Multi-Task Finetuning-Concatenation
- concatenation, Concatenation
- layer stacking, Layer stacking-Layer stacking
- summing, Summing-Pruning redundant task-specific parameters
- model optimization, Model Optimization-Kernels and compilers
- attention mechanism optimization, Attention mechanism optimization-Writing kernels for attention computation
- attention mechanism redesign, Redesigning the attention mechanism
- KV cache size optimization, Optimizing the KV cache size
- write kernels for attention computation, Writing kernels for attention computation
- autoregressive decoding bottleneck, Overcoming the autoregressive decoding bottleneck-Parallel decoding
- inference with reference, Inference with reference
- parallel decoding, Parallel decoding
- speculative decoding, Speculative decoding-Speculative decoding
- kernels and compilers, Kernels and compilers-Kernels and compilers
- model compression, Model compression
- attention mechanism optimization, Attention mechanism optimization-Writing kernels for attention computation
- model ranking, Ranking Models with Comparative Evaluation-The Future of Comparative Evaluation
- model router, Step 3. Add Model Router and Gateway-Gateway
- model selection, Model Selection-Handling data contamination
- model build versus buy, Model Build Versus Buy-On-device deployment
- open source models versus model APIs, Open source models versus model APIs-On-device deployment
- open source, open weight, and model licenses, Open source, open weight, and model licenses-Open source, open weight, and model licenses
- model selection workflow, Model Selection Workflow-Model Selection Workflow
- navigating public benchmarks, Navigate Public Benchmarks-Custom leaderboards with public benchmarks
- benchmark selection and aggregation, Benchmark selection and aggregation
- public leaderboards, Public leaderboards
- model build versus buy, Model Build Versus Buy-On-device deployment
- model size, Model Size-Scaling bottlenecks
- scaling bottlenecks, Scaling bottlenecks-Scaling bottlenecks
- scaling extrapolation, Scaling extrapolation
- scaling law: building compute-optimal models, Scaling law: Building compute-optimal models-Scaling law: Building compute-optimal models
- model-centric AI, Dataset Engineering
- model-level defense, Model-level defense
- modeling, Modeling-Scaling bottlenecks
- model architecture, Model Architecture-Other model architectures
- model size, Model Size-Scaling bottlenecks
- MoE (mixture-of-experts) models, Layer stacking
- monitoring, Break Complex Tasks into Simpler Subtasks, Monitoring and Observability-Drift detection
- MTTD (mean time to detection), Monitoring and Observability
- MTTR (mean time to response), Monitoring and Observability
- multi-query attention, Redesigning the attention mechanism
- multi-task finetuning, Model Merging and Multi-Task Finetuning
- multilingual training data models, Multilingual Models-Multilingual Models
- multimodal models, From Large Language Models to Foundation Models
- multiple-choice questions (MCQs), Domain-Specific Capability
N
- n-gram similarity, Lexical similarity
- natural language feedback, Natural language feedback-Sentiment
- complaints, Complaints
- early termination, Early termination
- error correction, Error correction
- sentiment, Sentiment
- natural language generation (NLG), Generation Capability-Safety
- natural language processing (NLP), Generation Capability-Safety
- needle in a haystack (NIAH) test, Context Length and Context Efficiency
O
- obscure data lineage, Obscure data lineage
- observability, Monitoring and Observability-Drift detection
- on-device deployment, On-device deployment
- online inference APIs, Online and batch inference APIs-Online and batch inference APIs
- Open CLIP, Domain-Specific Models
- open source licenses, Open source, open weight, and model licenses-Open source, open weight, and model licenses
- open source models, model APIs versus, Open source models versus model APIs-On-device deployment
- API cost versus engineering cost, API cost versus engineering cost
- control, access, and transparency, Control, access, and transparency
- data lineage and copyright, Data lineage and copyright
- data privacy, Data privacy
- functionality, Functionality
- on-device deployment, On-device deployment
- performance, Performance
- open weight models, Open source, open weight, and model licenses
-
OpenAI
- batch APIs, Online and batch inference APIs
- evaluation harnesses, Navigate Public Benchmarks
- first GPT model, Self-supervision
- instruction hierarchy for model-level defense, Model-level defense
- model as a service, From Foundation Models to AI Engineering
- natural language supervision, From Large Language Models to Foundation Models
- open source APIs, Open source models versus model APIs
- progression/distillation paths, Base models
- quality of updated models, Custom leaderboards with public benchmarks
- test time compute, Test Time Compute
- operator fusion, Kernels and compilers
-
optimization
- inference optimization (see inference optimization)
- of retrieval systems, Retrieval Optimization-Contextual retrieval
P
- pairwise comparison, Deduplicate Data
- parallel decoding, Parallel decoding
- parallelism, Parallelism-Parallelism
- parallelization, Break Complex Tasks into Simpler Subtasks, Kernels and compilers
- parameter-efficient finetuning, Parameter-Efficient Finetuning-Quantized LoRA
- adapter-based/soft-prompt techniques, PEFT techniques-PEFT techniques
- LoRA, LoRA-Quantized LoRA
- configurations, LoRA configurations-LoRA configurations
- how it works, Why does LoRA work?
- LoRA adapters service, Serving LoRA adapters-Serving LoRA adapters
- quantized LoRA, Quantized LoRA-Quantized LoRA
- Pareto optimization, Cost and Latency
- partial finetuning, Parameter-Efficient Finetuning
- passive phishing, Indirect prompt injection
- PEFT (see parameter-efficient finetuning)
- perplexity, Perplexity-Perplexity Interpretation and Use Cases
- perturbation, Rule-based data synthesis
- pipeline orchestration, AI Pipeline Orchestration-AI Pipeline Orchestration
- monitoring and observability, Monitoring and Observability-Drift detection
- drift detection, Drift detection
- logs and traces, Logs and traces-Logs and traces
- metrics, Metrics-Metrics
- monitoring and observability, Monitoring and Observability-Drift detection
-
planning
- plan generation, Plan generation-Complex plans
- complex plans, Complex plans
- function calling, Function calling-Function calling
- granularity, Planning granularity
- reflection and error correction, Reflection and error correction-Reflection and error correction
- plan generation, Plan generation-Complex plans
- pointwise evaluation, Reward model, Ranking Models with Comparative Evaluation
- position bias, Biases
- post-processing, Prompting
- post-training, Modeling and training, Post-Training-Finetuning using the reward model
- preference finetuning, Preference Finetuning-Finetuning using the reward model
- supervised finetuning, Supervised Finetuning-Supervised Finetuning
- potential model collapse, Potential model collapse
- power consumption, Power consumption-Power consumption
- PPO (proximal policy optimization), Finetuning using the reward model
- pre-training, Modeling and training
- precision bits, Numerical Representations
- preference bias, Biases
- preference finetuning, Preference Finetuning-Finetuning using the reward model, Finetuning Overview
- preference models, What Models Can Act as Judges?
- prefilling, Transformer architecture
- prefilling, decoupling from decoding, Decoupling prefill and decode
- proactive features, The role of AI and humans in the application
- probabilistic nature of AI, The Probabilistic Nature of AI-Hallucination
- hallucination, Hallucination-Hallucination
- inconsistency, Inconsistency-Inconsistency
- probabilistic definition, The Probabilistic Nature of AI-Hallucination
- procedural generation, Traditional Data Synthesis Techniques-Simulation
- product quantization, Embedding-based retrieval
- prompt attacks, Defensive Prompt Engineering, Jailbreaking and Prompt Injection-Indirect prompt injection
- automated attacks, Automated attacks
- defense against, Defenses Against Prompt Attacks-System-level defense
- direct manual prompt hacking, Direct manual prompt hacking-Direct manual prompt hacking
- indirect prompt injection, Indirect prompt injection-Indirect prompt injection
- prompt caching, Prompt caching-Prompt caching
- prompt catalogs, Organize and Version Prompts
- prompt engineering, Prompt Engineering-Summary
- basics, Introduction to Prompting-Context Length and Context Efficiency
- context length and context efficiency, Context Length and Context Efficiency-Context Length and Context Efficiency
- in-context learning: zero-shot and few-shot, In-Context Learning: Zero-Shot and Few-Shot-In-Context Learning: Zero-Shot and Few-Shot
- best practices, Prompt Engineering Best Practices-Organize and Version Prompts
- break complex tasks into simpler subtasks, Break Complex Tasks into Simpler Subtasks-Break Complex Tasks into Simpler Subtasks
- evaluating prompt engineering tools, Evaluate Prompt Engineering Tools-Evaluate Prompt Engineering Tools
- give the model time to think, Give the Model Time to Think-Give the Model Time to Think
- iterating on your prompts, Iterate on Your Prompts
- organize and version prompts, Organize and Version Prompts-Organize and Version Prompts
- provide sufficient context, Provide Sufficient Context
- write clear and explicit instructions, Write Clear and Explicit Instructions
- defensive engineering, Defensive Prompt Engineering-System-level defense
- information extraction, Information Extraction-Information Extraction
- jailbreaking and prompt injection, Jailbreaking and Prompt Injection-Indirect prompt injection
- prompt attacks defense, Defenses Against Prompt Attacks-System-level defense
- proprietary prompts and reverse prompt engineering, Proprietary Prompts and Reverse Prompt Engineering-Proprietary Prompts and Reverse Prompt Engineering
- defined, Prompt engineering and context construction
- restricting model knowledge to its context, Provide Sufficient Context
- terminology ambiguity: prompt versus context, In-Context Learning: Zero-Shot and Few-Shot
- basics, Introduction to Prompting-Context Length and Context Efficiency
- prompt loss rate, Prompt loss weight
- prompt optimization, Evaluate Prompt Engineering Tools
- prompt versioning, Organize and Version Prompts-Organize and Version Prompts
- prompt-level defense, Prompt-level defense
- proprietary prompts, Proprietary Prompts and Reverse Prompt Engineering-Proprietary Prompts and Reverse Prompt Engineering
- proximal policy optimization (PPO), Finetuning using the reward model
- public leaderboards, Public leaderboards
Q
- QAT (quantization-aware training), Training quantization
- QLoRA (quantized LoRA), Quantized LoRA-Quantized LoRA
- QPS (queries per second), Comparing retrieval algorithms
- quality control, Quality control
- quantization, Quantization-Training quantization
- inference quantization, Inference quantization-Inference quantization
- training quantization, Training quantization-Training quantization
- quantization-aware training (QAT), Training quantization
- quantized LoRA (QLoRA), Quantized LoRA-Quantized LoRA
- queries per second (QPS), Comparing retrieval algorithms
- query rewriting, Query rewriting
- query vector (Q), Attention mechanism
R
- RAG (retrieval-augmented generation), RAG-RAG with tabular data
- finetuning and, Finetuning and RAG-Finetuning and RAG
- RAG architecture, RAG Architecture
- RAG beyond texts, RAG Beyond Texts-RAG with tabular data
- multimodal RAG, Multimodal RAG
- RAG with tabular data, RAG with tabular data-RAG with tabular data
- retrieval algorithms, Retrieval Algorithms-Combining retrieval algorithms
- combining, Combining retrieval algorithms
- comparing, Comparing retrieval algorithms-Comparing retrieval algorithms
- embedding-based retrieval, Embedding-based retrieval-Embedding-based retrieval
- term-based retrieval, Term-based retrieval-Term-based retrieval
- retrieval optimization, Retrieval Optimization-Contextual retrieval
- chunking strategy, Chunking strategy-Chunking strategy
- contextual retrieval, Contextual retrieval-Contextual retrieval
- query rewriting, Query rewriting
- reranking, Reranking
- random feedback, Biases
- range bits, Numerical Representations
- ranking, Similarity Measurements Against Reference Data
- rating algorithms, Ranking Models with Comparative Evaluation
- reactive features, The role of AI and humans in the application
- recall, Comparing retrieval algorithms
- recurrent neural networks (RNNs), Transformer architecture
- reference-based judges, What Models Can Act as Judges?
- reference-based metrics, Similarity Measurements Against Reference Data
- reference-free metrics, Similarity Measurements Against Reference Data
- reflection, Reflection and error correction-Reflection and error correction
- regeneration, Regeneration
- reinforcement learning from human feedback (RLHF), Preference Finetuning-Finetuning using the reward model
- relevance, Generation Capability
- reliability, latency versus, Guardrail implementation
- replica parallelism, Parallelism
- reranking, Reranking
- restricted weight, Open source, open weight, and model licenses
- retrieval algorithms, Retrieval Algorithms-Combining retrieval algorithms
- combining, Combining retrieval algorithms
- comparing, Comparing retrieval algorithms-Comparing retrieval algorithms
- embedding-based retrieval, Embedding-based retrieval-Embedding-based retrieval
- term-based retrieval, Term-based retrieval-Term-based retrieval
-
retrieval optimization
- chunking strategy, Chunking strategy-Chunking strategy
- contextual retrieval, Contextual retrieval-Contextual retrieval
- query rewriting, Query rewriting
- reranking, Reranking
- retrieval-augmented generation (see RAG)
-
retrievers
- combining retrieval algorithms, Combining retrieval algorithms
- main functions, RAG Architecture
- multimodal RAG and, Multimodal RAG
- quality evaluation, Comparing retrieval algorithms
- sparse versus dense, Retrieval Algorithms
- reverse prompt engineering, Proprietary Prompts and Reverse Prompt Engineering-Proprietary Prompts and Reverse Prompt Engineering
- reward models, Reward model-Reward model, What Models Can Act as Judges?
- RLHF (reinforcement learning from human feedback), Preference Finetuning-Finetuning using the reward model
- RNNs (recurrent neural networks), Transformer architecture
- RoleLLM, Roleplaying
- roleplaying, Roleplaying-Roleplaying
- routers, Router-Router
- rule-based data synthesis, Rule-based data synthesis-Rule-based data synthesis
S
- S4 architecture, Other model architectures
- safety, Safety-Safety
- safety, as evaluation criteria, Safety-Safety
- sampling, Sampling-Hallucination
- probabilistic nature of AI, The Probabilistic Nature of AI-Hallucination
- sampling fundamentals, Sampling Fundamentals-Sampling Fundamentals
- sampling strategies, Sampling Strategies-Stopping condition
- strategies, Sampling Strategies-Stopping condition
- stopping condition, Stopping condition
- temperature, Temperature-Temperature
- top-k, Top-k
- top-p, Top-p
- structured outputs, Structured Outputs-Finetuning
- test time compute, Test Time Compute-Test Time Compute
- scaling bottlenecks, Scaling bottlenecks-Scaling bottlenecks, Scalability bottlenecks
- scaling extrapolation, Scaling extrapolation
- scaling law, Scaling law: Building compute-optimal models-Scaling law: Building compute-optimal models
- scoring rubrics, Create scoring rubrics with examples
- self-evaluation, What Models Can Act as Judges?
- self-supervision language models, Self-supervision-Self-supervision
- self-verification, Factual consistency
- semantic caching, Semantic caching
- semantic similarity, Semantic similarity-Semantic similarity
- sequence parallelism, Parallelism
- sequential finetuning, Model Merging and Multi-Task Finetuning
- SFT (supervised finetuning), Post-Training, Supervised Finetuning-Supervised Finetuning, Finetuning Overview
- short-term memory, Memory
- simulation, Simulation
- simultaneous finetuning, Model Merging and Multi-Task Finetuning
- SLERP (spherical linear interpolation), Spherical linear interpolation (SLERP)
- slicing, Annotate evaluation data
- soft attributes, Model Selection Workflow
- soft prompt-based PEFT methods, PEFT techniques-PEFT techniques
- sparse models, Model Size, Model compression
- sparse retrievers, Retrieval Algorithms
- speculative decoding, Speculative decoding-Speculative decoding
- spherical linear interpolation (SLERP), Spherical linear interpolation (SLERP)
- SQL queries, Agent Overview
- static batching, Batching
- static features, The role of AI and humans in the application
- stopping condition, Stopping condition
- structured data, Perplexity Interpretation and Use Cases, Memory
- structured outputs, Structured Outputs-Finetuning
- constrained sampling, Constrained sampling
- finetuning, Finetuning
- post-processing, Prompting
- summing, Summing-Pruning redundant task-specific parameters
- linear combination, Linear combination-Linear combination
- pruning redundant task-specific parameters, Pruning redundant task-specific parameters
- spherical linear interpolation (SLERP), Spherical linear interpolation (SLERP)
- superficial imitation, Superficial imitation
- supervised finetuning (SFT), Post-Training, Supervised Finetuning-Supervised Finetuning, Finetuning Overview
- supervision, Self-supervision
- synthesis of data (see data synthesis)
- system components evaluation, Step 1. Evaluate All Components in a System-Step 1. Evaluate All Components in a System
- creating scoring rubrics with examples, Create scoring rubrics with examples
- defining evaluation criteria, Define evaluation criteria
- tying evaluation metrics to business metrics, Tie evaluation metrics to business metrics
- system prompts, System Prompt and User Prompt-System Prompt and User Prompt
- system-level defense, System-level defense
- systems evaluation, Evaluate AI Systems-Summary
- evaluation criteria, Evaluation Criteria-Cost and Latency
- cost and latency, Cost and Latency-Cost and Latency
- domain-specific capability, Domain-Specific Capability-Domain-Specific Capability
- evaluation-driven development, Evaluation Criteria-Evaluation Criteria
- generation capability, Generation Capability-Safety
- instruction-following capability, Instruction-Following Capability-Roleplaying
- evaluation pipeline design, Design Your Evaluation Pipeline-Iterate
- step 1: creating an evaluation guideline, Step 2. Create an Evaluation Guideline -Tie evaluation metrics to business metrics
- step 2: evaluating all components in a system, Step 1. Evaluate All Components in a System-Step 1. Evaluate All Components in a System
- step 3: defining evaluation methods and data, Step 3. Define Evaluation Methods and Data-Iterate
- evaluation-driven development, Evaluation Criteria-Evaluation Criteria
- model selection, Model Selection-Handling data contamination
- data contamination with public benchmarks, Data contamination with public benchmarks-Handling data contamination
- model build versus buy, Model Build Versus Buy-On-device deployment
- model selection workflow, Model Selection Workflow-Model Selection Workflow
- navigating public benchmarks, Navigate Public Benchmarks-Custom leaderboards with public benchmarks
- OpenAI model quality, Custom leaderboards with public benchmarks
- evaluation criteria, Evaluation Criteria-Cost and Latency
T
- task-based evaluation, Step 1. Evaluate All Components in a System
- temperature, Temperature-Temperature
- term frequency (TF), Term-based retrieval
- text-to-SQL, Structured Outputs, Functional Correctness, RAG with tabular data
- throughput, Throughput and goodput-Throughput and goodput
- time between tokens (TBT), Latency, TTFT, and TPOT
- time per output token (TPOT), Setting Expectations, Latency, TTFT, and TPOT-Latency, TTFT, and TPOT
- time to first token (TTFT), Setting Expectations, Latency, TTFT, and TPOT-Latency, TTFT, and TPOT
- tokenization, Multilingual Models, Model Size, Bits-per-Character and Bits-per-Byte, Term-based retrieval, Chunking strategy
- defined, Language models
- tokenizer, Chunking strategy
- tokens, Language models, Model Size
- tool use, Tool selection
- top-k, Top-k
- top-p, Top-p
- TPOT (time per output token), Setting Expectations, Latency, TTFT, and TPOT-Latency, TTFT, and TPOT
- traces, Logs and traces
- trainable parameters, Backpropagation and Trainable Parameters-Backpropagation and Trainable Parameters
- training, Modeling and training-Modeling and training
- training data, Training Data-Domain-Specific Models
- domain-specific models, Domain-Specific Models-Domain-Specific Models
- multilingual models, Multilingual Models-Multilingual Models
- training quantization, Training quantization-Training quantization
- transfer learning, Finetuning Overview
- transformer architecture, Transformer architecture-Transformer block
- attention mechanism, Attention mechanism-Attention mechanism
- attention modules, Transformer block
- MLP modules, Transformer block
- transformer blocks, Transformer block-Transformer block
- attention modules, Transformer block
- embedding modules, Transformer block
- MLP modules, Transformer block
- output layers, Transformer block
- attention mechanism, Attention mechanism-Attention mechanism
- TruthfulQA, Public leaderboards
- TTFT (time to first token), Setting Expectations, Latency, TTFT, and TPOT-Latency, TTFT, and TPOT
- turn-based evaluation, Step 1. Evaluate All Components in a System
U
- unstructured data, Data Organization, Memory
- use case evaluation, Use Case Evaluation-AI product defensibility
- usefulness threshold, Setting Expectations
- user feedback, User Feedback-Degenerate feedback loop
- extracting conversational feedback, Extracting Conversational Feedback-Dialogue diversity
- natural language feedback, Natural language feedback-Sentiment
- other conversational feedback, Other conversational feedback-Dialogue diversity
- feedback design, Feedback Design-How to collect feedback
- when to collect feedback, When to collect feedback
- feedback limitations, Feedback Limitations-Degenerate feedback loop
- biases, Biases
- degenerate feedback loops, Degenerate feedback loop
- extracting conversational feedback, Extracting Conversational Feedback-Dialogue diversity
V
- value vector (V), Attention mechanism
- vector database, Embedding-based retrieval-Embedding-based retrieval
- vectorization, Kernels and compilers
- vocabulary, Perplexity Interpretation and Use Cases
- defined, Language models
W
- WinoGrande, Public leaderboards
- workflow automation, Workflow Automation
- write actions, Write actions