>_TheQuery

Chapter 3 of 16

Chapter 1 - Course Overview1 / 4

1.5 KEY TERMINOLOGY & DEFINITIONS

Purpose: This comprehensive glossary defines all technical terms, acronyms, and concepts used throughout the course. Reference this section whenever you encounter unfamiliar terminology.

(Don't try to memorize all of this now. That would be a waste of time. Skim it once to see what's here, then come back when you need it. Think of this as a dictionary, not a textbook chapter. Nobody reads dictionaries cover-to-cover for a reason.)

Core Acronyms & Abbreviations

A-E

ANN (Approximate Nearest Neighbor): Algorithm for finding points in a dataset that are closest to a query point, with some tolerance for error in exchange for speed. Used in vector databases for efficient similarity search.

API (Application Programming Interface): A set of protocols and tools that allow different software applications to communicate with each other.

BERT (Bidirectional Encoder Representations from Transformers): A transformer-based language model developed by Google that processes text bidirectionally (looking at both left and right context simultaneously).

BFS (Breadth-First Search): A graph traversal algorithm that explores all neighbors at the current depth before moving to nodes at the next depth level.

BM25 (Best Matching 25): A ranking function used in information retrieval to estimate the relevance of documents to a given search query, based on term frequency and document length.

CBOW (Continuous Bag of Words): A word embedding model that predicts a target word from its surrounding context words.

Cypher: A declarative graph query language created for Neo4j, using ASCII-art syntax to represent graph patterns.

DAG (Directed Acyclic Graph): A directed graph with no cycles - you cannot start at a node and follow directed edges back to that same node.

DFS (Depth-First Search): A graph traversal algorithm that explores as far as possible along each branch before backtracking.

DPR (Dense Passage Retrieval): A neural retrieval method that encodes queries and documents as dense vectors for similarity-based retrieval.

Embedding: A learned, dense vector representation of data (text, images, graphs) in a continuous vector space where semantically similar items are close together.

EHR (Electronic Health Records): Digital version of a patient's paper chart, containing medical history, diagnoses, medications, and treatment plans.

F-M

FAISS (Facebook AI Similarity Search): A library developed by Meta for efficient similarity search and clustering of dense vectors, optimized for billion-scale datasets.

FFN (Feed-Forward Network): A neural network layer where information moves in only one direction, from input through hidden layers to output, with no cycles.

GCN (Graph Convolutional Network): A type of neural network that operates on graph-structured data by aggregating information from node neighborhoods.

GPT (Generative Pre-trained Transformer): A series of large language models developed by OpenAI that use decoder-only transformer architecture for text generation.

Hallucination: When an LLM generates information that sounds plausible but is factually incorrect or not grounded in the provided context.

IDF (Inverse Document Frequency): A measure of how much information a word provides - rare words have high IDF, common words have low IDF.

kNN (k-Nearest Neighbors): An algorithm that finds the k closest points to a query point in a dataset, used for classification, regression, or retrieval.

KG (Knowledge Graph): A structured representation of knowledge as entities (nodes) and relationships (edges), often with properties attached to both.

LLM (Large Language Model): A neural network with billions of parameters trained on vast amounts of text data to understand and generate human-like text.

LSA (Latent Semantic Analysis): A technique for analyzing relationships between documents and terms using singular value decomposition of term-document matrices.

N-Z

NER (Named Entity Recognition): The task of identifying and classifying named entities (people, organizations, locations, etc.) in text.

NLP (Natural Language Processing): A field of AI focused on enabling computers to understand, interpret, and generate human language.

Ontology: A formal specification of concepts and relationships within a domain, defining what things exist and how they relate.

PageRank: An algorithm that measures the importance of nodes in a graph based on the structure of incoming links, originally developed for ranking web pages.

RAG (Retrieval-Augmented Generation): A technique that combines information retrieval with text generation - retrieve relevant documents, then generate answers based on them.

RDF (Resource Description Framework): A framework for representing information about resources in the web, using subject-predicate-object triples.

Reranking: A second-stage ranking process that reorders initially retrieved results using more sophisticated (and computationally expensive) relevance signals.

RNN (Recurrent Neural Network): A neural network architecture designed for sequential data, where outputs from previous steps feed back as inputs.

SPARQL: A query language for RDF databases, similar to SQL but designed for graph-structured data.

TF (Term Frequency): A measure of how frequently a term appears in a document.

TF-IDF (Term Frequency-Inverse Document Frequency): A numerical statistic that reflects how important a word is to a document in a collection, balancing term frequency against rarity.

Transformer: A neural network architecture based on self-attention mechanisms that processes all positions of a sequence simultaneously, enabling parallelization.

Vector Database: A specialized database optimized for storing and querying high-dimensional vector embeddings, supporting operations like similarity search.


Chapter 1 - Course Overview1 / 4