HCC Extraction Project
An automated system for extracting HCC-relevant conditions from clinical progress notes using AI.
Overview
This project implements an AI-based solution that automates the extraction of medical conditions and their associated codes from clinical progress notes, determining which are relevant for HCC (Hierarchical Condition Category). Using Vertex AI Gemini 1.5 Flash and LangGraph, the system transforms a tedious and error-prone process into an efficient workflow, allowing healthcare professionals to focus more on patient care.
System Architecture
The project follows a layered architecture that clearly separates:
- Condition extraction: Identifies medical conditions and their codes from progress notes
- HCC relevance evaluation: Determines which of the extracted conditions are relevant for HCC
- LangGraph orchestration: Manages the workflow between different components
Main components:
-
src/main.py
: Main entry point of the application -
src/agent.py
: Defines the agent configuration -
src/extraction/extractor.py
: Handles condition extraction from notes -
src/evaluation/evaluator.py
: Evaluates the HCC relevance of extracted conditions -
src/utils/
: Contains utilities for data handling, logging, and constants
Prerequisites
- Poetry for development (optional)
- Docker and Docker Compose
- Google Cloud service account with access to Vertex AI
- Google Cloud credentials file (JSON)
Configuration
Google Cloud Credentials
To access Vertex AI Gemini 1.5 Flash, you need to configure a credentials file:
- Create a service account in Google Cloud Console with access to Vertex AI
- Generate and download a JSON key file for the service account
- Place the credentials file in project root
credentials.json
Set Required Google Cloud Credentials Environment Variable on Windows
Follow these steps:
-
Open the Control Panel.
-
Search for Environment Variables.
-
Click the Edit Environment Variables button.
-
In the System variables section, click New.
-
In the Variable name field, enter GOOGLE_APPLICATION_CREDENTIALS.
-
In the Variable value field, enter the full path to your credentials file, for example:
C:\path\to\your\credentials.json
-
Click OK and close the settings windows.
-
Verify the Environment Variable is Set Correctly
Powershell
$env:GOOGLE_APPLICATION_CREDENTIALS
CMD
echo %GOOGLE_APPLICATION_CREDENTIALS%
Set Required Google Cloud Credentials Environment Variable on Linux
Follow these steps:
-
Open your terminal.
-
Edit your shell’s profile configuration file:
nano ~/.bashrc
Add the following configuration
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/credentials.json
-
Save the file and exit the editor (in nano, press Ctrl + X, then press Y, and then Enter).
-
Reload the profile to apply the changes:
source ~/.bashrc
-
Verify the Environment Variable is Set Correctly
echo $GOOGLE_APPLICATION_CREDENTIALS
Installation and Execution
Using Docker (recommended)
- Build and run the Docker container:
docker-compose up --build
This will automatically process all progress notes in data/input/
and save the results in data/output/
.
- To access the processed results:
ls -la data/output/
Manual Installation (development)
- Set up the environment using Poetry:
# Install Poetry if not already installed
pip install poetry
# Install dependencies
poetry install
- Run the application:
# Using Poetry
poetry run python ./src/main.py
Usage
Processing Progress Notes
- Place clinical progress notes in text format in the
data/input/
directory - Run the application following the instructions above
- Review the processed results in
data/output/
Input/Output Format
- Input: Text files with clinical progress notes
- Output: Files with extracted conditions, codes, and their HCC relevance
Example output:
Patient Information:
Name: ROOB, NATHANIAL
Age: 84
DOB: 06/17/1940
Insurance #: 123456789
Medical Conditions:
HCC Relevant:
- Hyperglycemia due to type 2 diabetes mellitus (Code: E11.65)
- Chronic obstructive lung disease (Code: J44.9)
- Chronic systolic heart failure (Code: I50.22)
- Chronic kidney disease stage 4 (Code: N18.4)
- Morbid obesity (Code: E66.01)
HCC Not Relevant:
- Gastroesophageal reflux disease (Code: K21.9)
- Essential hypertension (Code: I10)
Running the LangGraph Development Web App
To start the LangGraph development interface:
# Using Poetry
poetry run langgraph dev
This will start the web application where you can:
- Visualize the workflow graph
- Test different inputs
- Inspect the state at each step of the process
- Debug the behavior of nodes
Running Tests
To run the unit tests:
# Using Poetry
poetry run pytest
Project Structure
├── 📄 .gitignore # Files and directories ignored by Git
├── 📄 credentials.json # Google Cloud credentials (do not include in repo)
├── 📁 data # Input, output, and reference data
│ ├── 📁 input # Progress notes to process
│ ├── 📁 output # Processed results
│ └── 📁 reference # Reference data (HCC codes)
├── 📁 src # Main application directory
│ ├── 📄 agent.py # AI agent and Vertex AI configuration
│ ├── 📄 main.py # Main entry point
| ├── 📁 evaluation # HCC relevance evaluation
│ ├── 📁 extraction # Condition extraction
│ ├── 📁 langgraph # LangGraph implementation
│ ├── 📁 models # Models and data structures
│ └── 📁 utils # Utilities and tools
├── 📁 tests # Unit tests
├── 📄 docker-compose.yaml # Docker Compose configuration
├── 📄 dockerfile # Docker container definition
├── 📄 langgraph.json # LangGraph configuration
├── 📄 poetry.lock # Dependency version lock
├── 📄 pyproject.toml # Poetry and project configuration
Dockerfile
The Dockerfile is configured to:
- Build an environment with all necessary dependencies
- Automatically process all notes in the input directory
- Store the results in a mounted volume
Methodology and Approach
Condition Extraction
The solution specifically looks for the “assessment/plan” section in progress notes, where medical conditions are listed. We use Gemini 1.5 Flash to:
- Identify and extract this specific section
- Recognize medical conditions and their associated codes
- Structure the information
HCC Relevance Evaluation
The evaluation uses:
- Reference database of HCC-relevant codes
- Comparison logic to determine relevance
- Calculation of confidence scores for each condition