HCC Extraction Project | Félix Suárez

Overview

This project implements an AI-based solution that automates the extraction of medical conditions and their associated codes from clinical progress notes, determining which are relevant for HCC (Hierarchical Condition Category). Using Vertex AI Gemini 1.5 Flash and LangGraph, the system transforms a tedious and error-prone process into an efficient workflow, allowing healthcare professionals to focus more on patient care.

System Architecture

The project follows a layered architecture that clearly separates:

Condition extraction: Identifies medical conditions and their codes from progress notes
HCC relevance evaluation: Determines which of the extracted conditions are relevant for HCC
LangGraph orchestration: Manages the workflow between different components

Main components:

src/main.py: Main entry point of the application
src/agent.py: Defines the agent configuration
src/extraction/extractor.py: Handles condition extraction from notes
src/evaluation/evaluator.py: Evaluates the HCC relevance of extracted conditions
src/utils/: Contains utilities for data handling, logging, and constants

Prerequisites

Poetry for development (optional)
Docker and Docker Compose
Google Cloud service account with access to Vertex AI
Google Cloud credentials file (JSON)

Configuration

Google Cloud Credentials

To access Vertex AI Gemini 1.5 Flash, you need to configure a credentials file:

Create a service account in Google Cloud Console with access to Vertex AI
Generate and download a JSON key file for the service account
Place the credentials file in project root credentials.json

Set Required Google Cloud Credentials Environment Variable on Windows

Follow these steps:

Open the Control Panel.
Search for Environment Variables.
Click the Edit Environment Variables button.
In the System variables section, click New.
In the Variable name field, enter GOOGLE_APPLICATION_CREDENTIALS.
In the Variable value field, enter the full path to your credentials file, for example:
```
 C:\path\to\your\credentials.json
```
Click OK and close the settings windows.

Verify the Environment Variable is Set Correctly

Powershell

 $env:GOOGLE_APPLICATION_CREDENTIALS

CMD

 echo %GOOGLE_APPLICATION_CREDENTIALS%

Set Required Google Cloud Credentials Environment Variable on Linux

Follow these steps:

Open your terminal.

Edit your shell’s profile configuration file:

 nano ~/.bashrc

Add the following configuration

 export GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/credentials.json

Save the file and exit the editor (in nano, press Ctrl + X, then press Y, and then Enter).
Reload the profile to apply the changes:
```
 source ~/.bashrc
```
Verify the Environment Variable is Set Correctly
```
 echo $GOOGLE_APPLICATION_CREDENTIALS
```

Installation and Execution

Using Docker (recommended)

Build and run the Docker container:

docker-compose up --build

This will automatically process all progress notes in data/input/ and save the results in data/output/.

To access the processed results:

ls -la data/output/

Manual Installation (development)

Set up the environment using Poetry:

# Install Poetry if not already installed
pip install poetry

# Install dependencies
poetry install

Run the application:

# Using Poetry
poetry run python ./src/main.py

Usage

Processing Progress Notes

Place clinical progress notes in text format in the data/input/ directory
Run the application following the instructions above
Review the processed results in data/output/

Input/Output Format

Input: Text files with clinical progress notes
Output: Files with extracted conditions, codes, and their HCC relevance

Example output:

Patient Information:
	Name: ROOB, NATHANIAL
	Age: 84
	DOB: 06/17/1940
	Insurance #: 123456789

Medical Conditions:
	HCC Relevant:
	  - Hyperglycemia due to type 2 diabetes mellitus (Code: E11.65)
	  - Chronic obstructive lung disease (Code: J44.9)
	  - Chronic systolic heart failure (Code: I50.22)
	  - Chronic kidney disease stage 4 (Code: N18.4)
	  - Morbid obesity (Code: E66.01)
	HCC Not Relevant:
	  - Gastroesophageal reflux disease (Code: K21.9)
	  - Essential hypertension (Code: I10)

Running the LangGraph Development Web App

To start the LangGraph development interface:

# Using Poetry
poetry run langgraph dev

This will start the web application where you can:

Visualize the workflow graph
Test different inputs
Inspect the state at each step of the process
Debug the behavior of nodes

Running Tests

To run the unit tests:

# Using Poetry
poetry run pytest

Project Structure

├── 📄 .gitignore                  # Files and directories ignored by Git
├── 📄 credentials.json            # Google Cloud credentials (do not include in repo)
├── 📁 data                 	   # Input, output, and reference data
│   ├── 📁 input            	   # Progress notes to process
│   ├── 📁 output          		   # Processed results
│   └── 📁 reference       	   	   # Reference data (HCC codes)
├── 📁 src                         # Main application directory
│   ├── 📄 agent.py                # AI agent and Vertex AI configuration
│   ├── 📄 main.py                 # Main entry point
|   ├── 📁 evaluation       	   # HCC relevance evaluation
│   ├── 📁 extraction       	   # Condition extraction
│   ├── 📁 langgraph        	   # LangGraph implementation
│   ├── 📁 models           	   # Models and data structures
│   └── 📁 utils            	   # Utilities and tools
├── 📁 tests                 	   # Unit tests
├── 📄 docker-compose.yaml         # Docker Compose configuration
├── 📄 dockerfile                  # Docker container definition
├── 📄 langgraph.json              # LangGraph configuration
├── 📄 poetry.lock                 # Dependency version lock
├── 📄 pyproject.toml              # Poetry and project configuration

Dockerfile

The Dockerfile is configured to:

Build an environment with all necessary dependencies
Automatically process all notes in the input directory
Store the results in a mounted volume

Methodology and Approach

Condition Extraction

The solution specifically looks for the “assessment/plan” section in progress notes, where medical conditions are listed. We use Gemini 1.5 Flash to:

Identify and extract this specific section
Recognize medical conditions and their associated codes
Structure the information

HCC Relevance Evaluation

The evaluation uses:

Reference database of HCC-relevant codes
Comparison logic to determine relevance
Calculation of confidence scores for each condition