1001 Freelance Projects -- Extract blood test data from PDF documents that have been OCR'd

Latest Projects from Freelance Marketplaces

Today is: 27-Mar-2026 23:19 GMT

View this project in detail (Note: you will be redirected to external marketplace)
Project title:	Extract blood test data from PDF documents that have been OCR'd
Posted by:	External project from PeoplePerHour
Started:	03-Mar-2026 14:36 GMT
Description:	Expected duration: 1 day or less The objective is to build a structured blood test database that allows pathology results to be viewed, edited, filtered, and exported to Excel via a web-based HTML interface. The system stores results in a clean, standardised format so trends can be analysed accurately over time. Using AI-assisted OCR, I have built a local Python extraction pipeline that converts PDF pathology reports into machine-readable text and inserts structured data into a SQLite database. The majority of blood tests extract correctly, including canonical test name, result value, unit, and reference range. However, I have reached a specific technical issue with three markers: • CRP (C-reactive protein) • ESR • GLU (Glucose) The OCR output clearly contains the correct lines, and debug logs confirm they are processed. Yet no rows are inserted for these markers. The failure appears to occur between canonical matching, numeric extraction, or validation logic. Current System Architecture The system runs locally and consists of: • extraction_core_2.py (main engine) • Supporting modules for OCR preprocessing, lab dictionary building, regex matching, and validation • SQLite backend • Schema-driven canonical lab dictionary • Controlled fuzzy fallback logic • HTML viewer for results display and Excel export Pipeline flow: Convert PDF to image (pdf2image) Preprocess Run Tesseract OCR Clean and normalise text Match against canonical lab dictionary Extract: canonical test name numeric result unit reference range Validate Insert into SQLite The engine is deterministic and rule-based. The Specific Problem Example OCR line: CRP H 5.2 mg/L 0-5 OCR text is correct. NUMBER_PATTERN matches. The canonical dictionary contains the test. Yet: Inserted 0 rows from 0126251OrderReport_23B00006604_CRP.pdf Likely failure points include: • Canonical containment match failing due to normalisation • Flag tokens (“H”, “L”) interfering with numeric capture • Numeric extraction anchored incorrectly • Validation rejecting due to strict range formatting • Unit pattern mismatch (e.g. mmol/L) • Dictionary indexing issue • Match overridden by another lab name • Guard conditions too strict If validation fails, the row is rejected silently. All other panels extract correctly. The issue appears isolated. What Is Required This is not a rebuild. We do not want: • Re-architecture • Experimental AI guessing logic • Large-scale changes • Expanded fuzzy matching We need: 1. Precise Diagnosis Identify exactly where CRP, ESR, and GLU are failing insertion and which rule is causing rejection. 2. Minimal Safe Fix Implement a targeted correction that: • Adjusts canonical matching if required • Anchors numeric extraction correctly • Allows flag tokens without blocking capture • Relaxes only necessary validation checks • Preserves deterministic behaviour 3. Zero Regression • No impact to currently working panels • No performance degradation • No uncontrolled fuzzy expansion 4. Modular Implementation If appropriate: • Implement as small isolated module or • Cleanly adjust matching block The existing architecture should remain intact. Constraints The system is designed to be: • Deterministic • Schema-driven • Reproducible • Forensic-grade We cannot introduce probabilistic or unpredictable behaviour. Longer-Term Goal After stabilising extraction: • Migrate to web deployment • Enable structured uploads • Add trend analysis • Later incorporate AI-assisted interpretation Immediate priority: Stabilise deterministic extraction for CRP, ESR, and GLU without breaking the existing engine. Materials Provided Uploaded: • Full extraction_core_2.py (text format) • Screenshot of HTML viewer • Sample PDF files • Export showing required output Additional materials available on request: • Sample OCR blocks • Canonical dictionary entries • Regex patterns • Validation logic • Database schema • Debug logs This is a focused debugging and refinement request. I have spent many hours attempting to isolate the issue and now require an experienced developer to identify the blocking condition and implement a practical fix. I have been advised this should take 1–2 hours for a senior developer. Looking for a swift turnaround.
Project ID:	3472184
Project category:
Project budget:
View this project in detail (Note: you will be redirected to external marketplace)

Project	Started
Global Travel Sales Accelerator (high-energy, growth-focused) - Open for India Location Only, I am not looking for Just social media marketing, its more like Partnership model Category: Advertising, Brand Management, Digital Marketing, Email Marketing, Event Planning, Lead Generation, Marketing, Public Relations, Sales, Social Media Marketing Budget: ₹10000 - ₹25000 INR	27-Mar-2026 22:59 GMT
Part-Time AI/ML Engineer Needed Category: CAD / CAM, Computer Vision, Data Augmentation, Finite Element Analysis, Machine Learning (ML), Natural Language Processing, Predictive Analytics, Python Budget: $250 - $750 USD	27-Mar-2026 22:59 GMT
Remote Worker for Content Creation and Site Management Category: Annual Report Design, Content Creation, Content Management System (CMS), Content Writing, Video Editing, Video Production, Video Services, Web Design, Website Management Budget: $15 - $25 USD	27-Mar-2026 22:57 GMT
Nature-Inspired Abstract Canvas Art Category: Abstract, Art Consulting, Art Installation, Arts & Crafts, Caricature & Cartoons Budget: $750 - $1500 USD	27-Mar-2026 22:56 GMT
Fantasy Novel Synopsis Refinement Category: Book Writing, Content Strategy, Copy Editing, Creative Writing, Editing, Proofreading, Screenwriting, Script Writing, Writing, Writing Tutoring Budget: $30 - $250 USD	27-Mar-2026 22:55 GMT
MS Project Cash Flow Alignment Category: Construction Management, Construction Monitoring, Microsoft Project, Project Management, Project Scheduling, Risk Management Budget: ₹1500 - ₹12500 INR	27-Mar-2026 22:52 GMT
Corporate Hero UI Re-creation Category: Photoshop, Adobe XD, Figma, Graphic Design, Logo Design, Motion Design, Photoshop Design, UI / User Interface, Web Design Budget: $15 - $25 USD	27-Mar-2026 22:50 GMT
LinkedIn Ads Lead Gen Specialist Category: A / B Testing, B2B Marketing, Conversion Rate Optimization, CRM, Digital Marketing, Internet Marketing, Lead Generation, Leads, Marketing, Sales Budget: $30 - $250 USD	27-Mar-2026 22:47 GMT
Advanced Japanese Voice Recordings Category: Audio Editing, Audio Services, English (US) Translator, Japanese Teaching, Japanese Translator, Japanese Tutoring, Voice Artist, Voice Assistance Devices, Voice Over, Voice Talent Budget: ₹600 - ₹1500 INR	27-Mar-2026 22:44 GMT
Intermediate Relaxation Yoga Videos Category: Audio Services, Content Creation, Fitness, Post Production, Video Editing, Video Production, Video Services, Yoga Budget: $15 - $25 AUD	27-Mar-2026 22:41 GMT
FreePBX VoIP & Caller-ID Setup Category: Asterisk PBX, Debian, Linux, Network Administration, Network Engineering, SIP, System Administration, Telecom Sales, Telecommunications Engineering, VoIP Budget: ₹12500 - ₹37500 INR	27-Mar-2026 22:40 GMT
ClickBank + WordPress (Accelerator) Funnel Cleanup — Fast Conversion Upgrade Category: ClickBank, Conversion Rate Optimization, WordPress Budget: $30 - $250 USD	27-Mar-2026 22:37 GMT
AI Art E-commerce Automation Setup Category: AI Art Creation, AI Development, Automation, HTML, PHP, SEO, Shopify, Shopify Templates Budget: $30 - $250 USD	27-Mar-2026 22:36 GMT
Create Print Marketing Collateral Category: Adobe Illustrator, Adobe InDesign, Branding, Brochure Design, Business Card Design, Graphic Design, Illustration, Print Design Budget: ₹100 - ₹400 INR	27-Mar-2026 22:36 GMT
On-Site Guitar Testing Shibuya (Japan) Category: Audio Engineering, Audio Production, Audio Services, Guitar Lesson, Music, Sound Design Budget: €30 - €250 EUR	27-Mar-2026 22:35 GMT

Browse All Projects

Projects by Skills ...
Projects for 'android' Projects for 'ajax' Projects for 'asp' Projects for 'aspnet' Projects for 'cms' Projects for 'cpp' Projects for 'csharp' Projects for 'css' Projects for 'delphi' Projects for 'design' Projects for 'drupal'	Projects for 'excel' Projects for 'facebook' Projects for 'flash' Projects for 'html' Projects for 'java' Projects for 'javascript' Projects for 'joomla' Projects for 'iphone' Projects for 'mysql' Projects for 'photoshop' Projects for 'php' Projects for 'python'	Projects for 'ruby' Projects for 'seo' Projects for 'sql' Projects for 'sysadm' Projects for 'translate' Projects for 'typing' Projects for 'twitter' Projects for 'vbnet' Projects for 'xml' Projects for 'wordpress' Projects for 'writing'
Read RSS feeds ... New!
RSS feed for 'android' RSS feed for 'ajax' RSS feed for 'asp' RSS feed for 'aspnet' RSS feed for 'cms' RSS feed for 'cpp' RSS feed for 'csharp' RSS feed for 'css' RSS feed for 'delphi' RSS feed for 'design' RSS feed for 'drupal'	RSS feed for 'excel' RSS feed for 'facebook' RSS feed for 'flash' RSS feed for 'html' RSS feed for 'java' RSS feed for 'javascript' RSS feed for 'joomla' RSS feed for 'iphone' RSS feed for 'mysql' RSS feed for 'photoshop' RSS feed for 'php' RSS feed for 'python'	RSS feed for 'ruby' RSS feed for 'seo' RSS feed for 'sql' RSS feed for 'sysadm' RSS feed for 'translate' RSS feed for 'typing' RSS feed for 'twitter' RSS feed for 'vbnet' RSS feed for 'xml' RSS feed for 'wordpress' RSS feed for 'writing'

New!
Проекты на русском (Projects in Russian)	Long URL: www.1001freelanceprojects.com	Mobile version: m.1001fp.com