1. Description We need to create a template based flow of documents which automates the templates.
Process is: 1. read the file and extract all text (you) 2. find values in table and compare with text extracted (you) 3. if not found send to template creation setting a vlue to 0 instead of 1(you) 5. manual labeling stores coordinates for every label. this is where we manually make the bounding boxes. 6. Based on the coordinates you extract text strings inside of the boxes with regex for example and store values to template table(you) 7. read the document again and extract based on coordinate and compare with media_template table and store the results in media table(you)
2+ time document arrives 1. upload document (already done) 2. Detect if there is a template already. Extract text strings with regex by using coordinates and. Store strings in media table
You work is to quickly do the above work.
2. Skills Python MySQL 5+ years Bounding boxes OpenCV, NLP, spaCy, regex, tesseract, OCR, PDFXML, TableNet, DeepDeSRT, Graph neural networks, GANs and genetic algoritm
You must have done something similar previously and you know what regex and tesseract is and have used it several times. You have worked with vision, ML, DL or NN