|

Automation Project: Extract Table Data from Images to Excel (OCR)

3D visualization of a scanner converting a paper table into floating Excel cells, representing OCR table extraction.

This project is a perfect example of “Automating the Boring Stuff.” In this tutorial, we’ll walk through using Python OCR to Excel as we combine two libraries you’ve learned:

  1. Pytesseract: To read text from a scanned image (OCR).
  2. Openpyxl: To write that text into a new Excel spreadsheet.

The Goal

Imagine you have a scanned image of a simple table, like invoice.png:

Item    Qty    Price
Widget  2      10.00
Gadget  5      5.50

We want to extract this text and put it into invoice.xlsx.

Step 1: Extract Text with OCR

First, let’s just get the raw text from the image.

import pytesseract
from PIL import Image

# Make sure tesseract is installed!
img_path = 'invoice.png'
try:
    img = Image.open(img_path)
    raw_text = pytesseract.image_to_string(img)
    print("--- Raw Text Extracted ---")
    print(raw_text)
except Exception as e:
    print(f"Error: {e}. Is tesseract installed and in your PATH?")
    exit()

Step 2: Process the Text and Write to Excel

Now we take that raw_text, split it into lines, and write it to an Excel file.

from openpyxl import Workbook

# 1. Create a new Excel workbook
wb = Workbook()
sheet = wb.active
sheet.title = "Invoice Data"

# 2. Process the raw text
for line in raw_text.splitlines():
    if line.strip(): # Skip blank lines
        # Split each line by spaces (a real-world script might use regex)
        columns = line.split() 
        
        # 3. Append the list of columns as a new row in Excel
        sheet.append(columns)

# 4. Save the file
wb.save("invoice.xlsx")
print("\nData successfully written to invoice.xlsx!")

You’ve just built a bot that does manual data entry!

Similar Posts

Leave a Reply