Python OCR 使用手冊 : 圖片轉文字 , 超簡單上手

OCR x Pytesseract

前言

在 Python 中，使用 OCR (Optical Character Recognition , 字元辨識) 將圖片的內容轉換成一般的文本，非常簡單。

只要將相關軟體與 Python 套件安裝完成後，即可運行程式，

這份文件就是將之前的踩坑過程記錄下來，以供想後續想要研究的開發者可以快速上手。

安裝文件與範例程式

【安裝文件】

https://gitlab.com/GammaRayStudio/DevDoc/-/blob/master/Python/004.PythonOCR.md

【範例程式】

https://gitlab.com/GammaRayStudio/Program/PythonStudio/SE/PythonOCR

圖片範例

轉換目標

英文

圖片

文字

English
Gamma Ray Studio
English Text
Text Text Text ~ !!!

繁體中文

圖片

文字

繁體中文
Gamma Ray 軟體工作室
中文 文字
文字 文字 文字 ~ !!!

簡體中文

圖片

文字

简体中文
Gamma Ray 软体工作室
中文 文字
文字 文字 文字 ~ !!!

安裝 Tesseract

Win

https://github.com/UB-Mannheim/tesseract/wiki

環境變數

Mac

brew install tesseract

Linux

apt-get install tesseract-ocr

驗證

tesseract -v

Python 環境

Python 版本

python -V

Python 3.8.5

PyPI

Pillow
pytesseract

pip3 install Pillow
pip3 install pytesseract

Python 範例

from PIL import Image
import pytesseract
img_name = './001.en-us.png'
img = Image.open(img_name)
text = pytesseract.image_to_string(img, lang='eng')
print(text)

PIL : 處理圖片 Pillow
pytesseract : OCR 模組 Pytesseract
img_name = './001.en-us.png' : 圖片路徑
img = Image.open(img_name) : 載入圖片
text = pytesseract.image_to_string(img, lang='eng') : 圖片轉文字，使用英文語系

Output

English

Gamma Ray Studio
English Text

Text Text Text ~ 11!

驚嘆號的地方被辨認為 1 ，但基本上大部分都辯認得出來

中文辨識

from PIL import Image
import pytesseract
img_name = './002.zh-cht.png'
img = Image.open(img_name)
text = pytesseract.image_to_string(img, lang='eng')
print(text)

img_name = './002.zh-cht.png' : 調整載入的圖片 中文

Output

SRE
Gamma Ray BREA TER

FX XF

XF XF XF~

現階段，中文的轉換會變成不認識的編碼
新增語言庫，可以添加更多的語言辨識

下載語言庫

GitHub - Tessdata

https://github.com/tesseract-ocr/tessdata_best

英文

eng.traineddata

https://github.com/tesseract-ocr/tessdata_best/blob/main/eng.traineddata

繁體中文

chi_tra.traineddata

https://github.com/tesseract-ocr/tessdata_best/blob/main/chi_tra.traineddata

簡體中文

chi_sim.traineddata

https://github.com/tesseract-ocr/tessdata_best/blob/main/chi_sim.traineddata

預設路徑

以 Mac 為例

Tessdata 程式路徑

/usr/local/Cellar/tesseract

語言包路徑

/usr/local/Cellar/tesseract/4.1.3/share/tessdata

下載語言包後，放到程式資料夾內的 /share/tessdata 路徑，就可以生效

配置環境變數

獨立資料夾

Win

C:\DevTools\tessdata

環境變數

Mac

/Users/Enoxs/DevTools/tessdata

.zprofile

# TESSDATA
export TESSDATA_PREFIX=/Users/Enoxs/DevTools/tessdata

Linux

.bash_profile

# TESSDATA
export TESSDATA_PREFIX=/Users/enoxs/DevTools/tessdata

語系參數調整

from PIL import Image
import pytesseract
img_name = './002.zh-cht.png'
img = Image.open(img_name)
text = pytesseract.image_to_string(img, lang='chi_tra+eng')
print(text)

img_name = './002.zh-cht.png' : 調整載入的圖片 繁體中文
text = pytesseract.image_to_string(img, lang='chi_tra+eng') : 圖片轉文字，使用繁體中文與英文
- 英文 : eng
- 繁體中文 : chi_tra
- 簡體中文 : chi_sim

平時用的程式碼

無 Code Review ，Free Style

執行 ocr.py 
將 image 資料夾內的圖片轉換成文字，
並且依據原始檔案名稱，保存在 text 資料夾中

from PIL import Image
import pytesseract
import os
from os import listdir
from os.path import isfile, join


def ocrText(fileName):
    img = Image.open(fileName)
    # text = pytesseract.image_to_string(img, lang='eng')
    # text = pytesseract.image_to_string(img, lang='eng+chi_tra')
    text = pytesseract.image_to_string(img, lang='eng+chi_tra+chi_sim')
    return text


def replaceText(str):
    str = str.replace(",", "，")
    text = str.replace(" ", "")
    return text


def save(fileName, text):
    print("text.length => ", len(text))
    with open(fileName, 'w', encoding='UTF-8') as f:
        f.write(text)
        f.close


def main():
    path = '.' + os.sep + 'image'
    lstFile = [f for f in listdir(path) if isfile(join(path, f))]

    for f in lstFile:
        if '.png' in f:
            idx = f.find('.png')
            out_name = ""
            for i in range(idx):
                out_name += f[i]
            print(out_name)
            text = ocrText('.' + os.sep + 'image' + os.sep + '{}'.format(f))
            # text = replaceText(text)
            path = '.' + os.sep + 'text' + os.sep + out_name + '.txt'
            save(path, text)


if __name__ == "__main__":
    main()

ocrText() : 將目標圖片轉換成文字
replaceText() : 依據需求替換指定文字
save() : 將文字保存到指定路徑
main() : 整體流程
1. 取得 image 資料夾下所有的檔案名
2. 使用 for in 逐筆轉換文字與處理檔名
3. 轉換完成後，將文字保存到 text 資料夾

Reference

Python OCR 使用手冊 : 圖片轉文字 超簡單上手

Python OCR 使用手冊 : 圖片轉文字 , 超簡單上手

前言

安裝文件與範例程式

【安裝文件】

圖片範例

英文

繁體中文

簡體中文

安裝 Tesseract

Win

Mac

Linux

驗證

Python 環境

Python 版本

PyPI

Python 範例

中文辨識

下載語言庫

GitHub - Tessdata

英文

繁體中文

簡體中文

預設路徑

配置環境變數

Win

Mac

Linux

語系參數調整

平時用的程式碼

Reference

GitHub - Tesseract

iT 邦幫忙 - 聽過 OCR 嗎? 實作看看吧 -- pytesseract

[Python] 5.光學字元辨識(OCR)，圖片辨識文字

Pytesseract 辨識圖片中的文字

mac 安裝tesseract、pytesseract， 實現圖片裏文字的識別

留言

張貼留言

熱門文章

Markdown 語法大全，範例模板

【如何寫乾淨的程式碼 ? 】程式設計 代碼風格 指南 | 基礎 + 9 個進階概念

【 git 基礎教程 #1】什麼是 git ? | Sourcetree 介紹 與 入門基礎操作教學

Markdown 是什麼？｜如何用它來寫文件 ? | 學習軟體工程師的數位筆記方案

【什麼是 git flow ?】 5 項分支全詳解 | Sourcetree 實戰演練

Python OCR 使用手冊 : 圖片轉文字超簡單上手

mac 安裝tesseract、pytesseract，實現圖片裏文字的識別

【如何寫乾淨的程式碼 ? 】程式設計代碼風格指南 | 基礎 + 9 個進階概念

【 git 基礎教程 #1】什麼是 git ? | Sourcetree 介紹與入門基礎操作教學