id
type
0 (not classified)
status
21 (imported old-v2, waiting for another import)
review version
0
cleanup version
0
pending deletion
0 (-)
created at
2025-11-14 20:40:20
updated at
2025-11-14 20:40:22
url
http://aty.sdsu.edu/bibliog/latex/debian/tess.html
url length
50
url crc
4289
url crc32
712446145
location type
1 (url matches target location, page_location is empty)
canonical status
2 (missing canonical tag in html)
canonical page id
-
domain id
domain tld
2295
domain parts
0
originating warc id
-
originating url
https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-33/segments/1754151280019.53/warc/CC-MAIN-20250808034803-20250808064803-00605.warc.gz
source type
11 (CommonCrawl)
server ip
Publication date
2025-08-08 04:13:42
Fetch attempts
0
Original html size
32968
Normalized and saved size
32968
title
Tesseract
excerpt
content
Using Tesseract Introduction The tesseract OCR system is very complicated, with more than 600 adjustable parameters. It can perform very well, but you often have to tweak some of those parameters. Just remember that a system that's infinitely adjustable is always out of adjustment. Unfortunately, the documentation for tesseract isn't very clear, so it's difficult for beginners to learn what needs to be tweaked, or how to do it. This page explains some ways to improve its performance. Overview of the OCR process To help you understand what's involved, here's an outline of how the system turns a picture of text into machine-readable text. First, the image is converted to a standard format, tiff. If the image is in color or grayscale, it's converted to black-and-white; this process is called thresholding. The thresholded image usually must be cleaned of “noise” like dust specks, if it was scanned in from a printed page. The outlines of the printed characters are extracted,...
author
updated
1767175161
block type
0
extracted fields
104
extracted bits
title
full content
content was extracted heuristically
detected location
0
detected language
1 (English)
category id
Other [en] (231)
index version
2025123101
paywall score
0
spam phrases
0
text nonlatin
4
text cyrillic
0
text characters
23284
text words
4883
text unique words
1115
text lines
1
text sentences
207
text paragraphs
1
text words per sentence
23
text matched phrases
0
text matched dictionaries
0
links self subdomains
0
links other subdomains
0
links other domains
1
links spam adult
0
links spam random
0
links spam expired
0
links ext activities
0
links ext ecommerce
0
links ext finance
0
links ext crypto
0
links ext booking
0
links ext news
0
links ext leaks
0
links ext ugc
2
links ext klim
0
links ext generic
0
image author
featured image