Skip to main content

Convert Scanned PDF Documents to Text without having to wait for google bots

Working supporting old scientific hardware sometimes brings me some challenges. Usually, the manuals are only on paper and when there is a digital version it was digitized (scanned) .

Googling a little, came to me an article which relies on waiting for the google to index your files and OCR them. But there is an open source alternative.

Looking a little further, I did find another two articles at linuxquestions.org and at linux.com, on which I've found the tesseract-ocr. So to solve my issue, I had to first convert my PDF file to a bunch of TIF images, and so OCR them with tesseract. This way:

gs -dNOPAUSE -sDEVICE=tiffgray -r300x300 -sOutputFile=page%03d.tif -- 1850_operators_manual.pdf

ls -1 *.tif | cut -d e -f 2 | while read line ; do tesseract "page"$line "page"$line -l eng; done

Hope this helps somebody...

Comments

  1. This comment has been removed by the author.

    ReplyDelete
  2. It keeps the line breaks, page breaks and the blank lines between the paragraphs.

    ReplyDelete
  3. Now, there's an app for that... ;-)
    https://github.com/jbarlow83/OCRmyPDF

    ReplyDelete

Post a Comment

Popular posts from this blog

More trickery with gnuplot dumb terminal

In my post "Plotting memory usage on console" the chart doesn't pan the data.
Now, using a named pipe, the effect got a little bit nicer.
First, we have to run the memUsage.sh script to get a file filled with memory usage info:
./memUsage.sh > memUsage.dat &
Then we have to create a named pipe:
mkfifo pipe
Now we have to run another process to tail only the last 64 lines from the memUsage.dat
while [ 1 ]; do tail -64 memUsage.dat> pipe; done &
And now we just have to plot the data from the pipe:
watch -n 1 'gnuplot -e "set terminal dumb;p \"pipe\" with lines"'
And that is it!

uSleep on windows (win32)

I am facing a terrible issue regarding timing on windows.

Googling arround, I've found those infos:
Using QueryPerformanceCounter and QueryPerformanceFrequency APIs in Dev-C++
(http://yeohhs.blogspot.com/2005/08/using-queryperformancecounter-and_13.html)
QueryPerformanceCounter() vs. GetTickCount()
http://www.delphifaq.com/faq/delphi_windows_API/f345.shtml
How to time a block of code
http://www.cryer.co.uk/brian/delphi/howto_time_code.htm
And Results of some quick research on timing in Win32 http://www.geisswerks.com/ryan/FAQS/timing.html
With that I'm trying to write something like a uSleep function for windows:


#include<windows.h>

voiduSleep(int waitTime){
__int64 time1 = 0, time2 = 0, sysFreq = 0;

QueryPerformanceCounter((LARGE_INTEGER *)&time1);
QueryPerformanceFrequency((LARGE_INTEGER *)&freq);
do{
QueryPerformanceCounter((LARGE_INTEGER *)&time2);

// }while((((time2-time1)*1.0)/sysFreq)<waitTime);
}while( (time2-time1) <waitTime);
}

There is also already a nanosleep…

Checking auth.log for ssh brute force attacks

As I am letting my personal computer always on, as a homelinux server, I decided to check if someone is trying to breaking in with SSH brute force attacks.

First I did a grep for fail at the /var/log/auth.log. (grep -i /var/log/auth.log)

And I got lots of lines with the string "fail". With [grep -i /var/log/auth.log | wc -l] I figured out that were 1164 fail entries at auth.log

With an [grep -i fail auth.log | cut -d " " -f 6 | sort | uniq] I checked that were two kind of failed attempts:
Failed
pam_unix(sshd:auth):

So I wrote the following line to check with which users they were attempting to log:
grep Failed auth.log | cut -d " " -f 11 | sort | uniq | while read line ; do echo -n $line" "; grep $line auth.log | wc -l; done | sort -n -k 2

Here, the field position (the number 11 at the above command lines [-f 11]) may change in some systems. At my desktop at work, the username came at the position 9.

Here are the "top ten":
root 2922
user 2884