Nuance pdf conversion batch ocr
Get via App Store Read this post in our app!
Batch convert pdf's t searchable pdf's
I'm looking for a way to convert thousands of pdf's to searchable pdf's. I've used a program called "PDF Create Assistant" that came with Nuance's ecopy software. However, you can't select a folder, you have to go into each sub folder, select the files to convert, and then go to the next folder.
What is another way to convert a large number of pdf's to searchable pdf's?
Haven't had any suggestions. Surely there must be a way to batch convert pdf's(?).
migrated from stackoverflow.com Oct 6 '12 at 15:00
This question came from our site for professional and enthusiast programmers.
Nuance pdf conversion batch ocr
Get via App Store Read this post in our app!
Batch OCR for many PDF files (not already OCRed)? [closed]
I use Google Desktop Search (I am on Vista) and not all my PDF files are recognized in my archive folder. It is normal as "PDF files that contain scanned images" are not indexed ( http://desktop.google.com/support/bin/answer.py?hl=en&answer=90651 )
So I would like to OCR many of my PDF files that are not already OCRed. My goal : I give the program a folder and it search alone in the subfolders the PDF files that need to be converted into PDF-OCRed files.
Note: In the past, if a PDF file was password protected, I removed the password with another batch (paying) tool: verypdf.com "pwdremover" http://www.verypdf.com/pwdremover/
Any (not too much expensive) idea ?
I already tried : Finereader 6 pro on xp at the time, but there was no batch processor included. Paperfile paperfile.net which uses Tesseract http://code.google.com/p/tesseract-ocr/ . But the OCR is only PDF to text, not PDF to PDF! There is also another project http://code.google.com/p/ocropus/
Thanks in advance ;)
closed as off-topic by DavidPostill ♦ , bwDraco, Kevin Panko, LawrenceC, mdpc Jan 10 '15 at 4:18
4 Answers
tl;dr? Start with Nuance PowerPDF Advanced.
I evaluated OCR software in Dec 2014 in prep for a big project - OCR on millions of English-language pages done in batches. If you're willing to spend a few hundred dollars you have many options; trial versions can get you thru if you only need to convert a few hundred pages.
Many software packages want to load all the input files, do OCR and coalesce the mess into a single output. IMHO this is dead wrong, I have no idea who would want that. I was looking for true batch: one output file for each input file, unattended operation, don't stop for anything, give me a detailed report at the end. Spoiler alert: I didn't find that.
Packages in alphabetic order follow. Prices shown below are list but discounts abound. Take my comments about accuracy with a grain of salt; your inputs will not be the same as my inputs so your mileage will certainly vary.
ABBYY Finereader 12 Corporate: $400. Batch feature is called the "Task Manager" and it's on the Tools menu. It will process files from a folder, including subfolders; it will happily create a separate output file for each input file. It does not seem capable of preserving the input folder hierarchy; all output files went to the same output folder. The accuracy was high in my tests, yet still the lowest of the packages I've listed here.
Adobe Acrobat XI: $300. Batch feature is called "Text Recognition/In Multiple Files" which can be found by clicking on Tools (third toolbar, top right side of the main screen). Processes subfolders, one output for each input. Stops and puts up a prompt if it finds a password-protected file. Does not preserve input directory tree by default; can do so by writing output to same folder as input. Accuracy was quite good in my tests.
Nuance PowerPDF Advanced v1.1 (successor to OmniPage Ultimate): $150. Batch feature is called "Batch Converter" and it's reachable from the main program under the Advanced Processing tab. It will process folders and subfolders, preserving the input structure in the output. One output for each input. Will use multiple cores, but not aggressively; what that means is I could not get it to saturate a multi-core host. Accuracy is excllent, as good or better than OmniPage. Bad or fuzzy files did not cause it to hang. The batch processor writes (shock) a plain-text log file to the output directory.
ReadIris Corporate 14: $600. Batch feature is invoked by the "Batch OCR" item which is revealed by clicking on the "From Files" button on the main screen. It will process folders and subfolders, one output for each input, and by default the output directory structure matches the input directory structure. Stops and demands user input on an invalid file; processes without further complaint all protected documents apparently by OCR-ing the image. The accuracy was very good, on par with Acrobat.
On my desktop machine (only dual core), with my chosen inputs, every package required at least 3 seconds to process a page; some took more. Might be able to drive this down on a machine with more cores.
Gotchas abound, be sure to plan for them: invalid PDFs (some packages halt), password-protected PDFs (some packages halt, others convert anyhow!), and rotated pages (landscape instead of portrait). If you want the batch to run thru to completion, you have to prep the input area for these packages Very, Very Carefully. Look into the GhostScript package's print-to-PDF feature for a way of removing protection from PDFs.
Running large batches can lead to memory-exhaustion and hanging problems, even tho it should not (argh - probably memory leaks). If you're doing any kind of automation at all, a big problem is discovering after the fact what really happened - which documents could not be processed, which failed during processing, etc. It's like the desktop software people never heard of something called a "log file".
Finally getting support, even as a paying customer, is pretty difficult for these mass-market packages. For example I complained to one esteemed customer support rep about a package (which shall remain nameless) hanging for some large inputs. I waited 36 hours before giving up :). They sweetly suggested limiting the batch size to 300 documents. That was just completely unacceptable to me, but hey it got that support ticket closed dang quick, right? And that's all that matters, right? Sigh.
Adobe Acrobat will process a folder of PDFs and like most Adobe products there's a 30 day trial.
The function is located in the 'Document' menu:
from where you can add your folder.
In Acrobat X the function is available as follows:
Actually, pdfsandwich has been updated within the last year and was not at all difficult for me to install in Linux Mint. The results it gives are inferior to Adobe Acrobat, but it's the only workable solution I've found in Linux so far.
Try WatchOCR. It is an open source software package that converts scanned images into text searchable pdfs. It is free and open source and has a nice web interface for remote administration. With the right configuration it be used to create a batch pdf/ocr service for an entire network via smb shares. Unfortunately it is linux only. But you could install it on an old server and then your entire organisation could use it.
Convertir PDF en Word
Grâce au logiciel Power PDF de Nuance, ne retapez plus aucun document Word !
Utilisez-vous Microsoft Word ? Êtes-vous un habitué des tâches administratives ?
Vous avez besoin de faire quelques modifications mineures à un fichier PDF, mais vous ne disposez pas de convertisseur PDF, vous avez essayé quelques outils de conversion en ligne mais aucun n'a été en mesure de conserver la mise en forme d'origine du document. Vous réalisez que vous pourriez avoir besoin de retaper l'ensemble du document à partir de zéro.
Avec Nuance Power PDF, la conversion d’un document PDF au format Word est facile et très rapide, de plus tout document converti garde son format d’origine.
Convertissez vos documents PDF au format Word en trois étapes
Power PDF est un logiciel intuitif et facile d’utilisation qui vise à améliorer la productivité des entreprises. Il convertit tout document PDF au format Word en seulement trois étapes.
Reproduisez avec précision des mises en pages complexes
Nuance utilise la technologie OCR (Reconnaissance optique de caractères) la plus avancée du marché et vous permet de reproduire des mises en page complexes (tableaux, colonnes, graphiques) conformément au document de base.
Une expérience intuitive pour l’utilisateur
Power PDF a été conçu pour intégrer au mieux votre flux de travail. Vous pouvez convertir vos documents PDF au format Word à partir du logiciel Power PDF, ou tout simplement faire un clic droit sur votre document PDF et sélectionner l’option de conversion au format Word.
Un logiciel facile à apprivoiser
Avec une interface proche de celle des produits Microsoft, Power PDF s’intégrera facilement à votre flux de travail si vous envisagez d’introduire le logiciel à l’ensemble de votre entreprise.
Comment convertir vos documents PDF vers Microsoft Word
Ouvrez votre document PDF
Dans Traitement avancé, cliquez sur Accueil et sélectionnez l’icône MS Word
Enregistrez le document et cliquez sur OK. Votre document sera convertit automatiquement.
Plus besoin de retaper vos PDF statiques ou vos documents numérisés. Convertissez-les en documents Microsoft Office, et modifiez-les comme bon vous semble avec Power PDF. Découvrez notre vidéo pour en savoir plus.
Pour le travail en équipe et les entreprises
À la recherche d’une meilleure solution PDF pour votre entreprise ? Contactez-nous pour recevoir plus d’informations sur nos programmes d’achat de licences groupées.
Ressources
Nuance met à votre disposition de nombreuses ressources pour vous faire découvrir Nuance Power PDF et vous aider à en faire une utilisation optimale. Explorer la bibliothèque de ressources de Power PDF.
Pourquoi passer à Power PDF ?
Informations produits
Fonctions clés de Power PDF
Témoignages
Efficacité accrue et charges de travail allégées dans toute l'entreprise.
Livre blanc
Assistance technique
Trouvez les réponses à vos questions sur nos produits :
- Explorez gratuitement notre base de données en ligne
- Connexion au service d'assistance en ligne
- Numéros de téléphone de nos services clients
- Avis sur un produit ou demande de fonctionnalités