Does your work involve going through PDFs for data? It can be really annoying and time-consuming to check PDF files one by one in search for specific data. One can use the search option but that doesn’t make it that much convenient.
In this article, I will cover an online service where you can run SQL queries on PDF files. By running an SQL query, you can not only search but can also extract specific data from a collection of multiple PDF files. This makes extracting data from PDFs a lot easier and convenient.
Rockset is a freemium online service that lets you run real-time SQL on raw data. With the free plan, you can process 500 KB ingested documents (after processing) per month with 1 concurrent query slot. The free limit is quite good as this service extracts and saves only the text which takes very less disk space.
Also read: Redact PDF Online with These Free Websites
How to Run SQL Queries on PDF Files?
To run SQL queries, you should have some experience with SQL. If you are already familiar with SQL, You can learn more about the syntax and SQL commands you can run on Rockset here. Otherwise, I recommend you to find an online course on SQL and get familiar with the basics of SQL, syntax, and commands.
To run SQL queries on PDFs, first, you have to upload the PDF files to the Rockset. This service then creates an ingested document by extracting data from your source files. It shows you all the data fields that it extracted from the files.
Collections
On Rockset, you can create a collection from any of the following source types:
- Amazon S3
- Amazon Kinesis
- Amazon DyanmoDB
- Google Cloud Storage
- File Upload
- Sample Database (for testing)
This service is not limited to PDF only, it supports semi-structured data in the data formats:
- JSON
- CSV/TSV
- XML
- Parquet
- XLS/XLSX
Query
Once you have all your collection(s) on the Rockset, you can run the SQL queries on any of your collection. After executing the query, you can export query for
- Python
- Jupyter Notebook
- Go
- Java
- NodeJS
and download the query results in the following file formats:
- JSON
- CSV
Run SQL queries on PDF files here.
Verdict
Rockset is a handy service to easily extract specific data from semi-structured file formats. It does require some basic understanding of SQL but it can save you lots of time. Give it a try and share your thoughts with us in the comments.