![]() ![]() DocArray - a data structure for unstructured data.Since this task is a whole search pipeline that deals with different kinds of data, we’ll use some specialist tools to get this done: In future posts we’ll look at how to actually search through that data. So that’s what we’ll focus on in this post. If you want to follow along at home (and maybe fix a few of my bugs!), check the repo:Īs anyone who’s spent any time in data science knows, wrangling the data into a usable state is 90% of the job. This is just a rough and ready roadmap - so stay tuned to see how things really pan out. Finally we’ll look at some other useful tasks, like extracting metadata.Next we’ll look at how to search through that index using a client and Streamlit frontend.After extracting our PDF’s text and images, CLIP will generate a semantically-useful index that we can search by giving it an image or text as input (and it’ll understand the input semantically, not just match keywords or pixels). For the next post we’ll look at feeding these into CLIP, a deep learning model that “understands” text and images.In this post we’ll cover how to extract the images and text from PDFs, process them, and store them in a sane way.This will be part 1 of n posts that walk you through creating a PDF neural search engine using Python: I know several folks already building PDF search engines powered by AI, so I figured I’d give it a stab too. ![]() With neural search seeing rapid adoption, more people are looking at using it for indexing and searching through their unstructured data.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |