Populating Azure AI Index
Page content
1. Upload the PDF document
Before populating the Azure AI index, we first need to upload a PDF document to the Azure Storage Account.
The Git repository includes a small sample PDF document, which is a snippet from the Azure AKS service. You can locate and view the PDF at:
./sample-documents/Azure-Kubernetes-Service.pdf. Later, you’ll learn how to index your own PDF documents using this solution. For now, let’s start with this sample PDF.
Run the command below to upload the sample PDF document to the Azure Blob Storage:
bash ./helper.sh upload-pdf
Let’s validate that the PDF document was successfully uploaded to the storage account. Follow these steps:
- Click Storage browser
- Click Blob containers and data
Observe that there is raw_data directory. Now click at the directory and validate that the Azure-Kubernetes-Service.pdf file is present.
Get help!
If you’re unsure which storage account is being used for the PDF document upload, you can retrieve the information by running the command below:
echo "${storage_account_name}"
At this point the data source already be prepaired, and most likely you also saw the prepaired_data directory.
Go back to the data container and open the prepaired_data directory. Observe:
- There is an image directory
- There is a text directory
Click at the text directory and observe that there is the Azure-Kubernetes-Service.json file. Let’s download it and look at the JSON object structure. Run the below command to download the file.
storage_account_key=$(az storage account keys list \
--account-name "${storage_account_name}" \
--resource-group "${resource_group_name}" \
--output tsv \
--query "[0].value")
az storage blob download \
--account-name "${storage_account_name}" \
--container-name data \
--name "prepaired_data/text/Azure-Kubernetes-Service.json" \
--account-key "${storage_account_key}" | json_pp
Observe that the structure of the object is an array and each object there is a content chunk, and an array of image url. Meaning, there could be zero to many images that are mapped to a chunk of text, and an object will be mapped to the object in Azure AI Search index. We’ll look at it a bit closer once we populate the index.
Before we run the indexer, let’s also validate that the images are also present in the image directory.
- Go back to prepaired_data directory, and open the image directory.
- Click at the Azure-Kubernetes-Service directory
- Observe there are *.png image files.
Feel free to download any and open on your local machine.
2. Run indexer
Now that we know the prepaired data is ready, we’re also to finally run the indexer and populate the index!
Run the below command to run the indexer.
bash ./helper.sh run-indexer
Let’s validate the indexer ran sucessully. Open the Azure AI Search resource and
- Click Search management and than click Indexers
- Observe the Status is Success
- Also observe the Docs succeded isn’t zero anymore but is 8/8
- Lastly, observe that under Error/Warning it’s 0/0, meaning all chunks are indexed successfully.
Let’s explore further details about the indexer process. Click at search-aoai-emb-indexer to see more details.
In this view you can see further helpful details, such as:
- Indexer runs over time and status of each run
- Duration of each run and number of docs that succeded or had error or warning.
- In case there are error or warnings, you could click at the status of the individual run, Success in our case, and see further details about the errors and warnings.
3. Query Azure Ai Search index using the build-in Search feature
Azure AI Search comes with build-in search feature, let’s try it out. From the Azure AI Search resource:
- Click Search management and after click Indexes
- Click at the search-aoai-emb index name
- In Search explorer, enter kubernetes and click the Search button.
- Observe that in the Results you see relevant search result to the search query (aka kubernetes).
You can also write advance queries using the search explorer, to do so
- Click View and select JSON view
- Observe that the simple input field changed to JSON query editor, where you can write more complex queries using the lucene query syntax.
Congratulations! You’ve successfully populated the Azure AI Search Index and completed the Document Data Management section! Now we’re ready to move on to the next step—configuring and running the demo application!
« Document Data Management: PDF Document Processing | Application Runtime: Overview »