Generative AI confusing term: ‘domain aware generative AI’


What follows is a warning to people looking to buy a new system built with the same technology that powers ChatGPT from a startup which may claim to having something brand new. These companies are taking something you can get today for $360/year/user and charging orders of magnitude more by confusing their customers.

One term that seemed weird to me is term is “domain aware generative AI.” I dug in and discovered that this phrase is just used so that a customer can’t look up the real name: Open Domain Question Answering (ODQA). If you have Microsoft Office with Co-Pilot enabled, you have this today. It’s that thing which happens when you put a bunch of files in SharePoint and then use Co-Pilot to ask questions about the documents in SharePoint, nothing more. One of the companies using the term claims to have the patent on ODQA: https://ppubs.uspto.gov/dirsearch-public/print/downloadPdf/20240062019. Note one can find papers on ODQA which predate this patent application, so I am of the opinion that Microsoft does not need to worry about a patent troll any time soon. One example which predates the patent, which has links to even older articles, is here: https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00530/114590/Improving-the-Domain-Adaptation-of-Retrieval.

What is ODQA? It’s a technique to create answers from a generative AI while significantly reducing the opportunity to hallucinate. ODQA requires a 2 stage pipeline: a retrieval process which selects paragraphs and other text relevant to a given question. This combines with a tool that generates an answer from the selected passages. That 2-stage pipeline assumes a search mechanism of some sort. I’ve built such a system which did the following:

  1. Read the corpus of content and cut things up into paragraphs up to 1K tokens long.
  2. For each token, create an embedding using an algorithm like text-ada-002 from Open AI or another embedding algorithm which understand the language of the document.
  3. Store the embedding alongside the document path and page number in a vector database such as weaviate.

Using that system, one could then take a question like “How does our quality control process provide widgets which satisfy the requirement to <do a task>?” The vector database uses a cosine similarity search to identify the most aligned paragraphs. One then submits those paragraphs to an LLM like Mistral, Llama, GPT-4, etc. and delivers a prompt which states something like

Answer the question 'How does our quality control process provide widgets which satisfy the requirement to <do a task>?' using the information below. The information is formatted as [DOC NAME]; [Pg Number]. In the response, cite the [DOC NAME] and [Pg Number]. If you can't find the answer in the attached content, respond that you do not know.

That prompt will also contain the relevant content, formatted as described. It will respond “I do not know the answer” if no answer can be found.

So now, if you hear the term “Domain Aware Generative AI”, know that this is ODQA and is something you have in SharePoint today with CoPilot.