LOGO

Docugami AI Analyzes NASA Archives - New Document Understanding Model

April 12, 2021
Docugami AI Analyzes NASA Archives - New Document Understanding Model

The Challenge of Unstructured Data

The prevalence of data is widely discussed, yet a significant portion of global information remains locked within documents. These diverse files and formats represent substantial value, but often lack compatibility with modern, structured database systems.

Docugami is developing a solution designed to bridge this gap. Their system aims to intuitively interpret any collection of documents and intelligently catalog their information, and NASA has already begun utilizing their technology.

Transforming Documents into Usable Data

Docugami’s technology promises to rapidly convert accumulated document archives into readily accessible data. This capability could revolutionize how organizations leverage their existing information assets.

The generation of documents is inherent to nearly all business operations. Legal professionals deal with contracts and briefs, real estate involves leases and agreements, marketing relies on proposals and releases, and healthcare generates medical charts – the list continues.

These documents exist in a multitude of formats, including Word documents, PDFs, scanned paper copies, and PDFs exported from Word. This variety further complicates data extraction efforts.

Existing Solutions and Their Limitations

Recent efforts to address this challenge have primarily focused on organizational strategies. These strategies emphasize centralizing document storage and enabling collaborative editing.

However, truly understanding the content within the documents has largely been left to human review. This is because deciphering documents presents significant complexities.

The Nuances of Document Comprehension

Consider a standard rental agreement. Humans readily recognize that when a contract identifies “Jill Jackson” as the renter, subsequent references to “the renter” also pertain to the same individual.

Moreover, we intuitively understand that “the renter” in different contracts represents the same category of entity – a person obligated by a lease – but not necessarily the same specific person.

These concepts, while simple for humans, pose substantial challenges for machine learning and natural language understanding systems. Mastering these nuances is crucial for unlocking the wealth of information contained within the world’s vast document repositories.

Successfully extracting this information would provide access to valuable insights currently hidden within millions of documents worldwide.

The Challenge with Modern Documents

Jean Paoli, founder of Docugami, asserts a significant breakthrough in document processing. Given his extensive background at Microsoft, including pivotal work on the XML format (.docx, .xlsx, etc.), his claim carries considerable weight.

Paoli emphasizes a fundamental distinction: “Data and documents are not identical.” He explains that while humans perceive documents, computers operate on data. His initial work at Microsoft centered on bridging this gap by creating a format capable of representing documents as data, leading to the development of XML alongside industry colleagues, with the approval of Bill Gates.

Despite the widespread adoption of these formats, the core problem remains, amplified by increasing digitization. Paoli’s proposed solution mirrors the original approach. He believes documents should be structured similarly to webpages – utilizing nested elements, each defined by metadata – creating a hierarchical structure that computers can readily interpret.

Image Credits: Docugami

Paoli explored applying AI to this problem, but encountered a significant obstacle. “I needed an algorithm capable of navigating this hierarchical model, and I was told such an algorithm doesn’t exist,” he stated. The existing AI models haven’t been integrated with the XML structure where each component is contained within another, and uniquely identified by its data representation.

This incompatibility isn’t unexpected; new technologies invariably come with inherent assumptions and limitations. Current AI research has prioritized areas like speech recognition and computer vision. However, these approaches don’t align with the requirements for systematically analyzing documents.

“Many perceive documents as analogous to cats. You train AI to identify eyes, tails… but documents aren’t like cats,” Paoli clarifies.

This distinction is crucial. Contemporary AI techniques, such as segmentation and scene understanding, represent advanced forms of object detection – extending beyond cats to encompass dogs, vehicles, facial expressions, and locations. However, documents exhibit too much variability, or conversely, too much similarity, for these methods to achieve meaningful results beyond basic categorization.

While natural language processing (NLP) has made strides, it falls short of Paoli’s needs. “NLP operates primarily at the linguistic level,” he observes. “It analyzes text without considering its context within the document.” He acknowledges the value of NLP specialists – half his team comprises them – but stresses the need to integrate them with experts in XML and computer vision to gain a more comprehensive understanding of the document’s structure.

Docugami: A New Approach to Document Understanding

Image Credits: Docugami

Paoli’s objective proved unattainable through the modification of current technologies – even with established methods like optical character recognition – leading him to establish a dedicated AI research team. This team has been engaged in development for approximately two years.

He stated that the project involved fundamental scientific research, funded independently and conducted discreetly, culminating in the submission of numerous patent applications. Subsequently, they approached venture capitalists, and SignalFire readily agreed to lead the seed funding round with a $10 million investment.

Reports concerning the funding round lacked detailed insight into the practical experience of utilizing Docugami. Paoli, however, provided a demonstration of the platform using live documents. Direct access was not granted, and the company refrained from sharing screenshots or videos, citing ongoing work on integrations and the user interface. Therefore, visualization requires some imagination, though it closely resembles typical enterprise SaaS platforms.

Users upload a variable number of documents to Docugami, ranging from a few dozen to several thousand. These documents then undergo a machine understanding process, which analyzes their structure – whether they are scanned PDFs, Word files, or other formats – and organizes them into a hierarchical structure akin to XML, tailored to the content itself.

“When presented with, for instance, 500 documents, the system attempts to categorize them into logical sets. Thirty documents may appear similar, twenty others may share characteristics, and five might form a distinct group. This grouping is achieved through a combination of visual cues, content analysis, and an assessment of intended usage,” Paoli explained. While other tools can differentiate between a lease and an NDA, the diversity of documents prevents successful categorization using pre-defined templates. Each document set is potentially unique, prompting Docugami to retrain itself with each upload, even for a single document. “Following grouping, we gain insight into the overall structure and hierarchy of the specific document set, as this is crucial for effective document utilization.”

Image Credits: Docugami

This capability extends beyond simple indexing or keyword searching. The data contained within the documents – such as payer details, amounts, dates, and conditions – becomes structured and editable within the context of comparable documents. (A small amount of user verification is requested to confirm the system’s deductions.)

Understanding this can be challenging, but consider the task of compiling a report on your company’s outstanding loans. Simply highlight relevant information in a sample document – for example, “Jane Roe,” “$20,000,” and “five years” – and then select the other documents from which to extract corresponding data. Within moments, an organized spreadsheet containing names, amounts, dates, and any other desired information will be generated.

This data is designed to be transferable, with planned integrations with various business systems. This will enable automated report generation, alerts based on specific criteria, and the automated creation of templates and standardized documents, eliminating the need to maintain outdated versions with placeholders.

Importantly, this entire process occurs within approximately thirty minutes of the initial upload, without requiring any labeling, pre-processing, or cleaning. The AI does not rely on preconceived notions of document formats; it learns directly from the uploaded documents – their structure, the relative positioning of elements like names and dates, and so on. This functionality is accessible across various industries and features an intuitive interface that can be mastered in minutes, whether in healthcare data management or construction contract administration.

The platform offers two primary interfaces: a web-based tool for document ingestion and creation, and an integration within Microsoft Word. Within Word, Docugami functions as an intelligent assistant, possessing comprehensive knowledge of all documents of a given type, facilitating the creation of new documents, pre-filling standard information, and ensuring regulatory compliance.

While processing legal documents may not represent the most groundbreaking application of machine learning, its significance warrants attention. Similar deep understanding of document types is currently limited to established industries with standardized documentation, such as law enforcement or healthcare. Developing custom models for niche businesses, like a kayak rental service, remains a distant prospect. However, small businesses possess just as much valuable information within their documents as large enterprises – and often lack the resources to employ data scientists. Even large organizations find complete manual processing impractical.

A Wealth of Information at NASA

Image Credits: NASA

A task that appears remarkably simple to humans presents a significant challenge for artificial intelligence. A person could readily review a collection of twenty comparable documents and a corresponding list of names and figures, potentially even faster than an AI system can process and learn from them.

However, the core purpose of AI is to replicate and surpass human capabilities. While an account manager might routinely compile monthly reports on twenty contracts, generating a daily report on a thousand represents a vastly different scale of operation. Docugami effectively handles both scenarios, making it suitable for large enterprises requiring scalability and for NASA, which faces a substantial backlog of documentation.

NASA possesses an extensive collection of documents. Its archives, generally well-maintained, date back to its inception, with numerous important records publicly accessible. Exploring these historical documents can be a rewarding experience.

Currently, NASA isn’t focused on uncovering new details about Apollo 11. Through its diverse programs, solicitations, grant initiatives, budgetary allocations, and engineering endeavors, the agency accumulates a considerable volume of documentation – a natural consequence of its position within the federal government. This vast archive holds considerable, yet unrealized, potential.

Valuable insights, research foundations, engineering resolutions, and numerous other critical pieces of information reside within files that may be searchable using basic keyword matching, but lack comprehensive structure. Imagine a JPL engineer quickly accessing a complete, up-to-date compilation of documents concerning the evolution of nozzle design, categorized by type, date, author, and status. Consider a patent advisor assisting a NIAC grant recipient; shouldn't they be able to locate relevant prior art with greater precision than a simple keyword search allows?

The NASA SBIR grant, awarded last summer, isn’t tied to a specific project, such as gathering all documents of a particular type from the Johnson Space Center. It’s an exploratory agreement, common among these grants, and Docugami is collaborating with NASA scientists to determine the optimal application of the technology to their archives. (A particularly promising application could be within the SBIR program and other small business funding initiatives.)

A separate SBIR grant from the NSF differs in its focus. While NASA is exploring methods to better organize a large variety of documents with some shared information, the NSF project aims to more accurately identify “small data.” Paoli explained, “We are concentrating on the minute details. For example, when a name appears, does it refer to the lender or the borrower? The physician or the patient? If a patient record mentions penicillin, is it a prescription or a contraindication? If sections are labeled ‘allergies’ and ‘prescriptions,’ we can establish that connection.”

  • NASA maintains extensive archives dating back to its founding.
  • Docugami aims to unlock valuable insights from NASA’s documentation backlog.
  • SBIR grants are funding exploratory research into applying the technology.

Effectively managing and analyzing this data is crucial for advancing future space exploration and research initiatives.

A Novel Approach to Scientific Collaboration

During a discussion regarding the limited financial resources associated with SBIR grants, a company executive responded with amusement when it was suggested these funds alone wouldn't sustain the business.

He clarified that grant funding wasn't the primary objective. Instead, it served as a means to collaborate with leading scientists and prestigious laboratories globally. He also indicated that numerous additional grant projects were anticipated.

The executive described science itself as a driving force, a source of energy for the company. The core business strategy, he explained, mirrors a subscription-based service model, akin to established platforms like Docusign or Dropbox.

The company is transitioning into its core operational phase, having recently established partnerships with integrators and initiated testing procedures. Expansion of the private beta program is planned over the next twelve months, with a public launch date yet to be determined.

“We are still in our early stages,” stated Paoli, the company’s leader. “Just a year ago, our team consisted of only five or six individuals. Following a successful $10 million seed funding round, we experienced rapid growth.”

Paoli expressed confidence that the venture would not only achieve financial success but also fundamentally alter conventional business practices.

“There’s a clear affinity for documentation,” he observed, adding with a touch of humor, “Perhaps it’s a cultural trait stemming from my French heritage.” He further elaborated that written communication, books, and the act of writing are integral to human cognition.

The company believes in a synergistic relationship, where human insight enhances machine intelligence, and conversely, machines augment human thought processes.

#AI#document understanding#NASA#archives#Docugami#machine learning