Gen2: The Next Step Forward for AI Document Review

A text & vision AI system that understands corporate documents as words and images.

Generation 2: 2022

Explore bigger architectures and introduce the multilingual analysis

  • Training as Language model transformer-based architectures (e.g. Roberta [9], XLM-Roberta [10]) on Legal Domain
  • Introducing the multilingualism analysis support
  • Applied distillation techniques to keep the legal knowledge and reduce the size of deployments (Legal-Tech Open Diaries: Lesson learned on how to develop and deploy light-weight models in the era of humongous Language Models)
  • We leverage big machine translation models (e.g., M2M of Facebook [11]) to translate and augment our current datasets.
  • Custom automated alignment procedure that maps the annotations from the source documents in English to the translated ones.
  • Fine-tuning the above architecture on our downstream tasks by adding classification head(s) at the top improves our performance and capabilities.
  • Semi-supervised/Zero Shot Cross-lingual transfer learning techniques [12] utilizing a Teacher/Student architecture where a Teacher can be a model with big capacity trained offline and then used for predicting non-annotated instances and help the student be trained on those
  • Integration to HuggingFace

Language Modeling

Language modeling is a crucial aspect of natural language processing (NLP) and artificial intelligence (AI) that focuses on developing algorithms and models capable of understanding, generating, and predicting human language. At its core, a language model learns the structure, patterns, and relationships within a given language, enabling it to perform tasks such as speech recognition, machine translation, text generation, and sentiment analysis. The primary goal of language modeling is to capture human language's inherent complexity and diversity. This involves understanding the contextual nuances, grammatical rules, semantics, and syntactic structures that govern language use. In our case, language modeling becomes even more crucial since legal documents often have a complex and specialized language structure, and it helps to interpret and generate text that adheres to legal terminology, ensuring accuracy and precision in understanding the content.

Therefore, by applying language modeling, we moved from Hierarchical Neural Networks with self-attention mechanisms to Transformers [3] models. Some key advantages of Transformers are that they excel at capturing contextual information, considering the entire context of a document rather than relying solely on hierarchical structures. This contextual understanding is crucial in legal texts where the meaning of terms and phrases can be highly dependent on the broader context. Also, legal documents often involve long-range dependencies and complex relationships between different sections or clauses. Transformers are designed to handle such dependencies efficiently, making them well-suited for tasks like contract analysis, where understanding connections between distant parts of a document is essential. It should be mentioned that transformers use an attention mechanism to focus on specific parts of the input sequence. This mechanism is advantageous in legal language modeling as it enables the model to prioritize relevant information, making it more accurate in understanding and generating legal text.

Multilingual

The above architectures were trained on the below multilingual datasets making them able to handle easier multilingual legal documents in those 10 languages.
EUR-LEX (10 languages, 1065K=650K EU laws)
UK-LEX (1 languages, 35K UK laws)
US CODES (1 language, 250 US Code books)
LEDGAR (10 languages, 10900K=9M contractual sections)

Distillation

In the paper of Maroudas et., all (2022) [13] we published, we experimented with a full-scale pipeline for model compression which includes:

  1. Parameter Pruning,
  2. Knowledge Distillation, and
  3. Quantization

and produced much more efficient (smaller and faster) models that can be effectively deployed in production environments.

Machine Translation - Alignment

Since we had developed those powerful multilingual models with legal domain knowledge, we had to find a way to leverage the English-annotated data (custom insights) we had. We used some big machine translation models such as:

  1. Fairseq [14] & M2M [11]: Those were developed by Facebook AI Research. Fairseq is an open-source sequence-to-sequence learning toolkit that supports various NLP tasks, including machine translation and M2M-100, the first multilingual machine translation (MMT) model that can translate between any pair of 100 languages without relying on English data.
  2. MarianMT: Developed by the team behind the Marian Neural Machine Translation framework, MarianMT is known for its efficiency and scalability (models ).

We translated our documents while simultaneously developing a custom alignment procedure based on embedding similarities and mapping the annotations of the source documents in English to the corresponding translated ones.

Machine Translation - Alignment

Since we had developed those powerful multilingual models with legal domain knowledge, we had to find a way to leverage the English-annotated data (custom insights) we had. We used some big machine translation models such as:

  1. ClassificationHead(): in the cases of document classification and sentence classification, where the [CLS] representation of the document/sentence passes through this head.

    def forward(self, features, **kwargs):
        x = features[:, 0, :]  # take token <s> (equiv. to [CLS])
        x = self.dropout(x)
        x = self.dense(x)
        x = torch.tanh(x)
        x = self.dropout(x)
        x = self.out_proj(x)
        return x
                            

  2. TokenClassificationHead(): in the cases of sequence tagging tasks such as ContractParties, EffectiveDates, etc.

    from torch import nn
    
    self.classifier = nn.Linear(config.hidden_size, config.num_labels)
    
    sequence_output = outputs[0]
    sequence_output = self.dropout(sequence_output)
    logits = self.classifier(sequence_output)
                            

We translated our documents while simultaneously developing a custom alignment procedure based on embedding similarities and mapping the annotations of the source documents in English to the corresponding translated ones.

Zero-shot cross-lingual transfer learning

We continued the research on the Realistic Zero-Shot Cross-Lingual Transfer in the Legal Topic Classification [12] field where we show that translation-based methods vastly outperform cross-lingual fine-tuning of multilingually pre-trained models, the best previous zero-shot transfer method for MultiEURLEX [15]. We also develop a bilingual teacher-student zero-shot transfer approach, which exploits additional unlabeled documents of the target language and performs better than a model fine-tuned directly on labeled target language documents.

Integration to HuggingFace (Both models and datasets are stored in a Huggingface organization account keeping the versioning with git)