top of page

Leveraging GPT-J for Information Extraction: Insights from Our Experiments

Overview of GPT-J

In the rapidly evolving field of natural language processing (NLP), the adoption of powerful transformer models like GPT-J represents a significant leap forward in tasks such as sentence correction and attribute extraction. This blog post delves into a series of experiments conducted utilizing GPT-J to tackle 2 specific NLP challenges: entity extraction and sentence correction on retail product data.

The GPT-J model was released in the kingoflolz/mesh-transformer-jax repository by Ben Wang and Aran Komatsuzaki. It is a GPT-2-like causal language model trained on the Pile dataset.

GPT-J 6B is a transformer model trained using Ben Wang's Mesh Transformer JAX. "GPT-J" refers to the class of model, while "6B" represents the number of trainable parameters.

The model consists of 28 layers with a model dimension of 4096, and a feedforward dimension of 16384. The model dimension is split into 16 heads, each with a dimension of 256. Rotary Position Embedding (RoPE) is applied to 64 dimensions of each head. The model is trained with a tokenization vocabulary of 50257, using the same set of BPEs as GPT-2/GPT-3.

Information Retrieval Using LLMs

Our use cases involved working with retail product data to:

Entity Extraction

Given a bob of text extract are the domain specific keywords or entities. Specific use case is extracting product attribute key, value pairs. Examples:

Source Text


Information Retrieval (IR) is the process of obtaining resources relevant to the information need. For instance, a search query on a web search engine can be an information need. The search engine can return web pages that represent relevant resources.

information retrieval, search query, relevant resources

Great Value Light Roast 100% Arabica Coffee K-Cup Packs, French Vanilla, 48 Ct (2 Pack)

"great value", "light roast", "100", "arabica coffee", "k", "cup packs", "french vanilla", "48 ct", "2 pack"

Fruit is our 1ST Ingredient! Welch's FAMILY FARMER OWNED MIXED FRUIT NATURAL & ARTIFICIAL FLAVORS Fruit Snacks Now With More REAL Fruit 100% Vitamin C DV Per Serving 25% Vitamins A & E FAT FREE Gluten Free No Preservatives PER SERVING 90 CALORIE 0g SAT FAT 0% DV 10mg SODIUM 1% DV 12g SUGARS NET WT 4 oz. (113g)

"fruit is", "1st ingredient", "welch", "family farmer owned", "mixed fruit", "natural", "artificial", "flavors fruit snacks", "now with more", "real fruit", "100", "vitamin c", "dv", "per serving", "25", "vitamins", "e", "fat free", "gluten free", "no preservatives", "per serving", "90 calorie", "0g", "sat fat", "0", "dv", "10mg", "sodium 1", "dv", "12g sugars", "net wt", "4 oz", "113g"

Sentence correction

Correcting product text content with correct words or information. Specifically when text extracting or OCR results in noisy text, can model correct the sentence to create robust content



Corrected Sentence

I love goin to the beach.

I love going to the beach.

Lipton Soup SecretsabHEESENoodle SoupSoup Mix with Real Chicken BrothVow With Bigger NVith Bigger NoodlesServingSuggestionbeeshePER 2 TBSP DRY MIX50 | Og || 650mgCALORIESSAT FAT SODIUM0%DV 27%DVSUGARS2 POUCHESNET WT 4.5 OZ (1279)0%DV

Lipton Soup SecretsCHEESENoodle SoupSoup Mix with Real Chicken BrothWith Bigger NWith Bigger NoodlesServingPER 3 TBSP DRY MIXOg 650mg70 CALORIES0g SAT FAT 0%DV660 mg SODIUM 27%DV0g SUGARS 0%DV2 POUCHESNET WT 4.5 OZ (139g)

LiptonRecipe SecretsOnion MushroomRecipe Soup & Dip Mix28PER 123 TBSP DRY MIX35 0 690mgO,SUGARSCALORIESSAT FAT0% DVSODIUM29% DVSERVING SUGGESTION2 ENVELOPES NET WT 1.8 OZ (519)

LiptonRecipe SecretsOnion MushroomRecipe Soup & Dip MixPER 1 2/3 TBSP DRY MIX35 CALORIES0 SAT FAT, 0% DV690mg SODIUM 29% DV0g SUGARSSERVING SUGGESTION2 ENVELOPES NET WT 1.8 OZ (51g)


One of the primary challenges encountered was the model's high resource demands, particularly in terms of GPU and CPU RAM. This was addressed through a series of trials, each building on the learnings of the previous:

  • Trial 1: Confronted with disk space issues, an 80GB instance was created to accommodate the model's size.

  • Trial 2: Encountered CPU RAM limitations, leading to an upgrade to larger CPU RAM options.

  • Trial 3: Faced issues with data types and memory efficiency, which were resolved by switching from FP32 to FP16 to reduce RAM usage, albeit at the cost of increased computational demand.

Testing and Fine-Tuning

With GPT-J successfully running, the focus shifted to testing and fine-tuning for specific use cases:

  • Sentence Correction: Initial tests revealed subpar performance, attributed to insufficient fine-tuning and the need for more extensive training data.

  • Attribute Extraction: Experiments with synthetic data for extracting net weight from product descriptions highlighted the model's potential but also its sensitivity to training data quality and variety.

Leveraging DeepSpeed for Enhanced Performance

Recognizing the need for more efficient training given the model's resource demands, DeepSpeed was introduced into the workflow. This powerful tool enabled more effective training and fine-tuning of GPT-J by allowing for:

  • Batch Sizes and Gradient Accumulation: Managing memory usage more effectively by adjusting batch sizes and employing gradient accumulation strategies.

  • Gradient Checkpointing: Reducing memory overhead by recomputing activations on demand during the backward pass, rather than saving all activations from the forward pass.

  • Optimization Techniques: Exploring memory-efficient optimizers like Adafactor, as opposed to the traditional Adam optimizer, to further enhance performance.

Key Takeaways and Best Practices

Through a series of trials, errors, and successes, several key takeaways emerged:

  1. Resource Management: GPT-J's resource demands are significant, necessitating careful planning and optimization of disk and RAM usage.

  2. Data Quality and Quantity: The quality and diversity of training data are paramount, especially for tasks like sentence correction and attribute extraction, where context and nuance are crucial.

  3. Efficient Training Techniques: Techniques like gradient accumulation, checkpointing, and the use of memory-efficient optimizers are vital for managing the computational demands of large models like GPT-J.

  4. Continuous Testing and Fine-Tuning: Achieving appreciable performance requires an iterative process of testing, fine-tuning, and, when necessary, augmenting the training data to address specific use case challenges.



bottom of page