Finetuning GPT-2 for scientific text generation
 
    
  Suggesting that deep learning models based are capable of generating realistic text from a prompt would be an understatement. Ever since the advent of Transformer models, natural language processing has been undergoing a revolution. Large language models (LLMs), and generative models in general, have received public attention with the releases of text-to-image models (Stable Diffusion) and of course the ChatGPT chatbot. While LLMs have impressive generalized capabilities for text generation, they can be challenging to use due to their size (hundreds of millions or even billions of trainable parameters). This post presents an experiment I have carried out, to assess the capabilities of the small version of GPT-2 (124 million trainable parameters) on a more constrained task of generating scientific text.
Creating a text corpus
First, a text corpus is needed for training. The task is causal language modeling, essentially training the model to predict the next word based on the given context. Text is extracted from scientific research papers using GROBID, which is a tool for parsing PDF files based on deep learning. GROBID can be deployed as a REST API and parsing can be done using one of the clients. I used the GROBID service locally and the Python client to parse the PDF files.
The following steps are required :
- Install GROBID according to the documentation 
- Install the Python client library 
- Navigate to the GROBID directory and start the GROBID service: 
cd grobid-0.7.2
./gradlew run- Use the command line tool from the Python client to parse all PDF files in a folder using the processFulltextDocumentservice. Note, that thesegmentSentencesflag is also used, which allow citations to be remove easily later on:
grobid_client --input directory_with_pdfs/ \
              --segmentSentences \
              --force processFulltextDocument- The PDF files are parsed to XML. All text from the document body can be merged and dumped to a text file leveraging the BeautifulSoup library in a simple Python script:
from bs4 import BeautifulSoup as bs
import re
from pathlib import Path
txt_list = []
pdf_dir = Path("directory_with_pdfs/")
#Iterate through all files
for file in pdf_dir.glob("*.tei.xml"):
    with open(file, 'r') as tei:
        soup = bs(tei, 'lxml')
    for x in soup.find_all('s'):
        if x.find('ref'):
            x.decompose()
    x = " ".join([i.getText() for i in soup.find_all('s')])
    
    #Perform some cleaning of rare symbols, double spaces, etc.
    #This step can be developed to further boost performance
    x = x.replace("•", "")
    x = x.replace(" .", ".")
    x = x.replace(" ,", ",")
    x = " ".join(x.split())
    
    txt_list.append(x)
#Write to file
with open("corpus.txt", 'w') as f:
    for i in txt_list:
        f.write("%s\n" % i)That is it! I used the described approach to extract text from ~500 scientific research papers. The papers are mainly investigating carbon cycling and emissions in lakes and streams.
Training a model
Fortunately, the transformers by Hugging Face has made the process of training NLP models very easy. I have used the run_clm.py script (in the examples/pytorch/language-modeling/ directory) to finetune a pretrained version of the small GPT-2 model:
python run_clm.py \
       --model_name_or_path gpt2 \
       --train_file corpus.txt \
       --validation_split_percentage 5 \
       --per_device_train_batch_size 4 \
       --per_device_eval_batch_size 4 \
       --do_train \
       --do_eval \
       --output_dir gpt2_finetuned \
       --fp16 \
       --overwrite_output_dir \
       --gradient_checkpointing \
       --num_train_epochs 10 \
       --warmup_steps 200 \
       --gradient_accumulation_steps 2 \
       --no_keep_linebreaksTraining this model can be done on consumer hardware with a GPU with 8 GB RAM in my case. This is thanks to the use of half-precision training (fp16 flag), gradient accumulation, and gradient checkpointing, which are all options that reduce the GPU memory footprint of the model. On the validation set, the model obtains an accuracy of 42% and perplexity of ~23. But how good is the model at generating scientific text?
Text generation examples
Overall, the text generation capabilities of the model are impressive given the small size of the model and the size of the text corpus. It is straightforward to generate text from a prompt:
from transformers import pipeline
#Load model and move to GPU
model = pipeline('text-generation', model='./gpt2_finetuned', device="cuda:0")
#Generate text and show results
result = model("Input prompt", penalty_alpha=0.7, top_k=5, max_new_tokens=300)
print(result[0]["generated_text"])Below are three unedited examples generated by the model based on the input prompt in bold:
We found that CO 2 emissions from small lakes are influenced by the lake size, the water temperature and the relative contributions of the three major drivers to lake size: (i) water temperature and the water column depth; (ii) lake area (area divided by lake area); and (iii) the ratio of lake area and the lake size. The relationship of surface area to total water volume (TP) was significant for the lakes with the highest TP. The relationship between surface area and TP (R 0.98, p = 0.001) was also significant (R 0.98, p = 0.001). In contrast, for lakes with the smallest TP, TP was significantly correlated with the water column depth (R 0.98, p = 0.001), with the largest TP in the smallest lake having the smallest TP. For the small lakes, there was no significant relationship between the TP and TP, but TP was significant (r = 0.98, p < 0.001) for the small lakes, with TP significantly correlated with water temperature (R 0.98, p = 0.001), and for the small lake with the largest TP. For both TP and surface area, TP was positively related to TP and TP was positively related to surface area in all the lakes except for the Lake Mendota (r = −0.90, p = 0.0001)…
Stratification affects lake metabolism, but the magnitude of the effect varies among sites. In this study we investigated how lake metabolism changes over time with lake surface area and lake morphometry. We hypothesized that the magnitude of lake metabolism would be influenced by the size of the lake and morphometry, which would influence the magnitude of lake metabolism. To test the hypothesis, we conducted field and laboratory tests to determine if lake metabolism changes over time with lake surface area (i.e., surface area × lake area) and lake morphometry, which would affect the magnitude of metabolism…
Catchment land use affects the carbon cycle in the watershed, and it has implications for the management of the watershed and the ecosystem. In particular, the impact of watershed land use on the carbon cycle in lakes and reservoirs is likely to be more pronounced for larger watersheds. In the case of lakes, the carbon cycle of lakes may be dominated by the carbon dioxide (CO 2 ) cycle (Sec. 2.1), which is dominated by the carbonate dissolution of organic matter and carbonate precipitation. However, lakes and reservoirs are also affected by other processes (e.g., erosion, sedimentation, and water quality management), which are likely to have a direct impact on the carbon cycle. For example, in a large watershed like the Great Lakes, the carbonate precipitation in the lake is likely to be more important than the organic matter content of the lake, which may have a direct effect on the carbon cycle in the lake…
Concluding remarks
The results do not suggest that I should be worried about GPT-2 taking over my scientific writing just yet. However, notice the phrasing of sentences, use of common abbreviations, statistics, etc. which all sound very scientific even though some of it is just wrong. I am sure that such a model can provide value doing the process of writing scientific research papers. The model does not write factual truth but can suggest wording, keywords, and topics from very limited input. Remarkably, that this can be done using a small model which can be trained on an average gaming desktop computer. It is my belief, that many would benefit from having a personalized writing assistant as illustrated here, and might also be easier to get in the hands of many as serving the bigger and more prominent LLMs of GPT-3-like size at scale require dedicated hardware and great power consumption. So far, everything points to the fact that increasing the size of datasets and models results in better and better performance but much can be achieved by the targeted use of smaller and more efficient models.
