Numerous companies are interested in developing efficient ways to gather strategic information from their document repositories. This is especially relevant for the oil and gas industry, which has large repositories of geoscientific reports from several decades of production. However, most natural language processing resources were built with general domain texts extracted from the Internet and written primarily in English.
This thesis presents a methodology for extracting geoscientific entities and relations from technical documents and populating a knowledge graph - the Petro KGraph. We discuss the natural language processing and information extraction resources, the process used to train the machine learning models, and the relevant literature. Finally, we evaluate each model and the overall methodology.
The Entity Linking approach is innovative as it allows for finding new entities beyond those already known. Another crucial contribution is a new comprehensive set of resources and evaluation procedures to train and compare a complete information extraction and ontology population pipeline. These resources and evaluation procedures constitute a new benchmark for the Portuguese language and the geoscience domain. We evaluated an information retrieval system using the Petro KGraph to expand its queries, which presented a slightly better result than the system without a query expansion. Plans for future work include building an improved information retrieval test set, comparing the results using different graph embedding algorithms, and testing language models launched after BERT models.
Quando: 13/08/2024
Link do zoom: https://fgv-br.zoom.us/j/94238776385?pwd=etZ9QC6sOE8ULlylSSa2SvTfQjOQeF.1
Horário: 13h.