Expanding the Open Wordnets for English and Portuguese to Geology Domain: Inclusion of Lythology and Geological Time Concepts

Student: 

  • Alexandre Tessarollo

When: 

30/09/2020 - 20:00

Where: 

Via Zoom

Abstract: 

Human knowledge has been stored, transferred and built upon by written means. The human ability to tap into this source is by far the main reason why we've been able to advance our collective understanding. Over a quarter century ago, our technologies for collecting, storing, and disseminating vast amounts of information had gotten ahead of our technologies for collating and analyzing it. Natural Language Processing (NLP) tackles this issue. The everyday life already benefits from NLP, with applications ranging from spam filtering to (limited) support chatbots and artificial intelligence assistants interacting through voice commands. When it comes to technical language, however, NLP is shortcoming. This is particularly true for the Oil&Gas domain, where information is the most precious resource, one that supports decisions worth billions of dollars. Even though there are numerous reports, papers, documents and alike, such knowledge remains untapped due to NLP domain limitations. It is our hypothesis that expanding a lexical resource, namely the WordNet, will have a scalable effect particularly on Word Sense Disambiguation (WSD) and on the overall NLP for Oil&Gas domain documents. To verify this we extended the WordNet with 377 new concepts (synsets), 558 new lexical forms (words) and 948 new relations (pointers) between such words and/or synsets. Such extension is focused on two of the most common references mentioned in Oil&Gas documents: Geological Time and Lithology (branch of geology devoted to rocks). We perform such extension both "vertically" from the original Princeton WordNet in English into the Open WordNet for English (OWN-EN) and "horizontally" by translating and adapting such effort to the Open WordNet for Portuguese (OWN-PT). We then compare the outputs of the WSD algorithm UKB before and after such extension. Both WordNet extensions (English and Portuguese) are available on online open-source initiatives. This work also made possible to contribute to other online initiatives that are either WordNet-related or domain-related, such as the testing and improving of a textual-file-approach to WordNet editing (instead of the usual not human-friendly lexicographer files) and some assistance on minor errors detection on the geological timescale.

*Abstract informed by the student. 

Thesis Committee: 

  • Alexandre Rademaker (advisor) - FGV EMAp
  • Mara Abel - UFRGS
  • Francis Charles Bond - Nanyang Technological University