Keyword extraction based on structural properties of language complex networks

Beliga, Slobodan

prikaz prve stranice dokumenta Keyword extraction based on structural properties of language complex networks

Access restricted to students and staff of home institution

doctoral thesis

Keyword extraction based on structural properties of language complex networks

2019. urn:nbn:hr:195:847609

Beliga, Slobodan

University of Rijeka
Faculty of Informatics and Digital Technologies

Request a copy
of the document

Cite this document

APA 6th Edition

Beliga, S. (2019). Keyword extraction based on structural properties of language complex networks (Doctoral thesis). Rijeka: University of Rijeka. Retrieved from https://urn.nsk.hr/urn:nbn:hr:195:847609

MLA 8th Edition

Beliga, Slobodan. "Keyword extraction based on structural properties of language complex networks." Doctoral thesis, University of Rijeka, 2019. https://urn.nsk.hr/urn:nbn:hr:195:847609

Chicago 17th Edition

Beliga, Slobodan. "Keyword extraction based on structural properties of language complex networks." Doctoral thesis, University of Rijeka, 2019. https://urn.nsk.hr/urn:nbn:hr:195:847609

Harvard

Beliga, S. (2019). 'Keyword extraction based on structural properties of language complex networks', Doctoral thesis, University of Rijeka, accessed 22 April 2025, https://urn.nsk.hr/urn:nbn:hr:195:847609

Vancouver

Beliga S. Keyword extraction based on structural properties of language complex networks [Doctoral thesis]. Rijeka: University of Rijeka; 2019 [cited 2025 April 22] Available at: https://urn.nsk.hr/urn:nbn:hr:195:847609

IEEE

S. Beliga, "Keyword extraction based on structural properties of language complex networks", Doctoral thesis, University of Rijeka, Rijeka, 2019. Available at: https://urn.nsk.hr/urn:nbn:hr:195:847609

Cite this item: https://urn.nsk.hr/urn:nbn:hr:195:847609

Please login to the repository to save this object to your list.

Metadata

Title	Keyword extraction based on structural properties of language complex networks
Title (croatian)	Izlučivanje ključnih riječi iz teksta zasnovano na strukturnim svojstvima jezičnih kompleksnih mreža
Author	Slobodan Beliga
Mentor	Sanda Martinčić-Ipšić (mentor)
Committee member	Ana Meštrović (predsjednik povjerenstva)
Committee member	Marina Ivašić-Kos (član povjerenstva)
Granter	University of Rijeka (Faculty of Informatics and Digital Technologies) Rijeka
Defense date and country	2019-09-07, Croatia
Scientific / art field, discipline and subdiscipline	SOCIAL SCIENCES Information and Communication Sciences
Universal decimal classification (UDC)	004 - Computer science and technology. Computing. Data processing
Abstract	Automatic keyword extraction task is the initial step in a number of systems for natural language processing (NLP), text mining (TM), and information retrieval (IR). Keywords concisely and compactly describe the subject of the text. The doctoral thesis examines the issues of automatic keyword extraction and proposes a new method for this challenge. The proposed method is a graph-based unsupervised method based on the structural properties of language complex networks. The thesis employs the standard methodology from the fields of IR and NLP both in the development and evaluation phases of the research. Within the method, new centrality measures for keyword extraction task are proposed and tested. The first is the selectivity, and the second is the generalized selectivity measure. The node selectivity value is calculated from a weighted network as the average weight distributed on the links of a single node. Selectivity-based extraction (SBKE) method does not require external linguistic knowledge since it is purely derived from a network structure, making it suitable for use in different natural languages and a multilingual scenario. The SBKE method consists of two steps: keyword candidate extraction (based on selectivity values) and keyword expansion to longer sequences of keyword candidates. The proposed SBKE method is tested for different natural languages (Croatian, English, Serbian and Italian) and for various domains (scientific publications in the field of mining and geology, essays and critiques in architecture and design, news form politics, sports, culture and economy, and technical texts from Wikipedia in the field of computer science). For the purposes of the thesis, new multilingual datasets are created. Datasets contain comparable texts that are suitable for keyword extraction in general, allowing the evaluation in fully controlled conditions. Specifically, a bilingual Serbian-English and trilingual Croatian-English-Italian datasets are created. The performance of SBKE method is assessed empirically in terms of precision, recall, F1 and F2 scores, and area under the precision-recall curve. The evaluation, according to IIC (inter-indexer consistency) measure and adjusted Kappa statistics (Fleiss’ and Gwet’s coefficients), allows for assessing the consistency of the method with human annotators. The area under the precision-recall curve and Kappa statistics (Fleiss’ and Gwet’s coefficients) are novel evaluation principles for evaluating the keyword extraction tasks. It is experimentally confirmed that the method, by using knowledge from the network structure, without any additional external (linguistic or semantic) knowledge, can successfully extract the keywords from the text and it is close to the level of human annotations of keywords. Additionally, it is confirmed that a novel selectivity measure is appropriate for extraction and ranking of keywords. The proposed SBKE method demonstrates its potential for keyword extraction from different domains of texts, from individual documents or the collection of documents and for portability to new languages. The portability and low cost-feasibility of SBKE characterize the method as a highly desirable candidate for unsupervised automatic keyword extraction, especially in the absence of human annotated resources, for under-resourced languages (lacking the natural language processing resources, and tools) or for a multilingual keyword extraction task.
Abstract (croatian)	Automatsko izlučivanje ključnih riječi iz teksta je početni korak u brojnim sustavima za računalnu analizu prirodnog jezika (engl. natural language processing), dubinsku analizu teksta (engl. text mining) i pretraživanje informacija (engl. information retrieval). Ključne riječi jezgrovito i kompaktno opisuju tematiku teksta. Doktorska disertacija proučava problematiku automatskog izlučivanja ključnih riječi iz teksta te predlaže novu metodu za taj postupak. Razvijena metoda pripada skupini nenadziranih metoda baziranih na grafovima (engl. graphbased), odnosno baziranih na jezičnim kompleksnim mrežama (engl. language complex network). U postupku razvoja i vrednovanja koristi se standardna metodologija iz područja pretraživanja informacija (engl. information retrieval) i računalne obrade prirodnog jezika. U okviru metode, predložene su nove mjere centralnosti koje prethodno nisu bile korištene u postupcima ekstrakcije ključnih riječi iz teksta: selektivnost čvora (engl. node selectivity) i generalizirana selektivnost čvora (engl. generalized selectivity). Selektivnost čvora se definira na usmjerenoj težinskoj mreži kao prosječna težina distribuirana na bridovima pojedinog čvora mreže. Metoda za izlučivanje ključnih riječi bazirana na selektivnosti čvora – SBKE metoda (engl. selectivity-based keyword extraction) ne zahtjeva dodatna lingvistička znanja već je izvedena iz definirane strukture mreže, što je čini pogodnom za korištenje na tekstovima zapisanim u različitim prirodnim jezicima, dakle i u višejezičnom scenariju primjene. Predložena SBKE metoda, testirana je na podatkovnim skupovima (1) različitih prirodnih jezika (hrvatski, engleski, srpski i talijanski), (2) na različitim domenama (sažecima iz rudarstva i geologije, kritikama i esejima iz arhitekture i dizajna, novinskim člancima informativnog karaktera iz područja kulture, sporta, politike i sl. te tehničkim tekstovima s Wikipedije iz područja računarstva), (3) za zadatke izlučivanja iz pojedinačnih dokumenta i kolekcija tekstova. U okviru disertacije, načinjeni su novi podatkovni skupovi usporedivih tekstova na više jezika kojima se može u kontroliranim uvjetima usporediti uspješnost metode za zadatke višejezičnog izlučivanja ključnih riječi. Pripremljeni su dvojezični srpsko-engleski te hrvatsko englesko-talijanski podatkovni skupovi, koji su ujedno i prvi dvojezični kao i trojezični podatkovni skupovi namijenjeni za zadatak ekstrakcije ili izlučivanja ključnih riječi. Uspješnost metode u ovoj se disertaciji mjeri empirijski pomoću mjera preciznosti, odziva, F1 i F2 te površinom ispod krivulja preciznosti i odziva. Mjere IIC (engl. inter-indexer consistency) te Kappa statistika, odonosno Fleissov i Gwetov koeficijent, su korištene za uspoređivanje konzistentnosti metode s anotacijama ljudskih eksperata. Površina ispod krivulje preciznosti i odziva te Fleissov i Gwetov koeficijent su novo predložene mjere za vrednovanje postupaka izlučivanja ključnih riječi. Eksperimentalno je potvrđeno da SBKE metoda korištenjem znanja iz strukture mreže, bez dodatnih vanjskih izvora znanja (semantičkih ili dodatnih lingvističkih), može uspješno izlučiti ključne riječi iz teksta te se rezultatski približava ljudskoj uspješnosti izvođenje zadatka. Također je pokazano da je predložena mjera selektivnosti prikladna za izlučivanje, odnosno predlaganje i rangiranje ključnih riječi. Razvijena SBKE metoda iskazuje svoj potencijal mogućnošću prilagodbe za primjenu na tekstovima pisanim na različitim jezicima i kolekcijama tekstova iz različitih domena. Jednostavne je arhitekture, prenosiva je na različite jezike i domene tekstova i ima nisku računsku zahtjevnost. Time se SBKE metoda pozicionira na listu dobrih kandidata za nenadzirano automatsko izlučivanje ključnih riječi posebice u situacijama: (1) kad ne postoje ljudski eksperti za određivanje ključnih riječi, (2) za slabije računalno poduprte jezike računalno-lingvističkim resursima i alatima, (3) kao i u situacijama višejezičnog izlučivanja ključnih riječi.
Keywords
Keywords (english)
Language	english
URN:NBN	urn:nbn:hr:195:847609
Promotion	2021
Study programme	Title: Informatics Study programme type: university Study level: postgraduate Academic / professional title: doktor znanosti iz znanstvenog područja Društvene znanosti, polja Informacijske i komunikacijske znanosti. (doktor znanosti iz znanstvenog područja Društvene znanosti, polja Informacijske i komunikacijske znanosti.)
Type of resource	Text
File origin	Born digital
Access conditions	Access restricted to students and staff of home institution
Terms of use
Created on	2021-04-26 07:40:22