Information Retrieval

Art Analysis Information Retrieval MIEIC 2020/2021 Descrição, Armazenamento e Pesquisa de Informação Ana Silva, up201604105 Gonçalo Santos, up201603265 Fábio Araújo, up201607944 Susana Lima, up201603634 1 Information Retrieval tool Apache Solr • Wide support and documentation available • REST API • Faceting • Boosting • Wide variety of filters • Full-text search • Fuzzy search • Proximity search Figure 1 – Solr logo 2 Collections • One single collection • SemArt • 19163 entries Figure 2 - Collection size 3 Documents • Two relevant documents: • Artwork • Artist • One schema compatible with the different types of documents • contains attributes that would be defined or not according to each type • Only the artwork document is being considered at the moment. 4 Indexing Process • Define custom_text • Define fields schema Figure 3 – Custom text schema Custom text schema Figure 4 – Fields schema 5 Retrieval Process • When only using the query field, results are too many and not so relevant • Filtering improves the results and the use of boost weights achieves the best ones • In general, a user searches for the content they want to find in an artwork. Therefore, the TITLE field should have the biggest boost Information need Query ranking formula Number of results First 10 results relevancy Base: Fisherman 23 R R R N N R N R N N Relevant items are paintings not from the French school that Filter: !SCHOOL: French 17 R R R R N N N N N N have a fisherman depicted, and Weights: TITLE^3.0 not only fishing elements. 17 R R R R R N N N N N DESCRIPTION^0.0 Table 1 – Results analysis for the stated information need 6 Evaluation: sacred monuments What paintings have sacred monuments depicted? Relevant items include paintings whose focus is a sacred monument, for example, a church, mosque, chapel, cathedral, among others. Irrelevant items are artworks that were made to be in one of those monuments, that were made by an artist whose surname is Church or that depict people connected to a religious institution. Relevant details include the artist’s name, the date when the artwork was started, the artwork’s technique, material, image and description. “church” “painting” “cathedral” “chapel” “mosque” “synagogue” “sanctuary” 7 Evaluation: sacred monuments Query 1 Query 2 q: church OR mosque OR chapel OR cathedral, q: church OR mosque OR chapel OR Query defType: edismax, cathedral qf: TITLE^10 DESCRIPTION^5 Top 10 R N R R R N N N R N R R R R R R R R R R AVP 0.75 1.0 Table 2 – Evaluation of different tune weights Precision@k 1.2 1 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8 9 10 Figure 5 – Precision@k for different tune weights Query 1 Query 2 8 Evaluation: influenced by Rubens Which paintings were influenced by Rubens? Relevant items include artworks that were influenced by the Dutch artist Peter Paul Rubens. Documents that were made by Rubens or a friend, about him or a family member are considered irrelevant. Relevant details include the artist’s name, the date when the artwork was started, the artwork’s technique, material, image and description. “influence” “artworks” “Rubens” “Peter Paul Rubens” 9 Evaluation: influenced by Rubens Query 1 Query 2 Query 3 q: influence (Rubens OR “Peter q: influence (Rubens OR “Peter Paul Rubens”), q: influence (Rubens OR Paul Rubens”), defType: edismax, Query “Peter Paul Rubens”) defType: edismax, pf: DESCRIPTION^10, qf: DESCRIPTION^10 ps: 5 Top 10 N N N N N N N N N N N N N R N N R N R R R R R R R R R R N R AVP 0.75 0.32 0.99 Precision@k Table 3 – Evaluation of different tune weights 1.2 1 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8 9 10 Figure 6 – Precision@k for different tune weights Query 1 Query2 Query 3 10 STEAM GAMES Milestone #2 - Information Retrieval DAPI - November 2020 Ângelo Teixeira up201606516 Duarte Frazão up201605658 Mariana Aguiar up201605904 Pedro Costa up201605339 Problem domain 1. games 2. reviews 2.3. reviews orgs Represents Steam games, the Represents Steam reviews, the Represents the organizations game name and description, review text and sentiment, the related to Steam games, the categories and genres, metrics number of people who’ve found name and a brief description. related to the game usage, it helpful and the game developer and publisher reviewed. (organizations), the price and website. Tool Selection We compared the two most popular search platforms, Apache Solr and Elasticsearch. ● Elasticsearch is focused on scaling, data analytics and processing time series data, in order to, extract meaningful insights and patterns ● Solr is best suited for search applications that use significant amounts of static data. The problem at hand falls in Solr's use case: a search application on Steam games data to perform advanced information retrieval tasks. Collections and documents Example: { “appid”: “g10” “name”: “Counter-Strike”, … // remaining fields Each game has several reviews. In order to accommodate both in a single “_childDocuments_”: [ { collection we used the nest child documents of Solr. “appid”: “g10”, “review”: “...review text”, Since each review already has the game ID, the correlation is direct. “sentiment”: 1, “number_helpful”: 1, “id”: “0” }, .... // remaining reviews ] } Indexing process ● 1 collection -> 2 different types of documents (games and reviews) ● Review documents indexed as nested documents of the games ● 3 custom field types -> name_text, tag_text and custom_text ● Filters used in custom_text ○ Lowercase filter ○ Synonym filter (with game title acronyms) ○ English possessive filter ○ Stemming filter (Porter Algorithm) ○ Stop words filter Retrieval process - information needs To test our system, we created 6 Information Needs (IN): 1. Online Games with server issues 2. Family games 3. Free games with in-app purchases 4. Specific game 5. Games with toxic community 6. Fast paced games However, in this presentation we chose only to detail the first 4 INs due to time constraints. The full study can be found in the report. Retrieval process - systems For this experiment, 3 Information Systems were developed: 1. Baseline - Serves as a control for the experiment, and a root for comparison 2. Custom Indexing - Applies some filters to fields like lower-case, having game synonyms, ignoring stop-words and stemming 3. Custom Querying - Weighs the fields in a different way, to mark some of them as more relevant than others Information need 1 - Online Games with server issues Information Need: Online Games with server problems Information Need Type: Informational System Baseline Custom Indexing Custom Querying Requirements (Relevance criteria): Mention of lag or server problems in the reviews P@1 1 1 0 Query: (lag OR "server down") AND online P@10 0.80 1 0 Query Fields: detailed_description review categories AvP 0.95 1 0 Custom Querying Weights: detailed_description^1.5 review^0.5 categories^0.1 Information need 1 - Online Games with server issues Information need 1I - Family games Information Need: Games suited for family play Information Need Type: Informational Requirements (Relevance criteria): System Baseline Custom Indexing Custom Querying ● Required Age < 12 ● Multiplayer P@1 0 1 0 ● Suitable for couch play P@10 0.60 0.90 0.50 Query: (family OR "fun for all" OR kid) AND multiplayer AvP 0.66 1 0.50 Query Fields: detailed_description review categories Custom Querying Weights: detailed_description^1.5 review^0.5 categories^0.1 Information need 1I - Family games Information need 1II - Free games with in-app purchases Information Need: Free games that have in-app purchases, sometimes deemed as “pay-to-win” games Information Need Type: Informational System Baseline Custom Indexing Custom Querying Requirements (Relevance criteria): ● Free P@1 1 0 0 ● In-game purchases / transactions P@10 0.80 0.70 0.40 Query: ("pay to win" OR "in-app purchase"~10) AND free AvP 0.75 0.67 0.28 Query Fields: detailed_description review categories Custom Querying Weights: detailed_description^1.5 review^0.5 categories^0.1 Information need 1II - Free games with in-app purchases Information need 1v - Speciﬁc game Information Need: Page of the Counter-Strike game System Baseline Custom Indexing Custom Querying Information Need Type: Navigational P@1 1 1 1 Requirements (Relevance criteria): Must be the Counter-Strike game Query: “Counter-Strike” Query Fields: name Due to the usage of synonyms list, it has the advantage of finding the game even by its aliases like “CS”, for example! Retrieval process - results System Baseline Custom Indexing Custom Querying MAP 0.814 0.861 0.357 Tool evaluation and ﬁnal remarks ● Most of our trouble boils down to lack of documentation on how to use Solr ● After fiddling with the dataset, we found out that dealing with nested documents is not trivial for retrieval operations ● The results were rather interesting, as we found out that the indexing-time polishing of the dataset is very relevant to the accuracy of the Information System ● When trying out different query weights, we decided to reduce the weight of the reviews, as they were the predominant source for un-weighted queries, which caused the accuracy to be worse Goodreads Books and Reviews DAPI 2020/21 - Group 3 Presentation 2 - Information Retrieval System System’s Datasets The system datasets’ preparation resulted in three datasets: ● Books ○ CSV format ○ About 10,000 entries ● Authors ○ CSV format ○ About 3,100 entries ● Reviews ○ JSON format ○ About 500,000 entries Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 2 System’s Datasets - Book ● GoodReads ID: number ● Title: text ● ISBN code: text ● Language code: text ● Publication year: number ● Rating: number ● Authors: text array Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 3 System’s Datasets - Author ● Name: text ● Gender: text ● Date of birth: text ● Place of birth: text ● Country(ies) of citizenship: text Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 4 System’s Datasets - Review ● ID: text ● GoodReads book id: number ● Text: text ● Date: text ● Rating: number Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 5 System’s Datasets The showcased work was developed using Apache Solr.

Load more