Art Analysis Information Retrieval

MIEIC 2020/2021 Descrição, Armazenamento e Pesquisa de Informação

Ana Silva, up201604105 Gonçalo Santos, up201603265 Fábio Araújo, up201607944 Susana Lima, up201603634 1 Information Retrieval tool

Apache Solr

• Wide support and documentation available • REST API • Faceting • Boosting • Wide variety of filters • Full-text search • Fuzzy search • Proximity search

Figure 1 – Solr logo

2 Collections

• One single collection • SemArt • 19163 entries

Figure 2 - Collection size

3 Documents

• Two relevant documents: • Artwork • Artist

• One schema compatible with the different types of documents • contains attributes that would be defined or not according to each type

• Only the artwork document is being considered at the moment.

4 Indexing Process

• Define custom_text • Define fields schema

Figure 3 – Custom text schema

Custom text schema

Figure 4 – Fields schema 5 Retrieval Process

• When only using the query field, results are too many and not so relevant • Filtering improves the results and the use of boost weights achieves the best ones • In general, a user searches for the content they want to find in an artwork. Therefore, the TITLE field should have the biggest boost

Information need Query ranking formula Number of results First 10 results relevancy

Base: Fisherman 23 R R R N N R N R N N Relevant items are paintings not from the French school that Filter: !SCHOOL: French 17 R R R R N N N N N N have a fisherman depicted, and Weights: TITLE^3.0 not only fishing elements. 17 R R R R R N N N N N DESCRIPTION^0.0

Table 1 – Results analysis for the stated information need 6 Evaluation: sacred monuments

What paintings have sacred monuments depicted?

Relevant items include paintings whose focus is a sacred monument, for example, a church, mosque, chapel, cathedral, among others. Irrelevant items are artworks that were made to be in one of those monuments, that were made by an artist whose surname is Church or that depict people connected to a religious institution. Relevant details include the artist’s name, the date when the artwork was started, the artwork’s technique, material, image and description.

“church” “painting” “cathedral” “chapel” “mosque” “synagogue” “sanctuary”

7 Evaluation: sacred monuments

Query 1 Query 2

q: church OR mosque OR chapel OR cathedral, q: church OR mosque OR chapel OR Query defType: edismax, cathedral qf: TITLE^10 DESCRIPTION^5

Top 10 R N R R R N N N R N R R R R R R R R R R AVP 0.75 1.0

Table 2 – Evaluation of different tune weights Precision@k

1.2

1

0.8

0.6

0.4

0.2

0 1 2 3 4 5 6 7 8 9 10 Figure 5 – Precision@k for different tune weights Query 1 Query 2 8 Evaluation: influenced by Rubens

Which paintings were influenced by Rubens?

Relevant items include artworks that were influenced by the Dutch artist Peter Paul Rubens. Documents that were made by Rubens or a friend, about him or a family member are considered irrelevant. Relevant details include the artist’s name, the date when the artwork was started, the artwork’s technique, material, image and description.

“influence” “artworks” “Rubens” “Peter Paul Rubens”

9 Evaluation: influenced by Rubens

Query 1 Query 2 Query 3 q: influence (Rubens OR “Peter q: influence (Rubens OR “Peter Paul Rubens”), q: influence (Rubens OR Paul Rubens”), defType: edismax, Query “Peter Paul Rubens”) defType: edismax, pf: DESCRIPTION^10, qf: DESCRIPTION^10 ps: 5 Top 10 N N N N N N N N N N N N N R N N R N R R R R R R R R R R N R AVP 0.75 0.32 0.99

Precision@k Table 3 – Evaluation of different tune weights

1.2

1

0.8

0.6

0.4

0.2

0 1 2 3 4 5 6 7 8 9 10 Figure 6 – Precision@k for different tune weights Query 1 Query2 Query 3 10 STEAM GAMES

Milestone #2 - Information Retrieval

DAPI - November 2020

Ângelo Teixeira up201606516 Duarte Frazão up201605658 Mariana Aguiar up201605904 Pedro Costa up201605339 Problem domain

1. games 2. reviews 2.3. reviews orgs

Represents Steam games, the Represents Steam reviews, the Represents the organizations game name and description, review text and sentiment, the related to Steam games, the categories and genres, metrics number of people who’ve found name and a brief description. related to the game usage, it helpful and the game developer and publisher reviewed. (organizations), the price and website. Tool Selection

We compared the two most popular search platforms, Apache Solr and Elasticsearch.

● Elasticsearch is focused on scaling, data analytics and processing time series data, in order to, extract meaningful insights and patterns ● Solr is best suited for search applications that use significant amounts of static data.

The problem at hand falls in Solr's use case: a search application on Steam games data to perform advanced information retrieval tasks. Collections and documents

Example:

{ “appid”: “g10” “name”: “Counter-Strike”, … // remaining fields Each game has several reviews. In order to accommodate both in a single “_childDocuments_”: [ { collection we used the nest child documents of Solr. “appid”: “g10”, “review”: “...review text”, Since each review already has the game ID, the correlation is direct. “sentiment”: 1, “number_helpful”: 1, “id”: “0” }, .... // remaining reviews ] } Indexing process

● 1 collection -> 2 different types of documents (games and reviews) ● Review documents indexed as nested documents of the games

● 3 custom field types -> name_text, tag_text and custom_text

● Filters used in custom_text ○ Lowercase filter ○ Synonym filter (with game title acronyms) ○ English possessive filter ○ Stemming filter (Porter Algorithm) ○ Stop words filter Retrieval process - information needs

To test our system, we created 6 Information Needs (IN): 1. Online Games with server issues 2. Family games 3. Free games with in-app purchases 4. Specific game 5. Games with toxic community 6. Fast paced games

However, in this presentation we chose only to detail the first 4 INs due to time constraints. The full study can be found in the report. Retrieval process - systems

For this experiment, 3 Information Systems were developed:

1. Baseline - Serves as a control for the experiment, and a root for comparison 2. Custom Indexing - Applies some filters to fields like lower-case, having game synonyms, ignoring stop-words and stemming 3. Custom Querying - Weighs the fields in a different way, to mark some of them as more relevant than others Information need 1 - Online Games with server issues

Information Need: Online Games with server problems

Information Need Type: Informational System Baseline Custom Indexing Custom Querying Requirements (Relevance criteria): Mention of lag or server problems in the reviews P@1 1 1 0

Query: (lag OR "server down") AND online P@10 0.80 1 0

Query Fields: detailed_description review categories AvP 0.95 1 0

Custom Querying Weights: detailed_description^1.5 review^0.5 categories^0.1 Information need 1 - Online Games with server issues Information need 1I - Family games

Information Need: Games suited for family play

Information Need Type: Informational

Requirements (Relevance criteria): System Baseline Custom Indexing Custom Querying ● Required Age < 12 ● Multiplayer P@1 0 1 0 ● Suitable for couch play P@10 0.60 0.90 0.50 Query: (family OR "fun for all" OR kid) AND multiplayer AvP 0.66 1 0.50 Query Fields: detailed_description review categories

Custom Querying Weights: detailed_description^1.5 review^0.5 categories^0.1 Information need 1I - Family games Information need 1II - Free games with in-app purchases

Information Need: Free games that have in-app purchases, sometimes deemed as “pay-to-win” games

Information Need Type: Informational System Baseline Custom Indexing Custom Querying Requirements (Relevance criteria): ● Free P@1 1 0 0 ● In-game purchases / transactions P@10 0.80 0.70 0.40 Query: ("pay to win" OR "in-app purchase"~10) AND free AvP 0.75 0.67 0.28 Query Fields: detailed_description review categories

Custom Querying Weights: detailed_description^1.5 review^0.5 categories^0.1 Information need 1II - Free games with in-app purchases Information need 1v - Specific game

Information Need: Page of the Counter-Strike game System Baseline Custom Indexing Custom Querying Information Need Type: Navigational P@1 1 1 1 Requirements (Relevance criteria): Must be the Counter-Strike game

Query: “Counter-Strike”

Query Fields: name Due to the usage of synonyms list, it has the advantage of finding the game even by its aliases like “CS”, for example! Retrieval process - results

System Baseline Custom Indexing Custom Querying

MAP 0.814 0.861 0.357 Tool evaluation and final remarks

● Most of our trouble boils down to lack of documentation on how to use Solr ● After fiddling with the dataset, we found out that dealing with nested documents is not trivial for retrieval operations ● The results were rather interesting, as we found out that the indexing-time polishing of the dataset is very relevant to the accuracy of the Information System ● When trying out different query weights, we decided to reduce the weight of the reviews, as they were the predominant source for un-weighted queries, which caused the accuracy to be worse Goodreads Books and Reviews

DAPI 2020/21 - Group 3 Presentation 2 - Information Retrieval System System’s Datasets

The system datasets’ preparation resulted in three datasets:

● Books ○ CSV format ○ About 10,000 entries ● Authors ○ CSV format ○ About 3,100 entries ● Reviews ○ JSON format ○ About 500,000 entries

Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 2 System’s Datasets - Book

● GoodReads ID: number ● Title: text ● ISBN code: text ● Language code: text ● Publication year: number ● Rating: number ● Authors: text array

Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 3 System’s Datasets - Author

● Name: text ● Gender: text ● Date of birth: text ● Place of birth: text ● Country(ies) of citizenship: text

Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 4 System’s Datasets - Review

● ID: text ● GoodReads book id: number ● Text: text ● Date: text ● Rating: number

Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 5 System’s Datasets

The showcased work was developed using Apache Solr. The IR system comprises 3 types of documents in the same core: ● Books ● Authors ● Reviews

The three datasets (Books, Authors and Reviews) were merged into a single goodreads.json file, and imported to solr using the post utility tool:

$ post -c goodreads -format solr goodreads.json

Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 6 Indexing the Datasets - Schema fields

Book

● title: text_general, indexed, stored ● id: string, stored ● isbn: string, indexed, stored ● language_code: string, indexed, stored ● publication_year: plongs, indexed, stored ● book_rating: string, indexed, stored ● authors: string, stored

Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 7 Indexing the Datasets - Schema fields

Author

● author_name: text_general, indexed, stored ● sex_or_gender: string, indexed, stored ● date_of_birth: string, indexed, stored ● place_of_birth: text_general, indexed, stored ● country_of_citizenship: string, indexed, stored

Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 8 Indexing the Datasets - Schema fields

Review

● review_text: text_general, indexed, stored ● id: string, stored ● date: string, indexed, stored ● review_rating: plongs , indexed, stored ● book_id: string, stored ● book_name: string, stored

Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 9 Indexing the Datasets - Configurations

Three IR system’s configurations were taken into account:

Configuration Stop words / Synonyms Analyzer Filters

IR1 No No

IR2 No Yes

IR3 Yes Yes

Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 10 Indexing the Datasets - Analyzer Filters

● Stop Filter: Removes stop words from a given stop words list ● Synonyms Graph Filter: Considers terms’ synonyms (querying only) ● Lowercase Filter: Converts any uppercase letters in a token to the equivalent lowercase token ● English Possessive Filter: removes singular possessives (trailing 's) from words ● Porter’s Stem Filter: applies the Porter Stemming Algorithm for English ● Hyphenated Words Filter: reconstructs hyphenated words that have been tokenized as two tokens because of a line break or other intervening whitespace

Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 11 Querying and Evaluation - Methodology

Parameters used to evaluate the three different IR system configurations:

● Total of 8 information need / query pairs ● For each information need, 3 sets of field weights ● DisMax querying mode

The first 20 results were classified as Relevant or Non-Relevant based on their fulfilment of the information need.

Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 12 Querying and Evaluation - Methodology

Metrics used to evaluate the system configuration:

● Precision@ (at) ● Recall@ (at) ● Interpolated Precision-Recall ● Average Precision (AvP) ● Mean Average Precision (MaP), for each configuration / field weight pair

Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 13 Querying and Evaluation - Methodology

Three sets of field weights were taken into account:

Field Weights Review Text Book Title Author Name Author Country

WF1 1.1 0.9 0.9 0.9

WF2 0.75 2 2 1

WF3 0.825 2.75 2.45 1.375

Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 14 Querying and Evaluation - Example no. 1

System: IR2 (analyzer filters applied, no stop words / synonyms used) Query: [religion faith] Information need: Search for religion-related content and opinions on books

WF1 WF3

AvP (20 first results) 85% 57%

Explanation: The WF1 weighting system is better suited for information needs satisfied by book Review documents!

Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 15 Querying and Evaluation - Methodology

Reasoning:

● WF1 should deliver better results for information needs that are satisfied by book Reviews; ● WF2 and WF3 should deliver better results for information needs that are satisfied by Books and Authors.

Field Weights Review Text Book Title Author Name Author Country

WF1 1.1 0.9 0.9 0.9

WF2 0.75 2 2 1

WF3 0.825 2.75 2.45 1.375

Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 16 Querying and Evaluation - Example no. 2

System: IR2 (analyzer filters applied, no stop words / synonyms used) Query: [movie film cinema] Information need: Find books with movie adaptations

WF1 WF3

AvP (20 first results) 87% 95%

Explanation: The WF3 weighting system is better suited for information needs satisfied by Book documents!

Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 17 Results and Conclusions - System IR1

● Configuration IR1: No stop words / synonyms, No analyzer filters ● Similar values for different field weight configurations ● Inconclusive!

Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 18 Results and Conclusions - System IR2

● Configuration IR2: No stop words / synonyms, Uses analyzer filters ● High precision and interpolated precision-recall values in the first documents retrieved ● Adding analyzer filters significatively improved the results!

Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 19 Results and Conclusions - System IR3

● Configuration IR3: Uses stop words / synonyms, Uses analyzer filters ● Similar results when compared to configuration IR2, although WF1 achieved better results with this system configuration!

Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 20 Results - Mean Average Precision

Results for each IR system / field-weighting configuration pair:

MaP for IR/WF pair WF1 WF2 WF2

IR1 60% 53% 53%

IR2 87% 65% 59%

IR3 88% 59% 56%

Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 21 Conclusions

● The addition of analyzer filters (lowercase conversion, stemming, singular possessive removal, …) significatively improved results relevancy ● Overall, the WF1 configurations achieved better results - this is the most flexible weighting configuration and most information needs in the test suite were satisfied by any type of document ● The addition of stop words / synonyms analyzers MaP for IR/WF pair WF1 WF2 WF2 was only meaningful when applied together with IR1 60% 53% 53%

other filters IR2 87% 65% 59%

IR3 88% 59% 56%

Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 22 Future Work

● Testing different system configurations, using other analyzer filters and / or stop words and synonyms lists ● Tweaking the field weights for the different text fields ● Using a more complex and robust test set ● Using different Solr’s querying methods

Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 23 Thank you for your attention

Any questions?

Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System Information Description, Storage and Retrieval Popular movies and streaming Milestone 2 - Group 4 Carlos Gomes - up201603404 Eduardo Silva - up201603135 Joana Silva - up201208979 Joana Ramos - up201605017 Used Dataset Datasets

Streaming Dataset IMDb Scraped Dataset Structured dataset in .csv Structured dataset with format with information movie information retrieved regarding the streaming through scraping of IMDb’s platform in which a movie is website. available.

IMDb Official Dataset IMDb Dataset Structured dataset in .tsv Unstructured data (movie format with IMDb’s website synopsis) obtained by the information regarding scraping of IMDb’s movie movies. pages. Document Types

Number of Indexed Documents: 105 975

● Movies: 15 531 ● People: 90 444 Used strategy to index the dataset Document Fields - Movies

Both documents (Movies and People) were indexed in the same core, being created a schema with the following fields:

Field Description Other fields: ● startYear imdb_id ID of the movie within the IMDb database ● originalTitle popularTitle Title for which the movie is best known as ● isAdult synopsis Brief summary of the movie

runtimeMinutes Duration of a movie in minutes

genres Various genres of a movie (e.g.: action)

Streaming platforms in which the movie is available netflix / primevideo / disney / hulu at Document Fields - People

Field Description Other fields: ● date_of_birth imdb_id ID of the movie within the IMDb database ● date_of _death imdb_name_id ID of the person within the IMDb database ● birth_name ● reason_of_death category Type of job carried out (e.g.: actor, composer, etc.) ● death_details ● birth_details characters Character(s) played out by the actor/actress in the movie (if ● children applicable) ● height name The person’s name ● divorces ● place_of_birth bio The person’s biography ● place_of_death ● spouses ● ... Used Filters - Movie Title & Character Names

Movie Title & Character Index Query Names

Two new filter types were created: White Space Tokenizer ✓ ✓ text_title (used in movie titles and character names) and custom_text Standard Tokenizer ⨯ ⨯ (used in the synopsis and bio). The Lowercase Filter ✓ ✓ following table shows the filters used in each filter type as well as at Porter Stemming ✓ ✓ query-time. Synonym Graph Filter ⨯ ✓

Duplicate Removal ✓ ⨯

English Possessive Filter ⨯ ⨯

Stop Word Filter ⨯ ⨯ Used Filters - Synopsis & Bio

Synopsis & Bio Index Query

White Space Tokenizer ⨯ ⨯

Standard Tokenizer ✓ ✓

Lowercase Filter ✓ ✓

Porter Stemming ✓ ✓

Synonym Graph Filter ⨯ ✓

Duplicate Removal ✓ ✓

English Possessive Filter ✓ ✓

Stop Word Filter ✓ ✓ Alternative strategy Another indexing strategy was experimented: indexing the two different types of documents in two cores.

This made it possible to retrieve the different type of documents in a query without mentioning fields from both of them explicitly. For instance, when searching for a movie (e.g.: Inception) the results list included the actors that played this movie (e.g.: Leonardo DiCaprio).

Query:

● [{!join from=imdb_id fromIndex=movies to=imdb_id}popularTitle:Inception] Weight Schemes

For the query parser, we used Extended DisMax with the following weight schemes, depending on the type of document to return:

Movies: People:

● Popular Title: x2 ● Name: x2 ● Synopsis: x3 ● Bio: x4 ● Genres: x1: ● Category: x3 ● Characters: x2 Representative Queries Representative Query - Movies

Information need:

● Gather movies about World War 2 that are featured on Netflix

Query:

● [netflix:true AND world war 2]

Field weights used:

● synopsis^3 popularTitle^2 ... Representative Query - Movies

Information need:

● Gather movies about WW2 that are featured on Netflix

Query: ... ● [netflix: true AND ww2]

Field weights used:

● synopsis^3 popularTitle^2 Representative Query - People

Information Need:

● Retrieve the british actors that have been nominated or won an Oscar award

Query:

● [category:act* AND oscar AND place_of_birth:engl* AND (winning nomina*)]

Field weights used:

● bio^2 ... Technology Evaluation Evaluating Query Results

Query #1 Query #2

Information Need: Information Need:

● Retrieve the british actors that have been ● Retrieve the movies that are set during nominated or won an Oscar award World War 2

Query: Query:

● [category:act* AND oscar AND ● [world war 2 nazi holocaust] place_of_birth:engl* AND (winning nomina*)] Field weights used: Field weights used: ● synopsis^3 popularTitle^2 genres ● bio^2 Evaluation Results

AvP #1: 0.582

AvP #2: 0.769

MAP: 0.675 Solr as an IR tool

Avantages: Disadvantages:

● Many features; ● Lack of practical examples in documentation; ● Plentiful documentation; ● No dedicated re-indexing mechanism; ● GUI available; ● Little GUI documentation. ● Good performance when indexing and fetching results.

MEMBERS OF THE EUROPEAN PARLIAMENT

RESOLUTIONS Data Import Handler

Custom Field Types Tokenizer Filters

gender_field Standard SynonymGraph, LowerCase

comma_list Pattern -

Synonyms text file: Pattern applied in comma_list field type:

F,Female,Woman,Women,Feminine,Girl \s*,\s* M,Male,Man,Men,Masculine,Boy,Guy Fields Type Fields Type

id string id string title text_general name string content text_general gender gender_field doc string country text_general committee text_general

birth_date pdate mep_favor text_general national_party text_general mep_against text_general

Query: female meps Results (filtered):

Results (indexed):

...... Query: oral question fish Results (weighted):

Results (filtered): Rapid, easy to use search tool

Uses less disk space

Considering the size of data after indexing

Coding languages

Solr supports less coding languages in comparison to, e.g., ElasticSearch 2020/2021

Diseases, Symptoms and Treatments Information Description, Storage and Retrieval

Group 6:

▪ André Esteves - up201606673 ▪ Francisco Filipe - up201604601 ▪ Helena Montenegro - up201604184 ▪ Juliana Marques - up201605568 Information Needs

Information Need 1: Title: Drugs used for symptom Description: What drugs are used to cure cough? Queries: [drugs treat cough] and [drugs cough]

Information Need 2: Title: Medical specialties associated to symptom Description: What medical specialty should I visit when I have cough? Queries: [medical specialty related cough] and [medical specialty cough]

1 Information Retrieval Systems

Tool used: Apache Solr

We developed three systems:

▪ System A: simple version. ▪ System B: improvements to the indexing process. ▪ System C: improvements to the querying process.

2 System A

3 documents indexed in one core: Disease, Symptom and Treatment.

Schema:

Symptom

Treatment

3 System A

Results:

▪ Simpler queries [drugs cough] and [medical specialties cough] resulted in higher recall and average precision than the other queries. ▪ In the query [drugs cough], the relevant documents are ranked higher.

Conclusions: ▪ The relevance of the results depends on the capacity of the query to express the information need.

4 System B

Improvements to the indexing process.

Filters: ▪ Remove stop words. ▪ Turn tokens into lowercase. ▪ Remove possessives. ▪ Stemming.

5 System B

Results:

▪ First information need: average precision values have improved → more relevant documents appear in a higher ranking.

▪ Second information need: lower average precision values and less relevant documents retrieved → relevant documents appear in lower ranking.

6 System C

Improvements to the querying process: ▪ Apply different weights to different fields.

First set of weights: ▪ More weight in the names of the diseases, symptoms and treatments.

Second set of weights: ▪ More weight in the names of diseases and symptoms, but not on treatments.

7 System C

Results:

MAP (Weight 1) = 0,51 MAP (Weight 2) = 0,63

▪ Second set of weights achieved higher mean average precision values.

▪ Overall, the results improved when compared with the previous systems.

8 Comparison between Systems

For the first information need, the For the second information need, results improved as the systems System A provided better results than evolved. System B with improved indexing process. System C gave the best results.

9 CS:GO Professional Matches and News Information Retrieval Tools used

● The two tools which were considered as viable options for this project were Solr and ElasticSearch ● Solr ○ Well established ○ Geared towards information retrieval ○ Documentation lacking in examples and clarity ● ElasticSearch ○ Newer ○ Bigger focus on data analytics ○ Updated documentation and overall stronger web presence ● We opted for Solr mostly because of the focus in information retrieval Collection and Documents

● Initially, we had 5 different documents: players, matches, picks, economy and news ● Some modifications were made before the implementation of the search system: ○ Removal of the economy document ○ Addition of an article field to the matches (whenever an article directly mentions the match) ○ Add an hierarchical structure to the documents, where picks and players were now children of matches ● After these modifications, we are left with 4 documents: players, matches, picks and news (which don’t mention any matches) Indexing Strategy

● The matches, picks and players were grouped in a single JSON file ● An “article” field was added to the matches if appropriate ● Players and picks were nested in the matches ● As mentioned previously, we also have independent articles which are not connected to any matches (in the original CSV format) ● It should be noted that some articles mention multiple matches; however, the structure for these articles is different, and, as such, this connection was dismissed ● The indexed fields represent what we believe to be the most useful for the information needs planned Indexed Fields

Document Fields

Match team_1 (text_general; csgo_name_general) team_2 (text_general; csgo_name_general) date (pdate) article (text_general;csgo_text_general)

Player player_name (text_general;csgo_text_general) rating (pfloat)

Picks N/A

News title (text_general;csgo_text_general) text (text_general;csgo_text_general) date (pdate) Field Types and Filters

Field Index Filters Query Filters

csgo_name_general Stop Stop EnglishPossessive EnglishPossessive PorterStem PorterStem LowerCase LowerCase SynonymGraph

csgo_text_general LowerCase LowerCase SynonymGraph Retrieval Tasks

● Three systems were put in place to test the performance of the search tool ○ System 1: default index (fixing only minor details) ○ System 2: improved index using the custom field types ○ System 3: improved index and use of weights at query time ● For all three systems, the precision @ 10 and average precision was calculated, followed by the MAP for all three systems Representative Queries

● Information need: Grand finals played by Astralis ● Query: +astralis grand final champions title “win edition”~10 ● A lot of terms, but they are all commonly used in articles which report a grand final

1 2 3 4 5 6 7 8 9 10 P@10 AP

System 1 R N R R N R N R N R 0.6 0.718

System 2 R N R R N R N R N R 0.6 0.718

System 3 R R R R N N R N R R 0.7 0.869 Representative Queries

● Information need: Transfers into/out of Cloud9 during 2018 ● Query: +cloud9 transfer sign add join confirm exit ● Boost results when terms occur in the title

1 2 3 4 5 6 7 8 9 10 P@10 AP

System 1 R R R N N N R R N R 0.6 0.799

System 2 R R R N N N R R R N 0.6 0.811

System 3 R R R R R R R R R N 0.9 1 Representative Queries

● Information need: Matches won by Natus Vincere in 2019 ● Query: natus vincere AND (win OR victory) ● Not very successful; the order of the terms is important and the query does not specify it; boosts based on date and older articles may be dismissed

1 2 3 4 5 6 7 8 9 10 P@10 AP

System 1 N R R R N N N R N R 0.5 0.489

System 2 N R R R R N R N N R 0.6 0.588

System 3 R N N R N R N N R N 0.4 0.488 Precision-Recall Technology Evaluation

● Difficult to measure a system just with five information needs, but it does give some insight as to how different configurations affect the search results ● MAP values between 0.75 and 0.8; AvP’s fluctuate more, registering values between 0.4 and 0.9 ● Unexpected results for queries 4 and 5 (poor performance of Systems 2 and 3) ● Solr proved to be an adequate tool for information retrieval contexts, given its high potential for customization and the ability to perform complex queries ● However, the integration of nested documents was overly complicated (mostly due to unclear documentation) Luís Silva (up201503730) Mariana Costa (up201604414) Pedro Fernandes (up201603846) (Group 7) BILLBOARD 200: POPULAR ALBUMS AND ARTISTS INFORMATION RETRIEVAL

Grupo 8 João Miguel ([email protected]) José Azevedo ([email protected]) Ricardo Ferreira ([email protected]) UPDATED CONCEPTUAL MODEL SCHEMA USED (GENERAL FIELD TYPES) SCHEMA USED (CUSTOM FIELD TYPES) CUSTOM FIELD TYPES (ARTIST-NAME) CUSTOM FIELD TYPES (TAG-TEXT) CUSTOM FIELD TYPES (DESCRIPTIVE-TEXT) QUERIES AND WEIGHTS

• Information Need: Find out information about the album “maad city” and it’s artist Kendrick Lamar • Query: kendrick lamar maad city

• Information Need: Find out information about the artist Pink • Query: pink

• Information Need: Find artists (solo or bands) that born or have been active in the 80s and won or have been nominated for a grammy award • Query: (born_date:198? OR years_active:198?) AND biography:”grammies”

Weights that proved to give the best results and ordering: artist album_artist track_artist album track_album rank.album song playlist

3.2 2.8 2.6 2.4 2.2 2.0 1.8 1.6 QUERY: KENDRICK LAMAR MAAD CITY

GENERIC FIELDS CUSTOM FIELDS CUSTOM FIELDS W/ WEIGHT

"response":{"numFound":10690,"start":0,"numFoundExa "response":{"numFound":5114,"start":0,"numFoundExac "response":{"numFound":5114,"start":0,"numFoundExac ct":true,"docs":[ t":true,"docs":[ t":true,"docs":[ { { { "song":"Sherane a.k.a Master Splinter’s "song":"m.A.A.d city", "album_artist":"Kendrick Lamar", Daughter", "track_album":"good kid, m.A.A.d city", "album":"good kid, m.A.A.d city", "track_album":"good kid, m.A.A.d city", "track_artist":"Kendrick Lamar", "num_listeners":566067, "track_artist":"Kendrick Lamar", "date":"2012", "release_date":"1 January 2012", "date":"2012", "length":"5:50", "tags":["hip-hop", "num_listeners":277069, "num_listeners":364499, "rap", "tags":["hip-hop", "tags":["rap", "hip hop", "rap", "hip-hop", "2012", "go drunk eosin cabs", "2012", "conscious hip-hop", "kendrick lamar"], "gangsta rap"], "kendrick lamar", "id":"d33c9893-7a24-46ac-bf8e-159f8520b7df", "id":"a6800343-ee11-46a8-afa7-fa284a295da3", "west coast rap", "_version_":1684428227576070152}, "_version_":1684424183851778049}, "drake", { { "conscious hip hop"], "song":"Black Boy Fly - Bonus Track", "album_artist":"Kendrick Lamar", "playlist":["Sherane a.k.a Master Splinter’s "track_album":"good kid, m.A.A.d city", "album":"good kid, m.A.A.d city", Daughter", "track_artist":"Kendrick Lamar", "num_listeners":566067, "Bitch, Don’t Kill My Vibe", "date":"2012", "release_date":"1 January 2012", "Backseat Freestyle", "length":"4:39", "tags":["hip-hop", "The Art of Peer Pressure", "num_listeners":86342, "rap", "Money Trees", "tags":["hip hop", "songs i can actually listen to on repeat", "Poetic Justice", "west coast hip hop", "west coast rap", "good kid", "explanation", "drake", "m.A.A.d city", "kendrick lamar"], "conscious hip hop"], "id":"74d8110e-a060-4776-828e-4644fc8fdc81", "_version_":1684428227582361601}, { QUERY: KENDRICK LAMAR MAAD CITY QUERY: PINK

GENERIC FIELDS CUSTOM FIELDS CUSTOM FIELDS W/ WEIGHT

"response":{"numFound":3241,"start":0,"numFoundExac "response":{"numFound":1443,"start":0,"numFoundExac "response":{"numFound":1443,"start":0,"numFoundExac t":true,"docs":[ t":true,"docs":[ t":true,"docs":[ { { { "album_artist":"Soundtrack", "song":"Pink", "artist":"P!nk", "album":"Panther", "track_album":"Nervous System (EP)", "num_listeners":2671748, "num_listeners":48, "track_artist":"Julia Michaels", "tags":["pop", "playlist":["The Pink Panther Theme - From the "date":"2017-07-28", "pop rock", Mirisch-G & E Production \"The Pink Panther\"", "length":"2:48", "female vocalists", "It Had Better Be Tonight (Meglio stasera) - From "num_listeners":15323, "rock", the Mirisch-G & E Production \"The Pink Panther\" "tags":["asmr", "pink", [Instrumental]", "pop", "p!nk", "Royal Blue - 1995 Remastered", ""], "female", "Champagne And Quail", "lyrics":"He's got a thing for fitness, seven "american"], "The Village Inn - From the Mirisch-G & E days a week\nBut I don't really care unless he's "born_date":"8 September 1979 (age 41)", Production \"The Pink Panther\"", working out with me\nHe's got a thing for flowers, "born_in":"Doylestown, Bucks County, "The Tiber Twist", but only certain kinds\nAnd by certain kinds I Pennsylvania, United States", "It Had Better Be Tonight (Vocal) - From the mean, only if it's mine\n\nDon't get enough, he "id":"a7f26d1d-1458-4463-ba52-be09e2fd3995", Mirisch-G & E Production \"The Pink Panther\"", don't get enough\nI don't get enough, he don't get "biography":"Alecia Beth Moore ...", "Cortina - From the Mirisch-G & E Production \"The enough of me\nDon't get enough, I don't get "_version_":1684424144295297026}, Pink Panther\"", enough\nHe don't get enough, I don't get { "The Lonely Princess - From the Mirisch-G & E enough\n\nThere's no innuendos, it's exactly what "album_artist":"P!nk", Production \"The Pink Panther\"", you think\nBelieve me …", "album":"", "Something for Sellers - From the Mirisch-G & E "id":"d8ff6857-1366-4d72-b426-8f2f9d339e04", "num_listeners":176254, Production \"The Pink Panther\"", "_version_":1684424186116702209}, "release_date":"12 October 2017", "Piano And Strings - 1995 Remastered", { "tags":["pop", "Shades Of Sennett"], "best of 2017", "id":"78766d97-644f-4e08-9a26-c91702d24d0e", "2017", "_version_":1684428218417807367}, "the best song ever", { "2010s", "pop rock", "jack antonoff"] QUERY: PINK QUERY: (BORN_DATE:198? OR YEARS_ACTIVE:198?) AND BIOGRAPHY:”GRAMMIES”

GENERIC FIELDS CUSTOM FIELDS

"response":{"numFound":1,"start":0,"numFoundExact": "response":{"numFound":155,"start":0,"numFoundExact true,"docs":[ ":true,"docs":[ { { "artist":"Jens Lekman", "artist":"Anoushka Shankar", "num_listeners":511190, "num_listeners":152177, "tags":["swedish", "tags":["indian", "singer-songwriter", "sitar", "indie pop", "world", "indie", "born_date":"9 June 1981 (age 39)", "pop", "born_in":"London, England, United Kingdom", "jens lekman-2007-night falls over kortedala", "biography":”… was nominated for a Grammy in the "chamber pop", Best Contemporary World Music category; this was "twee", her second Grammy nomination…", "00s"], "id":"b721585b-4d81-4814-b80a-96836f4d7bd9", "born_date":"6 February 1981 (age 39)", "_version_":1684424145327095810}, "born_in":"Angered, Gothenburg, Västra Götaland, { Sweden", "artist":"Tasha Cobbs Leonard", "biography":”… Lekman was nominated for three "num_listeners":8323, Swedish Grammies, three P3 Guld and three Manifest "tags":["gospel", awards, as well as dubbed album of the year by "christian", Nöjesguiden.\n\nA concert film shot from Lekman's "worship"], sold-out show with José González at Göteborg's "born_date":"7 July 1981 (age 39)", concert hall in December 2003 was broadcast by "born_in":"Jesup, Wayne County, Georgia, United Swedish national television two times in 2005. In States", June …", "biography":”… won the Grammy for Best "_version_":1684428207793635330}] Gospel/Contemporary Christian Music Performance.", "id":"eb03f123-e7b1-4ad6-b366-b734fc4f92ab", "_version_":1684424144619307011}, { QUERY: (BORN_DATE:198? OR YEARS_ACTIVE:198?) AND BIOGRAPHY:”GRAMMIES” ALL QUERIES

MAP No Filters W/ Filters 0.647 0.976 Information Retrieval: Animation Produced In Japan

MIEIC FEUP 2018-2019

Luís Fernando Mouta - up201808916 Dataset

All the information regarding Animes is in the datasets obtained from Milestone 1. Collection: Documents regarding general information of Animes The dataset are different CSV file’s Information Information Retrieval was one OpenRefine was used to refine Retrieval Tool : of the most important parts of the data and enhance the the work, but before we could quality to obtain better results Data query the information it is and to improve readability to required that the datasets be presente the data to Apache Refinement presented correctly. Solr. Schema

A schema was created in order to define field names for the attributes in the CSV file. Filter Applied – Title field

Filters Applied:

EnglishMinimalStemFilterFactory: StopFilterFactory: To Stop the EnglishPossessiveFilterFactory: EnglishPossessiveFilterFactory: Convertion of Plural Worsd to their analysis of commom tokens Removal of Singular Possessives Removal of Singular Possessives Singular Form The file was imported to Apache Solr file using the following command: Data processing

The core “Anime” was created and is where all the information is going to be stored. Queries- Searching for Animes of There are different types to be dealt This query can be used to search for an Type Movie with: Animes can be either: {TV,Movie, anime that its type is only Movie. Music,ONA,OVA,Special} Queries- Searching for Animes of Type Movie Difficulties:

FIGURING OUT HOW TO HOW TO IMPLEMENTE THE GENERATING A DATA FILE WORK WITH APACHE SOLR RESULT OF TWO DIFERENTE (CSV FILE) THAT WOULD BE AND HOW TO INTRODUCE SOLR CORES CORRECTLY INTERPRETED THE CORRECT QUERIES BY APACHE SOLR