Art Analysis Information Retrieval
MIEIC 2020/2021 Descrição, Armazenamento e Pesquisa de Informação
Ana Silva, up201604105 Gonçalo Santos, up201603265 Fábio Araújo, up201607944 Susana Lima, up201603634 1 Information Retrieval tool
Apache Solr
• Wide support and documentation available • REST API • Faceting • Boosting • Wide variety of filters • Full-text search • Fuzzy search • Proximity search
Figure 1 – Solr logo
2 Collections
• One single collection • SemArt • 19163 entries
Figure 2 - Collection size
3 Documents
• Two relevant documents: • Artwork • Artist
• One schema compatible with the different types of documents • contains attributes that would be defined or not according to each type
• Only the artwork document is being considered at the moment.
4 Indexing Process
• Define custom_text • Define fields schema
Figure 3 – Custom text schema
Custom text schema
Figure 4 – Fields schema 5 Retrieval Process
• When only using the query field, results are too many and not so relevant • Filtering improves the results and the use of boost weights achieves the best ones • In general, a user searches for the content they want to find in an artwork. Therefore, the TITLE field should have the biggest boost
Information need Query ranking formula Number of results First 10 results relevancy
Base: Fisherman 23 R R R N N R N R N N Relevant items are paintings not from the French school that Filter: !SCHOOL: French 17 R R R R N N N N N N have a fisherman depicted, and Weights: TITLE^3.0 not only fishing elements. 17 R R R R R N N N N N DESCRIPTION^0.0
Table 1 – Results analysis for the stated information need 6 Evaluation: sacred monuments
What paintings have sacred monuments depicted?
Relevant items include paintings whose focus is a sacred monument, for example, a church, mosque, chapel, cathedral, among others. Irrelevant items are artworks that were made to be in one of those monuments, that were made by an artist whose surname is Church or that depict people connected to a religious institution. Relevant details include the artist’s name, the date when the artwork was started, the artwork’s technique, material, image and description.
“church” “painting” “cathedral” “chapel” “mosque” “synagogue” “sanctuary”
7 Evaluation: sacred monuments
Query 1 Query 2
q: church OR mosque OR chapel OR cathedral, q: church OR mosque OR chapel OR Query defType: edismax, cathedral qf: TITLE^10 DESCRIPTION^5
Top 10 R N R R R N N N R N R R R R R R R R R R AVP 0.75 1.0
Table 2 – Evaluation of different tune weights Precision@k
1.2
1
0.8
0.6
0.4
0.2
0 1 2 3 4 5 6 7 8 9 10 Figure 5 – Precision@k for different tune weights Query 1 Query 2 8 Evaluation: influenced by Rubens
Which paintings were influenced by Rubens?
Relevant items include artworks that were influenced by the Dutch artist Peter Paul Rubens. Documents that were made by Rubens or a friend, about him or a family member are considered irrelevant. Relevant details include the artist’s name, the date when the artwork was started, the artwork’s technique, material, image and description.
“influence” “artworks” “Rubens” “Peter Paul Rubens”
9 Evaluation: influenced by Rubens
Query 1 Query 2 Query 3 q: influence (Rubens OR “Peter q: influence (Rubens OR “Peter Paul Rubens”), q: influence (Rubens OR Paul Rubens”), defType: edismax, Query “Peter Paul Rubens”) defType: edismax, pf: DESCRIPTION^10, qf: DESCRIPTION^10 ps: 5 Top 10 N N N N N N N N N N N N N R N N R N R R R R R R R R R R N R AVP 0.75 0.32 0.99
Precision@k Table 3 – Evaluation of different tune weights
1.2
1
0.8
0.6
0.4
0.2
0 1 2 3 4 5 6 7 8 9 10 Figure 6 – Precision@k for different tune weights Query 1 Query2 Query 3 10 STEAM GAMES
Milestone #2 - Information Retrieval
DAPI - November 2020
Ângelo Teixeira up201606516 Duarte Frazão up201605658 Mariana Aguiar up201605904 Pedro Costa up201605339 Problem domain
1. games 2. reviews 2.3. reviews orgs
Represents Steam games, the Represents Steam reviews, the Represents the organizations game name and description, review text and sentiment, the related to Steam games, the categories and genres, metrics number of people who’ve found name and a brief description. related to the game usage, it helpful and the game developer and publisher reviewed. (organizations), the price and website. Tool Selection
We compared the two most popular search platforms, Apache Solr and Elasticsearch.
● Elasticsearch is focused on scaling, data analytics and processing time series data, in order to, extract meaningful insights and patterns ● Solr is best suited for search applications that use significant amounts of static data.
The problem at hand falls in Solr's use case: a search application on Steam games data to perform advanced information retrieval tasks. Collections and documents
Example:
{ “appid”: “g10” “name”: “Counter-Strike”, … // remaining fields Each game has several reviews. In order to accommodate both in a single “_childDocuments_”: [ { collection we used the nest child documents of Solr. “appid”: “g10”, “review”: “...review text”, Since each review already has the game ID, the correlation is direct. “sentiment”: 1, “number_helpful”: 1, “id”: “0” }, .... // remaining reviews ] } Indexing process
● 1 collection -> 2 different types of documents (games and reviews) ● Review documents indexed as nested documents of the games
● 3 custom field types -> name_text, tag_text and custom_text
● Filters used in custom_text ○ Lowercase filter ○ Synonym filter (with game title acronyms) ○ English possessive filter ○ Stemming filter (Porter Algorithm) ○ Stop words filter Retrieval process - information needs
To test our system, we created 6 Information Needs (IN): 1. Online Games with server issues 2. Family games 3. Free games with in-app purchases 4. Specific game 5. Games with toxic community 6. Fast paced games
However, in this presentation we chose only to detail the first 4 INs due to time constraints. The full study can be found in the report. Retrieval process - systems
For this experiment, 3 Information Systems were developed:
1. Baseline - Serves as a control for the experiment, and a root for comparison 2. Custom Indexing - Applies some filters to fields like lower-case, having game synonyms, ignoring stop-words and stemming 3. Custom Querying - Weighs the fields in a different way, to mark some of them as more relevant than others Information need 1 - Online Games with server issues
Information Need: Online Games with server problems
Information Need Type: Informational System Baseline Custom Indexing Custom Querying Requirements (Relevance criteria): Mention of lag or server problems in the reviews P@1 1 1 0
Query: (lag OR "server down") AND online P@10 0.80 1 0
Query Fields: detailed_description review categories AvP 0.95 1 0
Custom Querying Weights: detailed_description^1.5 review^0.5 categories^0.1 Information need 1 - Online Games with server issues Information need 1I - Family games
Information Need: Games suited for family play
Information Need Type: Informational
Requirements (Relevance criteria): System Baseline Custom Indexing Custom Querying ● Required Age < 12 ● Multiplayer P@1 0 1 0 ● Suitable for couch play P@10 0.60 0.90 0.50 Query: (family OR "fun for all" OR kid) AND multiplayer AvP 0.66 1 0.50 Query Fields: detailed_description review categories
Custom Querying Weights: detailed_description^1.5 review^0.5 categories^0.1 Information need 1I - Family games Information need 1II - Free games with in-app purchases
Information Need: Free games that have in-app purchases, sometimes deemed as “pay-to-win” games
Information Need Type: Informational System Baseline Custom Indexing Custom Querying Requirements (Relevance criteria): ● Free P@1 1 0 0 ● In-game purchases / transactions P@10 0.80 0.70 0.40 Query: ("pay to win" OR "in-app purchase"~10) AND free AvP 0.75 0.67 0.28 Query Fields: detailed_description review categories
Custom Querying Weights: detailed_description^1.5 review^0.5 categories^0.1 Information need 1II - Free games with in-app purchases Information need 1v - Specific game
Information Need: Page of the Counter-Strike game System Baseline Custom Indexing Custom Querying Information Need Type: Navigational P@1 1 1 1 Requirements (Relevance criteria): Must be the Counter-Strike game
Query: “Counter-Strike”
Query Fields: name Due to the usage of synonyms list, it has the advantage of finding the game even by its aliases like “CS”, for example! Retrieval process - results
System Baseline Custom Indexing Custom Querying
MAP 0.814 0.861 0.357 Tool evaluation and final remarks
● Most of our trouble boils down to lack of documentation on how to use Solr ● After fiddling with the dataset, we found out that dealing with nested documents is not trivial for retrieval operations ● The results were rather interesting, as we found out that the indexing-time polishing of the dataset is very relevant to the accuracy of the Information System ● When trying out different query weights, we decided to reduce the weight of the reviews, as they were the predominant source for un-weighted queries, which caused the accuracy to be worse Goodreads Books and Reviews
DAPI 2020/21 - Group 3 Presentation 2 - Information Retrieval System System’s Datasets
The system datasets’ preparation resulted in three datasets:
● Books ○ CSV format ○ About 10,000 entries ● Authors ○ CSV format ○ About 3,100 entries ● Reviews ○ JSON format ○ About 500,000 entries
Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 2 System’s Datasets - Book
● GoodReads ID: number ● Title: text ● ISBN code: text ● Language code: text ● Publication year: number ● Rating: number ● Authors: text array
Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 3 System’s Datasets - Author
● Name: text ● Gender: text ● Date of birth: text ● Place of birth: text ● Country(ies) of citizenship: text
Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 4 System’s Datasets - Review
● ID: text ● GoodReads book id: number ● Text: text ● Date: text ● Rating: number
Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 5 System’s Datasets
The showcased work was developed using Apache Solr. The IR system comprises 3 types of documents in the same core: ● Books ● Authors ● Reviews
The three datasets (Books, Authors and Reviews) were merged into a single goodreads.json file, and imported to solr using the post utility tool:
$ post -c goodreads -format solr goodreads.json
Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 6 Indexing the Datasets - Schema fields
Book
● title: text_general, indexed, stored ● id: string, stored ● isbn: string, indexed, stored ● language_code: string, indexed, stored ● publication_year: plongs, indexed, stored ● book_rating: string, indexed, stored ● authors: string, stored
Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 7 Indexing the Datasets - Schema fields
Author
● author_name: text_general, indexed, stored ● sex_or_gender: string, indexed, stored ● date_of_birth: string, indexed, stored ● place_of_birth: text_general, indexed, stored ● country_of_citizenship: string, indexed, stored
Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 8 Indexing the Datasets - Schema fields
Review
● review_text: text_general, indexed, stored ● id: string, stored ● date: string, indexed, stored ● review_rating: plongs , indexed, stored ● book_id: string, stored ● book_name: string, stored
Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 9 Indexing the Datasets - Configurations
Three IR system’s configurations were taken into account:
Configuration Stop words / Synonyms Analyzer Filters
IR1 No No
IR2 No Yes
IR3 Yes Yes
Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 10 Indexing the Datasets - Analyzer Filters
● Stop Filter: Removes stop words from a given stop words list ● Synonyms Graph Filter: Considers terms’ synonyms (querying only) ● Lowercase Filter: Converts any uppercase letters in a token to the equivalent lowercase token ● English Possessive Filter: removes singular possessives (trailing 's) from words ● Porter’s Stem Filter: applies the Porter Stemming Algorithm for English ● Hyphenated Words Filter: reconstructs hyphenated words that have been tokenized as two tokens because of a line break or other intervening whitespace
Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 11 Querying and Evaluation - Methodology
Parameters used to evaluate the three different IR system configurations:
● Total of 8 information need / query pairs ● For each information need, 3 sets of field weights ● DisMax querying mode
The first 20 results were classified as Relevant or Non-Relevant based on their fulfilment of the information need.
Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 12 Querying and Evaluation - Methodology
Metrics used to evaluate the system configuration:
● Precision@ (at) ● Recall@ (at) ● Interpolated Precision-Recall ● Average Precision (AvP) ● Mean Average Precision (MaP), for each configuration / field weight pair
Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 13 Querying and Evaluation - Methodology
Three sets of field weights were taken into account:
Field Weights Review Text Book Title Author Name Author Country
WF1 1.1 0.9 0.9 0.9
WF2 0.75 2 2 1
WF3 0.825 2.75 2.45 1.375
Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 14 Querying and Evaluation - Example no. 1
System: IR2 (analyzer filters applied, no stop words / synonyms used) Query: [religion faith] Information need: Search for religion-related content and opinions on books
WF1 WF3
AvP (20 first results) 85% 57%
Explanation: The WF1 weighting system is better suited for information needs satisfied by book Review documents!
Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 15 Querying and Evaluation - Methodology
Reasoning:
● WF1 should deliver better results for information needs that are satisfied by book Reviews; ● WF2 and WF3 should deliver better results for information needs that are satisfied by Books and Authors.
Field Weights Review Text Book Title Author Name Author Country
WF1 1.1 0.9 0.9 0.9
WF2 0.75 2 2 1
WF3 0.825 2.75 2.45 1.375
Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 16 Querying and Evaluation - Example no. 2
System: IR2 (analyzer filters applied, no stop words / synonyms used) Query: [movie film cinema] Information need: Find books with movie adaptations
WF1 WF3
AvP (20 first results) 87% 95%
Explanation: The WF3 weighting system is better suited for information needs satisfied by Book documents!
Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 17 Results and Conclusions - System IR1
● Configuration IR1: No stop words / synonyms, No analyzer filters ● Similar values for different field weight configurations ● Inconclusive!
Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 18 Results and Conclusions - System IR2
● Configuration IR2: No stop words / synonyms, Uses analyzer filters ● High precision and interpolated precision-recall values in the first documents retrieved ● Adding analyzer filters significatively improved the results!
Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 19 Results and Conclusions - System IR3
● Configuration IR3: Uses stop words / synonyms, Uses analyzer filters ● Similar results when compared to configuration IR2, although WF1 achieved better results with this system configuration!
Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 20 Results - Mean Average Precision
Results for each IR system / field-weighting configuration pair:
MaP for IR/WF pair WF1 WF2 WF2
IR1 60% 53% 53%
IR2 87% 65% 59%
IR3 88% 59% 56%
Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 21 Conclusions
● The addition of analyzer filters (lowercase conversion, stemming, singular possessive removal, …) significatively improved results relevancy ● Overall, the WF1 configurations achieved better results - this is the most flexible weighting configuration and most information needs in the test suite were satisfied by any type of document ● The addition of stop words / synonyms analyzers MaP for IR/WF pair WF1 WF2 WF2 was only meaningful when applied together with IR1 60% 53% 53%
other filters IR2 87% 65% 59%
IR3 88% 59% 56%
Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 22 Future Work
● Testing different system configurations, using other analyzer filters and / or stop words and synonyms lists ● Tweaking the field weights for the different text fields ● Using a more complex and robust test set ● Using different Solr’s querying methods
Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System 23 Thank you for your attention
Any questions?
Group 3 | DAPI 2020/2021 | Presentation 2 - Information Retrieval System Information Description, Storage and Retrieval Popular movies and streaming Milestone 2 - Group 4 Carlos Gomes - up201603404 Eduardo Silva - up201603135 Joana Silva - up201208979 Joana Ramos - up201605017 Used Dataset Datasets
Streaming Dataset IMDb Scraped Dataset Structured dataset in .csv Structured dataset with format with information movie information retrieved regarding the streaming through scraping of IMDb’s platform in which a movie is website. available.
IMDb Official Dataset IMDb Dataset Structured dataset in .tsv Unstructured data (movie format with IMDb’s website synopsis) obtained by the information regarding scraping of IMDb’s movie movies. pages. Document Types
Number of Indexed Documents: 105 975
● Movies: 15 531 ● People: 90 444 Used strategy to index the dataset Document Fields - Movies
Both documents (Movies and People) were indexed in the same core, being created a schema with the following fields:
Field Description Other fields: ● startYear imdb_id ID of the movie within the IMDb database ● originalTitle popularTitle Title for which the movie is best known as ● isAdult synopsis Brief summary of the movie
runtimeMinutes Duration of a movie in minutes
genres Various genres of a movie (e.g.: action)
Streaming platforms in which the movie is available netflix / primevideo / disney / hulu at Document Fields - People
Field Description Other fields: ● date_of_birth imdb_id ID of the movie within the IMDb database ● date_of _death imdb_name_id ID of the person within the IMDb database ● birth_name ● reason_of_death category Type of job carried out (e.g.: actor, composer, etc.) ● death_details ● birth_details characters Character(s) played out by the actor/actress in the movie (if ● children applicable) ● height name The person’s name ● divorces ● place_of_birth bio The person’s biography ● place_of_death ● spouses ● ... Used Filters - Movie Title & Character Names
Movie Title & Character Index Query Names
Two new filter types were created: White Space Tokenizer ✓ ✓ text_title (used in movie titles and character names) and custom_text Standard Tokenizer ⨯ ⨯ (used in the synopsis and bio). The Lowercase Filter ✓ ✓ following table shows the filters used in each filter type as well as at Porter Stemming ✓ ✓ query-time. Synonym Graph Filter ⨯ ✓
Duplicate Removal ✓ ⨯
English Possessive Filter ⨯ ⨯
Stop Word Filter ⨯ ⨯ Used Filters - Synopsis & Bio
Synopsis & Bio Index Query
White Space Tokenizer ⨯ ⨯
Standard Tokenizer ✓ ✓
Lowercase Filter ✓ ✓
Porter Stemming ✓ ✓
Synonym Graph Filter ⨯ ✓
Duplicate Removal ✓ ✓
English Possessive Filter ✓ ✓
Stop Word Filter ✓ ✓ Alternative strategy Another indexing strategy was experimented: indexing the two different types of documents in two cores.
This made it possible to retrieve the different type of documents in a query without mentioning fields from both of them explicitly. For instance, when searching for a movie (e.g.: Inception) the results list included the actors that played this movie (e.g.: Leonardo DiCaprio).
Query:
● [{!join from=imdb_id fromIndex=movies to=imdb_id}popularTitle:Inception] Weight Schemes
For the query parser, we used Extended DisMax with the following weight schemes, depending on the type of document to return:
Movies: People:
● Popular Title: x2 ● Name: x2 ● Synopsis: x3 ● Bio: x4 ● Genres: x1: ● Category: x3 ● Characters: x2 Representative Queries Representative Query - Movies
Information need:
● Gather movies about World War 2 that are featured on Netflix
Query:
● [netflix:true AND world war 2]
Field weights used:
● synopsis^3 popularTitle^2 ... Representative Query - Movies
Information need:
● Gather movies about WW2 that are featured on Netflix
Query: ... ● [netflix: true AND ww2]
Field weights used:
● synopsis^3 popularTitle^2 Representative Query - People
Information Need:
● Retrieve the british actors that have been nominated or won an Oscar award
Query:
● [category:act* AND oscar AND place_of_birth:engl* AND (winning nomina*)]
Field weights used:
● bio^2 ... Technology Evaluation Evaluating Query Results
Query #1 Query #2
Information Need: Information Need:
● Retrieve the british actors that have been ● Retrieve the movies that are set during nominated or won an Oscar award World War 2
Query: Query:
● [category:act* AND oscar AND ● [world war 2 nazi holocaust] place_of_birth:engl* AND (winning nomina*)] Field weights used: Field weights used: ● synopsis^3 popularTitle^2 genres ● bio^2 Evaluation Results
AvP #1: 0.582
AvP #2: 0.769
MAP: 0.675 Solr as an IR tool
Avantages: Disadvantages:
● Many features; ● Lack of practical examples in documentation; ● Plentiful documentation; ● No dedicated re-indexing mechanism; ● GUI available; ● Little GUI documentation. ● Good performance when indexing and fetching results.
MEMBERS OF THE EUROPEAN PARLIAMENT
RESOLUTIONS Data Import Handler
Custom Field Types Tokenizer Filters
gender_field Standard SynonymGraph, LowerCase
comma_list Pattern -
Synonyms text file: Pattern applied in comma_list field type:
F,Female,Woman,Women,Feminine,Girl \s*,\s* M,Male,Man,Men,Masculine,Boy,Guy Fields Type Fields Type
id string id string title text_general name string content text_general gender gender_field doc string country text_general committee text_general
birth_date pdate mep_favor text_general national_party text_general mep_against text_general
Query: female meps Results (filtered):
Results (indexed):
...... Query: oral question fish Results (weighted):
Results (filtered): Rapid, easy to use search tool
Uses less disk space
Considering the size of data after indexing
Coding languages
Solr supports less coding languages in comparison to, e.g., ElasticSearch 2020/2021
Diseases, Symptoms and Treatments Information Description, Storage and Retrieval
Group 6:
▪ André Esteves - up201606673 ▪ Francisco Filipe - up201604601 ▪ Helena Montenegro - up201604184 ▪ Juliana Marques - up201605568 Information Needs
Information Need 1: Title: Drugs used for symptom Description: What drugs are used to cure cough? Queries: [drugs treat cough] and [drugs cough]
Information Need 2: Title: Medical specialties associated to symptom Description: What medical specialty should I visit when I have cough? Queries: [medical specialty related cough] and [medical specialty cough]
1 Information Retrieval Systems
Tool used: Apache Solr
We developed three systems:
▪ System A: simple version. ▪ System B: improvements to the indexing process. ▪ System C: improvements to the querying process.
2 System A
3 documents indexed in one core: Disease, Symptom and Treatment.
Schema:
Symptom
Treatment
3 System A
Results:
▪ Simpler queries [drugs cough] and [medical specialties cough] resulted in higher recall and average precision than the other queries. ▪ In the query [drugs cough], the relevant documents are ranked higher.
Conclusions: ▪ The relevance of the results depends on the capacity of the query to express the information need.
4 System B
Improvements to the indexing process.
Filters: ▪ Remove stop words. ▪ Turn tokens into lowercase. ▪ Remove possessives. ▪ Stemming.
5 System B
Results:
▪ First information need: average precision values have improved → more relevant documents appear in a higher ranking.
▪ Second information need: lower average precision values and less relevant documents retrieved → relevant documents appear in lower ranking.
6 System C
Improvements to the querying process: ▪ Apply different weights to different fields.
First set of weights: ▪ More weight in the names of the diseases, symptoms and treatments.
Second set of weights: ▪ More weight in the names of diseases and symptoms, but not on treatments.
7 System C
Results:
MAP (Weight 1) = 0,51 MAP (Weight 2) = 0,63
▪ Second set of weights achieved higher mean average precision values.
▪ Overall, the results improved when compared with the previous systems.
8 Comparison between Systems
For the first information need, the For the second information need, results improved as the systems System A provided better results than evolved. System B with improved indexing process. System C gave the best results.
9 CS:GO Professional Matches and News Information Retrieval Tools used
● The two tools which were considered as viable options for this project were Solr and ElasticSearch ● Solr ○ Well established ○ Geared towards information retrieval ○ Documentation lacking in examples and clarity ● ElasticSearch ○ Newer ○ Bigger focus on data analytics ○ Updated documentation and overall stronger web presence ● We opted for Solr mostly because of the focus in information retrieval Collection and Documents
● Initially, we had 5 different documents: players, matches, picks, economy and news ● Some modifications were made before the implementation of the search system: ○ Removal of the economy document ○ Addition of an article field to the matches (whenever an article directly mentions the match) ○ Add an hierarchical structure to the documents, where picks and players were now children of matches ● After these modifications, we are left with 4 documents: players, matches, picks and news (which don’t mention any matches) Indexing Strategy
● The matches, picks and players were grouped in a single JSON file ● An “article” field was added to the matches if appropriate ● Players and picks were nested in the matches ● As mentioned previously, we also have independent articles which are not connected to any matches (in the original CSV format) ● It should be noted that some articles mention multiple matches; however, the structure for these articles is different, and, as such, this connection was dismissed ● The indexed fields represent what we believe to be the most useful for the information needs planned Indexed Fields
Document Fields
Match team_1 (text_general; csgo_name_general) team_2 (text_general; csgo_name_general) date (pdate) article (text_general;csgo_text_general)
Player player_name (text_general;csgo_text_general) rating (pfloat)
Picks N/A
News title (text_general;csgo_text_general) text (text_general;csgo_text_general) date (pdate) Field Types and Filters
Field Index Filters Query Filters
csgo_name_general Stop Stop EnglishPossessive EnglishPossessive PorterStem PorterStem LowerCase LowerCase SynonymGraph
csgo_text_general LowerCase LowerCase SynonymGraph Retrieval Tasks
● Three systems were put in place to test the performance of the search tool ○ System 1: default index (fixing only minor details) ○ System 2: improved index using the custom field types ○ System 3: improved index and use of weights at query time ● For all three systems, the precision @ 10 and average precision was calculated, followed by the MAP for all three systems Representative Queries
● Information need: Grand finals played by Astralis ● Query: +astralis grand final champions title “win edition”~10 ● A lot of terms, but they are all commonly used in articles which report a grand final
1 2 3 4 5 6 7 8 9 10 P@10 AP
System 1 R N R R N R N R N R 0.6 0.718
System 2 R N R R N R N R N R 0.6 0.718
System 3 R R R R N N R N R R 0.7 0.869 Representative Queries
● Information need: Transfers into/out of Cloud9 during 2018 ● Query: +cloud9 transfer sign add join confirm exit ● Boost results when terms occur in the title
1 2 3 4 5 6 7 8 9 10 P@10 AP
System 1 R R R N N N R R N R 0.6 0.799
System 2 R R R N N N R R R N 0.6 0.811
System 3 R R R R R R R R R N 0.9 1 Representative Queries
● Information need: Matches won by Natus Vincere in 2019 ● Query: natus vincere AND (win OR victory) ● Not very successful; the order of the terms is important and the query does not specify it; boosts based on date and older articles may be dismissed
1 2 3 4 5 6 7 8 9 10 P@10 AP
System 1 N R R R N N N R N R 0.5 0.489
System 2 N R R R R N R N N R 0.6 0.588
System 3 R N N R N R N N R N 0.4 0.488 Precision-Recall Technology Evaluation
● Difficult to measure a system just with five information needs, but it does give some insight as to how different configurations affect the search results ● MAP values between 0.75 and 0.8; AvP’s fluctuate more, registering values between 0.4 and 0.9 ● Unexpected results for queries 4 and 5 (poor performance of Systems 2 and 3) ● Solr proved to be an adequate tool for information retrieval contexts, given its high potential for customization and the ability to perform complex queries ● However, the integration of nested documents was overly complicated (mostly due to unclear documentation) Luís Silva (up201503730) Mariana Costa (up201604414) Pedro Fernandes (up201603846) (Group 7) BILLBOARD 200: POPULAR ALBUMS AND ARTISTS INFORMATION RETRIEVAL
Grupo 8 João Miguel ([email protected]) José Azevedo ([email protected]) Ricardo Ferreira ([email protected]) UPDATED CONCEPTUAL MODEL SCHEMA USED (GENERAL FIELD TYPES) SCHEMA USED (CUSTOM FIELD TYPES) CUSTOM FIELD TYPES (ARTIST-NAME) CUSTOM FIELD TYPES (TAG-TEXT) CUSTOM FIELD TYPES (DESCRIPTIVE-TEXT) QUERIES AND WEIGHTS
• Information Need: Find out information about the album “maad city” and it’s artist Kendrick Lamar • Query: kendrick lamar maad city
• Information Need: Find out information about the artist Pink • Query: pink
• Information Need: Find artists (solo or bands) that born or have been active in the 80s and won or have been nominated for a grammy award • Query: (born_date:198? OR years_active:198?) AND biography:”grammies”
Weights that proved to give the best results and ordering: artist album_artist track_artist album track_album rank.album song playlist
3.2 2.8 2.6 2.4 2.2 2.0 1.8 1.6 QUERY: KENDRICK LAMAR MAAD CITY
GENERIC FIELDS CUSTOM FIELDS CUSTOM FIELDS W/ WEIGHT
"response":{"numFound":10690,"start":0,"numFoundExa "response":{"numFound":5114,"start":0,"numFoundExac "response":{"numFound":5114,"start":0,"numFoundExac ct":true,"docs":[ t":true,"docs":[ t":true,"docs":[ { { { "song":"Sherane a.k.a Master Splinter’s "song":"m.A.A.d city", "album_artist":"Kendrick Lamar", Daughter", "track_album":"good kid, m.A.A.d city", "album":"good kid, m.A.A.d city", "track_album":"good kid, m.A.A.d city", "track_artist":"Kendrick Lamar", "num_listeners":566067, "track_artist":"Kendrick Lamar", "date":"2012", "release_date":"1 January 2012", "date":"2012", "length":"5:50", "tags":["hip-hop", "num_listeners":277069, "num_listeners":364499, "rap", "tags":["hip-hop", "tags":["rap", "hip hop", "rap", "hip-hop", "2012", "go drunk eosin cabs", "2012", "conscious hip-hop", "kendrick lamar"], "gangsta rap"], "kendrick lamar", "id":"d33c9893-7a24-46ac-bf8e-159f8520b7df", "id":"a6800343-ee11-46a8-afa7-fa284a295da3", "west coast rap", "_version_":1684428227576070152}, "_version_":1684424183851778049}, "drake", { { "conscious hip hop"], "song":"Black Boy Fly - Bonus Track", "album_artist":"Kendrick Lamar", "playlist":["Sherane a.k.a Master Splinter’s "track_album":"good kid, m.A.A.d city", "album":"good kid, m.A.A.d city", Daughter", "track_artist":"Kendrick Lamar", "num_listeners":566067, "Bitch, Don’t Kill My Vibe", "date":"2012", "release_date":"1 January 2012", "Backseat Freestyle", "length":"4:39", "tags":["hip-hop", "The Art of Peer Pressure", "num_listeners":86342, "rap", "Money Trees", "tags":["hip hop", "songs i can actually listen to on repeat", "Poetic Justice", "west coast hip hop", "west coast rap", "good kid", "explanation", "drake", "m.A.A.d city", "kendrick lamar"], "conscious hip hop"], "id":"74d8110e-a060-4776-828e-4644fc8fdc81", "_version_":1684428227582361601}, { QUERY: KENDRICK LAMAR MAAD CITY QUERY: PINK
GENERIC FIELDS CUSTOM FIELDS CUSTOM FIELDS W/ WEIGHT
"response":{"numFound":3241,"start":0,"numFoundExac "response":{"numFound":1443,"start":0,"numFoundExac "response":{"numFound":1443,"start":0,"numFoundExac t":true,"docs":[ t":true,"docs":[ t":true,"docs":[ { { { "album_artist":"Soundtrack", "song":"Pink", "artist":"P!nk", "album":"Panther", "track_album":"Nervous System (EP)", "num_listeners":2671748, "num_listeners":48, "track_artist":"Julia Michaels", "tags":["pop", "playlist":["The Pink Panther Theme - From the "date":"2017-07-28", "pop rock", Mirisch-G & E Production \"The Pink Panther\"", "length":"2:48", "female vocalists", "It Had Better Be Tonight (Meglio stasera) - From "num_listeners":15323, "rock", the Mirisch-G & E Production \"The Pink Panther\" "tags":["asmr", "pink", [Instrumental]", "pop", "p!nk", "Royal Blue - 1995 Remastered", "justin tranter"], "female", "Champagne And Quail", "lyrics":"He's got a thing for fitness, seven "american"], "The Village Inn - From the Mirisch-G & E days a week\nBut I don't really care unless he's "born_date":"8 September 1979 (age 41)", Production \"The Pink Panther\"", working out with me\nHe's got a thing for flowers, "born_in":"Doylestown, Bucks County, "The Tiber Twist", but only certain kinds\nAnd by certain kinds I Pennsylvania, United States", "It Had Better Be Tonight (Vocal) - From the mean, only if it's mine\n\nDon't get enough, he "id":"a7f26d1d-1458-4463-ba52-be09e2fd3995", Mirisch-G & E Production \"The Pink Panther\"", don't get enough\nI don't get enough, he don't get "biography":"Alecia Beth Moore ...", "Cortina - From the Mirisch-G & E Production \"The enough of me\nDon't get enough, I don't get "_version_":1684424144295297026}, Pink Panther\"", enough\nHe don't get enough, I don't get { "The Lonely Princess - From the Mirisch-G & E enough\n\nThere's no innuendos, it's exactly what "album_artist":"P!nk", Production \"The Pink Panther\"", you think\nBelieve me …", "album":"Beautiful Trauma", "Something for Sellers - From the Mirisch-G & E "id":"d8ff6857-1366-4d72-b426-8f2f9d339e04", "num_listeners":176254, Production \"The Pink Panther\"", "_version_":1684424186116702209}, "release_date":"12 October 2017", "Piano And Strings - 1995 Remastered", { "tags":["pop", "Shades Of Sennett"], "best of 2017", "id":"78766d97-644f-4e08-9a26-c91702d24d0e", "2017", "_version_":1684428218417807367}, "the best song ever", { "2010s", "pop rock", "jack antonoff"] QUERY: PINK QUERY: (BORN_DATE:198? OR YEARS_ACTIVE:198?) AND BIOGRAPHY:”GRAMMIES”
GENERIC FIELDS CUSTOM FIELDS
"response":{"numFound":1,"start":0,"numFoundExact": "response":{"numFound":155,"start":0,"numFoundExact true,"docs":[ ":true,"docs":[ { { "artist":"Jens Lekman", "artist":"Anoushka Shankar", "num_listeners":511190, "num_listeners":152177, "tags":["swedish", "tags":["indian", "singer-songwriter", "sitar", "indie pop", "world", "indie", "born_date":"9 June 1981 (age 39)", "pop", "born_in":"London, England, United Kingdom", "jens lekman-2007-night falls over kortedala", "biography":”… was nominated for a Grammy in the "chamber pop", Best Contemporary World Music category; this was "twee", her second Grammy nomination…", "00s"], "id":"b721585b-4d81-4814-b80a-96836f4d7bd9", "born_date":"6 February 1981 (age 39)", "_version_":1684424145327095810}, "born_in":"Angered, Gothenburg, Västra Götaland, { Sweden", "artist":"Tasha Cobbs Leonard", "biography":”… Lekman was nominated for three "num_listeners":8323, Swedish Grammies, three P3 Guld and three Manifest "tags":["gospel", awards, as well as dubbed album of the year by "christian", Nöjesguiden.\n\nA concert film shot from Lekman's "worship"], sold-out show with José González at Göteborg's "born_date":"7 July 1981 (age 39)", concert hall in December 2003 was broadcast by "born_in":"Jesup, Wayne County, Georgia, United Swedish national television two times in 2005. In States", June …", "biography":”… won the Grammy for Best "_version_":1684428207793635330}] Gospel/Contemporary Christian Music Performance.", "id":"eb03f123-e7b1-4ad6-b366-b734fc4f92ab", "_version_":1684424144619307011}, { QUERY: (BORN_DATE:198? OR YEARS_ACTIVE:198?) AND BIOGRAPHY:”GRAMMIES” ALL QUERIES
MAP No Filters W/ Filters 0.647 0.976 Information Retrieval: Animation Produced In Japan
MIEIC FEUP 2018-2019
Luís Fernando Mouta - up201808916 Dataset