OD2 icon indicating copy to clipboard operation
OD2 copied to clipboard

Turn on and display Solr spellcheck

Open CGillen opened this issue 4 years ago • 3 comments

Descriptive summary

Search results should display close matched phrase and spellchecked phrases. Solr can do this for us, but I don't know if BL or Hyrax can read that information.

Expected behavior

Search results display helpful suggestions

Related work

Accessibility Concerns

Linked results are easily understood as a spell correction.

CGillen avatar Aug 19 '21 18:08 CGillen

@kmthorn @petersec @shieldsb So this is going to require an update to our solr config (adding copyfields to schema.xml to copy into _text_), which will require a reindex 😱 So we want to get this right on the first try if we can, and maybe discuss the importance/needs of this issue some.

The process for spellchecking involves solr looking at a field as a list of correctly spelled words and comparing the distance the search query is from those words. This means any fields with misspelled words (intentionally or unintentionally) will be seen as a "correct" spelling and may come up as a suggestion.

I've included a list of fields that are checked in the general search bar at the bottom of this comment. It's VEEEEEEERY long but tldr, it's pretty much all fields on all work types. Is there any fields we should exclude? Any others we should include?

I think we shouldn't include the all text/hocr text corpora since that data is prone to errors and can cause odd spelling "corrections" to show. This can also present another heavy performance hit with such large fields. EG: on a search for unversity:

Did you mean to type: university or univeristy or universi1y or uniwrsity or unviversity?

Here's another edge case, even correctly spelled words can come up with a suggestion. This is from a query for paged

Did you mean to type: pages or paced or pared or paved or pated?

pated can be attributed to that all text issue, but these are also correctly spelled words, but may be a little further from the original word than we want. I think we can fine tune the distance between words w/o reindexing, but it is a piece the consider

first_line_tesim
first_line_chorus_tesim
instrumentation_tesim
table_of_contents_tesim
contained_in_journal_tesim
alternative_tesim
tribal_title_tesim
creator_display_tesim
description_tesim
abstract_tesim
biographical_information_tesim
coverage_tesim
designer_inscription_tesim
former_owner_tesim
inscription_tesim
military_highest_rank_tesim
military_occupation_tesim
military_service_location_tesim
motif_tesim
tribal_notes_tesim
award_tesim
event_tesim
keyword_tesim
legal_name_tesim
sports_team_tesim
tribal_classes_tesim
tribal_terms_tesim
accepted_name_usage_tesim
original_name_usage_tesim
scientific_name_authorship_tesim
street_address_tesim
date_tesim
acquisition_date_tesim
award_date_tesim
collected_date_tesim
date_created_tesim
issued_tesim
view_date_tesim
accession_number_tesim
barcode_tesim
identifier_tesim
item_locator_tesim
copyright_claimant_tesim
rights_holder_tesim
rights_note_tesim
copy_location_tesim
location_copyshelf_location_tesim
box_number_tesim
current_repository_id_tesim
folder_name_tesim
folder_number_tesim
local_collection_id_tesim
provenance_tesim
series_name_tesim
series_number_tesim
source_tesim
has_finding_aid_tesim
has_version_tesim
is_part_of_tesim
relation_tesim
material_tesim
technique_tesim
exhibit_tesim
bulkrax_identifier_tesim
owner_label_tesim
creator_label_tesim
photographer_label_tesim
arranger_label_tesim
artist_label_tesim
author_label_tesim
cartographer_label_tesim
collector_label_tesim
composer_label_tesim
contributor_label_tesim
dedicatee_label_tesim
designer_label_tesim
donor_label_tesim
editor_label_tesim
illustrator_label_tesim
interviewee_label_tesim
interviewer_label_tesim
landscape_architect_label_tesim
lyricist_label_tesim
patron_label_tesim
print_maker_label_tesim
recipient_label_tesim
transcriber_label_tesim
translator_label_tesim
form_of_work_label_tesim
subject_label_tesim
cultural_context_label_tesim
ethnographic_term_label_tesim
military_branch_label_tesim
style_or_period_label_tesim
phylum_or_division_label_tesim
taxon_class_label_tesim
order_label_tesim
family_label_tesim
genus_label_tesim
species_label_tesim
common_name_label_tesim
location_label_tesim
ranger_district_label_tesim
tgn_label_tesim
water_basin_label_tesim
access_restrictions_label_tesim
repository_label_tesim
local_collection_name_label_tesim
publisher_label_tesim
place_of_production_label_tesim
publication_place_label_tesim
workType_label_tesim
institution_label_tesim
license_label_tesim
resource_type_label_tesim
language_label_tesim
non_user_collections_tesim
title_tesim
license_label_tesim
file_format_sim
all_text_tsimv
hocr_text_tsimv

CGillen avatar Jul 27 '23 23:07 CGillen

@CGillen @shieldsb I would be fine with holding off on this one. I'm out tomorrow and all next week, so won't have any time to devote to this until late in the work cycle, and this isn't something that came up as a priority at the quarterly. Maybe something to discuss at a future POSM meeting?

petersec avatar Jul 27 '23 23:07 petersec

No problem, gonna leave a note here so I remember some stuff when we come back

Create a copyfield into _text_. Likely this will be *_tesim:

<copyField source="*_tesim" dest="_text_"/>

Reindex so _text_ becomes populated. Turn spellcheck on in the all_fields catalog controller definition:

  field.solr_parameters = {
    qf: "#{all_names} #{title_name} license_label_tesim file_format_sim all_text_tsimv hocr_text_tsimv",
    pf: title_name.to_s,
    'spellcheck': 'on',
  }

CGillen avatar Jul 28 '23 15:07 CGillen