opencast icon indicating copy to clipboard operation
opencast copied to clipboard

Replace Solr Search with OpenSearch

Open gregorydlogan opened this issue 2 years ago • 1 comments

This PR removes Solr (and all of its issues) and replaces it with Opensearch/Elasticsearch (yay, new issues!), unifying our indexing tools.

To accomplish this:

  • We had to remove the Feeds
  • We replaced Solr with Opensearch
    • Notably, I did not include URL signing support with this. If this is a deal breaker we can work on it later.
  • We updated Tobira's integration
  • We updated out Paella glue
  • We fixed other integrations with Search

I've hit a few questions as I've reviewed this internally a bunch, so I'm going to copy those questions up here as an initial code review.

If you have a real system with lots of workflows and series, and want to give this a spin, know that this does not need anything but the database, and the dublincore metadata files. You should be able to pull those out of production deployments and then test this in entirely separate systems.

Todo list: ~- Release notes note indicating that a reindex of the search service is required.~

Known, unfixable issues:

  • Paella tests rely on develop.opencast.org to serve the search/* endpoints in the format expected by this PR. This obviously will not happen until after this PR is merged. Paella tests can be verified locally by setting up Opencast to run on port 80, then npm run dev and npx playwright test after the npm script finishes booting.

Closes #3921

Your pull request should…

gregorydlogan avatar Feb 08 '24 21:02 gregorydlogan

Aware of the broken Paella 7 tests, didn't catch that there were additional tests not run by mvn. Will address (hopefully) tomorrow, but should not block other testing.

gregorydlogan avatar Feb 09 '24 01:02 gregorydlogan

This pull request has conflicts ☹ Please resolve those so we can review the pull request. Thanks.

github-actions[bot] avatar Feb 27 '24 16:02 github-actions[bot]

When deleting an event or series, they remain in the search index which is afaik intended. However, the GET /search/episode.json does not return deleted events, while the GET /search/series.json does return deleted series. This may be related to the deleted attribute of a series in the search index never being set (not when the series is deleted, nor when the event it was attached to is deleted). Seems inconsistent to me, is this intended?

This should be fixed. Or at least it looks like it's working for me?

gregorydlogan avatar Mar 20 '24 05:03 gregorydlogan

When deleting an event or series, they remain in the search index which is afaik intended. However, the GET /search/episode.json does not return deleted events, while the GET /search/series.json does return deleted series. This may be related to the deleted attribute of a series in the search index never being set (not when the series is deleted, nor when the event it was attached to is deleted). Seems inconsistent to me, is this intended?

This should be fixed. Or at least it looks like it's working for me?

Not working for me. The deleted property of a deleted series is still null. Here's some Screenshots of my local Opencast after deleting a series and all attached events: Bildschirmfoto vom 2024-03-20 10-10-42 Bildschirmfoto vom 2024-03-20 10-11-00 Bildschirmfoto vom 2024-03-20 10-11-23

Arnei avatar Mar 20 '24 09:03 Arnei

Gah. It didn't occur to me to test deleting the series without removing the recordings - I always went the other way 'round. I'm surprised this is allowed. I can reproduce this one, thankfully?

This may take a day or two to chase down. It looks like it should be working via a goat trail through at least 5 different classes, but clearly it's not.

gregorydlogan avatar Mar 20 '24 22:03 gregorydlogan

Gah. It didn't occur to me to test deleting the series without removing the recordings - I always went the other way 'round. I'm surprised this is allowed. I can reproduce this one, thankfully?

This may take a day or two to chase down. It looks like it should be working via a goat trail through at least 5 different classes, but clearly it's not.

If I remember correctly, the search service publishes series when an event with a series is being published and uses information from that event only. This means that deleting a series from the series service should have no effect unless the events are updated and republished.

That may explain the described behavior?

lkiesow avatar Apr 02 '24 22:04 lkiesow

That's exactly why it was happening. Well, that and the event handlers weren't hooked up quite right so there was no way to pass the message through :)

Removing a series without the events is unfortunately expensive, but since the rest of the system supports it we have to support it here.

gregorydlogan avatar Apr 03 '24 03:04 gregorydlogan

This pull request has conflicts ☹ Please resolve those so we can review the pull request. Thanks.

github-actions[bot] avatar Apr 23 '24 14:04 github-actions[bot]

Search is missing from the REST docs:

POST /rebuild/{service} service: The service to recreate index from. The available services are: Themes, Series, Scheduler, AssetManager, Comments and Workflow. The service order (see above) is very important! Make sure, you do not run index rebuild for more than one service at a time!

It should be one of the options.

lkiesow avatar Apr 23 '24 20:04 lkiesow

The re-index doesn't work:

2024-04-23 22:23:11,145 | INFO  | (AbstractIndexProducer:145) - Updating the Elasticsearch index from service 'Search': 44/2000 finished, 2% complete.
2024-04-23 22:23:11,412 | INFO  | (AbstractIndexProducer:145) - Updating the Elasticsearch index from service 'Search': 45/2000 finished, 2% complete.
2024-04-23 22:23:11,670 | INFO  | (AbstractIndexProducer:145) - Updating the Elasticsearch index from service 'Search': 46/2000 finished, 2% complete.
2024-04-23 22:23:11,928 | INFO  | (AbstractIndexProducer:145) - Updating the Elasticsearch index from service 'Search': 47/2000 finished, 2% complete.
2024-04-23 22:23:12,152 | INFO  | (AbstractIndexProducer:145) - Updating the Elasticsearch index from service 'Search': 48/2000 finished, 2% complete.
2024-04-23 22:23:12,386 | INFO  | (AbstractIndexProducer:145) - Updating the Elasticsearch index from service 'Search': 49/2000 finished, 2% complete.
2024-04-23 22:23:12,511 | INFO  | (IndexRebuildService:214) - Finished to rebuild the Elasticsearch index from service 'Search'

It shouldn't stop at 2%

lkiesow avatar Apr 23 '24 20:04 lkiesow

Sorry, @gregorydlogan, but this is still broken. Luckily, I think that I know why. But first, running the newly patched version will still result in:

2024-04-24 21:08:04,257 | INFO  | (AbstractIndexProducer:145) - Updating the Elasticsearch index from service 'Search': 47/2000 finished, 2% complete.
2024-04-24 21:08:04,498 | INFO  | (AbstractIndexProducer:145) - Updating the Elasticsearch index from service 'Search': 48/2000 finished, 2% complete.
2024-04-24 21:08:04,759 | INFO  | (AbstractIndexProducer:145) - Updating the Elasticsearch index from service 'Search': 49/2000 finished, 2% complete.
2024-04-24 21:08:04,894 | INFO  | (IndexRebuildService:214) - Finished to rebuild the Elasticsearch index from service 'Search'

After adding a bit of logging, you can see why:

         offset += pageSize;
+        logger.info("offset: {}, total: {}", offset, total);
       } while (offset <= total);
2024-04-24 21:17:56,220 | INFO  | (AbstractIndexProducer:145) - Updating the Elasticsearch index from service 'Search': 47/2000 finished, 2% complete.                                                                                        
2024-04-24 21:17:56,481 | INFO  | (AbstractIndexProducer:145) - Updating the Elasticsearch index from service 'Search': 48/2000 finished, 2% complete.                                                                                        
2024-04-24 21:17:56,745 | INFO  | (AbstractIndexProducer:145) - Updating the Elasticsearch index from service 'Search': 49/2000 finished, 2% complete.                                                                                        
2024-04-24 21:17:56,899 | INFO  | (SearchServiceIndex:432) - offset: 50, total: 2000                                                                                                                                                          
2024-04-24 21:17:56,901 | INFO  | (SearchServiceIndex:432) - offset: 100, total: 2000                                                                                                                                                         
2024-04-24 21:17:56,901 | INFO  | (SearchServiceIndex:432) - offset: 150, total: 2000  

So, we don't get any new data after the first round. Looking at getAllMediaPackages, it becomes immediately clear why:

  public Stream<Tuple<MediaPackage, String>> getAllMediaPackages(int pagesize, int offset)                               
          throws SearchServiceDatabaseException {                                                                        
    List<SearchEntity> searchEntities;                                                                                   
    try {                                                                                                                
      int firstResult = pagesize * offset;                                                                               
      searchEntities = db.exec(namedQuery.findSome("Search.findAll", firstResult, pagesize, SearchEntity.class));

The second parameter of this method is called offset and that's what we pass to this function. However, this method passes pagesize * offset to the database instead. That means that the parameter should be called int page and we should pass the page to this method starting with 0 and increasing by 1 every time we want the next page. Alternatively, removing the int firstResult = pagesize * offset; should probably also fix the issue.


This patch should be fully functional (tried the revuild using this=:

diff --git a/modules/search-service-impl/src/main/java/org/opencastproject/search/impl/SearchServiceIndex.java b/modules/search-service-impl/src/main/java/org/opencastproject/search/impl/SearchServiceIndex.java
index 8ed05ae2a4..987c6ce325 100644
--- a/modules/search-service-impl/src/main/java/org/opencastproject/search/impl/SearchServiceIndex.java
+++ b/modules/search-service-impl/src/main/java/org/opencastproject/search/impl/SearchServiceIndex.java
@@ -400,13 +400,13 @@ public final class SearchServiceIndex extends AbstractIndexProducer implements I
     try {
       int total = persistence.countMediaPackages();
       int pageSize = 50;
-      int offset = 0;
+      int pageOffset = 0;
       AtomicInteger current = new AtomicInteger(0);
       logIndexRebuildBegin(logger, esIndex.getIndexName(), total, "search");
       List<Tuple<MediaPackage, String>> page = null;
 
       do {
-        page = persistence.getAllMediaPackages(pageSize, offset).collect(Collectors.toList());
+        page = persistence.getAllMediaPackages(pageSize, pageOffset).collect(Collectors.toList());
         page.forEach(tuple -> {
           try {
             MediaPackage mediaPackage = tuple.getA();
@@ -428,8 +428,8 @@ public final class SearchServiceIndex extends AbstractIndexProducer implements I
             throw new RuntimeException("Internal Index Rebuild Failure", e);
           }
         });
-        offset += pageSize;
-      } while (offset <= total);
+        pageOffset += 1;
+      } while (pageOffset * pageSize <= total);
       //NB: Catching RuntimeException since it can be thrown inside the functional forEach here
     } catch (SearchServiceDatabaseException | RuntimeException e) {
       logIndexRebuildError(logger, "search", e);

You should be able to just git apply this.

lkiesow avatar Apr 24 '24 20:04 lkiesow

The log message for an index rebuild update you get after an event has been processed. But the first event says that 0 events have been processed and the whole thing ends at 99%:

142371 2024-04-24T21:48:01,944 | INFO  | (AbstractIndexProducer:145) - Updating the Elasticsearch index from service 'Search': 0/2000 finished, 0% complete.
...
144409 2024-04-24T21:56:14,074 | INFO  | (AbstractIndexProducer:145) - Updating the Elasticsearch index from service 'Search': 1999/2000 finished, 99% complete.

This is easily fixed by setting the initial value of current to 1:

-      AtomicInteger current = new AtomicInteger(0);
+      AtomicInteger current = new AtomicInteger(1);

lkiesow avatar Apr 24 '24 20:04 lkiesow

You can use OpenSearch or Elasticsearch, but the logging still says Updating the Elasticsearch index from service 'Search'. Since we now only have one type of search index, maybe we can just change that to Updating the search index from service 'Search'?

lkiesow avatar Apr 24 '24 20:04 lkiesow

Patches applied, still building the dataset to properly test a reindex at scale. You'll probably get to this first, so let me know!

gregorydlogan avatar Apr 29 '24 22:04 gregorydlogan

This pull request has conflicts ☹ Please resolve those so we can review the pull request. Thanks.

github-actions[bot] avatar May 01 '24 17:05 github-actions[bot]

I just realized that POST /index/rebuild rebuilds the search service index, but POST /index/clear does not clear the search service index. I think there needs to be a clear distinction between the two different indexes in all those endpoints. We may also need a POST /index/clear/{indexType}

lkiesow avatar May 06 '24 11:05 lkiesow

The docs for POST /resume/{service} is still missing “Search”:

The service to start recreating the index from. The available services are: Themes, Series, Scheduler, AssetManager, Comments and Workflow. All services that come after the specified service in the order above will also run.

“Search” should be after “Workflow” in the order.

lkiesow avatar May 06 '24 12:05 lkiesow

If an event has a isPartOf metadata, but the linked series does not exist, the rebuild process for this event fails. This is probably not the expected behavior. If the series does not exist, I would expect the series to not be linked, but I would still expect a successful rebuild since the publication worked after all.

2024-05-06 14:40:32,674 | INFO  | (AbstractIndexProducer:44) - Starting Search index rebuild with 301 search                                                                                                                                  
2024-05-06 14:40:33,117 | WARN  | (SearchServiceIndex:275) - Could not get series DublinCoreValue(e12d5290-0825-4868-a8e6-1d75d4af9d41,__,None) from series service. Skipping its publication

To test this, just ingest a new event with a non-existing series identifier in isPartOf:

   curl -u admin:opencast http://localhost:8080/ingest/addMediaPackage/fast \
      -F 'flavor=presenter/source' \
      -F mediaUri=https://data.lkiesow.io/opencast/test-media/goat.mp4 \
      -F title="I 🖤 Opencast" \
      -F acl='{"acl": {"ace": [{"allow": true,"role": "ROLE_USER","action": "read"},{"allow": true,"role": "ROLE_USER","action": "dance"}]}}' \
      -F creator="Opencats" \
      -F isPartOf="e12d5290-0825-4868-a8e6-1d75d4af9d41"

lkiesow avatar May 06 '24 12:05 lkiesow

This pull request has conflicts ☹ Please resolve those so we can review the pull request. Thanks.

github-actions[bot] avatar May 06 '24 21:05 github-actions[bot]

Building this now fails with:

[ERROR] /home/lars/dev/opencast/opencast/modules/tobira/src/main/java/org/opencastproject/tobira/impl/Harvest.java:44:8: Unused import - java.util.Objects. [UnusedImports]

lkiesow avatar May 07 '24 08:05 lkiesow

I've tested this PR with extended metadata for series and episode. It works fine. But…

The LTI series tool is broken. Loading the page stuck with the loading… message. The search query works as expected but the frontend needs to be updated for the new structure. Maybe this work can be done separately but needs to be fixed before OC 16.0 release.

wsmirnow avatar May 07 '24 08:05 wsmirnow

@wsmirnow, how did you build this? Just removing the unused import from above results for me in:

[ERROR] /home/lars/dev/opencast/opencast/modules/tobira/src/main/java/org/opencastproject/tobira/impl/Item.java:[160,91] incompatible types: org.opencastproject.security.api.AccessControlList cannot be converted to java.util.List<org.open
castproject.security.api.AccessControlEntry>

lkiesow avatar May 07 '24 08:05 lkiesow

@wsmirnow, how did you build this? Just removing the unused import from above results for me in:

[ERROR] /home/lars/dev/opencast/opencast/modules/tobira/src/main/java/org/opencastproject/tobira/impl/Item.java:[160,91] incompatible types: org.opencastproject.security.api.AccessControlList cannot be converted to java.util.List<org.open
castproject.security.api.AccessControlEntry>

I've pulled Gregs branch into develop branch (plus rebased ETH patches) yesterday at 8:35PM (german time). The last patch from today is missing.

wsmirnow avatar May 07 '24 08:05 wsmirnow

This pull request has conflicts ☹ Please resolve those so we can review the pull request. Thanks.

github-actions[bot] avatar May 07 '24 10:05 github-actions[bot]

I just realized that POST /index/rebuild rebuilds the search service index, but POST /index/clear does not clear the search service index. I think there needs to be a clear distinction between the two different indexes in all those endpoints. We may also need a POST /index/clear/{indexType}

Ugh, missed this. Looked into the code. The issue here is that the SearchServiceIndex class created in this PR creates its index in different way, which is why it (and it alone) does not get cleared. Hooking that up is going to require a bunch of extra code, and would delay things further. It also might be a silly thing to do, especially in light of several institutions showing interest in removing the Index Service entirely, breaking things up into smaller and less-duplicative pieces. IMO this is something that should be tracked as an issue, and either fixed eventually or disappeared when we refactor the Index itself. I'm also not entirely sure how often one would want to clear their index without immediately rebuilding it, but whatever :)

gregorydlogan avatar May 07 '24 17:05 gregorydlogan