Replace Solr Search with OpenSearch
This PR removes Solr (and all of its issues) and replaces it with Opensearch/Elasticsearch (yay, new issues!), unifying our indexing tools.
To accomplish this:
- We had to remove the Feeds
- We replaced Solr with Opensearch
- Notably, I did not include URL signing support with this. If this is a deal breaker we can work on it later.
- We updated Tobira's integration
- We updated out Paella glue
- We fixed other integrations with Search
I've hit a few questions as I've reviewed this internally a bunch, so I'm going to copy those questions up here as an initial code review.
If you have a real system with lots of workflows and series, and want to give this a spin, know that this does not need anything but the database, and the dublincore metadata files. You should be able to pull those out of production deployments and then test this in entirely separate systems.
Todo list: ~- Release notes note indicating that a reindex of the search service is required.~
Known, unfixable issues:
- Paella tests rely on
develop.opencast.orgto serve thesearch/*endpoints in the format expected by this PR. This obviously will not happen until after this PR is merged. Paella tests can be verified locally by setting up Opencast to run on port 80, thennpm run devandnpx playwright testafter the npm script finishes booting.
Closes #3921
Your pull request should…
- [x] have a concise title
- [ ] ~close an accompanying issue if one exists~
- [x] be against the correct branch
- [ ] include migration scripts and documentation, if appropriate
- [x] pass automated tests
- [x] have a clean commit history
- [x] have proper commit messages (title and body) for all commits
Aware of the broken Paella 7 tests, didn't catch that there were additional tests not run by mvn. Will address (hopefully) tomorrow, but should not block other testing.
This pull request has conflicts ☹ Please resolve those so we can review the pull request. Thanks.
When deleting an event or series, they remain in the search index which is afaik intended. However, the
GET /search/episode.jsondoes not return deleted events, while theGET /search/series.jsondoes return deleted series. This may be related to thedeletedattribute of a series in the search index never being set (not when the series is deleted, nor when the event it was attached to is deleted). Seems inconsistent to me, is this intended?
This should be fixed. Or at least it looks like it's working for me?
When deleting an event or series, they remain in the search index which is afaik intended. However, the
GET /search/episode.jsondoes not return deleted events, while theGET /search/series.jsondoes return deleted series. This may be related to thedeletedattribute of a series in the search index never being set (not when the series is deleted, nor when the event it was attached to is deleted). Seems inconsistent to me, is this intended?This should be fixed. Or at least it looks like it's working for me?
Not working for me. The deleted property of a deleted series is still null. Here's some Screenshots of my local Opencast after deleting a series and all attached events:
Gah. It didn't occur to me to test deleting the series without removing the recordings - I always went the other way 'round. I'm surprised this is allowed. I can reproduce this one, thankfully?
This may take a day or two to chase down. It looks like it should be working via a goat trail through at least 5 different classes, but clearly it's not.
Gah. It didn't occur to me to test deleting the series without removing the recordings - I always went the other way 'round. I'm surprised this is allowed. I can reproduce this one, thankfully?
This may take a day or two to chase down. It looks like it should be working via a goat trail through at least 5 different classes, but clearly it's not.
If I remember correctly, the search service publishes series when an event with a series is being published and uses information from that event only. This means that deleting a series from the series service should have no effect unless the events are updated and republished.
That may explain the described behavior?
That's exactly why it was happening. Well, that and the event handlers weren't hooked up quite right so there was no way to pass the message through :)
Removing a series without the events is unfortunately expensive, but since the rest of the system supports it we have to support it here.
This pull request has conflicts ☹ Please resolve those so we can review the pull request. Thanks.
Search is missing from the REST docs:
POST /rebuild/{service}service: The service to recreate index from. The available services are: Themes, Series, Scheduler, AssetManager, Comments and Workflow. The service order (see above) is very important! Make sure, you do not run index rebuild for more than one service at a time!
It should be one of the options.
The re-index doesn't work:
2024-04-23 22:23:11,145 | INFO | (AbstractIndexProducer:145) - Updating the Elasticsearch index from service 'Search': 44/2000 finished, 2% complete.
2024-04-23 22:23:11,412 | INFO | (AbstractIndexProducer:145) - Updating the Elasticsearch index from service 'Search': 45/2000 finished, 2% complete.
2024-04-23 22:23:11,670 | INFO | (AbstractIndexProducer:145) - Updating the Elasticsearch index from service 'Search': 46/2000 finished, 2% complete.
2024-04-23 22:23:11,928 | INFO | (AbstractIndexProducer:145) - Updating the Elasticsearch index from service 'Search': 47/2000 finished, 2% complete.
2024-04-23 22:23:12,152 | INFO | (AbstractIndexProducer:145) - Updating the Elasticsearch index from service 'Search': 48/2000 finished, 2% complete.
2024-04-23 22:23:12,386 | INFO | (AbstractIndexProducer:145) - Updating the Elasticsearch index from service 'Search': 49/2000 finished, 2% complete.
2024-04-23 22:23:12,511 | INFO | (IndexRebuildService:214) - Finished to rebuild the Elasticsearch index from service 'Search'
It shouldn't stop at 2%
Sorry, @gregorydlogan, but this is still broken. Luckily, I think that I know why. But first, running the newly patched version will still result in:
2024-04-24 21:08:04,257 | INFO | (AbstractIndexProducer:145) - Updating the Elasticsearch index from service 'Search': 47/2000 finished, 2% complete.
2024-04-24 21:08:04,498 | INFO | (AbstractIndexProducer:145) - Updating the Elasticsearch index from service 'Search': 48/2000 finished, 2% complete.
2024-04-24 21:08:04,759 | INFO | (AbstractIndexProducer:145) - Updating the Elasticsearch index from service 'Search': 49/2000 finished, 2% complete.
2024-04-24 21:08:04,894 | INFO | (IndexRebuildService:214) - Finished to rebuild the Elasticsearch index from service 'Search'
After adding a bit of logging, you can see why:
offset += pageSize;
+ logger.info("offset: {}, total: {}", offset, total);
} while (offset <= total);
2024-04-24 21:17:56,220 | INFO | (AbstractIndexProducer:145) - Updating the Elasticsearch index from service 'Search': 47/2000 finished, 2% complete.
2024-04-24 21:17:56,481 | INFO | (AbstractIndexProducer:145) - Updating the Elasticsearch index from service 'Search': 48/2000 finished, 2% complete.
2024-04-24 21:17:56,745 | INFO | (AbstractIndexProducer:145) - Updating the Elasticsearch index from service 'Search': 49/2000 finished, 2% complete.
2024-04-24 21:17:56,899 | INFO | (SearchServiceIndex:432) - offset: 50, total: 2000
2024-04-24 21:17:56,901 | INFO | (SearchServiceIndex:432) - offset: 100, total: 2000
2024-04-24 21:17:56,901 | INFO | (SearchServiceIndex:432) - offset: 150, total: 2000
So, we don't get any new data after the first round.
Looking at getAllMediaPackages, it becomes immediately clear why:
public Stream<Tuple<MediaPackage, String>> getAllMediaPackages(int pagesize, int offset)
throws SearchServiceDatabaseException {
List<SearchEntity> searchEntities;
try {
int firstResult = pagesize * offset;
searchEntities = db.exec(namedQuery.findSome("Search.findAll", firstResult, pagesize, SearchEntity.class));
The second parameter of this method is called offset and that's what we pass to this function. However, this method passes pagesize * offset to the database instead. That means that the parameter should be called int page and we should pass the page to this method starting with 0 and increasing by 1 every time we want the next page. Alternatively, removing the int firstResult = pagesize * offset; should probably also fix the issue.
This patch should be fully functional (tried the revuild using this=:
diff --git a/modules/search-service-impl/src/main/java/org/opencastproject/search/impl/SearchServiceIndex.java b/modules/search-service-impl/src/main/java/org/opencastproject/search/impl/SearchServiceIndex.java
index 8ed05ae2a4..987c6ce325 100644
--- a/modules/search-service-impl/src/main/java/org/opencastproject/search/impl/SearchServiceIndex.java
+++ b/modules/search-service-impl/src/main/java/org/opencastproject/search/impl/SearchServiceIndex.java
@@ -400,13 +400,13 @@ public final class SearchServiceIndex extends AbstractIndexProducer implements I
try {
int total = persistence.countMediaPackages();
int pageSize = 50;
- int offset = 0;
+ int pageOffset = 0;
AtomicInteger current = new AtomicInteger(0);
logIndexRebuildBegin(logger, esIndex.getIndexName(), total, "search");
List<Tuple<MediaPackage, String>> page = null;
do {
- page = persistence.getAllMediaPackages(pageSize, offset).collect(Collectors.toList());
+ page = persistence.getAllMediaPackages(pageSize, pageOffset).collect(Collectors.toList());
page.forEach(tuple -> {
try {
MediaPackage mediaPackage = tuple.getA();
@@ -428,8 +428,8 @@ public final class SearchServiceIndex extends AbstractIndexProducer implements I
throw new RuntimeException("Internal Index Rebuild Failure", e);
}
});
- offset += pageSize;
- } while (offset <= total);
+ pageOffset += 1;
+ } while (pageOffset * pageSize <= total);
//NB: Catching RuntimeException since it can be thrown inside the functional forEach here
} catch (SearchServiceDatabaseException | RuntimeException e) {
logIndexRebuildError(logger, "search", e);
You should be able to just git apply this.
The log message for an index rebuild update you get after an event has been processed. But the first event says that 0 events have been processed and the whole thing ends at 99%:
142371 2024-04-24T21:48:01,944 | INFO | (AbstractIndexProducer:145) - Updating the Elasticsearch index from service 'Search': 0/2000 finished, 0% complete.
...
144409 2024-04-24T21:56:14,074 | INFO | (AbstractIndexProducer:145) - Updating the Elasticsearch index from service 'Search': 1999/2000 finished, 99% complete.
This is easily fixed by setting the initial value of current to 1:
- AtomicInteger current = new AtomicInteger(0);
+ AtomicInteger current = new AtomicInteger(1);
You can use OpenSearch or Elasticsearch, but the logging still says Updating the Elasticsearch index from service 'Search'. Since we now only have one type of search index, maybe we can just change that to Updating the search index from service 'Search'?
Summary of what is still missing here:
Patches applied, still building the dataset to properly test a reindex at scale. You'll probably get to this first, so let me know!
This pull request has conflicts ☹ Please resolve those so we can review the pull request. Thanks.
I just realized that POST /index/rebuild rebuilds the search service index, but POST /index/clear does not clear the search service index. I think there needs to be a clear distinction between the two different indexes in all those endpoints. We may also need a POST /index/clear/{indexType}
The docs for POST /resume/{service} is still missing “Search”:
The service to start recreating the index from. The available services are: Themes, Series, Scheduler, AssetManager, Comments and Workflow. All services that come after the specified service in the order above will also run.
“Search” should be after “Workflow” in the order.
If an event has a isPartOf metadata, but the linked series does not exist, the rebuild process for this event fails. This is probably not the expected behavior. If the series does not exist, I would expect the series to not be linked, but I would still expect a successful rebuild since the publication worked after all.
2024-05-06 14:40:32,674 | INFO | (AbstractIndexProducer:44) - Starting Search index rebuild with 301 search
2024-05-06 14:40:33,117 | WARN | (SearchServiceIndex:275) - Could not get series DublinCoreValue(e12d5290-0825-4868-a8e6-1d75d4af9d41,__,None) from series service. Skipping its publication
To test this, just ingest a new event with a non-existing series identifier in isPartOf:
curl -u admin:opencast http://localhost:8080/ingest/addMediaPackage/fast \
-F 'flavor=presenter/source' \
-F mediaUri=https://data.lkiesow.io/opencast/test-media/goat.mp4 \
-F title="I 🖤 Opencast" \
-F acl='{"acl": {"ace": [{"allow": true,"role": "ROLE_USER","action": "read"},{"allow": true,"role": "ROLE_USER","action": "dance"}]}}' \
-F creator="Opencats" \
-F isPartOf="e12d5290-0825-4868-a8e6-1d75d4af9d41"
This pull request has conflicts ☹ Please resolve those so we can review the pull request. Thanks.
Building this now fails with:
[ERROR] /home/lars/dev/opencast/opencast/modules/tobira/src/main/java/org/opencastproject/tobira/impl/Harvest.java:44:8: Unused import - java.util.Objects. [UnusedImports]
I've tested this PR with extended metadata for series and episode. It works fine. But…
The LTI series tool is broken. Loading the page stuck with the loading… message. The search query works as expected but the frontend needs to be updated for the new structure. Maybe this work can be done separately but needs to be fixed before OC 16.0 release.
@wsmirnow, how did you build this? Just removing the unused import from above results for me in:
[ERROR] /home/lars/dev/opencast/opencast/modules/tobira/src/main/java/org/opencastproject/tobira/impl/Item.java:[160,91] incompatible types: org.opencastproject.security.api.AccessControlList cannot be converted to java.util.List<org.open
castproject.security.api.AccessControlEntry>
@wsmirnow, how did you build this? Just removing the unused import from above results for me in:
[ERROR] /home/lars/dev/opencast/opencast/modules/tobira/src/main/java/org/opencastproject/tobira/impl/Item.java:[160,91] incompatible types: org.opencastproject.security.api.AccessControlList cannot be converted to java.util.List<org.open castproject.security.api.AccessControlEntry>
I've pulled Gregs branch into develop branch (plus rebased ETH patches) yesterday at 8:35PM (german time). The last patch from today is missing.
This pull request has conflicts ☹ Please resolve those so we can review the pull request. Thanks.
I just realized that
POST /index/rebuildrebuilds the search service index, butPOST /index/cleardoes not clear the search service index. I think there needs to be a clear distinction between the two different indexes in all those endpoints. We may also need aPOST /index/clear/{indexType}
Ugh, missed this. Looked into the code. The issue here is that the SearchServiceIndex class created in this PR creates its index in different way, which is why it (and it alone) does not get cleared. Hooking that up is going to require a bunch of extra code, and would delay things further. It also might be a silly thing to do, especially in light of several institutions showing interest in removing the Index Service entirely, breaking things up into smaller and less-duplicative pieces. IMO this is something that should be tracked as an issue, and either fixed eventually or disappeared when we refactor the Index itself. I'm also not entirely sure how often one would want to clear their index without immediately rebuilding it, but whatever :)