GreedyBear icon indicating copy to clipboard operation
GreedyBear copied to clipboard

Optimize API Performance: Remove redundant per-item serialization in `feeds_response`

Open opbot-xd opened this issue 4 months ago • 1 comments

Description

The feeds_response function in api/views/utils.py currently iterates over the IOC queryset and, for each item, instantiates a FeedsResponseSerializer to validate and serialize the data.

# api/views/utils.py

# ... inside the loop iterating over iocs ...
serializer_item = FeedsResponseSerializer(
    data=data_,
    context={"valid_feed_types": valid_feed_types},
)
serializer_item.is_valid(raise_exception=True)
json_list.append(serializer_item.data)

Problem

When the default feed_size is 5000, this loop creates 5000 individual serializer instances and performs full validation on data that was just constructed internally from our database. This adds significant overhead (CPU and time) to the response generation, making the API slower than necessary.

Proposed Solution

Since we strictly control the data_ dictionary construction immediately before this block, the data is inherently trusted. We should:

  1. Verify the data_ dictionary structure aligns with the API contract.
  2. Append data_ directly to json_list, skipping the serializer instantiation and validation entirely.

opbot-xd avatar Dec 20 '25 08:12 opbot-xd

@regulartim I'd like to work on this optimization. I have analyzed the code and can implement the proposed solution to remove the redundant serialization overhead.

opbot-xd avatar Dec 20 '25 08:12 opbot-xd

I had a similar idea a year ago and I still agree that it might be better to just skip validation by default. However @mlodic objected this back then and I don't remember why. I guess we discussed it in the context of #397 and introduced a setting in the env file that allows to skip validation. Do you remember the concerns you had @mlodic ?

regulartim avatar Dec 20 '25 15:12 regulartim

Thanks for the explanation. Since this seems to have some historical context, I was wondering if there’s a discussion channel (Slack or Discord) where this was talked through previously? I’d love to review or ask there if appropriate.

I did check Discord but could only find the IntelOwl channel, not a GreedyBear-specific one, and on Slack I noticed the GreedyBear channel looks quite new, so I wasn’t sure if there’s an older or different place where these discussions happened.

opbot-xd avatar Dec 20 '25 17:12 opbot-xd

I suggest to keep things in Github as much as possible for easier tracking but sure, if you want to have a chat, I think it is better if we use the official Slack channel of GSoC for Honeynet. The channel is new because we never had a Greedybear GSoCer before (cause they worked all on IntelOwl) but we'll probably have one next year because GreedyBear is a more active project at the time of writing. I personally don't check Discord much. If @regulartim has another channel to propose for communication, feel free to propose it.

On one hand I agree that in this context that serializer is not strictly necessary, but on the other hand, it is a security best practice to not only validate the data coming from external sources but also from different parts of your internal infrastructure. Let's say this DB has been tampered, how do you catch it?. In more complex projects with external databases and more complex flows, that's definitively something that can't be avoided. Also, what's the real benefit in terms of CPU/resources? Is there any evidence that it is too slow as is now? If we can extract data that prove that we can sacrifice the extra security measure to greatly improve a degraded performance and make the service more usable and reliable, so yes, I think the change is worth. Otherwise security should beat usability/performance. I think we need numbers to be able to sacrifice one for the other.

What do you think about? Anyway I appreciate your proactivity in trying to find solutions.

mlodic avatar Dec 22 '25 05:12 mlodic

Thanks everyone for the detailed discussion and historical context.

After reading the explanations, especially the security considerations raised by @mlodic I agree that in this case keeping the serializer validation is the safer and more appropriate choice. Even though the data is constructed internally, validating it adds an important layer of defense against unexpected or compromised states, and without concrete performance evidence it’s not justified to trade that off.

Given this, I don’t think it makes sense to pursue this optimization further at the moment. I’m happy to close this issue and revisit it in the future only if there is clear data showing a real performance problem.

Thanks again for the insights and for taking the time to explain the reasoning, I learned a lot from the discussion.

opbot-xd avatar Dec 22 '25 06:12 opbot-xd

I suggest to keep things in Github as much as possible for easier tracking but sure, if you want to have a chat, I think it is better if we use the official Slack channel of GSoC for Honeynet. The channel is new because we never had a Greedybear GSoCer before (cause they worked all on IntelOwl) but we'll probably have one next year because GreedyBear is a more active project at the time of writing. I personally don't check Discord much. If @regulartim has another channel to propose for communication, feel free to propose it.

I agree. Using Github as much as possible and Slack for chat. I'm also not very active on Discord.

Also, what's the real benefit in terms of CPU/resources? Is there any evidence that it is too slow as is now? If we can extract data that prove that we can sacrifice the extra security measure to greatly improve a degraded performance and make the service more usable and reliable, so yes, I think the change is worth. Otherwise security should beat usability/performance. I think we need numbers to be able to sacrifice one for the other.

I think I once measured the impact, but that was on a old version of GreedyBear. @opbot-xd , if you have time and feel like it, I would be happy to see a performance evaluation of the validation step over a small set of API calls. Maybe the impact is actually large!

regulartim avatar Dec 22 '25 07:12 regulartim

hi @regulartim @mlodic did the benchmarking for the feed size of 5000, is the method for benchmarking correct?

1. Benchmark Script

The following script was used to evaluate the performance:

import os
import time
import django
from datetime import datetime
from unittest.mock import Mock

os.environ.setdefault("DJANGO_SETTINGS_MODULE", "benchmark_settings")
django.setup()

from api.serializers import FeedsResponseSerializer
from api.enums import Honeypots

def run_benchmark():
    feed_size = 5000
    mock_iocs = []
    
    valid_feed_types = frozenset([Honeypots.LOG4J.value, Honeypots.COWRIE.value, "all"])

    print(f"Generating {feed_size} mock IOCs...")
    now = datetime.now()
    for i in range(feed_size):
        mock_iocs.append({
            "value": f"192.168.1.{i}",
            "first_seen": now,
            "last_seen": now,
            "attack_count": 10,
            "interaction_count": 5,
            "log4j": True,
            "cowrie": False,
            "scanner": True,
            "payload_request": False,
            "ip_reputation": "malicious",
            "asn": 12345,
            "destination_ports": [22, 80],
            "login_attempts": 2,
            "honeypots": ["log4j"],
            "days_seen": 1,
            "recurrence_probability": 0.5,
            "expected_interactions": 1.0,
        })
    
    formatted_iocs = mock_iocs
    
    # --- Baseline: Current Implementation ---
    print("Running baseline benchmark (with serialization)...")
    start_time = time.time()
    
    json_list = []
    
    for ioc in formatted_iocs:
        ioc_feed_type = []
        if ioc[Honeypots.LOG4J.value]:
            ioc_feed_type.append(Honeypots.LOG4J.value)
        if ioc[Honeypots.COWRIE.value]:
            ioc_feed_type.append(Honeypots.COWRIE.value)
        if len(ioc["honeypots"]):
            ioc_feed_type.extend([hp.lower() for hp in ioc["honeypots"] if hp is not None])

        data_ = ioc | {
            "first_seen": ioc["first_seen"].strftime("%Y-%m-%d"),
            "last_seen": ioc["last_seen"].strftime("%Y-%m-%d"),
            "feed_type": ioc_feed_type,
            "destination_port_count": len(ioc["destination_ports"]),
        }

        serializer_item = FeedsResponseSerializer(
            data=data_,
            context={"valid_feed_types": valid_feed_types},
        )
        serializer_item.is_valid(raise_exception=True)
        json_list.append(serializer_item.data)
            
    end_time = time.time()
    baseline_duration = end_time - start_time
    print(f"Baseline duration: {baseline_duration:.4f} seconds")


    # --- Optimization: Proposed Solution ---
    print("Running optimization benchmark (skipping serialization)...")
    start_time = time.time()
    
    json_list_opt = []
    
    for ioc in formatted_iocs:
        ioc_feed_type = []
        if ioc[Honeypots.LOG4J.value]:
            ioc_feed_type.append(Honeypots.LOG4J.value)
        if ioc[Honeypots.COWRIE.value]:
            ioc_feed_type.append(Honeypots.COWRIE.value)
        if len(ioc["honeypots"]):
            ioc_feed_type.extend([hp.lower() for hp in ioc["honeypots"] if hp is not None])

        data_ = ioc | {
            "first_seen": ioc["first_seen"].strftime("%Y-%m-%d"),
            "last_seen": ioc["last_seen"].strftime("%Y-%m-%d"),
            "feed_type": ioc_feed_type,
            "destination_port_count": len(ioc["destination_ports"]),
        }

        json_list_opt.append(data_)
            
    end_time = time.time()
    opt_duration = end_time - start_time
    print(f"Optimization duration: {opt_duration:.4f} seconds")
    
    if opt_duration > 0:
        print(f"Speedup: {baseline_duration / opt_duration:.2f}x")

2. Results

The benchmark was run with feed_size = 5000 (the default maximum feed size).

Metric Duration (seconds)
Baseline (with Serializer) ~1.8s
Optimized (Manual Dict) ~0.02s

3. Verification

(Performed on the optimized implementation) I verified that the manual dictionary construction proposed in the optimization strictly matches the fields defined in FeedsResponseSerializer. I also ran the existing test suite (tests.test_views) with the optimization applied, and all 78 tests passed, confirming no technical regression in API behavior.

opbot-xd avatar Dec 22 '25 15:12 opbot-xd

Thanks for your work! I will check your method tomorrow.

regulartim avatar Dec 22 '25 15:12 regulartim

Your method seems solid! Could you please try one more thing: Switch your code sections such that your proposed solution runs BEFORE the current implementation. Then try to reproduce your results.

I just want to make sure that the improvement is not due to some kind of caching happening in the background.

regulartim avatar Dec 23 '25 08:12 regulartim

I reopened the issue until we come to a final conclusion regarding the validation.

regulartim avatar Dec 23 '25 08:12 regulartim

1. Benchmark Script

The following script was used to evaluate the performance:

import os
import time
import django
from datetime import datetime
from unittest.mock import Mock

os.environ.setdefault("DJANGO_SETTINGS_MODULE", "benchmark_settings")
django.setup()

from api.serializers import FeedsResponseSerializer
from api.enums import Honeypots

def run_benchmark():
    feed_size = 5000
    mock_iocs = []
    
    valid_feed_types = frozenset([Honeypots.LOG4J.value, Honeypots.COWRIE.value, "all"])

    print(f"Generating {feed_size} mock IOCs...")
    now = datetime.now()
    for i in range(feed_size):
        mock_iocs.append({
            "value": f"192.168.1.{i}",
            "first_seen": now,
            "last_seen": now,
            "attack_count": 10,
            "interaction_count": 5,
            "log4j": True,
            "cowrie": False,
            "scanner": True,
            "payload_request": False,
            "ip_reputation": "malicious",
            "asn": 12345,
            "destination_ports": [22, 80],
            "login_attempts": 2,
            "honeypots": ["log4j"],
            "days_seen": 1,
            "recurrence_probability": 0.5,
            "expected_interactions": 1.0,
        })
    
    formatted_iocs = mock_iocs
    
    # --- Optimization: Proposed Solution ---
    print("Running optimization benchmark (skipping serialization)...")
    start_time = time.time()
    
    json_list_opt = []
    
    for ioc in formatted_iocs:
        ioc_feed_type = []
        if ioc[Honeypots.LOG4J.value]:
            ioc_feed_type.append(Honeypots.LOG4J.value)
        if ioc[Honeypots.COWRIE.value]:
            ioc_feed_type.append(Honeypots.COWRIE.value)
        if len(ioc["honeypots"]):
            ioc_feed_type.extend([hp.lower() for hp in ioc["honeypots"] if hp is not None])

        data_ = ioc | {
            "first_seen": ioc["first_seen"].strftime("%Y-%m-%d"),
            "last_seen": ioc["last_seen"].strftime("%Y-%m-%d"),
            "feed_type": ioc_feed_type,
            "destination_port_count": len(ioc["destination_ports"]),
        }

        json_list_opt.append(data_)
            
    end_time = time.time()
    opt_duration = end_time - start_time
    print(f"Optimization duration: {opt_duration:.4f} seconds")


    # --- Baseline: Current Implementation ---
    print("Running baseline benchmark (with serialization)...")
    start_time = time.time()
    
    json_list = []
    
    for ioc in formatted_iocs:
        ioc_feed_type = []
        if ioc[Honeypots.LOG4J.value]:
            ioc_feed_type.append(Honeypots.LOG4J.value)
        if ioc[Honeypots.COWRIE.value]:
            ioc_feed_type.append(Honeypots.COWRIE.value)
        if len(ioc["honeypots"]):
            ioc_feed_type.extend([hp.lower() for hp in ioc["honeypots"] if hp is not None])

        data_ = ioc | {
            "first_seen": ioc["first_seen"].strftime("%Y-%m-%d"),
            "last_seen": ioc["last_seen"].strftime("%Y-%m-%d"),
            "feed_type": ioc_feed_type,
            "destination_port_count": len(ioc["destination_ports"]),
        }

        serializer_item = FeedsResponseSerializer(
            data=data_,
            context={"valid_feed_types": valid_feed_types},
        )
        serializer_item.is_valid(raise_exception=True)
        json_list.append(serializer_item.data)
            
    end_time = time.time()
    baseline_duration = end_time - start_time
    print(f"Baseline duration: {baseline_duration:.4f} seconds")
    
    if opt_duration > 0:
        print(f"Speedup: {baseline_duration / opt_duration:.2f}x")


if __name__ == "__main__":
    run_benchmark()

Results

Initial Benchmark (Optimization → Baseline Order)

Metric Duration (seconds)
Optimized (Manual Dict) ~0.0334s
Baseline (with Serializer) ~1.7537s
Speedup ~52.52x
Image

Previous Benchmark (Baseline → Optimization Order)

Metric Duration (seconds)
Baseline (with Serializer) ~1.8s
Optimized (Manual Dict) ~0.02s
Speedup ~90x

opbot-xd avatar Dec 23 '25 09:12 opbot-xd

Nice, thank you! I am in favor of omitting the validation or at least introduce a setting to do so. What do you think @mlodic ?

regulartim avatar Dec 23 '25 10:12 regulartim

yes I guess it is fine, thanks for testing it @opbot-xd

mlodic avatar Dec 23 '25 17:12 mlodic

Would you like to get assigned to this issue @opbot-xd ?

regulartim avatar Dec 23 '25 18:12 regulartim

Thanks for reaching out! I’m currently wrapping up another PR, which is in draft right now. I’d prefer to keep this issue open for the moment and request assignment once that PR is finalized. Happy to take it up right after that.

opbot-xd avatar Dec 23 '25 20:12 opbot-xd