tiktoken-php
tiktoken-php copied to clipboard
Large memory usage?
Take the given test:
$usage = memory()[1];
$provider = new \Yethee\Tiktoken\EncoderProvider;
$provider->setVocabCache(storage_path('app'));
$encoder = $provider->getForModel('gpt-4o-mini');
dd(memory()[1]-$usage); // 26mb!
26mb seems a bit much no? Especially considering the cached vocab is only 3.6mb.
The token dictionary takes up most of the allocated memory. We need to keep the entire dictionary in memory so that encoding text into tokens and vice versa is efficient. Currently, the built-in array type is used for this. I have no idea how to reduce the amount of memory consumed in this place.
Profile
<?php
use Yethee\Tiktoken\EncoderProvider;
require_once 'vendor/autoload.php';
$provider = new EncoderProvider();
$encoder = $provider->get('<encoding>');
Top of memory usage: Vocab::fromStream()
Encoding: cl100k_base
*** SPX Report ***
Global stats:
Called functions : 81
Distinct functions : 50
Wall time : 161.9ms
ZE memory usage : 11.8MB
Flat profile:
Wall time | ZE memory usage |
Inc. | *Exc. | Inc. | Exc. | Called | Function
----------+----------+----------+----------+----------+----------
70.2ms | 59.0ms | 432.2KB | 418.5KB | 12 | {closure}
42.1ms | 38.2ms | 10.8MB | 8.8MB | 1 | Yethee\Tiktoken\Vocab\Vocab::fromStream
78.5ms | 5.9ms | 839.8KB | 363.7KB | 1 | ComposerAutoloaderInitac9bfb1d4166aeecccdb5d5dfb6f6537::getLoader
5.0ms | 5.0ms | 120B | 120B | 1 | Yethee\Tiktoken\Vocab\Loader\DefaultVocabLoader::checkHash
4.0ms | 4.0ms | 2.0MB | 2.0MB | 1 | Yethee\Tiktoken\Vocab\Vocab::__construct
2.4ms | 2.4ms | 43.0KB | 43.0KB | 1 | ComposerAutoloaderInitac9bfb1d4166aeecccdb5d5dfb6f6537::loadClassLoader
29.9us | 29.9us | 0B | 0B | 1 | /var/src/tiktoken/vendor/phpunit/phpunit/src/Framework/Assert/Functions.php
42.1ms | 19.4us | 10.8MB | -8.0KB | 1 | Yethee\Tiktoken\Vocab\Vocab::fromFile
15.4us | 15.4us | 424B | 424B | 1 | Composer\Autoload\ClassLoader::initializeIncludeClosure
5.7ms | 11.7us | 592B | 0B | 6 | Composer\Autoload\ClassLoader::findFile
Encoding: o200k_base
*** SPX Report ***
Global stats:
Called functions : 81
Distinct functions : 50
Wall time : 202.1ms
ZE memory usage : 22.7MB
Flat profile:
Wall time | ZE memory usage |
Inc. | *Exc. | Inc. | Exc. | Called | Function
----------+----------+----------+----------+----------+----------
84.6ms | 76.1ms | 21.8MB | 17.8MB | 1 | Yethee\Tiktoken\Vocab\Vocab::fromStream
16.4ms | 14.6ms | 64.9KB | 65.1KB | 6 | 1@Composer\Autoload\{closure}
10.8ms | 10.8ms | 120B | 120B | 1 | Yethee\Tiktoken\Vocab\Loader\DefaultVocabLoader::checkHash
8.5ms | 8.5ms | 4.0MB | 4.0MB | 1 | Yethee\Tiktoken\Vocab\Vocab::__construct
2.0ms | 2.0ms | 43.0KB | 43.0KB | 1 | ComposerAutoloaderInitac9bfb1d4166aeecccdb5d5dfb6f6537::loadClassLoader
31.9us | 31.9us | 0B | 0B | 1 | /var/src/tiktoken/vendor/phpunit/phpunit/src/Framework/Assert/Functions.php
84.7ms | 23.8us | 21.8MB | -8.0KB | 1 | Yethee\Tiktoken\Vocab\Vocab::fromFile
5.5ms | 10.6us | 592B | 0B | 6 | Composer\Autoload\ClassLoader::findFile
6.8us | 6.8us | 48B | 48B | 1 | Yethee\Tiktoken\EncoderProvider::__construct
106.4ms | 6.1us | 21.8MB | 432B | 1 | Yethee\Tiktoken\EncoderProvider::getVocab