tiktoken-php icon indicating copy to clipboard operation
tiktoken-php copied to clipboard

Large memory usage?

Open Vimiso opened this issue 1 year ago • 1 comments

Take the given test:

$usage = memory()[1];

$provider = new \Yethee\Tiktoken\EncoderProvider;
$provider->setVocabCache(storage_path('app'));
$encoder = $provider->getForModel('gpt-4o-mini');

dd(memory()[1]-$usage); // 26mb! 

26mb seems a bit much no? Especially considering the cached vocab is only 3.6mb.

Vimiso avatar Oct 20 '24 17:10 Vimiso

The token dictionary takes up most of the allocated memory. We need to keep the entire dictionary in memory so that encoding text into tokens and vice versa is efficient. Currently, the built-in array type is used for this. I have no idea how to reduce the amount of memory consumed in this place.

Profile
<?php

use Yethee\Tiktoken\EncoderProvider;

require_once 'vendor/autoload.php';

$provider = new EncoderProvider();
$encoder = $provider->get('<encoding>');

Top of memory usage: Vocab::fromStream()

Encoding: cl100k_base

*** SPX Report ***

Global stats:

  Called functions    :       81
  Distinct functions  :       50

  Wall time           :  161.9ms
  ZE memory usage     :   11.8MB

Flat profile:

 Wall time           | ZE memory usage     |
 Inc.     | *Exc.    | Inc.     | Exc.     | Called   | Function
----------+----------+----------+----------+----------+----------
   70.2ms |   59.0ms |  432.2KB |  418.5KB |       12 | {closure}
   42.1ms |   38.2ms |   10.8MB |    8.8MB |        1 | Yethee\Tiktoken\Vocab\Vocab::fromStream
   78.5ms |    5.9ms |  839.8KB |  363.7KB |        1 | ComposerAutoloaderInitac9bfb1d4166aeecccdb5d5dfb6f6537::getLoader
    5.0ms |    5.0ms |     120B |     120B |        1 | Yethee\Tiktoken\Vocab\Loader\DefaultVocabLoader::checkHash
    4.0ms |    4.0ms |    2.0MB |    2.0MB |        1 | Yethee\Tiktoken\Vocab\Vocab::__construct
    2.4ms |    2.4ms |   43.0KB |   43.0KB |        1 | ComposerAutoloaderInitac9bfb1d4166aeecccdb5d5dfb6f6537::loadClassLoader
   29.9us |   29.9us |       0B |       0B |        1 | /var/src/tiktoken/vendor/phpunit/phpunit/src/Framework/Assert/Functions.php
   42.1ms |   19.4us |   10.8MB |   -8.0KB |        1 | Yethee\Tiktoken\Vocab\Vocab::fromFile
   15.4us |   15.4us |     424B |     424B |        1 | Composer\Autoload\ClassLoader::initializeIncludeClosure
    5.7ms |   11.7us |     592B |       0B |        6 | Composer\Autoload\ClassLoader::findFile

Encoding: o200k_base

*** SPX Report ***

Global stats:

  Called functions    :       81
  Distinct functions  :       50

  Wall time           :  202.1ms
  ZE memory usage     :   22.7MB

Flat profile:

 Wall time           | ZE memory usage     |
 Inc.     | *Exc.    | Inc.     | Exc.     | Called   | Function
----------+----------+----------+----------+----------+----------
   84.6ms |   76.1ms |   21.8MB |   17.8MB |        1 | Yethee\Tiktoken\Vocab\Vocab::fromStream
   16.4ms |   14.6ms |   64.9KB |   65.1KB |        6 | 1@Composer\Autoload\{closure}
   10.8ms |   10.8ms |     120B |     120B |        1 | Yethee\Tiktoken\Vocab\Loader\DefaultVocabLoader::checkHash
    8.5ms |    8.5ms |    4.0MB |    4.0MB |        1 | Yethee\Tiktoken\Vocab\Vocab::__construct
    2.0ms |    2.0ms |   43.0KB |   43.0KB |        1 | ComposerAutoloaderInitac9bfb1d4166aeecccdb5d5dfb6f6537::loadClassLoader
   31.9us |   31.9us |       0B |       0B |        1 | /var/src/tiktoken/vendor/phpunit/phpunit/src/Framework/Assert/Functions.php
   84.7ms |   23.8us |   21.8MB |   -8.0KB |        1 | Yethee\Tiktoken\Vocab\Vocab::fromFile
    5.5ms |   10.6us |     592B |       0B |        6 | Composer\Autoload\ClassLoader::findFile
    6.8us |    6.8us |      48B |      48B |        1 | Yethee\Tiktoken\EncoderProvider::__construct
  106.4ms |    6.1us |   21.8MB |     432B |        1 | Yethee\Tiktoken\EncoderProvider::getVocab

yethee avatar Nov 10 '24 21:11 yethee