FastGPT Can I add a new knowledge base using API? The web version of the knowledge base is very powerful, but I hope that the new knowledge base can be converted into an API for use to facilitate secondary development.

例行检查

[ ] 我已确认目前没有类似 features
[x] 我已确认我已升级到最新版本
[x] 我已完整查看过项目 README，已确定现有版本无法满足需求
[x] 我理解并愿意跟进此 features，协助测试和提供反馈
[x] 我理解并认可上述内容，并理解项目维护者精力有限，不遵循规则的 features 可能会被无视或直接关闭

功能描述

应用场景

相关示例

Dec 07 '23 02:12 sccsbc

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

Routine inspection

[ ] I have confirmed that there are currently no similar features
[ ] I have confirmed that I have upgraded to the latest version
[ ] I have fully reviewed the project README and determined that the existing version cannot meet the needs.
[ ] I understand and am willing to follow up on this feature, assist with testing and provide feedback
[x] I understand and agree with the above content, and understand that project maintainers have limited energy. Features that do not follow the rules may be ignored or closed directly

Function description

Application Scenario

Related Examples

Dec 07 '23 02:12 c121914yu

分享给你一个思路，适合一次用代码上传大量文件

# 文件上传接口，post上传文件，指定知识库id（dataset id），返回文件id
upload_file_url = 'http://xxx:8222/api/common/file/upload' 

# 创建collection（一个集合对应一个源文件）接口，传入知识库id（dataset id）、文件名、文件id，返回collection_id
create_collection_url = 'http://xxx:8222/api/core/dataset/collection/create'

# 上传文本数据接口，传入collection_id、文本，自动构建向量索引
push_data_url = 'http://xxx:8222/api/core/dataset/data/pushData'

# 依次请求以上三个接口
# 下面以本地txt文件为例

# post_list = ['/data/1.txt', '/data/2.txt', '/data/3.txt', ...] 	# 文件列表
# token = xxx.yyy.zzz						        # 从浏览器F12/Application/Storage/local storage里找一下自己的token，格式类似xxx.yyy.zzz，替换成自己的
# dataset_id = '659676e7b99ec5f26f7b5336'			# 手动创建知识库，从fastgpt/知识库/设置页面粘贴过来


for txt_path in tqdm.tqdm(post_list):
    try:
    	# 1. 上传文件
        upload_response = requests.post(upload_file_url,
                      data={
                          'metadata': {"datasetId": dataset_id},
                          'bucketName': 'dataset'
                      },
                      files={'file': open(txt_path, 'rb')},
                      headers={'Token': token},
                      timeout=30
                      )
        # 拿到file_id
        upload_resp = json.loads(upload_response.content)
        file_id = upload_resp['data'][0]


        name = os.path.basename(txt_path)
        # 2. 创建集合
        create_collection_resp = requests.post(create_collection_url,
                      json={
                          "datasetId": dataset_id,            # 知识库id
                          "parentId": "",
                          "name": name,			   # 文件名，对话检索时会显示在对话框引用那里
                          "type": "file",
                          "metadata": {
                              "fileId": file_id    # 文件id
                          }
                      },
                      headers={'Token': token}
                      )
        # 拿到collection_id
        create_resp = json.loads(create_collection_resp.content)
        collection_id = create_resp['data']

        ####################################################################
        # 这里需要手动切分txt文本，默认大小500
        chunk_size = 500
        with open(txt_path, "r", encoding='utf-8') as f:
            txt_lines = f.read()
        line_list = txt_lines.split('\n')
        part_list = [[]]

        chunk_count = 0
        for line in line_list:
            line = line.strip()
            if line !='':
                if chunk_count < chunk_size:
                    part_list[-1].append(line)
                    chunk_count += len(line)
                else:
                    part_list.append([])
                    part_list[-1].append(line)
                    chunk_count = len(line)
        ####################################################################
        # 3. 上传切分后的每段文本到这个文件集合
        for part in part_list:
            part_q = '\n'.join(part)
            push_data_resp = requests.post(push_data_url,
                      json={
                          "collectionId": collection_id,
                          "mode": "chunk",
                          "data": [{
                                "q": part_q
                          }]
                      },
                      headers={'Token': token}
                      )

            push_resp = json.loads(push_data_resp.content)
            resp_code = push_resp['code']
            # print(resp_code)
    except Exception as e:
        print(e)
        print(txt_path)

Jan 11 '24 03:01 mac404-icu

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

Let me share with you an idea, suitable for uploading a large number of files using code at one time

# File upload interface, post uploads files, specifies the knowledge base id (dataset id), and returns the file id
upload_file_url = 'http://xxx:8222/api/common/file/upload'

# Create a collection (a collection corresponds to a source file) interface, pass in the knowledge base id (dataset id), file name, file id, and return collection_id
create_collection_url = 'http://xxx:8222/api/core/dataset/collection/create'

# Upload text data interface, pass in collection_id, text, and automatically build a vector index
push_data_url = 'http://xxx:8222/api/core/dataset/data/pushData'

# Request the above three interfaces in sequence
# The following takes a local txt file as an example

# post_list = ['/data/1.txt', '/data/2.txt', '/data/3.txt', ...] # File list
# token = xxx.yyy.zzz # Find your own token from the browser F12/Application/Storage/local storage, the format is similar to xxx.yyy.zzz, replace it with your own
# dataset_id = '659676e7b99ec5f26f7b5336' # Manually create the knowledge base and paste it from the fastgpt/knowledge base/settings page


for txt_path in tqdm.tqdm(post_list):
    try:
    # 1. Upload files
        upload_response = requests.post(upload_file_url,
                      data={
                          'metadata': {"datasetId": dataset_id},
                          'bucketName': 'dataset'
                      },
                      files={'file': open(txt_path, 'rb')},
                      headers={'Token': token},
                      timeout=30
                      )
        # Get file_id
        upload_resp = json.loads(upload_response.content)
        file_id = upload_resp['data'][0]


        name = os.path.basename(txt_path)
        # 2. Create a collection
        create_collection_resp = requests.post(create_collection_url,
                      json={
                          "datasetId": dataset_id, # Knowledge base id
                          "parentId": "",
                          "name": name, # The file name will be displayed in the dialog box reference when the dialog is retrieved.
                          "type": "file",
                          "metadata": {
                              "fileId": file_id # file id
                          }
                      },
                      headers={'Token': token}
                      )
        # Get collection_id
        create_resp = json.loads(create_collection_resp.content)
        collection_id = create_resp['data']

        ################################################ ##################
        # Here you need to manually split the txt text, the default size is 500
        chunk_size = 500
        with open(txt_path, "r", encoding='utf-8') as f:
            txt_lines = f.read()
        line_list = txt_lines.split('\n')
        part_list = [[]]

        chunk_count = 0
        for line in line_list:
            line = line.strip()
            if line !='':
                if chunk_count < chunk_size:
                    part_list[-1].append(line)
                    chunk_count += len(line)
                else:
                    part_list.append([])
                    part_list[-1].append(line)
                    chunk_count = len(line)
        ################################################ ##################
        # 3. Upload each segmented text to this file collection
        for part in part_list:
            part_q = '\n'.join(part)
            push_data_resp = requests.post(push_data_url,
                      json={
                          "collectionId": collection_id,
                          "mode": "chunk",
                          "data": [{
                                "q": part_q
                          }]
                      },
                      headers={'Token': token}
                      )

            push_resp = json.loads(push_data_resp.content)
            resp_code = push_resp['code']
            # print(resp_code)
    except Exception as e:
        print(e)
        print(txt_path)

Jan 11 '24 03:01 c121914yu

哪里有这个接口呢，文件限制还是挺烦的，可以单独搞个脚本出来然后做这个事情

Jan 11 '24 05:01 sunlin-xiaonai

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

Where is this interface? File restrictions are quite annoying. You can create a separate script and do this.

Jan 11 '24 05:01 c121914yu

哪里有这个接口呢，文件限制还是挺烦的，可以单独搞个脚本出来然后做这个事情

官方文档里只介绍了一个pushdata的接口，其他接口从网页端F12抓出来的

Jan 11 '24 05:01 mac404-icu

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

Where is this interface? File restrictions are quite annoying. You can create a separate script and do this.

The official document only introduces one pushdata interface, and the other interfaces are captured from F12 on the web page.

Jan 11 '24 05:01 c121914yu

哪里有这个接口呢，文件限制还是挺烦的，可以单独搞个脚本出来然后做这个事情

官方文档里只介绍了一个pushdata的接口，其他接口从网页端F12抓出来的

要是直接支持了，那就舒服了哈哈

Jan 11 '24 06:01 sunlin-xiaonai

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

Where is this interface? File restrictions are quite annoying. You can create a separate script and do this.

The official document only introduces one pushdata interface, and other interfaces are captured from F12 on the web page.

It would be comfortable if it was supported directly, haha

Jan 11 '24 06:01 c121914yu

哪里有这个接口呢，文件限制还是挺烦的，可以单独搞个脚本出来然后做这个事情

官方文档里只介绍了一个pushdata的接口，其他接口从网页端F12抓出来的

这个在新版中貌似已经失效了

Jan 18 '24 08:01 sunlin-xiaonai

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

Where is this interface? File restrictions are quite annoying. You can create a separate script and do this.

The official document only introduces one pushdata interface, and other interfaces are captured from F12 on the web page.

This seems to have expired in the new version.

Jan 18 '24 08:01 c121914yu

机器人检测到问题正文的语言不是英语，会自动翻译。👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

给大家分享一个想法，适合使用代码一次性上传大量文件

# File upload interface, post uploads files, specifies the knowledge base id (dataset id), and returns the file id
upload_file_url = 'http://xxx:8222/api/common/file/upload'

# Create a collection (a collection corresponds to a source file) interface, pass in the knowledge base id (dataset id), file name, file id, and return collection_id
create_collection_url = 'http://xxx:8222/api/core/dataset/collection/create'

# Upload text data interface, pass in collection_id, text, and automatically build a vector index
push_data_url = 'http://xxx:8222/api/core/dataset/data/pushData'

# Request the above three interfaces in sequence
# The following takes a local txt file as an example

# post_list = ['/data/1.txt', '/data/2.txt', '/data/3.txt', ...] # File list
# token = xxx.yyy.zzz # Find your own token from the browser F12/Application/Storage/local storage, the format is similar to xxx.yyy.zzz, replace it with your own
# dataset_id = '659676e7b99ec5f26f7b5336' # Manually create the knowledge base and paste it from the fastgpt/knowledge base/settings page


for txt_path in tqdm.tqdm(post_list):
    try:
    # 1. Upload files
        upload_response = requests.post(upload_file_url,
                      data={
                          'metadata': {"datasetId": dataset_id},
                          'bucketName': 'dataset'
                      },
                      files={'file': open(txt_path, 'rb')},
                      headers={'Token': token},
                      timeout=30
                      )
        # Get file_id
        upload_resp = json.loads(upload_response.content)
        file_id = upload_resp['data'][0]


        name = os.path.basename(txt_path)
        # 2. Create a collection
        create_collection_resp = requests.post(create_collection_url,
                      json={
                          "datasetId": dataset_id, # Knowledge base id
                          "parentId": "",
                          "name": name, # The file name will be displayed in the dialog box reference when the dialog is retrieved.
                          "type": "file",
                          "metadata": {
                              "fileId": file_id # file id
                          }
                      },
                      headers={'Token': token}
                      )
        # Get collection_id
        create_resp = json.loads(create_collection_resp.content)
        collection_id = create_resp['data']

        ################################################ ##################
        # Here you need to manually split the txt text, the default size is 500
        chunk_size = 500
        with open(txt_path, "r", encoding='utf-8') as f:
            txt_lines = f.read()
        line_list = txt_lines.split('\n')
        part_list = [[]]

        chunk_count = 0
        for line in line_list:
            line = line.strip()
            if line !='':
                if chunk_count < chunk_size:
                    part_list[-1].append(line)
                    chunk_count += len(line)
                else:
                    part_list.append([])
                    part_list[-1].append(line)
                    chunk_count = len(line)
        ################################################ ##################
        # 3. Upload each segmented text to this file collection
        for part in part_list:
            part_q = '\n'.join(part)
            push_data_resp = requests.post(push_data_url,
                      json={
                          "collectionId": collection_id,
                          "mode": "chunk",
                          "data": [{
                                "q": part_q
                          }]
                      },
                      headers={'Token': token}
                      )

            push_resp = json.loads(push_data_resp.content)
            resp_code = push_resp['code']
            # print(resp_code)
    except Exception as e:
        print(e)
        print(txt_path)

我使用了该方式上传知识库，但是得到了乱码的片段，请问有成功的吗

Jan 29 '24 01:01 qin1994

我犯了错误，text文件没有问题，PDF文件会乱码

Jan 29 '24 02:01 qin1994

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

I made a mistake. There was no problem with the text file, but the PDF file was garbled.

Jan 29 '24 02:01 c121914yu

4.6.7已支持

Feb 27 '24 13:02 c121914yu