language-models
language-models copied to clipboard
pre-trained Language Models
Hi @piegu! Thanks for processing the DocLayNet dataset into smaller portions. It really helps for fast experimentations! It was especially useful to have the byte stream of the pdfs in...
Hi, I want you to ask about Title, I'm working similar problem and using LayoutXLM and my own tokenizer but I can't understand **seems not to be NE tag.'.format(chunk)) UserWarning:...
Hi Pierre! And thanks for the amazing work you are doing! Could you kindly consider adding the License to the repository? "without a license, the default copyright laws apply, meaning...
Hi @piegu, Thank you for creating DocLayNet datasets (small, base and large). It's very time saving in finetune model for downstream task. I have question on bounding boxes. I checked...
Hello Pierre Guillou (@piegu) ! Thank you for your work on piegu/language-models. This GitHub project is interesting, and we think that it would be a great addition to make this...
Bonjour, D'abord, merci beaucoup pour votre travail et pour le temps que vous avez passé à entraîner à ces modèles. Je souhaiterais m'inspirer de votre notebook lm3-french-classifier-amazon.ipynb pour finetuner un...
Bonjour, merci d'avoir partagé votre code et les différents modèles pré-entrainés. J'ai téléchargé le corpus de Wikipédia et le premier modèle afin de faire tourner le notebook **lm-french-generation.ipynb**. Est-il possible...
I’m working on finetuning LILT model on a custom dataset where the labels aren’t exactly IOB format .When using seqeval i get an error telling me the tags aren’t NER...
Hi Piegu, in the search for training LiLT for sequence classification for document classification like RVL-CDIP, I have not found any relevant notebook or script , though it has shown...
i have my setup as ``` elements = partition_pdf( filename=pdf_path, strategy="hi_res", chunking_strategy="by_title", include_orig_elements=True, extract_images_in_pdf=True, extract_image_block_types=["Image", "Table"], extract_image_block_output_dir=str(self.dirs["images"]), # Save images to disk extract_image_block_to_payload=False, # Ensure base64 is not used include_page_breaks=True,...