I focused mainly on implementing Korean inference, including Korean tokenizing and embedding, based on the GPT-3 Model and successfully secured the know-how.
I am Seokweon Jang, CEO of the solo AI startup Deep Network.
GPT-3 LLM AI One-Parson startup Deep Network / sayhi7@daum.net
Even when GPT-3 was announced in June 2020, you knew that the fundamental requirement for implementing LLMs is to secure a large amount of training data, right? In the case of GPT-3, 90% of the training data consisting of 500B tokens was collected and processed using web crawling. Significant know-how in web backend design technology is also important when developing the GPT-3 Foundation Model. I focused mainly on implementing Korean inference, including Korean tokenizing and embedding, based on the GPT-3 Model and successfully secured the know-how. In fact, if I were to implement RAG search functionality, I planned to implement a limited search function to obtain the necessary information for RAG search by targeting specific sites like the Arxiv paper site, as I lack web crawling skills. I think I intended to use the API provided by the Arxiv paper site to obtain metadata because I lack web crawling skills. I know that a major company in Korea has been developing a Document Parser using deep learning OCR models for nearly 10 years. I intend to parse PDF documents directly. There are several open-source libraries for parsing PDF documents, but I also have key information for parsing PDF documents. I understand the key steps and methods for implementing tokenizing and embedding at the morpheme level to apply Korean to the GPT-3 Model. I went through some hardships to grasp the key procedures and methods of implementing tokenizing and embedding at the morpheme level to apply Korean to the GPT-3 Model. Nowadays, global companies are also focusing on specific inference technology issues as part of LLM commercialization. I believe the core issue among LLM commercialization issues is parsing PDF documents, and I understand the key issues of parsing PDF documents. In fact, if I were to implement RAG search functionality, I planned to implement a limited search function to obtain the necessary information (metadata information of PDF papers) by targeting specific sites like the Arxiv paper site because I lack web crawling skills. I understand the core implementation techniques for implementing multiple tasks in a multitask structure with specific datasets to perform learning and inference with multiple (tens of) benchmark datasets in the GPT-3 Model structure. There is much more I could tell you, but I will mention only this much. The details are confidential to my solo AI startup Deep Network and cannot be disclosed.