diff --git a/getting-started/index.html b/getting-started/index.html
index 503f129..796af47 100644
--- a/getting-started/index.html
+++ b/getting-started/index.html
@@ -117,7 +117,7 @@ <h2 id="installation">Installation</h2>
 conda activate tail
 
 # Install tailtest
-pip install tailtest
+pip install tail-test
 </code></pre>
 <p>Set your OPENAI_API_KEY as an environment variable.</p>
 <pre><code>export OPENAI_API_KEY=&quot;...&quot;
diff --git a/index.html b/index.html
index 299c869..00a3a98 100644
--- a/index.html
+++ b/index.html
@@ -242,5 +242,5 @@ <h4 class="modal-title" id="keyboardModalLabel">Keyboard Shortcuts</h4>
 
 <!--
 MkDocs version : 1.6.0
-Build Date UTC : 2024-08-16 20:36:08.020159+00:00
+Build Date UTC : 2024-08-16 20:38:35.502727+00:00
 -->
diff --git a/search/search_index.json b/search/search_index.json
index 6f8553e..c9679a3 100644
--- a/search/search_index.json
+++ b/search/search_index.json
@@ -1 +1 @@
-{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"Welcome to TAIL! Automatic, Easy and Realistic tool for LLM Evaluation Getting Started User Guide Features Easy to customize TAIL helps you generate benchmarks on your own documents (Patents, Papers, Financial Reports, anything you are interested in). It allows you to create test examples of any context length and questions at any depth you desire. Realistic and natural Unlike the needle-in-a-haystack test, TAIL generate questions based on infomations from your own document, instead of inserting a piece of new infomation, making the benchmark more realistic and natural. Quality assured TAIL utilizes multiple quality assurance measures, including RAG-based filtering and rigorous quality checks, to eliminate subpar QAs and deliver a high-caliber benchmark. Ready-to-use TAIL integrates an out-of-the-box evaluation module that enables users to easily evaluate commercial LLMs via API calls and open-source LLMs via vLLM on the generated benchmarks.","title":"Home"},{"location":"#welcome-to-tail","text":"Automatic, Easy and Realistic tool for LLM Evaluation Getting Started User Guide","title":"Welcome to TAIL!"},{"location":"about/","text":"About","title":"About"},{"location":"getting-started/","text":"Getting Started Installation Install the package from PyPi: # (Recommended) Create a new conda environment. conda create -n tail python=3.10 -y conda activate tail # Install tailtest pip install tailtest Set your OPENAI_API_KEY as an environment variable. export OPENAI_API_KEY=\"...\" For more details, see Installation Guide . Prepare a long document TAIL generates QAs for benchmark generation based on the document users inputs. Users need to prepare the input document in a JSON file, in the format of [{\"text: YOUR_LONG_TEXT}] (YOUR_LONG_TEXT is a long string). We prepare a example input document file in /data/example_input.json , if you don't have time to collect your own document, you can use it to generate benchmarks. Generate your own benchmark The next step is to set the document_length and depth for your benchmark. document_length means how long the test document in the benchmark will be, while depth indicates how deep a question's evidence locates within the test document. For example, setting document_length to 8000 and depth to 50 means generating a QA and test document of 8000 tokens, where the evidence for the question is located around the middle of the test document. Provide path for your long document and path to save your benchmark, specify document_length and depth ,and then run tail-cli.build to start benchmark generation! Here's an example: tail-cli.build --raw_document_path \"/data/raw.json\" --QA_save_path \"/data/QA.json\" --document_length 8000 32000 64000 --depth_list 25 50 75 Test LLMs on your benchmark After generation your benchmark, it's time to evaluate LLMs on it. Input the test model's name and path to the saved benchmark, provide document_length and depth you want to test, TAIL will automatically run the evaluation and store visualizations in test_result_save_dir . tail-cli.eval --QA_save_path \"/data/QA.json\" --test_model_name \"gpt-4o\" --test_depth_list 25 75 --test_doc_length 8000 32000 --test_result_save_dir /data/result/","title":"Getting Started"},{"location":"getting-started/#getting-started","text":"","title":"Getting Started"},{"location":"getting-started/#installation","text":"Install the package from PyPi: # (Recommended) Create a new conda environment. conda create -n tail python=3.10 -y conda activate tail # Install tailtest pip install tailtest Set your OPENAI_API_KEY as an environment variable. export OPENAI_API_KEY=\"...\" For more details, see Installation Guide .","title":"Installation"},{"location":"getting-started/#prepare-a-long-document","text":"TAIL generates QAs for benchmark generation based on the document users inputs. Users need to prepare the input document in a JSON file, in the format of [{\"text: YOUR_LONG_TEXT}] (YOUR_LONG_TEXT is a long string). We prepare a example input document file in /data/example_input.json , if you don't have time to collect your own document, you can use it to generate benchmarks.","title":"Prepare a long document"},{"location":"getting-started/#generate-your-own-benchmark","text":"The next step is to set the document_length and depth for your benchmark. document_length means how long the test document in the benchmark will be, while depth indicates how deep a question's evidence locates within the test document. For example, setting document_length to 8000 and depth to 50 means generating a QA and test document of 8000 tokens, where the evidence for the question is located around the middle of the test document. Provide path for your long document and path to save your benchmark, specify document_length and depth ,and then run tail-cli.build to start benchmark generation! Here's an example: tail-cli.build --raw_document_path \"/data/raw.json\" --QA_save_path \"/data/QA.json\" --document_length 8000 32000 64000 --depth_list 25 50 75","title":"Generate your own benchmark"},{"location":"getting-started/#test-llms-on-your-benchmark","text":"After generation your benchmark, it's time to evaluate LLMs on it. Input the test model's name and path to the saved benchmark, provide document_length and depth you want to test, TAIL will automatically run the evaluation and store visualizations in test_result_save_dir . tail-cli.eval --QA_save_path \"/data/QA.json\" --test_model_name \"gpt-4o\" --test_depth_list 25 75 --test_doc_length 8000 32000 --test_result_save_dir /data/result/","title":"Test LLMs on your benchmark"},{"location":"userguide/","text":"User Guide Installation Install the package from PyPi: # (Recommended) Create a new conda environment. conda create -n tail python=3.10 -y conda activate tail # Install tailtest pip install tail-test Build from source git clone https://github.com/yale-nlp/TAIL.git pip install -e . Since TAIL relies on vLLM for off-line inference, which only runs on Linux and needs to build from source if you are not using CUDA 11.8 or 12.1. So if you encounter problems with vLLM and only want to use TAIL for benchmark generation or test LLMs using API calls, you can clone the repository and modify it accordingly. Set your OPENAI_API_KEY as an environment variable. If you want to costimize base_url , clone the repository and add it in tail-cli.build and tail-cli.eval . export OPENAI_API_KEY=\"...\" Long-context Document Preparation To build a benchmark, firstly users need to prepare a long-context document as input. TAIL allows users to build benchmarks on any given document, such as patents, papers or financial reports. The input text should be constructed to meet the maximum length requirement of the models being evaluated. For instance, if users want to generate a benchmark with 128k tokens to evaluate LLMs, input texts that are at least 128k tokens long are needed. If the texts users have prepared aren\u2019t long enough to meet the above requirement, we suggest combining multiple shorter texts that are similar to each other. For example Users should prepare the input document in JSON file, in the format of [{\"text: YOUR_LONG_TEXT_HERE}] Benchmark Generation usage: tail-cli.build [--raw_document_path <path>] [--document_length <value>] [--depth_list <value>] [--QA_save_path <path>] [--gen_QA_model_name <name>] Options: --raw_document_path <path> Path to the document your prepared. --document_length <value> Expected token lengths for benchmarking. Multiple values can be provided. --depth_list <value> Expected depths for your questions. Multiple values can be provided. --QA_save_path <path> Path to save the generated QA dataset. --gen_QA_model_name <name> Model name for generating the QA (default: gpt-4o). The next step is to set the document_length and depth for your benchmark. document_length means how long the test document in the benchmark will be, while depth indicates how deep a question's evidence locates within the test document. For example, setting document_length to 8000 and depth to 50 means generating a QA and test document of 8000 tokens, where the evidence for the question is located around the middle of the test document. Provide path for your long document and path to save your benchmark, specify document_length and depth ,and then run tail-cli.build to start benchmark generation. Here's an example: tail-cli.build --raw_document_path \"/data/raw.json\" --QA_save_path \"/data/QA.json\" --document_length 8000 32000 64000 --depth_list 25 50 75 Model Evaluation & Visualization usage: tail-cli.eval [--QA_save_path <path>] [--test_model_name <name>] [--test_depth_list <value>] [--test_doc_length <value>] [--test_result_save_dir <path>] Options: --QA_save_path <path> Path to the saved QA dataset. --test_model_name <name> Test model name (default: gpt-4o). --test_depth_list <value> Depths you want to test. Multiple values can be provided (default: 30 70). --test_doc_length <value> Token lengths you want to test. Multiple values can be provided (default: 8000). --test_result_save_dir <path> Path to save the test results and visualizations. TAIL supports evaluation for both commercial LLMs using OpenAI API interface and off-line inference using vLLM. For commericial LLMs that compatibale with OpenAI interface, first set your \u2018OPENAI_API_KEY\u2019 Environment Variable, you can find a guide here and set up your base_url if needed. Then simply pass your model name using --test_model_name \"gpt-4o\" . For open-source LLMs, pass the your model name in command line using its name or local dir. The list of supported models can be found at supported models . e.g. --test_model_name \"meta-llama/Meta-Llama-3.1-70B\" We currently set the temperature to 0.8 and generate 5 responses for each question, then provide the average results. You may customize it in tail_test/test_llm_performance.py . Specific the context lengths and depths your want to test. For example, --test_depth_list 30 80 --test_doc_length 64000 128000 means you want to test the question targetting at depth 30% and 70% of 64k tokens and 128k tokens documents. Be aware that you need to first generate QA for this specific depth and context length, otherwise TAIL will raise an error warning you that it can't find this QA in the QA file you provided. Input the test model's name and path to the saved benchmark, provide document_length and depth you want to test, TAIL will automatically run the evaluation and save results in JSON format in test_result_save_dir . Results visualization TAIL will automatically visualize the results after the evaluation is done. A line plot and a heatmap showing model's performance across context lengths and depths will be stored in the result_save_dir users provide. Here are example visuilaztions for GLM-4-9B-128K:","title":"User Guide"},{"location":"userguide/#user-guide","text":"","title":"User Guide"},{"location":"userguide/#installation","text":"","title":"Installation"},{"location":"userguide/#install-the-package-from-pypi","text":"# (Recommended) Create a new conda environment. conda create -n tail python=3.10 -y conda activate tail # Install tailtest pip install tail-test","title":"Install the package from PyPi:"},{"location":"userguide/#build-from-source","text":"git clone https://github.com/yale-nlp/TAIL.git pip install -e . Since TAIL relies on vLLM for off-line inference, which only runs on Linux and needs to build from source if you are not using CUDA 11.8 or 12.1. So if you encounter problems with vLLM and only want to use TAIL for benchmark generation or test LLMs using API calls, you can clone the repository and modify it accordingly. Set your OPENAI_API_KEY as an environment variable. If you want to costimize base_url , clone the repository and add it in tail-cli.build and tail-cli.eval . export OPENAI_API_KEY=\"...\"","title":"Build from source"},{"location":"userguide/#long-context-document-preparation","text":"To build a benchmark, firstly users need to prepare a long-context document as input. TAIL allows users to build benchmarks on any given document, such as patents, papers or financial reports. The input text should be constructed to meet the maximum length requirement of the models being evaluated. For instance, if users want to generate a benchmark with 128k tokens to evaluate LLMs, input texts that are at least 128k tokens long are needed. If the texts users have prepared aren\u2019t long enough to meet the above requirement, we suggest combining multiple shorter texts that are similar to each other. For example Users should prepare the input document in JSON file, in the format of [{\"text: YOUR_LONG_TEXT_HERE}]","title":"Long-context Document Preparation"},{"location":"userguide/#benchmark-generation","text":"usage: tail-cli.build [--raw_document_path <path>] [--document_length <value>] [--depth_list <value>] [--QA_save_path <path>] [--gen_QA_model_name <name>] Options: --raw_document_path <path> Path to the document your prepared. --document_length <value> Expected token lengths for benchmarking. Multiple values can be provided. --depth_list <value> Expected depths for your questions. Multiple values can be provided. --QA_save_path <path> Path to save the generated QA dataset. --gen_QA_model_name <name> Model name for generating the QA (default: gpt-4o). The next step is to set the document_length and depth for your benchmark. document_length means how long the test document in the benchmark will be, while depth indicates how deep a question's evidence locates within the test document. For example, setting document_length to 8000 and depth to 50 means generating a QA and test document of 8000 tokens, where the evidence for the question is located around the middle of the test document. Provide path for your long document and path to save your benchmark, specify document_length and depth ,and then run tail-cli.build to start benchmark generation. Here's an example: tail-cli.build --raw_document_path \"/data/raw.json\" --QA_save_path \"/data/QA.json\" --document_length 8000 32000 64000 --depth_list 25 50 75","title":"Benchmark Generation"},{"location":"userguide/#model-evaluation-visualization","text":"usage: tail-cli.eval [--QA_save_path <path>] [--test_model_name <name>] [--test_depth_list <value>] [--test_doc_length <value>] [--test_result_save_dir <path>] Options: --QA_save_path <path> Path to the saved QA dataset. --test_model_name <name> Test model name (default: gpt-4o). --test_depth_list <value> Depths you want to test. Multiple values can be provided (default: 30 70). --test_doc_length <value> Token lengths you want to test. Multiple values can be provided (default: 8000). --test_result_save_dir <path> Path to save the test results and visualizations. TAIL supports evaluation for both commercial LLMs using OpenAI API interface and off-line inference using vLLM. For commericial LLMs that compatibale with OpenAI interface, first set your \u2018OPENAI_API_KEY\u2019 Environment Variable, you can find a guide here and set up your base_url if needed. Then simply pass your model name using --test_model_name \"gpt-4o\" . For open-source LLMs, pass the your model name in command line using its name or local dir. The list of supported models can be found at supported models . e.g. --test_model_name \"meta-llama/Meta-Llama-3.1-70B\" We currently set the temperature to 0.8 and generate 5 responses for each question, then provide the average results. You may customize it in tail_test/test_llm_performance.py . Specific the context lengths and depths your want to test. For example, --test_depth_list 30 80 --test_doc_length 64000 128000 means you want to test the question targetting at depth 30% and 70% of 64k tokens and 128k tokens documents. Be aware that you need to first generate QA for this specific depth and context length, otherwise TAIL will raise an error warning you that it can't find this QA in the QA file you provided. Input the test model's name and path to the saved benchmark, provide document_length and depth you want to test, TAIL will automatically run the evaluation and save results in JSON format in test_result_save_dir .","title":"Model Evaluation &amp; Visualization"},{"location":"userguide/#results-visualization","text":"TAIL will automatically visualize the results after the evaluation is done. A line plot and a heatmap showing model's performance across context lengths and depths will be stored in the result_save_dir users provide. Here are example visuilaztions for GLM-4-9B-128K:","title":"Results visualization"}]}
\ No newline at end of file
+{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"Welcome to TAIL! Automatic, Easy and Realistic tool for LLM Evaluation Getting Started User Guide Features Easy to customize TAIL helps you generate benchmarks on your own documents (Patents, Papers, Financial Reports, anything you are interested in). It allows you to create test examples of any context length and questions at any depth you desire. Realistic and natural Unlike the needle-in-a-haystack test, TAIL generate questions based on infomations from your own document, instead of inserting a piece of new infomation, making the benchmark more realistic and natural. Quality assured TAIL utilizes multiple quality assurance measures, including RAG-based filtering and rigorous quality checks, to eliminate subpar QAs and deliver a high-caliber benchmark. Ready-to-use TAIL integrates an out-of-the-box evaluation module that enables users to easily evaluate commercial LLMs via API calls and open-source LLMs via vLLM on the generated benchmarks.","title":"Home"},{"location":"#welcome-to-tail","text":"Automatic, Easy and Realistic tool for LLM Evaluation Getting Started User Guide","title":"Welcome to TAIL!"},{"location":"about/","text":"About","title":"About"},{"location":"getting-started/","text":"Getting Started Installation Install the package from PyPi: # (Recommended) Create a new conda environment. conda create -n tail python=3.10 -y conda activate tail # Install tailtest pip install tail-test Set your OPENAI_API_KEY as an environment variable. export OPENAI_API_KEY=\"...\" For more details, see Installation Guide . Prepare a long document TAIL generates QAs for benchmark generation based on the document users inputs. Users need to prepare the input document in a JSON file, in the format of [{\"text: YOUR_LONG_TEXT}] (YOUR_LONG_TEXT is a long string). We prepare a example input document file in /data/example_input.json , if you don't have time to collect your own document, you can use it to generate benchmarks. Generate your own benchmark The next step is to set the document_length and depth for your benchmark. document_length means how long the test document in the benchmark will be, while depth indicates how deep a question's evidence locates within the test document. For example, setting document_length to 8000 and depth to 50 means generating a QA and test document of 8000 tokens, where the evidence for the question is located around the middle of the test document. Provide path for your long document and path to save your benchmark, specify document_length and depth ,and then run tail-cli.build to start benchmark generation! Here's an example: tail-cli.build --raw_document_path \"/data/raw.json\" --QA_save_path \"/data/QA.json\" --document_length 8000 32000 64000 --depth_list 25 50 75 Test LLMs on your benchmark After generation your benchmark, it's time to evaluate LLMs on it. Input the test model's name and path to the saved benchmark, provide document_length and depth you want to test, TAIL will automatically run the evaluation and store visualizations in test_result_save_dir . tail-cli.eval --QA_save_path \"/data/QA.json\" --test_model_name \"gpt-4o\" --test_depth_list 25 75 --test_doc_length 8000 32000 --test_result_save_dir /data/result/","title":"Getting Started"},{"location":"getting-started/#getting-started","text":"","title":"Getting Started"},{"location":"getting-started/#installation","text":"Install the package from PyPi: # (Recommended) Create a new conda environment. conda create -n tail python=3.10 -y conda activate tail # Install tailtest pip install tail-test Set your OPENAI_API_KEY as an environment variable. export OPENAI_API_KEY=\"...\" For more details, see Installation Guide .","title":"Installation"},{"location":"getting-started/#prepare-a-long-document","text":"TAIL generates QAs for benchmark generation based on the document users inputs. Users need to prepare the input document in a JSON file, in the format of [{\"text: YOUR_LONG_TEXT}] (YOUR_LONG_TEXT is a long string). We prepare a example input document file in /data/example_input.json , if you don't have time to collect your own document, you can use it to generate benchmarks.","title":"Prepare a long document"},{"location":"getting-started/#generate-your-own-benchmark","text":"The next step is to set the document_length and depth for your benchmark. document_length means how long the test document in the benchmark will be, while depth indicates how deep a question's evidence locates within the test document. For example, setting document_length to 8000 and depth to 50 means generating a QA and test document of 8000 tokens, where the evidence for the question is located around the middle of the test document. Provide path for your long document and path to save your benchmark, specify document_length and depth ,and then run tail-cli.build to start benchmark generation! Here's an example: tail-cli.build --raw_document_path \"/data/raw.json\" --QA_save_path \"/data/QA.json\" --document_length 8000 32000 64000 --depth_list 25 50 75","title":"Generate your own benchmark"},{"location":"getting-started/#test-llms-on-your-benchmark","text":"After generation your benchmark, it's time to evaluate LLMs on it. Input the test model's name and path to the saved benchmark, provide document_length and depth you want to test, TAIL will automatically run the evaluation and store visualizations in test_result_save_dir . tail-cli.eval --QA_save_path \"/data/QA.json\" --test_model_name \"gpt-4o\" --test_depth_list 25 75 --test_doc_length 8000 32000 --test_result_save_dir /data/result/","title":"Test LLMs on your benchmark"},{"location":"userguide/","text":"User Guide Installation Install the package from PyPi: # (Recommended) Create a new conda environment. conda create -n tail python=3.10 -y conda activate tail # Install tailtest pip install tail-test Build from source git clone https://github.com/yale-nlp/TAIL.git pip install -e . Since TAIL relies on vLLM for off-line inference, which only runs on Linux and needs to build from source if you are not using CUDA 11.8 or 12.1. So if you encounter problems with vLLM and only want to use TAIL for benchmark generation or test LLMs using API calls, you can clone the repository and modify it accordingly. Set your OPENAI_API_KEY as an environment variable. If you want to costimize base_url , clone the repository and add it in tail-cli.build and tail-cli.eval . export OPENAI_API_KEY=\"...\" Long-context Document Preparation To build a benchmark, firstly users need to prepare a long-context document as input. TAIL allows users to build benchmarks on any given document, such as patents, papers or financial reports. The input text should be constructed to meet the maximum length requirement of the models being evaluated. For instance, if users want to generate a benchmark with 128k tokens to evaluate LLMs, input texts that are at least 128k tokens long are needed. If the texts users have prepared aren\u2019t long enough to meet the above requirement, we suggest combining multiple shorter texts that are similar to each other. For example Users should prepare the input document in JSON file, in the format of [{\"text: YOUR_LONG_TEXT_HERE}] Benchmark Generation usage: tail-cli.build [--raw_document_path <path>] [--document_length <value>] [--depth_list <value>] [--QA_save_path <path>] [--gen_QA_model_name <name>] Options: --raw_document_path <path> Path to the document your prepared. --document_length <value> Expected token lengths for benchmarking. Multiple values can be provided. --depth_list <value> Expected depths for your questions. Multiple values can be provided. --QA_save_path <path> Path to save the generated QA dataset. --gen_QA_model_name <name> Model name for generating the QA (default: gpt-4o). The next step is to set the document_length and depth for your benchmark. document_length means how long the test document in the benchmark will be, while depth indicates how deep a question's evidence locates within the test document. For example, setting document_length to 8000 and depth to 50 means generating a QA and test document of 8000 tokens, where the evidence for the question is located around the middle of the test document. Provide path for your long document and path to save your benchmark, specify document_length and depth ,and then run tail-cli.build to start benchmark generation. Here's an example: tail-cli.build --raw_document_path \"/data/raw.json\" --QA_save_path \"/data/QA.json\" --document_length 8000 32000 64000 --depth_list 25 50 75 Model Evaluation & Visualization usage: tail-cli.eval [--QA_save_path <path>] [--test_model_name <name>] [--test_depth_list <value>] [--test_doc_length <value>] [--test_result_save_dir <path>] Options: --QA_save_path <path> Path to the saved QA dataset. --test_model_name <name> Test model name (default: gpt-4o). --test_depth_list <value> Depths you want to test. Multiple values can be provided (default: 30 70). --test_doc_length <value> Token lengths you want to test. Multiple values can be provided (default: 8000). --test_result_save_dir <path> Path to save the test results and visualizations. TAIL supports evaluation for both commercial LLMs using OpenAI API interface and off-line inference using vLLM. For commericial LLMs that compatibale with OpenAI interface, first set your \u2018OPENAI_API_KEY\u2019 Environment Variable, you can find a guide here and set up your base_url if needed. Then simply pass your model name using --test_model_name \"gpt-4o\" . For open-source LLMs, pass the your model name in command line using its name or local dir. The list of supported models can be found at supported models . e.g. --test_model_name \"meta-llama/Meta-Llama-3.1-70B\" We currently set the temperature to 0.8 and generate 5 responses for each question, then provide the average results. You may customize it in tail_test/test_llm_performance.py . Specific the context lengths and depths your want to test. For example, --test_depth_list 30 80 --test_doc_length 64000 128000 means you want to test the question targetting at depth 30% and 70% of 64k tokens and 128k tokens documents. Be aware that you need to first generate QA for this specific depth and context length, otherwise TAIL will raise an error warning you that it can't find this QA in the QA file you provided. Input the test model's name and path to the saved benchmark, provide document_length and depth you want to test, TAIL will automatically run the evaluation and save results in JSON format in test_result_save_dir . Results visualization TAIL will automatically visualize the results after the evaluation is done. A line plot and a heatmap showing model's performance across context lengths and depths will be stored in the result_save_dir users provide. Here are example visuilaztions for GLM-4-9B-128K:","title":"User Guide"},{"location":"userguide/#user-guide","text":"","title":"User Guide"},{"location":"userguide/#installation","text":"","title":"Installation"},{"location":"userguide/#install-the-package-from-pypi","text":"# (Recommended) Create a new conda environment. conda create -n tail python=3.10 -y conda activate tail # Install tailtest pip install tail-test","title":"Install the package from PyPi:"},{"location":"userguide/#build-from-source","text":"git clone https://github.com/yale-nlp/TAIL.git pip install -e . Since TAIL relies on vLLM for off-line inference, which only runs on Linux and needs to build from source if you are not using CUDA 11.8 or 12.1. So if you encounter problems with vLLM and only want to use TAIL for benchmark generation or test LLMs using API calls, you can clone the repository and modify it accordingly. Set your OPENAI_API_KEY as an environment variable. If you want to costimize base_url , clone the repository and add it in tail-cli.build and tail-cli.eval . export OPENAI_API_KEY=\"...\"","title":"Build from source"},{"location":"userguide/#long-context-document-preparation","text":"To build a benchmark, firstly users need to prepare a long-context document as input. TAIL allows users to build benchmarks on any given document, such as patents, papers or financial reports. The input text should be constructed to meet the maximum length requirement of the models being evaluated. For instance, if users want to generate a benchmark with 128k tokens to evaluate LLMs, input texts that are at least 128k tokens long are needed. If the texts users have prepared aren\u2019t long enough to meet the above requirement, we suggest combining multiple shorter texts that are similar to each other. For example Users should prepare the input document in JSON file, in the format of [{\"text: YOUR_LONG_TEXT_HERE}]","title":"Long-context Document Preparation"},{"location":"userguide/#benchmark-generation","text":"usage: tail-cli.build [--raw_document_path <path>] [--document_length <value>] [--depth_list <value>] [--QA_save_path <path>] [--gen_QA_model_name <name>] Options: --raw_document_path <path> Path to the document your prepared. --document_length <value> Expected token lengths for benchmarking. Multiple values can be provided. --depth_list <value> Expected depths for your questions. Multiple values can be provided. --QA_save_path <path> Path to save the generated QA dataset. --gen_QA_model_name <name> Model name for generating the QA (default: gpt-4o). The next step is to set the document_length and depth for your benchmark. document_length means how long the test document in the benchmark will be, while depth indicates how deep a question's evidence locates within the test document. For example, setting document_length to 8000 and depth to 50 means generating a QA and test document of 8000 tokens, where the evidence for the question is located around the middle of the test document. Provide path for your long document and path to save your benchmark, specify document_length and depth ,and then run tail-cli.build to start benchmark generation. Here's an example: tail-cli.build --raw_document_path \"/data/raw.json\" --QA_save_path \"/data/QA.json\" --document_length 8000 32000 64000 --depth_list 25 50 75","title":"Benchmark Generation"},{"location":"userguide/#model-evaluation-visualization","text":"usage: tail-cli.eval [--QA_save_path <path>] [--test_model_name <name>] [--test_depth_list <value>] [--test_doc_length <value>] [--test_result_save_dir <path>] Options: --QA_save_path <path> Path to the saved QA dataset. --test_model_name <name> Test model name (default: gpt-4o). --test_depth_list <value> Depths you want to test. Multiple values can be provided (default: 30 70). --test_doc_length <value> Token lengths you want to test. Multiple values can be provided (default: 8000). --test_result_save_dir <path> Path to save the test results and visualizations. TAIL supports evaluation for both commercial LLMs using OpenAI API interface and off-line inference using vLLM. For commericial LLMs that compatibale with OpenAI interface, first set your \u2018OPENAI_API_KEY\u2019 Environment Variable, you can find a guide here and set up your base_url if needed. Then simply pass your model name using --test_model_name \"gpt-4o\" . For open-source LLMs, pass the your model name in command line using its name or local dir. The list of supported models can be found at supported models . e.g. --test_model_name \"meta-llama/Meta-Llama-3.1-70B\" We currently set the temperature to 0.8 and generate 5 responses for each question, then provide the average results. You may customize it in tail_test/test_llm_performance.py . Specific the context lengths and depths your want to test. For example, --test_depth_list 30 80 --test_doc_length 64000 128000 means you want to test the question targetting at depth 30% and 70% of 64k tokens and 128k tokens documents. Be aware that you need to first generate QA for this specific depth and context length, otherwise TAIL will raise an error warning you that it can't find this QA in the QA file you provided. Input the test model's name and path to the saved benchmark, provide document_length and depth you want to test, TAIL will automatically run the evaluation and save results in JSON format in test_result_save_dir .","title":"Model Evaluation &amp; Visualization"},{"location":"userguide/#results-visualization","text":"TAIL will automatically visualize the results after the evaluation is done. A line plot and a heatmap showing model's performance across context lengths and depths will be stored in the result_save_dir users provide. Here are example visuilaztions for GLM-4-9B-128K:","title":"Results visualization"}]}
\ No newline at end of file