並び順

ブックマーク数

期間指定

  • から
  • まで

1 - 40 件 / 49件

新着順 人気順

evaluationの検索結果1 - 40 件 / 49件

タグ検索の該当結果が少ないため、タイトル検索結果を表示しています。

evaluationに関するエントリは49件あります。 LLM組織マネジメント などが関連タグです。 人気エントリには 『エンジニア組織30人の壁を超えるための 評価システムとマネジメントのスケール / Scaling evaluation system and management』などがあります。
  • エンジニア組織30人の壁を超えるための 評価システムとマネジメントのスケール / Scaling evaluation system and management

    2024夏のジンジニアMeetup! 〜みんなで学ぼう!開発組織の評価制度と運用〜 https://jinjineer.connpass.com/event/323746/

      エンジニア組織30人の壁を超えるための 評価システムとマネジメントのスケール / Scaling evaluation system and management
    • 新米マネージャーの初めての目標設定と評価 / New manager's first goal setting and evaluation

      2024/03/01: EMゆるミートアップ vol.6 〜LT会〜 https://em-yuru-meetup.connpass.com/event/308552/ 新米マネージャーの初めての目標設定と評価 倉澤 直弘 EM

        新米マネージャーの初めての目標設定と評価 / New manager's first goal setting and evaluation
      • 定量データと定性評価を用いた技術戦略の組織的実践 / Systematic implementation of technology strategies using quantitative data and qualitative evaluation

        CNDS2024 https://event.cloudnativedays.jp/cnds2024/

          定量データと定性評価を用いた技術戦略の組織的実践 / Systematic implementation of technology strategies using quantitative data and qualitative evaluation
        • 作るだけなら簡単なLLMを“より優れたもの”にするには 「Pretraining」「Fine-Tuning」「Evaluation & Analysis」構築のポイント | ログミーBusiness

          より優れたLLMを作るために必要なこと秋葉拓哉氏:めでたくFine-Tuningもできた。これけっこう、びっくりするかもしれません。コードはさすがにゼロとはいかないと思いますが、ほとんど書かずに実はLLMは作れます。 「さすがにこんなんじゃゴミみたいなモデルしかできないだろう」と思われるかもしれませんが、おそらく余計なことをしなければこれだけでも、まあまあそれっぽいLLMにはなるかなと思います。 なので、ちょっと、先ほどの鈴木先生(鈴木潤氏)の話と若干矛盾してしまって恐縮なのですが、僕のスタンスは、LLMを作るだけであれば思っているよりは簡単かなと思います。ここまで前半でした。 とはいえ、じゃあ、これをやったらGPT-4になるのかっていったら当然ならないわけです。そこにやはりギャップがあるわけですよね。「それは何なのか?」を次に考えていきましょうか。ここはかなりキリがないのですが、挙げられ

            作るだけなら簡単なLLMを“より優れたもの”にするには 「Pretraining」「Fine-Tuning」「Evaluation & Analysis」構築のポイント | ログミーBusiness
          • Best Practices for LLM Evaluation of RAG Applications

            Unified governance for all data, analytics and AI assets

              Best Practices for LLM Evaluation of RAG Applications
            • Off-Policy Evaluationの基礎とZOZOTOWN大規模公開実データおよびパッケージ紹介 - ZOZO TECH BLOG

              ※AMP表示の場合、数式が正しく表示されません。数式を確認する場合は通常表示版をご覧ください ※2020年11月7日に、「Open Bandit Pipelineの使い方」の節に修正を加えました。修正では、パッケージの更新に伴って、実装例を新たなバージョンに対応させました。詳しくは対応するrelease noteをご確認ください。今後、データセット・パッケージ・論文などの更新情報はGoogle Groupにて随時周知する予定です。こちらも良ければフォローしてみてください。また新たに「国際会議ワークショップでの反応」という章を追記しました。 ZOZO研究所と共同研究をしている東京工業大学の齋藤優太です。普段は、反実仮想機械学習の理論と応用をつなぐような研究をしています。反実仮想機械学習に関しては、拙著のサーベイ記事をご覧ください。 本記事では、機械学習に基づいて作られた意思決定の性能をオフラ

                Off-Policy Evaluationの基礎とZOZOTOWN大規模公開実データおよびパッケージ紹介 - ZOZO TECH BLOG
              • GitHub - yahoojapan/JGLUE: JGLUE: Japanese General Language Understanding Evaluation

                You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert

                  GitHub - yahoojapan/JGLUE: JGLUE: Japanese General Language Understanding Evaluation
                • GitHub - Stability-AI/lm-evaluation-harness: A framework for few-shot evaluation of autoregressive language models.

                  You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert

                    GitHub - Stability-AI/lm-evaluation-harness: A framework for few-shot evaluation of autoregressive language models.
                  • GitHub - Arize-ai/phoenix: AI Observability & Evaluation

                    Phoenix is an open-source AI observability platform designed for experimentation, evaluation, and troubleshooting. It provides: Tracing - Trace your LLM application's runtime using OpenTelemetry-based instrumentation. Evaluation - Leverage LLMs to benchmark your application's performance using response and retrieval evals. Datasets - Create versioned datasets of examples for experimentation, evalu

                      GitHub - Arize-ai/phoenix: AI Observability & Evaluation
                    • GitHub - confident-ai/deepeval: The LLM Evaluation Framework

                      DeepEval is a simple-to-use, open-source LLM evaluation framework, for evaluating and testing large-language model systems. It is similar to Pytest but specialized for unit testing LLM outputs. DeepEval incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval, hallucination, answer relevancy, RAGAS, etc., which uses LLMs and various other NLP models that runs locall

                        GitHub - confident-ai/deepeval: The LLM Evaluation Framework
                      • COVID-19 vaccine efficacy summary | Institute for Health Metrics and Evaluation

                          COVID-19 vaccine efficacy summary | Institute for Health Metrics and Evaluation
                        • PMスキル・評価制度を導入し、アウトカムを生み出すプロダクトマネジメント集団へ進化する道のりの共有 / How we introduced the PM skills and evaluation system and evolved into a product management group that produces outcomes

                          pmconf2021登壇スライド。 Rettyがプロジェクトマネジメント一辺倒な組織から、アウトカムドリブンな開発ができるプロダクトマネジメントが根付いた組織に至るまでの成長の経過について。 具体的な取り組みの一例を挙げると、私たちは力のあるプロダクトマネージャーを育てるために、PMのスキル…

                            PMスキル・評価制度を導入し、アウトカムを生み出すプロダクトマネジメント集団へ進化する道のりの共有 / How we introduced the PM skills and evaluation system and evolved into a product management group that produces outcomes
                          • Estimation of total and excess mortality due to COVID-19 | Institute for Health Metrics and Evaluation

                            Estimation of total and excess mortality due to COVID-19 Published October 15, 2021 This page was updated on October 15, 2021 to reflect changes in our modeling strategy. View our previous methods published May 13, 2021 here. In our October 15 release, we introduced three major changes. First, we have very substantially updated the data and methods used to estimate excess mortality related to the

                              Estimation of total and excess mortality due to COVID-19 | Institute for Health Metrics and Evaluation
                            • Evaluation of science advice during the COVID-19 pandemic in Sweden - Humanities and Social Sciences Communications

                              Sweden was well equipped to prevent the pandemic of COVID-19 from becoming serious. Over 280 years of collaboration between political bodies, authorities, and the scientific community had yielded many successes in preventive medicine. Sweden’s population is literate and has a high level of trust in authorities and those in power. During 2020, however, Sweden had ten times higher COVID-19 death rat

                                Evaluation of science advice during the COVID-19 pandemic in Sweden - Humanities and Social Sciences Communications
                              • GitHub - st-tech/zr-obp: Open Bandit Pipeline: a python library for bandit algorithms and off-policy evaluation

                                You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert

                                  GitHub - st-tech/zr-obp: Open Bandit Pipeline: a python library for bandit algorithms and off-policy evaluation
                                • GitHub - pfnet-research/japanese-lm-fin-harness: Japanese Language Model Financial Evaluation Harness

                                  You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert

                                    GitHub - pfnet-research/japanese-lm-fin-harness: Japanese Language Model Financial Evaluation Harness
                                  • Top Evaluation Metrics for RAG Failures

                                    Figure 1: Root Cause Workflows for LLM RAG Applications (flowchart created by author) If you have been experimenting with large language models (LLMs) for search and retrieval tasks, you have likely come across retrieval augmented generation (RAG) as a technique to add relevant contextual information to LLM generated responses. By connecting an LLM to private data, RAG can enable a better response

                                      Top Evaluation Metrics for RAG Failures
                                    • KDD 2024 LLM Evaluation Tutorial

                                      Grounding and Evaluation for Large Language Models (Tutorial) With the ongoing rapid adoption of Artificial Intelligence (AI) based systems in high-stakes domains such as financial services, healthcare and life sciences, hiring and human resources, education, societal infrastructure, and national security, it is crucial to develop and deploy the underlying AI models and systems in a responsible ma

                                      • 論文紹介 Towards a Fair Marketplace: Counterfactual Evaluation of the trade-off between Relevance, Fairness & Satisfaction in Recommendation Systems

                                        社内論文読み会の資料です Mehrotra, Rishabh, et al. "Towards a fair marketplace: Counterfactual evaluation of the trade-off between relevance, fairness & satisfac…

                                          論文紹介 Towards a Fair Marketplace: Counterfactual Evaluation of the trade-off between Relevance, Fairness & Satisfaction in Recommendation Systems
                                        • Evaluation method of UX “The User Experience Honeycomb” | blog / bookslope

                                          ウェブサイトを評価する・レビューする方法にはさまざまな視点が必要になると思いますが、市場の流れから考えて「UX」視点が必要だとする見方があります。以前から、利用者視点というものを評価方法として加えている調査会社であれば、当然の流れといえますが、そうした場合のUXの評価とはユーザーテストを実施して実際に被験者に利用してもらうことが多いと思います。 ユーザテストのシナリオ作成においては、もっぱらそうした検討がされていると思いますが、評価方法としてUXを考える場合、「UXハニカム構造」がベースになるように思いました。 User Experience Design – Semantic Studios この記事に「The User Experience Honeycomb」というものがあり、これを「UXハニカム構造」と呼んでいるわけですが、UXを構成する要素には、Useful (役に立つ)・Usa

                                            Evaluation method of UX “The User Experience Honeycomb” | blog / bookslope
                                          • GitHub - FreedomIntelligence/LLMZoo: ⚡LLM Zoo is a project that provides data, models, and evaluation benchmark for large language models.⚡

                                            You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert

                                              GitHub - FreedomIntelligence/LLMZoo: ⚡LLM Zoo is a project that provides data, models, and evaluation benchmark for large language models.⚡
                                            • 長期の評価に最適なWindows 10/11 Enterprise Evaluationともっと長く付き合う“裏ワザ”

                                              長期の評価に最適なWindows 10/11 Enterprise Evaluationともっと長く付き合う“裏ワザ”:山市良のうぃんどうず日記(277) Windows 10/11 Enterpriseには、90日無料で評価できる「Evaluation」エディションがあります。ライセンスを購入しなくても、企業向けWindows 10/11をテスト、評価できるので、筆者はよく利用しています。そんなEvaluationエディションを可能な限り長く利用する裏ワザを幾つか紹介します。

                                                長期の評価に最適なWindows 10/11 Enterprise Evaluationともっと長く付き合う“裏ワザ”
                                              • Misplaced trust: When trust in science fosters belief in pseudoscience and the benefits of critical evaluation

                                                At a time when pseudoscience threatens the survival of communities, understanding this vulnerability, and how to reduce it, is paramount. Four preregistered experiments (N = 532, N = 472, N = 605, N = 382) with online U.S. samples introduced false claims concerning a (fictional) virus created as a bioweapon, mirroring conspiracy theories about COVID-19, and carcinogenic effects of GMOs (Geneticall

                                                  Misplaced trust: When trust in science fosters belief in pseudoscience and the benefits of critical evaluation
                                                • OpenTofu 1.8.0 is out with Early Evaluation, Provider Mocking, and a Coder-Friendly Future | OpenTofu

                                                  July 29, 2024OpenTofu 1.8.0 is out with Early Evaluation, Provider Mocking, and a Coder-Friendly Future Since the 1.7 release, the OpenTofu community and core team have been hard at work on much-requested features, making .tf code easier to write, reducing unnecessary boilerplate, improving performance, and more. We are happy to announce the immediate availability of OpenTofu 1.8 with the followin

                                                    OpenTofu 1.8.0 is out with Early Evaluation, Provider Mocking, and a Coder-Friendly Future | OpenTofu
                                                  • 【ML Tech RPT. 】第11回 機械学習のモデルの評価方法 (Evaluation Metrics) を学ぶ (2) - Sansan Tech Blog

                                                    DSOC研究員の吉村です. 弊社には「よいこ」という社内の部活のような社内制度があり, 私はその中のテニス部に所属しています. 月一程度で活動をしているのですが, 最近は新たに入社された部員も増えてきて新しい風を感じています. さて, 今回も前回に引き続き「機械学習のモデルの評価方法 (Evaluation Metrics)」に焦点を当てていきます. (今回も前回同様, "モデル" という言葉を機械学習のモデルという意味で用います.) 前回は, モデルを評価する観点や注意事項について確認しました. 今回からは, 各種問題設定ごとにどのような評価指標が存在し, それらが何を意味するのかについて見ていこうと思います. 今回は二値分類問題を取り扱います. 前回の記事の最後で, 多クラス (マルチクラス) 分類・回帰問題についても本記事で取り扱うと書きましたが, 量が多くなりすぎてしまったため,

                                                      【ML Tech RPT. 】第11回 機械学習のモデルの評価方法 (Evaluation Metrics) を学ぶ (2) - Sansan Tech Blog
                                                    • The Generative AI Evaluation Company - Galileo

                                                      Evaluate, observe, and protect your GenAI applications Go beyond ‘vibe checks’ and asking GPT with the first end-to-end GenAI Stack, powered by Evaluation Foundation Models.

                                                        The Generative AI Evaluation Company - Galileo
                                                      • GitHub - huggingface/evaluation-guidebook: Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!

                                                        You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert

                                                          GitHub - huggingface/evaluation-guidebook: Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!
                                                        • Windows Server 2022 | Microsoft Evaluation Center

                                                          In addition to your trial experience of Windows Server 2022, you can more easily add and manage languages and Features on Demand with the new Languages and Optional Features ISO. Download this ISO. This ISO is only available on Windows Server 2022 and combines the previously separate Features on Demand and Language Packs ISOs, and can be used as a FOD and Language pack repository. To learn about F

                                                          • 「Microsoft Evaluation Center」に障害、評価版ソフトがダウンロード不能に/コミュニティサイトでダウンロードリンクを案内中

                                                              「Microsoft Evaluation Center」に障害、評価版ソフトがダウンロード不能に/コミュニティサイトでダウンロードリンクを案内中
                                                            • Evaluation of Retrieval-Augmented Generation: A Survey

                                                              Retrieval-Augmented Generation (RAG) has recently gained traction in natural language processing. Numerous studies and real-world applications are leveraging its ability to enhance generative models through external information retrieval. Evaluating these RAG systems, however, poses unique challenges due to their hybrid structure and reliance on dynamic knowledge sources. To better understand thes

                                                              • Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation

                                                                Off-policy evaluation (OPE) aims to estimate the performance of hypothetical policies using data generated by a different policy. Because of its huge potential impact in practice, there has been growing research interest in this field. There is, however, no real-world public dataset that enables the evaluation of OPE, making its experimental studies unrealistic and irreproducible. With the goal of

                                                                • Mandoline: Model Evaluation under Distribution Shift

                                                                  Machine learning models are often deployed in different settings than they were trained and validated on, posing a challenge to practitioners who wish to predict how well the deployed model will perform on a target distribution. If an unlabeled sample from the target distribution is available, along with a labeled sample from a possibly different source distribution, standard approaches such as im

                                                                  • Terms of Evaluation

                                                                    Terms of Evaluation for HashiCorp SoftwareBefore you download and/or use our enterprise software for evaluation purposes, you will need to agree to a special set of terms (“Agreement”), which will be applicable for your use of the HashiCorp, Inc.’s (“HashiCorp”, “we”, or “us”) enterprise software. PLEASE READ THIS AGREEMENT CAREFULLY BEFORE INSTALLING OR USING THE SOFTWARE. THESE TERMS AND CONDITI

                                                                      Terms of Evaluation
                                                                    • 論文紹介:ChatGPT で情報抽出タスクは解けるのか?�Is information extraction solved by ChatGPT? �An analysis of performance, evaluation criteria, robustness and errors

                                                                      論文紹介:ChatGPT で情報抽出タスクは解けるのか? Is information extraction solved by ChatGPT? An analysis of performance, evaluation criteria, robustness and errors

                                                                        論文紹介:ChatGPT で情報抽出タスクは解けるのか?�Is information extraction solved by ChatGPT? �An analysis of performance, evaluation criteria, robustness and errors
                                                                      • GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models.

                                                                        You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert

                                                                          GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models.
                                                                        • U.S. AI Safety Institute Signs Agreements Regarding AI Safety Research, Testing and Evaluation With Anthropic and OpenAI

                                                                          GAITHERSBURG, Md. — Today, the U.S. Artificial Intelligence Safety Institute at the U.S. Department of Commerce’s National Institute of Standards and Technology (NIST) announced agreements that enable formal collaboration on AI safety research, testing and evaluation with both Anthropic and OpenAI. Each company’s Memorandum of Understanding establishes the framework for the U.S. AI Safety Institut

                                                                            U.S. AI Safety Institute Signs Agreements Regarding AI Safety Research, Testing and Evaluation With Anthropic and OpenAI
                                                                          • CAE (Continuous Access Evaluation: 継続的アクセス評価)

                                                                            こんにちは。Azure Identity チームの金森です。 みなさんは CAE (Continuous Access Evaluation: 継続的アクセス評価) という機能をご存知でしょうか。 2021 年 11 月現在、以下のようなお知らせがあり、目にされた方も多いのではないかと思います。 Microsoft 365 管理ポータルのメッセージ センターに MC255540 (Continuous access evaluation on by default) として情報が公開 送信元 : Microsoft Azure [email protected] から TRACKING ID: 5T93-LTG として以下の件名のメールでお知らせ -> Continuous access evaluation will be enabled in premium Azu

                                                                              CAE (Continuous Access Evaluation: 継続的アクセス評価)
                                                                            • Meta-analysis of faculty's teaching effectiveness: Student evaluation of teaching ratings and student learning are not related

                                                                              •Students do not learn more from professors with higher student evaluation of teaching (SET) ratings. •Previus meta-analyses of SET/learning correlations in multisection studies are not interprettable. •Re-analyses of previous meta-analyses of multisection studies indicate that SET ratings explain at most 1% of variability in measures of student learning. •New meta-analyses of multisection studies

                                                                                Meta-analysis of faculty's teaching effectiveness: Student evaluation of teaching ratings and student learning are not related
                                                                              • Amazon Bedrockで行うモデル評価入門 / Introduction to Model Evaluation in Amazon Bedrock

                                                                                Bedrock Claude Night 2(JAWS-UG AI/ML支部 × 東京支部コラボ)のLT資料です。 https://jawsug-ai.connpass.com/event/319748/

                                                                                  Amazon Bedrockで行うモデル評価入門 / Introduction to Model Evaluation in Amazon Bedrock
                                                                                • Evaluation of Suicides Among US Adolescents During the COVID-19 Pandemic

                                                                                  Our website uses cookies to enhance your experience. By continuing to use our site, or clicking "Continue," you are agreeing to our Cookie Policy | Continue

                                                                                    Evaluation of Suicides Among US Adolescents During the COVID-19 Pandemic

                                                                                  新着記事