diff --git a/_toc.yml b/_toc.yml index 28d9ea7..d39fe20 100644 --- a/_toc.yml +++ b/_toc.yml @@ -7,6 +7,14 @@ parts: - caption: Meetings chapters: - file: meetings/about + - file: meetings/2024-12-10 + - file: meetings/2024-11-12 + - file: meetings/2024-10-08 + - file: meetings/2024-09-10 + - file: meetings/2024-08-13 + - file: meetings/2024-06-11 + - file: meetings/2024-05-07 + - file: meetings/2024-04-09 - file: meetings/2024-03-12 - file: meetings/2024-02-13 - file: meetings/2024-01-09 diff --git a/meetings/2021-09-14.md b/meetings/2021-09-14.md index 0bb8e15..4645c7a 100644 --- a/meetings/2021-09-14.md +++ b/meetings/2021-09-14.md @@ -4,60 +4,8 @@ title: ai4lam Metadata/Discovery WG Monthly Meeting 9 AM California | 12 PM Washington DC | 5 PM UK | 6 PM Oslo & Paris -**Connection Information** - - - Topic: AI-LAM Metadata Working Group - - - Time: This is a recurring meeting. Meet anytime - - - Join from PC, Mac, Linux, iOS or Android: [https://stanford.zoom.us/j/91421044393?pwd=L0VLbnQ0WlE4SDV0MDY5SUhTQnVydz09](https://stanford.zoom.us/j/91421044393?pwd=L0VLbnQ0WlE4SDV0MDY5SUhTQnVydz09) - - - Password: 306295 - - - Or iPhone one-tap (US Toll): +18333021536,,91421044393# or +16507249799,,91421044393# - - - Or Telephone: - - - Dial: +1 650 724 9799 (US, Canada, Caribbean Toll) or +1 833 302 1536 (US, Canada, Caribbean Toll Free) - - - - - - Meeting ID: 914 2104 4393 - - - Password: 306295 - - - International numbers available: https://stanford.zoom.us/u/aeoeCDrpd - - - Meeting ID: 914 2104 4393 - - - Password: 306295 - - - SIP: 91421044393@zoomcrc.com - - - Password: 306295 - - - Zoom recording: https://stanford.zoom.us/rec/share/QaghrbGoKoStuPm36LciC8tXv_41vQQWD8ZnfsqMAbPo3mkzjICl02KM8tqjC-5l.AfmJLKlWbkyPksgP - **Attending** - - * Tim Thompson (Yale) * Jeremy Nelson (Stanford) * Erik Radio (CU Boulder) diff --git a/meetings/2021-10-12.md b/meetings/2021-10-12.md index fd63706..9320561 100644 --- a/meetings/2021-10-12.md +++ b/meetings/2021-10-12.md @@ -4,42 +4,6 @@ title: ai4lam Metadata/Discovery WG Monthly Meeting 9 AM California \| 12 PM Washington DC \| 5 PM UK \| 6 PM Oslo & Paris -<<<<<<< HEAD -======= -**Connection Information** - -Topic: AI-LAM Metadata Working Group - -Time: This is a recurring meeting. Meet anytime - -Join from PC, Mac, Linux, iOS or Android: -[*https://stanford.zoom.us/j/91421044393?pwd=L0VLbnQ0WlE4SDV0MDY5SUhTQnVydz09*](https://stanford.zoom.us/j/91421044393?pwd=L0VLbnQ0WlE4SDV0MDY5SUhTQnVydz09) - -Password: 306295 - -Or iPhone one-tap (US Toll): +18333021536,,91421044393# or -+16507249799,,91421044393# - -Or Telephone: - -Dial: +1 650 724 9799 (US, Canada, Caribbean Toll) or +1 833 302 1536 -(US, Canada, Caribbean Toll Free) - -Meeting ID: 914 2104 4393 - -Password: 306295 - -International numbers available: https://stanford.zoom.us/u/aeoeCDrpd - -Meeting ID: 914 2104 4393 - -Password: 306295 - -SIP: 91421044393\@zoomcrc.com - -Password: 306295 ->>>>>>> 9c0c03e (Adds remaining 2021 meetings) - **Attending** - Jeremy Nelson (Stanford) diff --git a/meetings/2024-04-09.md b/meetings/2024-04-09.md new file mode 100644 index 0000000..770a7fe --- /dev/null +++ b/meetings/2024-04-09.md @@ -0,0 +1,57 @@ +title: ai4lam Metadata/Discovery WG Monthly Meeting + +# Apr 9, 2024 + +8 AM California | 11 AM Washington DC | 4 PM UK | 5 PM Oslo & Paris + +**Attending** + +* Jeremy Nelson (Stanford) +* Andrew Elliot +* Sara Amato +* Joy Panigabutra-Roberts (University of Tennessee) +* Sarah Mann +* Erik Radio (University of Colorado) +* Ian Bogus (ReCAP) +* Craig Rosenbeck + +## Helpful Links + +* [Metadata WG Zotero Group Library](https://www.zotero.org/groups/2709151/ai4lam_metadata_wg/library) + +## Project Documents and Data + +* [WG charter](https://drive.google.com/file/d/1ypcx2F30siqr-KYOKFZtVv8h9PIS9a77/view?usp=sharing) +* [WG Google Drive folder](https://drive.google.com/drive/folders/1cpZtbjKadgD30794fD97XY-EChUSy2r9?usp=sharing) + +## Agenda + +* Announcements + * Next meeting will be on May 7 2024 for Abigail Potter presentation on AI at LOC +* Joy Panigabutra-Roberts’ presentation on AI authors and performers in the context of identity management. + * Head of Cataloging at the University of Tennessee Libraries + * What about AI Authors and A Robot Comedian? + * Beta Writer and Steffen Pauly, *Lithium-Ion Batteries: A Machine-Generated Summary of Current Research,* 2022\. From Artificial Intelligence in Libraries and Publishing + * Brent Katz, Josh Morgenthua, and Simon Rich, I Am Code: An Artifical Intelligence Speaks. Poems by code-devinci-002. 2023\. Simon Rich. New York, NY: Back Bay Books. + * Jon the Robot (comedian) + * PCC does not consider AI to be authors + * Consider a Named AI or generative computer program used to create a resource to be a related work, not as an agent… + * AstroLLLaMA-Chat \- [https://huggingface.co/universeTBD](https://huggingface.co/universeTBD) + * The first open-source conversational AI tool tailored for teh astronomy community [https://doi.org/10.48550/arXiv.2401.01916](https://doi.org/10.48550/arXiv.2401.01916) + * Excerpt from Youtube (https://www.youtube.com/watch?v=OkCoTixo-MM \- A Bot and Costello \- Let's Power the Whole Thing off + * Philosophical and Legal implications are greater than cataloging these works generated by AI + * Questions: + * Amazon generated books? + * Case in wikipedia articles \- predatory publishiser, declined to cataloged to. WIkipedia articles have varied quality, + * Comes down to collection development to screen out poor-quality + * Attributions, but include disclaimer, assisted by AI generated, + * Publisher, authors not claim attribution if generated by AI. + * How do you screen for these works? Be up front about how you are using generated + * Not too hard to tell now if work is AI generated, aesthetical judgements by Cataloger. How would cataloger training be in 5 years? + * Poor quality, don’t catalog + * Responsibility of Author and Publisher to disclose and up front they need transparency. + * Run into this issue in your institutions? + * OCLC RLP Metadata Managers Focus Group will discuss on a topic related to AI and cataloging/metadata with both domestic and international institutions next week + * Creating cataloging records on ebooks, convert publisher data, Ex Libris training a model using PDFs to train to extract publisher information from records + * Another presentation Joy attended recently on a library that did an experiment to use ChatGPT with OCLC records, once you feed the data to the model, the data is now in the training set for OpenAI + diff --git a/meetings/2024-05-07.md b/meetings/2024-05-07.md new file mode 100644 index 0000000..f31bd6e --- /dev/null +++ b/meetings/2024-05-07.md @@ -0,0 +1,174 @@ +title: ai4lam Metadata/Discovery WG May Monthly Meeting + +# May 7, 2024 + +8 AM California | 11 AM Washington DC | 4 PM UK | 5 PM Oslo & Paris + +**Attending** + +* Name, institution +* Jeremy Nelson, Stanford +* Abigail Porter, Library of Congress +* Caroline Saccucci, Library of Congress +* Julia Kim, Library of Congress +* Erik Radio, Colorado +* Sara Amato, Eastern Academic Scholars’ Trust +* Ian Bogus, ReCAP + +## Helpful Links + +* [Metadata WG Zotero Group Library](https://www.zotero.org/groups/2709151/ai4lam_metadata_wg/library) + +## Project Documents and Data + +* [WG charter](https://drive.google.com/file/d/1ypcx2F30siqr-KYOKFZtVv8h9PIS9a77/view?usp=sharing) +* [WG Google Drive folder](https://drive.google.com/drive/folders/1cpZtbjKadgD30794fD97XY-EChUSy2r9?usp=sharing) + +## Agenda + +* Announcements +* Presentation “Exploring Computational Description: An update on Library of Congress experiments to automatically create MARC data metadata from ebooks” by Abigail Porter and Caroline Saccucci + * Update on FF 2023 presentation + * Cataloging and OC Labs collaborating on this experiment + * LC Labs AI Planning Framework + * Understand, Experiment, Implement \- Governance and Policy + * Come with quality baseline and then implement, have a robust auditing and shared quality standards + * Tools + * Articulating Principles + * Use case Risks & Benefits + * Domain Profiles + * Data Assessment + * Acquisitions + * Goal of Exploring Machine Learning + * First task order in the Digital Innovation in IDIA which is scoped for experiments inAI and ML + * What are examples, benefits, risks, costs, and quality benchmarks + * What technologies and workflow models are most promising to support metadata creation and assist with cataloging wordings + * Similar activities being employed by other organizations + * Is experiment, not building towards production + * E-Books with Cataloging Records Used to Train Models + * Ground Truth fo testing + * CIP 13802 items + * Open Access 5825 items + * E Deposit ebooks 403 items + * Legal reports 3750 + * Test data + * Existing catalog records for the ebooks + * Test models against test data + * Generate performance reports + * Target data + * Uncataloged ebooks + * Run most models + * What are testing? + * Token classification \- extracting specific bibliography metadata from the text such as Title or Author name + * Text classification \- charactering the whole of the text for example into subject heading or genres + * Models: Bert, Spay GPTs with variations (NLP, NER, LLM, transformer and non-transformer models) Vendor picked models, 2022 was before ChatGPT + * Results: Token Classification for All Fields + * F1 score for each of the models, ranked from highest (best) to lowest (worst) + * 80% for some fields + * Results: Token Classification for One Field + * Fields + * 700 + * 655 + * 264 + * 245 + * Expectation \~80% Quality standard \~95% + * Exceeded \~80% for some fields (SBN, Author, Title, did pretty well in identifying these field) + * Date \- not easy for it identified, Vendors setting for date parameter, not sure if is the Model. + * LCCN 100%, list of records almost always had the LCCN. Copyright page in e-book would LCCN. + * Early Metrics Analysis: Matches and Non-Matches + * Results after applying Annif + * Green shaded areas are exact match to MARC XML + * 1-1 match + * White rows ML model that wasn’t + * Didn’t put in fiction, nothing in text indicated that is fiction + * Ability to get an URL with a freeform sub divisions. + * The ones that have fiction, established with URL, maybe, no URL + * Combination of Positive and Negative hits questions about relevancy and accuracy, how well does the model do + * Assisted Cataloging HITL Prototype + * Vendor met with a group of catalogers workshop conversation, benefitting catalogers for this work. + * What is the name of the author but the name of the author in the authority field + * Authorized form of name or concept. + * Two low-fidelity prototypes + * Number of tabs for the cataloger to go through + * Abstracts and summary, + * Extracted summary directly from the text of the book, Machine picks main sentences + * Abstract of Summary scanning through the text + * Cataloger selects model suggestions, opportunity for cataloger to select to what the machine is suggesting. Opportunity how well does ML workflow suggest Subjects and Names in the authority records. + * Both cases narrow, broader terms from LCSH, wikidata, and other linked-data resource + * How well did the cataloger appreciated the suggestions, what is and is not beneficial, provide feedback to the model + * Assessing Quality + * How to access quality for these tools in different ways + * F1 score, highest performing scores by field + * Humans in the loop prototypes to increase quality for the records. + * Contractor qualitative scoring of the models + * Reliability + * Compute cost + * Training data + * Activity + * Documentation + * Developer + * Compliance to security and privacy considerations + * Overall quality of the service, program evaluation factors + * Likelihood of maintaining quality over time + * Reasonable cost + * Benefits to staff/catalogers + * Benefits to users + * Benefits to organization + * Fair and equitable risks + * Risks to users + * Risks to organization + * Security risks + * Privacy risks + * Authenticity risks + * Reputation risks + * Compliance risks + * Want to collaborate with other or + * Challenges + * Unbalanced data- long tail of subject terms + * Create well balanced training data to train and test models + * Exploring correcting over representation of English language \* other bias + * Short-term timeline \- Several NLP tools have ben in development for over 20 years, can’t reach state of the art + * Need to develop quality standards and policies for these approaches + * Stability of tools, unknown costs, tooling lock-in + * What did I learn? + * There two type so ML classification, text and token + * One of the models, Annif, had some success with text classification predicting subjects + * Some models very successful at token classification, predicating authors, titles, and identifiers, such as ISBN and LCCN + * ML requires lots of training data to improve results + * ½ of the training data contained similar patterns of LCSH + * ½ of the training data contained unique LCSH + * Catalogers reacted more positively to the results than expected + * Cataloger-assisted prototypes were really cool and have potential + * Catalogers interested in ML and seem less afraid of it than expected + * LLM (ChatGPT) shows promise but need more experimentation + * Room for both HILN and LLM + * What do I still want to learn? + * Faceted subject headings (post-coordinated) be more success than subject strings a la LCSH (pre-coordinated) in ML processes? + * Subject categories are more successfully cataloged using ML? + * Could a model be trained to accurately predict LC Classification and/or Dewey Decimal Classification? + * What other metadata elements can be extracted from ebooks? + * Can LLMs like ChatGP be trained to predicate accurate bibliography descriptions? + * What are ML policies/decision that the Library need ot make, e.g.g: + * Copyright concerns + * Accuracy vs. Relevance + * Training data biases + * Next steps + * ECD2: Toward Piloting Computational Description + * Where are the most effective combinations of automation and human intervention in generating high-quality catalog records tht will be usable at the LOC + * What are the benefits, risks, and requirement for building a pilot application for ML-assisted cataloging workflows + * ECD3: Extending Experiments to Explore Computational Description + * How can ML methods support the CIP cataloging workflow + * How can CIP metadata generated through ML be ingested and used in BFDB + * How can additional elements added to BF descriptions improve quality and usefulness of the metadata compared to ECD1 and ECD2? + * Experiment with 3 ML models + * Use prepublication galleys, all in PDF, some with minimal text provided + * Created BIBFRAME descriptions that can be loaded to test BFDB + * Require more metadata beyond the 6 fields required in task order 1 + * Allows for cataloger review in the BIBFRAME Editor + * Extension of cataloger assistant prototypes + * Questions: + * Other LLMs? Current Using ChatGPT maybe Claude, tested Llama almost as good as ChatGPT 3.5, more permissive with later GPT 4.0 + * Timeline for next steps and BF Work? Current task order, working on prototypes, task order ends in August, BF begins in August 2024\. + * Can you expand on faceted subjects as potentially more successful? Extreme text classification breaks strings down, going to use linked-data, URLs necessary for BF. There is controversy about pre-coordinate vs post-coordinated strings useful for users. If only one or two patterns of data across the training data, interesting question for policy + * AI metadata clean-up? National Library of Medicine from Alvin Stockdale using ChatGPT to run TOC format properly for a MARC record. Give prompts to format TOC. Running a script to fix things is more automation than ML. In an automated find/replace, ML making choices of what it thinks it should be. Difference in application, ML method. + * Prototype integrated w/FOLIO? Ebook cataloging in a number of ways. Will the same method in ML produce the same quality. Getting results, very open to plans for integrating with FOLIO. Not serialized in MARC format, in weird space next string. BF description is going into BF test instance, then migrate to FOLIO instance. Not near pilot phase, but experiments. Learning about requirements and what is possible, creating future systems or creating requirements for future systems. Not dealing with production systems, experiments with Prototypes. Fortunately to work with LC labs to work with cataloging to experiments with ML methods. diff --git a/meetings/2024-06-11.md b/meetings/2024-06-11.md new file mode 100644 index 0000000..8564c2e --- /dev/null +++ b/meetings/2024-06-11.md @@ -0,0 +1,165 @@ +title: ai4lam Metadata/Discovery WG June 2024 Monthly Meeting + +# Jun 11, 2024 + +8 AM California | 11 AM Washington DC | 4 PM UK | 5 PM Oslo & Paris + +**Attending** + +* Name, institution +* Tim Thompson, Yale +* Gavin Mendel-Gleason, TerminusDB +* Jeremy Nelson, Stanford +* Stephen McConnachie, BFA +* Erik Radio, Colorado +* Sara Amato, Eastern Academic Scholars’ Trust +* Victor Mireles, National University of Mexico +* Kalli Mathios, Stanford +* Joy Panigabutra-Roberts + + +## Helpful Links + +* [Metadata WG Zotero Group Library](https://www.zotero.org/groups/2709151/ai4lam_metadata_wg/library) + +## Project Documents and Data + +* [WG charter](https://drive.google.com/file/d/1ypcx2F30siqr-KYOKFZtVv8h9PIS9a77/view?usp=sharing) +* [WG Google Drive folder](https://drive.google.com/drive/folders/1cpZtbjKadgD30794fD97XY-EChUSy2r9?usp=sharing) + +## Agenda + +* Announcements +* Presentation by Dr Gavin Mendel-Gleason, who will present on the following topic: What are text embeddings, and how can we use them for AI-assisted information retrieval and data quality? + * AI For Document Retrieval and Data Quality [presentation](https://docs.google.com/presentation/d/1UVK4oTkKX4LgDknKJSzCw3wDDo1ZrKwDbHxXGCrU-Pc/edit#slide=id.p) + Using text-embedding to enhance user-experience of data collections + * What is a text embedding? + * LLMs offer a representation of text in a high dimensional vector space + * These semantic spaces allow meaning to be represented with maths + * King \- man \+ woman \= queen + * Transformer models are the current state of the art for obtaining vector representations + * What can you do with a vector representation? + * Semantic Record retrieval + * Improve data quality + * Entity matching + * Anomaly detection outliers in the vector space, sometimes represent problems in the data + * RAG (Retrieval-Augmented Generation) promise to really help us find resource interactively + * Semantic Record retrieval + * The semantic distance between two text is the distance between vectors + d(“dogs are the best”, “canines are the greatest”) \~ 0 + * Similar documents are near each-other in the space allowing clustering, matching etc. d(“QUERY Who was president in 1964”, “ANSWER Lyndon Johnson”) \~ 0 + * Some transformers can also split the space into queries and answers, allowing us to obtain a vector representation from the question which is the near the answer, rather than the question. d(“ANSWER + * Data Quality + * Matching with LLMs is flexible \- text can work across languages and otheographies, with spelling mistakes and without normalisation. This is very helpful in entity recognition tasks: + * d(“Jim”, “James”) \~ 0 + * d(“Khrushchev”, “Chruscthschow”) \~0 + * The strategy of “embed and cluster” can help to find duplicate records + * It also provides a strategy for controlling record matching at scale (1 billion x 1 billion \= 1 quintillion) \- it’s much faster just to search the neighbors (low distance vectors) of each record + * Anomaly detection + * RAG Retrieval-Augmented Generation + * Multi-stage process using chatbot to obtain answers + * Vectorize documents + * Ask a question + * Get the “QUESTION” embedding, and search for neighbors, e.g. “QUESTION What are some novels about the Cold War” + * Extract information about the neighbors from a traditional RDBMS or Graph Database (title, abstract, author, hyper-link, etc) + * Use this information to produce a prompt + * “Answer the following question given the relevant documents and their hyperlinks: + * Author: Tom Clancy, Title: The Hunt for Red October + * Author: Tom Clancy, Title: Clear and Present Danger + * Feed the prompt and question to a chat bot + * Chat bot responses with the correct internal document record answer without hallucination (hopefully) + * What do I need to make these things work + * An LLM turned for embeddings (MxBai is good for small things, Ada is better for big things) + * A traditional database (Graph or RDBMS) + * A vector database (currently HNSW \- higher navigable small graphs) variants are the best performing) + * A way to create strings for the embeddings from records (JSON+handlebars templates?) + * Good prompt engineering (Good luck\!) + * Some glue code (python?) +* Question: Vector for Graph documents + * Graph embeddings create Transformers embed information in graph + * Easiest to create a query embedded into query + * Get answer back from query, link back to the original elements in the graph + * Disaggregate book into chapter, paragraphs, query then bubble up, structure after the fact after the query and vectors +* Question: MxBai + * Large data set with small documents, vector size relative small, vectorize faster +* Question: Thoughts about RAG graph vs. relational database + * Bias towards graph databases + * Easier to model hierarchical and more complex +* Artificial Intelligence \- Deep Learning for Language Modelling [presentation](https://docs.google.com/presentation/d/19ihwzIppUGy0L1VedxoSDKbN49NSqO-PoM5KK_xbqgY/edit?pli=1#slide=id.p) + * A simple neural network + Input layer \-\> hidden layer \-\> output layer + * Inputs, new vector with weights create new vector, “higher order features from previously recognized features”, Last vector is the what is provide as output + * Generative Model output is feed back through the layers to approximate the input +* How do we vectorise language? + * No single answer, sill an open question \-anything I say here is but one approach + * BUT we have amazingly good language model now + * Some questions: + * What is the unit of vectorisation? Word, sentence, paragraph, document + * How do we incorporate context? +* Vectorising words + * Input Vector \- “1 hot vector” exactly 1 element in a 10k with remaining zeros + * Hidden Layer \- Linear Neurons + * Output Layer \- Softmax Classifer + * Probability that the word at a randomly chosen nearby position is “abandon” + * “Ability” +* Semantic context + * CBOW + * Input \-\> Projection \-\> Output + * Skip-gram +* How do we get the right weights? + * Use a loss/cost/object function for how good the answer is \- manay possibilities + * Use a search strategy to alter the weights (gradient descent for instance) +* Rise of the Transformers + * Sequence data with context was being addressed with recursive neural networks + * These had problems with keeping good track of context + * Largely superseded by an attention model + * Attention tells us what part of a sequence we should be paying attention to in order to understand the next bit. +* Attention + * I want to got the store + \\ + * Ich mochte zum Geschaft gehen + * Look forward and backward in time to construct a complex context +* Transformers (training) + * Word IDs “You are welcome” + * Embeddings and Position Encoding + * Encorder-1 \-\> Enc-1 Out \-\> Encoder \-2 \-\> en +* The Transformer (training) processes as follows + * The input sequence is converted into Embeedings with Position encoding) and feed to the Encorder + * The stack of Encoders process this and produces an ecoded representation of the input sequence. + * The target sequence is prepended +* Transformer (inference) + * Built up on answer based on the probability of the next word, looks back at information, feeds back to the encoder, feeds it back into the loop back. + * Train with enough data, very good to +* AI is the future of content + * These models allow sophisticated modelling of semantics + * We need to be at the forefront of semantic modelling of content to win + * Some ideas + * A librarian which knows about your conet and con converse with you about it + * Library who knows about connections (attention model embeddings?) + * Content summarisation engine + * Automatic schema generation from examples + * Synthetic content generation + * Auto-clusterings + * Entity Resolution + * Anomy detection +* Question: Why use HNSW vector database? + * Vector database, one of the requirements to be open source, TerminesDB open source graph database. + * Not too many database that can go to billion vectors, smooth, no nice CLI + * Main problem is the recall, most important when dealing with these things. + * Product search for “fish and chips” need 100%, 80% not good high, high recall at scale 99.999% for 1 million documents + * Building own vector database, proud of recall and scale + * Vespa \- billion scale, +* Question: Vector postgres? + * Storage great + * Indexing is the issue + * Worth looking at the recall numbers +* Question: Different use cases you worked on? + * Library vs other industry use cases + * Entity recognition fraud detection, sales , bankruptcy, match records between records, similar to things what is being done. + * Older traditional matching techniques mixes together with vectors has the best results. + * Phone numbers and Dates don’t work very well + * Reasoning by Transformers on time very bad + * Semantic search + * Experiments with RAG, no big project with RAG, refocus in the TerminsDB very low touch with easy to use RAG with a range of different options. +* Question: Names and disambiguation + * Name and titles anything that can be misspelled good match diff --git a/meetings/2024-08-13.md b/meetings/2024-08-13.md new file mode 100644 index 0000000..2b5da23 --- /dev/null +++ b/meetings/2024-08-13.md @@ -0,0 +1,30 @@ +title: ai4lam Metadata/Discovery WG August 2024 Monthly Meeting + +# Aug 13, 2024 + +8 AM California | 11 AM Washington DC | 4 PM UK | 5 PM Oslo & Paris + +**Attending** + +* Name, institution +* Kendra Bouda +* Jeremy Nelson, Stanford +* Erik Radio, Colorado +* Kalli Mathios, Stanford +* Sara Amato, EAST + +## Helpful Links + +* [Metadata WG Zotero Group Library](https://www.zotero.org/groups/2709151/ai4lam_metadata_wg/library) + +## Project Documents and Data + +* [WG charter](https://drive.google.com/file/d/1ypcx2F30siqr-KYOKFZtVv8h9PIS9a77/view?usp=sharing) +* [WG Google Drive folder](https://drive.google.com/drive/folders/1cpZtbjKadgD30794fD97XY-EChUSy2r9?usp=sharing) + +## Agenda + +* Announcements +* WOLFcon pre-conference workshop [Use cases](https://github.com/folio-labs/edge-ai/wiki/) + * edge-ai \- [https://github.com/folio-labs/edge-ai](https://github.com/folio-labs/edge-ai) + * ai-workflows \- https://github.com/folio-labs/ai-workflows diff --git a/meetings/2024-09-10.md b/meetings/2024-09-10.md new file mode 100644 index 0000000..ffb2352 --- /dev/null +++ b/meetings/2024-09-10.md @@ -0,0 +1,32 @@ +title: ai4lam Metadata/Discovery WG September 2024 Monthly Meeting + +Sep 10, 2024 + +8 AM California | 11 AM Washington DC | 4 PM UK | 5 PM Oslo & Paris + +**Attending** + +* Name, institution +* Jeremy Nelson, Stanford +* Erik Radio, Colorado +* Kendra Bouda, UW-Madison +* Kalli Mathios, Stanford + +## Helpful Links + +* [Metadata WG Zotero Group Library](https://www.zotero.org/groups/2709151/ai4lam_metadata_wg/library) + +## Project Documents and Data + +* [WG charter](https://drive.google.com/file/d/1ypcx2F30siqr-KYOKFZtVv8h9PIS9a77/view?usp=sharing) +* [WG Google Drive folder](https://drive.google.com/drive/folders/1cpZtbjKadgD30794fD97XY-EChUSy2r9?usp=sharing) + +## Agenda + +* Announcements +* WOLFcon pre-conference workshop update + * [https://sul-dlss-labs.github.io/wolfcon-2024-ai-workshop/](https://sul-dlss-labs.github.io/wolfcon-2024-ai-workshop/) Workshop website +* Ideas for future meetings + * {I’m not going to make today’s meeting but would love some ‘hack-a-thon’ type meetings around metadata \- e.g. ways to clean up diacritics in marc records, or parse holding statements} + * Hands-on testing of AI technologies + * diff --git a/meetings/2024-10-08.md b/meetings/2024-10-08.md new file mode 100644 index 0000000..74540af --- /dev/null +++ b/meetings/2024-10-08.md @@ -0,0 +1,30 @@ +title: ai4lam Metadata/Discovery WG October 2024 Monthly Meeting + +# Oct 8, 2024 + +8 AM California | 11 AM Washington DC | 4 PM UK | 5 PM Oslo & Paris + + +**Attending** + +* Name, institution +* Jeremy Nelson, Stanford +* Radek Svetlik +* Kalli Mathos, Stanford +* K bouda + +## Helpful Links + +* [Metadata WG Zotero Group Library](https://www.zotero.org/groups/2709151/ai4lam_metadata_wg/library) + +## Project Documents and Data + +* [WG charter](https://drive.google.com/file/d/1ypcx2F30siqr-KYOKFZtVv8h9PIS9a77/view?usp=sharing) +* [WG Google Drive folder](https://drive.google.com/drive/folders/1cpZtbjKadgD30794fD97XY-EChUSy2r9?usp=sharing) + +## Agenda + +* Announcements +* WOLFcon pre-conference workshop Use Cases Review + [https://github.com/folio-labs/ai-workflows/wiki](https://github.com/folio-labs/ai-workflows/wiki) + diff --git a/meetings/2024-11-12.md b/meetings/2024-11-12.md new file mode 100644 index 0000000..4f070e9 --- /dev/null +++ b/meetings/2024-11-12.md @@ -0,0 +1,34 @@ +title: ai4lam Metadata/Discovery WG November 2024 Monthly Meeting + +# Nov 12, 2024 + +8 AM California | 11 AM Washington DC | 4 PM UK | 5 PM Oslo & Paris + +**Attending** + +* Name, institution +* Jeremy Nelson, Stanford +* Richard Urban, OCLC Research Library Partnership +* Jenn Colt, Cornell +* Kalli Mathios, Stanford +* Ian Bogus, ReCAP + +## Helpful Links + +* [Metadata WG Zotero Group Library](https://www.zotero.org/groups/2709151/ai4lam_metadata_wg/library) + +## Project Documents and Data + +* [WG charter](https://drive.google.com/file/d/1ypcx2F30siqr-KYOKFZtVv8h9PIS9a77/view?usp=sharing) +* [WG Google Drive folder](https://drive.google.com/drive/folders/1cpZtbjKadgD30794fD97XY-EChUSy2r9?usp=sharing) + +## Agenda + +* Announcements +* Hack-a-thon using AI with MARC records +* Ideas for future Hack-a-thon + * Linked Data from a Spreadsheet of data, quick statements in wikidata, + * Generate quick statements + * Create entities for wikidata + * Create a collection of BIBFRAME data for use in a RAG workflow and provide context for de-duping BIBFRAME Works and Instances + * Use OpenRefine and Jupyter Notebooks to convert spreadsheets into MARC records diff --git a/meetings/2024-12-10.md b/meetings/2024-12-10.md new file mode 100644 index 0000000..eecba10 --- /dev/null +++ b/meetings/2024-12-10.md @@ -0,0 +1,29 @@ +title: ai4lam Metadata/Discovery WG December 2024 Monthly Meeting + +# Dec 10, 2024 + +8 AM California | 11 AM Washington DC | 4 PM UK | 5 PM Oslo & Paris + + +**Attending** + +* Name, institution +* Kalli Mathios, Stanford +* Jeremy Nelson, Stanford +* Jenn Colt, Cornell + +## Helpful Links + +* [Metadata WG Zotero Group Library](https://www.zotero.org/groups/2709151/ai4lam_metadata_wg/library) + +## Project Documents and Data + +* [WG charter](https://drive.google.com/file/d/1ypcx2F30siqr-KYOKFZtVv8h9PIS9a77/view?usp=sharing) +* [WG Google Drive folder](https://drive.google.com/drive/folders/1cpZtbjKadgD30794fD97XY-EChUSy2r9?usp=sharing) + +## Agenda + +* Announcements +* Linked Data from a Spreadsheet of data, quick statements in wikidata, + * Generate quick statements + * Create entities for wikidata