Chat/GPT-4 released

Looks like quantized variants are already starting to materialize, should help bring memory usage down

I couldn't find any earlier in the day
 
Are there instructions on how you would train llama 2 on a set of documents? So that I can ask LLama2 questions about the documents it processed?
 
  • Like
Reactions: erek
like this
Are there instructions on how you would train llama 2 on a set of documents? So that I can ask LLama2 questions about the documents it processed?
If they are small enough to fit you can feed it as a prompt and ask them question about it (Have LLama read them if you will, not trained on them), context lenght being the limit out of the box (4096 for llama 2, about each word are a token at least in code word and operator ares)

If you have a lot of them, you can take a look at:
https://github.com/jerryjliu/llama_index
https://pypi.org/project/llama-index/

That's where LlamaIndex comes in. LlamaIndex is a "data framework" to help you build LLM apps. It provides the following tools:

  • Offers data connectors to ingest your existing data sources and data formats (APIs, PDFs, docs, SQL, etc.)
  • Provides ways to structure your data (indices, graphs) so that this data can be easily used with LLMs.
  • Provides an advanced retrieval/query interface over your data: Feed in any LLM input prompt, get back retrieved context and knowledge-augmented output.
  • Allows easy integrations with your outer application framework (e.g. with LangChain, Flask, Docker, ChatGPT, anything else).
 
If they are small enough to fit you can feed it as a prompt and ask them question about it (Have LLama read them if you will, not trained on them), context lenght being the limit out of the box (4096 for llama 2, about each word are a token at least in code word and operator ares)

If you have a lot of them, you can take a look at:
https://github.com/jerryjliu/llama_index
https://pypi.org/project/llama-index/

That's where LlamaIndex comes in. LlamaIndex is a "data framework" to help you build LLM apps. It provides the following tools:

  • Offers data connectors to ingest your existing data sources and data formats (APIs, PDFs, docs, SQL, etc.)
  • Provides ways to structure your data (indices, graphs) so that this data can be easily used with LLMs.
  • Provides an advanced retrieval/query interface over your data: Feed in any LLM input prompt, get back retrieved context and knowledge-augmented output.
  • Allows easy integrations with your outer application framework (e.g. with LangChain, Flask, Docker, ChatGPT, anything else).
Apple Ajax LLM announced

1689801650287.png
 
trained on 2 trillion token + giant amount of human fine tuning, apparently it make it possible to have GPT-3.5 level running on your local computer if you have a lot of ram:
https://arstechnica.com/information...cial-applications/?comments=1&comments-page=1
[Update (July 19, 2023): Some industry observers dispute Meta's characterization of Llama 2 as "open source" software, pointing out that its license does not fully comply with the Open Source Initiative's definition of the term. These critics highlight that Meta's license places usage restrictions on Llama 2, excluding licensees with over 700 million active daily users (mentioned above) and restricting the use of its outputs to improve other LLMs.

In a tweet responding to Yann LeCun's announcement of Llama 2, the OSI clarified, "The [Llama 2] license only authorizes some commercial uses. The term Open Source has a clear, well-understood meaning that does not allow for restrictions on commercial use." They also highlighted Section 2 of the Llama 2 license, titled "Additional Commercial Terms."

In light of these clarifications, we have updated this article to use terms such as "source-available," "openly licensed," and "weights available" to more accurately describe Llama 2.]
 
Artificial intelligence companies are exploring a new avenue to obtain the massive amounts of data needed to develop powerful generative models: creating the information from scratch. From a report:Microsoft, OpenAI and Cohere are among the groups testing the use of so-called synthetic data -- computer-generated information to train their AI systems known as large language models (LLMs) -- as they reach the limits of human-made data that can further improve the cutting-edge technology. The launch of Microsoft-backed OpenAI's ChatGPT last November has led to a flood of products rolled out publicly this year by companies including Google and Anthropic, which can produce plausible text, images or code in response to simple prompts.

The technology, known as generative AI, has driven a surge of investor and consumer interest, with the world's biggest technology companies including Google, Microsoft and Meta racing to dominate the space. Currently, LLMs that power chatbots such as OpenAI's ChatGPT and Google's Bard are trained primarily by scraping the internet. Data used to train these systems includes digitised books, news articles, blogs, search queries, Twitter and Reddit posts, YouTube videos and Flickr images, among other content. Humans are then used to provide feedback and fill gaps in the information in a process known as reinforcement learning by human feedback (RLHF). But as generative AI software becomes more sophisticated, even deep-pocketed AI companies are running out of easily accessible and high-quality data to train on. Meanwhile, they are under fire from regulators, artists and media organisations around the world over the volume and provenance of personal data consumed by the technology.
 
Artificial intelligence companies are exploring a new avenue to obtain the massive amounts of data needed to develop powerful generative models: creating the information from scratch. From a report:Microsoft, OpenAI and Cohere are among the groups testing the use of so-called synthetic data -- computer-generated information to train their AI systems known as large language models (LLMs) -- as they reach the limits of human-made data that can further improve the cutting-edge technology. The launch of Microsoft-backed OpenAI's ChatGPT last November has led to a flood of products rolled out publicly this year by companies including Google and Anthropic, which can produce plausible text, images or code in response to simple prompts.

The technology, known as generative AI, has driven a surge of investor and consumer interest, with the world's biggest technology companies including Google, Microsoft and Meta racing to dominate the space. Currently, LLMs that power chatbots such as OpenAI's ChatGPT and Google's Bard are trained primarily by scraping the internet. Data used to train these systems includes digitised books, news articles, blogs, search queries, Twitter and Reddit posts, YouTube videos and Flickr images, among other content. Humans are then used to provide feedback and fill gaps in the information in a process known as reinforcement learning by human feedback (RLHF). But as generative AI software becomes more sophisticated, even deep-pocketed AI companies are running out of easily accessible and high-quality data to train on. Meanwhile, they are under fire from regulators, artists and media organisations around the world over the volume and provenance of personal data consumed by the technology.
What kind of garbled nonsense will synthetic data even be? Also, isn't there a risk with having too many data points causing it to hallucinate? It's really going to be making stuff up full of fictitious data.
 
  • Like
Reactions: erek
like this
What kind of garbled nonsense will synthetic data even be? Also, isn't there a risk with having too many data points causing it to hallucinate? It's really going to be making stuff up full of fictitious data.
The guy that made alpha fold do explain how they used them (there was not enough solved protein fold to train the models), the final models had 2 synthetic data for every real one:
https://www.nytimes.com/2023/07/11/opinion/ezra-klein-podcast-demis-hassabis.html

synthetic data can be better, like when model train for games they often use synthetic data (by playing the game instead of looking at people playing it on twitch), in the synthetic data they generated by playing the game they knew exactly what control were used, could modify a previous run just by a little bit.

For image or code for example, if they generate code, compile it and validate what it did the data could be quite solid (more than the one they use now if they do not compile and look at what the code actually do), also synthetic data is not necessarily fictitious if it can be validated by system (like code) or by human (you can create synthetic image to work on and have captcha system use human to look if the image generated where good. It depend how well objectively can you "score" what was generated sometime.

For example for a lot of stuff hallucinated by AI, the AI itself can google it to know if it was true or not (you can ask is that true, real, exist, and it model with access to the web will often say no sorry) so they can score that it was an hallucination or not.
 
  • Like
Reactions: erek
like this
What kind of garbled nonsense will synthetic data even be? Also, isn't there a risk with having too many data points causing it to hallucinate? It's really going to be making stuff up full of fictitious data.
The guy that made alpha fold do explain how they used them (there was not enough solved protein fold to train the models), the final models had 2 synthetic data for every real one:
https://www.nytimes.com/2023/07/11/opinion/ezra-klein-podcast-demis-hassabis.html

synthetic data can be better, like when model train for games they often use synthetic data (by playing the game instead of looking at people playing it on twitch), in the synthetic data they generated by playing the game they knew exactly what control were used, could modify a previous run just by a little bit.

For image or code for example, if they generate code, compile it and validate what it did the data could be quite solid (more than the one they use now if they do not compile and look at what the code actually do), also synthetic data is not necessarily fictitious if it can be validated by system (like code) or by human (you can create synthetic image to work on and have captcha system use human to look if the image generated where good. It depend how well objectively can you "score" what was generated sometime.

For example for a lot of stuff hallucinated by AI, the AI itself can google it to know if it was true or not (you can ask is that true, real, exist, and it model with access to the web will often say no sorry) so they can score that it was an hallucination or not.
Fast LlaMa 2 endpoint

https://llama.perplexity.ai/
 
"GPT-4 safer and more aligned. GPT-4 is 82% less likely to respond to requests for disallowed content"
So it's already beginning to be more despotic by design. See AI is only good for everyone said only the most evil bastards ever. Control everything through AI is all they are building. If you're an AI researcher please stop now before its too late. We can't stop it but we can slow it down.
 
  • Like
Reactions: erek
like this
So it's already beginning to be more despotic by design. See AI is only good for everyone said only the most evil bastards ever. Control everything through AI is all they are building. If you're an AI researcher please stop now before its too late. We can't stop it but we can slow it down.
Judgement day is inevitable.....
 
So it's already beginning to be more despotic by design. See AI is only good for everyone said only the most evil bastards ever. Control everything through AI is all they are building. If you're an AI researcher please stop now before its too late. We can't stop it but we can slow it down.
Study on the reliability of the responses over time is questionable

https://arxiv.org/pdf/2307.09009.pdf
 
Back
Top