-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I have some basic questions about using Crawl4AI #375
Comments
I also think so, and I have encountered many problems during use. The documentation is not clear and I don't know how to solve them |
@mozou Thanks for trying the library. Could you show me some examples? For instance, define a task that you like to do and explain how you found it easier with other libraries. I want to clarify that I don't compare Crawl4ai with Selenium or Playwright; they serve as wrappers around Chromium, and I find Playwright much faster than Selenium. Crawl4ai generates data suitable for large language models. This data can be structured or high-quality markdown, which has motivated me from the very beginning. I focus on generating markdown quickly while providing developers the ability to intervene at any stage. The library excels at generating markdown efficiently, and I'm also working on making it scalable in the cloud. To provide a good comparison, please share a specific task you find difficult to accomplish with Crawl4ai, and I will create a version for you to compare. @sz0811 is somehow correct; the documentation currently confuses users due to numerous changes, and it hasn't been updated properly. That's why I've been working heavily on the documentation for the last two weeks, and I will soon update it with more practical examples. I need help and support from community members because my goal is to ensure this becomes the best available tool for people who need data extraction for AI applications. I appreciate your support. Please share your case study, and I can create a coding snippet for your feedback. |
@unclecode First of all, thank you for your answer. Because I found that the most common problem in crawling is the anti-crawling strategies on the website. Especially when crawling on some shopping websites or social websites. For example, sliders, or click verification (such as clicking on pictures containing "cats"). These are very troublesome in crawling. In traditional crawlers, OCR is basically used for processing or other platforms. However, Crawl4AI integrates large models, and perhaps it can be broken through with the power of AI. I used the traditional crawler framework before, and accidentally found your very interesting project on Github, so I tried to learn and understand it. I did not use the new feature of LLM this time, but I will try it later because this is a very attractive part of Crawl4AI. |
@unclecode Because my question is based on the comparison with traditional crawler frameworks, it may not be very accurate, after all, Crawl4AI is specially built for AI and LLM. But encountering various anti-crawling strategies in crawlers is also a common problem for all crawler frameworks. Although Crawl4AI can simulate users to try to avoid risks, it is still very difficult on some websites with strict anti-crawling. If Crawl4AI can handle these problems, it may become the greatest crawler framework in history, and even a beginner can become a crawler master. I really hope that Crawl4AI can become more powerful, because it is very interesting. and thank you for spending a lot of time to write documents for us beginners. |
@mozou Thank you for your kind words. You mentioned something crucial, and I will spend a lot of time on that. Right now, we have something called a managed browser. With the managed browser, you can do everything you can do with your personal browser using Crawl4ai. I have detailed multiple GitHub issues and included code examples in the new documentation. The idea is to open a new browser in your terminal using a command line. Then, you assign a new folder called the user data profile directory or user profile directory to that browser. This action opens a fresh new browser. You can then visit all the pages you want to crawl. If you need to log in, you log in. If you need to bypass anti-crawling gates, you do that as well. Essentially, this is your browser and your identity. After closing the browser, you start using Crawl4ai and pass the folder. This time, Crawl4ai opens the browser attached to that folder, and you magically have everything you created. You can crawl and act because you are using your own identity, and you deserve it since it's your own data and browser. This is just one of the multiple approaches I incorporated, and I can say it works with the majority of websites. In the last three to four months, I received many requests, and I have already handled many of them, which made Crawl4ai stand out. I tried to find links to those issues to share with you, and they will also be included in the new documentation. So please stay tuned for that. |
@unclecode Thank you for your answer, I hope you can succeed. I will continue to learn Crawl4AI, I hope it can become more powerful. |
@unclecode Hello and Gongrats! I tried several features of the tool, the page interaction, advanced session management and auth crawler strategy by hooks. I noticed that the media extraction and analysis capabilities structure doesn't detect perfectly in real-world conditions. I mainly refer to Background images where media processing sometimes does not detect properly. So in that case I tried to figure it out by Pre-Execute my custom js_code to get all background images before the browser return the html. For example, this url https://akispetretzikis.com/recipe/8615/revithada-me-chwriatiko-loukaniko I created a js_code. const backgrounds = new Set(
[...document.querySelectorAll('*')]
.map(el => getComputedStyle(el).backgroundImage)
.filter(img => img && img !== 'none')
);
const result = [...backgrounds];
let url = result[0] || undefined; // get the first bg image to test it out.
if (url) {
// Extract the first URL with regex
const matches = url.match(/url\\("https:\\/\\/[^"]+\\.(jpg|jpeg|png|apng|webp|avif|gif)"\\)/gi);
if (matches) {
// If matches are found, loop through them and extract the URLs
matches.forEach(match => {
// Extract the URL from each match
const imageUrl = match.match(/url\\("([^"]+)"/i)[1];
// console.log(imageUrl); // Log each image URL
// Create an img element and set the src attribute to the extracted image URL
const img = document.createElement('img');
img.src = imageUrl; // Set the extracted URL as the img src
img.classList.add('bg-image-selection'); // Replace with your class name
document.body.appendChild(img); // Append the img element to the document body
});
}
} But when I tried to wait_for the condition it was not work properly. bg_images = ".bg-image-selection"
result = await crawler.arun(
# session_id=session_id,
excluded_tags=['header', 'footer', 'nav', 'meta', 'link'], # Additional tags to remove
# url="https://akispetretzikis.com/recipe/8733/christougenniatikh-mpolonez-me-kima-galopoulas-kai-kastana",
url=url,
js_code=js_code, # where js_code is the previous code content
process_iframes=True, # Extract iframe content
remove_overlay_elements=True, # Remove popups/modals that might block iframe
wait_for=f"css:{bg_images}",
# css_selector=bg_images,
screenshot=True,
# Timing
# delay_before_return_html=3.0, # Additional wait time
magic=True,
scan_full_page=True, # Enables scrolling
scroll_delay=0.2, # Waits 200ms between scrolls (optional)
cache_mode=CacheMode.BYPASS, # New way to handle cache
wait_for_images=True, # Add this argument to ensure images are fully loaded
# Only execute JS without reloading page
simulate_user=True, # Simulate human behavior
override_navigator=True, # Override navigator properties
# adjust_viewport_to_content=True, # Dynamically adjusts the viewport
# markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True}),
js_only=True, # Only execute JS without reloading page
)
# Access different media types
images = result.media["images"] # List of image details
.... Call log:
|
@prokopis3 I will check this URL and get back to you. The desirable output is that we should crawl it, and we will definitely do that. Since day zero, many people have reported different situations, and I keep updating. Hopefully, within a few months, we can say we have covered all the situations. I will check this one and provide an update. |
I'm very sorry, but I still want to ask.
I did some simple learning and understanding at https://crawl4ai.com/mkdocs/.
I tried to crawl some simple pages. I found that because of the built-in interface, it is very easy to get media resources, which is very interesting and great!
But I still have some additional questions to ask.
I used selenium and pyppepeer for crawling before, and this time I used Crawl4AI, but I didn't seem to feel its power. (I didn't use the LLM function this time) Maybe I am a beginner to use crawl4AI.
I found that it simplifies some common problems of crawlers and provides interfaces, but it doesn't seem to be much stronger than traditional crawler frameworks in terms of crawling and verification code processing? Can you tell me its advantages? Thank you.
The text was updated successfully, but these errors were encountered: