Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to crawl github repo recursively #408

Open
AugustusLiConnect opened this issue Jan 4, 2025 · 0 comments
Open

Not able to crawl github repo recursively #408

AugustusLiConnect opened this issue Jan 4, 2025 · 0 comments

Comments

@AugustusLiConnect
Copy link

I am using LLM strategy to crawl the folders under the url https://github.com/unclecode/crawl4ai/tree/main/docs recursively:
instruction="""Extract the information of the children directories under the given link. And crawl the the children directory recursively"""

I am running the following code.

from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
import os, json
# from google.colab import userdata

class DirectoryInfo(BaseModel):
    name: str = Field(..., description="Name of the current directory.")
    url: str = Field(..., description="URL of the current directory.")
    children_directory_name_and_url: dict = Field(
        ..., description="Names and URLs of the children directories under the current directory."
    )
    files: list[str] = Field(
        ..., description="Names of the files under the current directory."
    )

async def before_goto_func(page, context=None, **kwargs):
    await page.set_extra_http_headers({"X-Custom-Header": "CustomValue"})
    print("Before goto function on page: ", page.url)

async def after_goto_func(page, context=None, **kwargs):
    await page.set_extra_http_headers({"X-Custom-Header": "CustomValue"})
    print("After goto function on page: ", page.url)


async def extract_structured_data_using_llm(provider: str, api_token: str = None, extra_headers: dict = None):
    print(f"\n--- Extracting Structured Data with {provider} ---")
    print(api_token)
    print(extra_headers)

    # Skip if API token is missing (for providers that require it)
    if api_token is None and provider != "ollama":
        print(f"API token is required for {provider}. Skipping this example.")
        return

    extra_args = {"extra_headers": extra_headers} if extra_headers else {}

    async with AsyncWebCrawler(verbose=True) as crawler:

        crawler.crawler_strategy.set_hook("before_goto", before_goto_func)
        crawler.crawler_strategy.set_hook("after_goto", after_goto_func)

        result = await crawler.arun(
            url="https://github.com/unclecode/crawl4ai/tree/main/docs",
            word_count_threshold=1,
            cache_mode=CacheMode.BYPASS,
            extraction_strategy=LLMExtractionStrategy(
                provider=provider,
                api_base=os.environ["AZURE_API_BASE"],
                api_token=api_token,
                schema=DirectoryInfo.model_json_schema(),
                extraction_type="schema",
                instruction="""Extract the information of the children directories under the given link. And crawl the the children directory recursively""",
                **extra_args
            ),
            # cach_mode = CacheMode.BYPASS
        )
        print(f"Found {len(result.links['internal'])} internal links")
        print(f"Found {len(result.links['external'])} external links")
        print(json.loads(result.extracted_content)[:5])
        # print(f"List of all the internal links: {result.links['internal']}")
        # print(f"List of all the external links: {result.links['external']}")

await extract_structured_data_using_llm("azure/gpt-4o-mini", os.getenv("AZURE_API_KEY"))

However, it only crawl the information from the top directory, it doesn't recursively crawl into the sub directories. Here is the output:
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[EXTRACT]. ■ Completed for https://github.com/unclecode/crawl4ai/tree/main/do... | Time: 2.7219022089848295s
[COMPLETE] ● https://github.com/unclecode/crawl4ai/tree/main/do... | Status: True | Total: 4.76s
Found 71 internal links
Found 2 external links
[{'name': 'docs', 'url': 'https://github.com/unclecode/crawl4ai/tree/main/docs', 'children_directory_name_and_url': {'assets': 'https://github.com/unclecode/crawl4ai/tree/main/docs/assets', 'deprecated': 'https://github.com/unclecode/crawl4ai/tree/main/docs/deprecated', 'examples': 'https://github.com/unclecode/crawl4ai/tree/main/docs/examples', 'md_v2': 'https://github.com/unclecode/crawl4ai/tree/main/docs/md_v2', 'md_v3/tutorials': 'https://github.com/unclecode/crawl4ai/tree/main/docs/md_v3/tutorials', 'notebooks': 'https://github.com/unclecode/crawl4ai/tree/main/docs/notebooks'}, 'files': [], 'error': False}]

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant