User Guide
==========

This guide covers how to use BICAM to download and work with congressional datasets.

Command Line Interface
---------------------

BICAM provides a simple command-line interface for downloading and managing datasets.

**List Available Datasets**
View all available datasets and their sizes:

.. code-block:: bash

   bicam list-datasets

For detailed information:

.. code-block:: bash

   bicam list-datasets --detailed

**Download a Dataset**
Download a specific dataset:

.. code-block:: bash

   bicam download bills

Download with options:

.. code-block:: bash

   # Force re-download
   bicam download bills --force

   # Use custom cache directory
   bicam download bills --cache-dir /path/to/cache

   # Skip confirmation for large (> 1GB) datasets
   bicam download complete --confirm

   # Suppress output
   bicam download bills --quiet

**Get Dataset Information**
View detailed information about a dataset:

.. code-block:: bash

   bicam info bills

**Manage Cache**
View cache information:

.. code-block:: bash

   bicam cache

Clear specific dataset:

.. code-block:: bash

   bicam clear bills

Clear all cached data:

.. code-block:: bash

   bicam clear --all

Python API
---------

BICAM also provides a Python API for programmatic access.

**Basic Usage**

.. code-block:: python

   import bicam

   # Download a dataset
   bills_path = bicam.download_dataset('bills')
   print(f"Bills data available at: {bills_path}")

**Loading Data as DataFrames**

The easiest way to work with BICAM data is using the `load_dataframe` function:

.. code-block:: python

   import bicam
   import pandas as pd

   # Load bills data directly into a DataFrame (downloads if needed, auto-confirms for large datasets)
   bills_df = bicam.load_dataframe('bills', 'bills_metadata.csv', download=True)
   print(f"Loaded {len(bills_df)} bills")

   # Load members data (will raise error if not cached)
   try:
       members_df = bicam.load_dataframe('members', 'members_current.csv')
   except ValueError as e:
       print(f"Dataset not cached: {e}")
       # Download it first
       members_df = bicam.load_dataframe('members', 'members_current.csv', download=True)

   # Load first available CSV file from a dataset
   df = bicam.load_dataframe('bills', download=True)

   # Force confirmation prompt for large datasets
   bills_df = bicam.load_dataframe('bills', download=True, confirm=False)

   # Suppress all output during download
   bills_df = bicam.load_dataframe('bills', download=True, quiet=True)

   # Use different DataFrame engines
   # Polars (included by default)
   bills_df = bicam.load_dataframe('bills', df_engine='polars')

   # Dask (requires dask installed)
   bills_df = bicam.load_dataframe('bills', df_engine='dask')

   # Spark (requires pyspark installed)
   bills_df = bicam.load_dataframe('bills', df_engine='spark')

   # DuckDB (requires duckdb installed)
   bills_df = bicam.load_dataframe('bills', df_engine='duckdb')

**Advanced Options**

.. code-block:: python

   # Force re-download
   bills_path = bicam.download_dataset('bills', force_download=True)

   # Custom cache directory
   bills_path = bicam.download_dataset('bills', cache_dir='/custom/path')

   # Skip confirmation for large (> 1GB) datasets
   bills_path = bicam.download_dataset('complete', confirm=True)

   # Suppress logging
   bills_path = bicam.download_dataset('bills', quiet=True)

**Dataset Information**

.. code-block:: python

   # List all datasets
   datasets = bicam.list_datasets()
   print(f"Available datasets: {datasets}")

   # Get info about a dataset
   info = bicam.get_dataset_info('bills')
   print(f"Size: {info['size_mb']} MB")
   print(f"Description: {info['description']}")

**Cache Management**

.. code-block:: python

   # Get cache size
   cache_info = bicam.get_cache_size()
   print(f"Total cache size: {cache_info['total']}")

   # Clear specific dataset
   bicam.clear_cache('bills')

   # Clear all cache
   bicam.clear_cache()

Working with Data
----------------

**Using pandas with load_dataframe**

.. code-block:: python

   import bicam
   import pandas as pd

   # Load bills data directly into DataFrame
   bills_df = bicam.load_dataframe('bills', 'bills_metadata.csv', download=True)

   # Basic analysis
   print(f"Total bills: {len(bills_df)}")
   print(f"Congress range: {bills_df['congress'].min()} - {bills_df['congress'].max()}")

   # Filter recent bills
   recent_bills = bills_df[bills_df['congress'] >= 115]
   print(f"Recent bills: {len(recent_bills)}")

**Using different DataFrame engines**

.. code-block:: python

   import bicam

   # Load with polars (faster for large datasets)
   bills_df = bicam.load_dataframe('bills', 'bills_metadata.csv', df_engine='polars')
   print(f"Loaded {len(bills_df)} bills with polars")

   # Load with dask (for out-of-memory processing)
   bills_df = bicam.load_dataframe('bills', 'bills_metadata.csv', df_engine='dask')
   print(f"Loaded bills with dask: {bills_df.npartitions} partitions")

   # Load with spark (for distributed processing)
   bills_df = bicam.load_dataframe('bills', 'bills_metadata.csv', df_engine='spark')
   print(f"Loaded bills with spark: {bills_df.count()} rows")

**Working with Multiple Datasets**

.. code-block:: python

   import bicam

   # Load multiple datasets as DataFrames
   bills_sponsors_df = bicam.load_dataframe('bills', 'bills_sponsors.csv', download=True)
   members_df = bicam.load_dataframe('members', 'members.csv', download=True)

   # Join data (example)
   # bills_with_sponsors_detailed = bills_sponsors_df.merge(members_df, left_on='bioguide_id')

**Data Exploration**

.. code-block:: python

   # Explore bills dataset
   bills_df = bicam.load_dataframe('bills', 'bills_metadata.csv', download=True)

   # View columns
   print(bills_df.columns.tolist())

   # Basic statistics
   print(bills_df.describe())

   # Value counts
   print(bills_df['congress'].value_counts().sort_index())

Best Practices
-------------

**Dataset Selection**

* Start with smaller datasets like ``congresses`` or ``members``
* Use ``bills`` for legislative analysis
* Download ``complete`` only if you need all data

**Performance Tips**

* Use ``--quiet`` for automated scripts
* Use ``--confirm`` to skip prompts in batch operations
* Monitor disk space before downloading large datasets
* Use ``df_engine='polars'`` for faster loading of large datasets
* Use ``df_engine='dask'`` for out-of-memory processing

**Data Management**

* Use ``bicam cache`` to monitor storage usage
* Clear unused datasets with ``bicam clear``
* Consider using custom cache directories for different projects

**Error Handling**

* Use try/except blocks to handle download or loading errors. For example:

    .. code-block:: python

       import bicam

       try:
           bills_df = bicam.load_dataframe('bills', download=True)
       except Exception as e:
           print(f"Download failed: {e}")
           # Handle error appropriately

Examples
--------

**Legislative Analysis**

.. code-block:: python

   import bicam

   # Load bills and amendments data
   bills_df = bicam.load_dataframe('bills', 'bills.csv', download=True)

   # Analyze bill types by congress
   bill_types = bills_df.groupby('congress')['bill_type'].value_counts()
   print("Number of different bill types by congress:")
   print(bill_types)

**Committee Analysis**

.. code-block:: python

   import bicam

   # Load committee and hearing-committee mapping data
   committees_df = bicam.load_dataframe('committees', 'committees.csv', download=True)
   hearings_committees_df = bicam.load_dataframe('hearings', 'hearings_committees.csv', download=True)

   # Join on 'committee_code' to find committees with hearings that are current
   merged = hearings_committees_df.merge(
       committees_df[['committee_code', 'is_current']],
       on='committee_code',
       how='inner'
   )

   # Filter for current committees
   current_committees_with_hearings = merged[merged['is_current'] == True]

   print("Committees with hearings where is_current is True:")
   print(current_committees_with_hearings['committee_code'].unique())