civic_scraper.base
Submodules
civic_scraper.base.asset
- class civic_scraper.base.asset.Asset(url: str, asset_name: str = None, committee_name: str = None, place: str = None, place_name: str = None, state_or_province: str = None, asset_type: str = None, meeting_date: datetime = None, meeting_time: time = None, meeting_id: str = None, scraped_by: str = None, content_type: str = None, content_length: str = None)
Bases:
object- Parameters:
url (str) – URL to download an asset.
asset_name (str) – Title of an asset. Ex: City Council Regular Meeting
committee_name (str) – Name of committee that generated the asset. Ex: City Council
place (str) – Name of place associated with the asset. Lowercase with spaces and punctuation removed. Ex: menlopark
place_name (str) – Human-readable place name. Ex: Menlo Park
state_or_province (str) – Two-letter abbreviation for state or province associated with an asset. Ex: ca
asset_type (str) – One of SUPPORTED_ASSET_TYPES. Ex: agenda
meeting_date (datetime.datetime) – Date of meeting or None if no date given
meeting_time (datetime.time) – Time of meeting or None
meeting_id (str) – Unique meeting ID. For example, cominbation of scraper type, subdomain and numeric ID or date. Ex: civicplus-nc-nashcounty-05052020-382
scraped_by (str) – civic_scraper.__version__
content_type (str) – File type of the asset as given by HTTP headers. Ex: ‘application/pdf’
content_length (str) – Asset size in bytes
- Public methods:
download: downloads an asset to a given target_path
- download(target_dir, session=None, timeout=None)
Downloads an asset to a target directory.
- Parameters:
target_dir (str) – target directory name
session – optional requests.Session to reuse
timeout (int or float) – optional timeout in seconds for the HTTP request
- Returns:
Full path to downloaded file
civic_scraper.base.cache
- class civic_scraper.base.cache.Cache(path=None)
Bases:
object- property artifacts_path
Path for HTML and other intermediate artifacts from scraping
- property assets_path
Path for agendas, minutes and other gov file assets
- property metadata_files_path
Path for metadata files related to file artifacts
- write(name, content)
civic_scraper.base.constants
civic_scraper.base.site
- class civic_scraper.base.site.Site(base_url, cache=None, parser_kls=None)
Bases:
objectBase class for all Site scrapers.
- Parameters:
base_url (int) – URL to a government agency site
cache (Cache instance) – Optional Cache instance (default: “.civic-scraper” in user home dir)
parser_kls (class) – Optional parser class to extract data from government agency websites.
- scrape(*args, **kwargs) AssetCollection
Scrape the site and return an AssetCollection instance.