[파이썬]beautifulsoup를 이용한 사이트 정보 추출# Script/Python2021. 4. 29. 11:12
Table of Contents
728x90
반응형
1. beautifulsoup 설치
$ pip install beautifulsoup4
2. beautifulsoup 사용법
import requests
from bs4 import BeautifulSoup
url = 'https://darksharavim.tistory.com/'
response = requests.get(url)
if response.status_code == 200:
html = response.text
soup = BeautifulSoup(html, 'html.parser')
print(soup)
else:
print(response.status_code)
상태값이 200인 경우만 출력
C:\Users\kajin7-ryzen3\PycharmProjects\darksharavim\venv\Scripts\python.exe C:/Users/kajin7-ryzen3/PycharmProjects/darksharavim/beatuifulsoup-test.py
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<link href="https://t1.daumcdn.net/tistory_admin/lib/lightbox/css/lightbox.min.css" rel="stylesheet" type="text/css"/><link href="https://t1.daumcdn.net/tistory_admin/assets/blog/tistory-7254247c8aa83aa1312d04ac2cb0d36d24dad79c/blogs/style/content/font.css?_version_=tistory-7254247c8aa83aa1312d04ac2cb0d36d24dad79c" rel="stylesheet" type="text/css"/><link href="https://t1.daumcdn.net/tistory_admin/assets/blog/tistory-7254247c8aa83aa1312d04ac2cb0d36d24dad79c/blogs/style/content/content.css?_version_=tistory-7254247c8aa83aa1312d04ac2cb0d36d24dad79c" rel="stylesheet" type="text/css"/><!--[if lt IE 9]><script src="https://t1.daumcdn.net/tistory_admin/lib/jquery/jquery-1.12.4.min.js"></script><![endif]--><!--[if gte IE 9]>
<!--><script src="https://t1.daumcdn.net/tistory_admin/lib/jquery/jquery-3.2.1.min.js"></script><!--<![endif]-->
<script src="https://t1.daumcdn.net/tistory_admin/lib/lightbox/js/lightbox-plus-jquery.min.js"></script>
<script>
lightbox.options.fadeDuration = 200;
lightbox.options.resizeDuration = 200;
lightbox.options.wrapAround = false;
lightbox.options.albumLabel = "%1 / %2";
</script>
<script>var tjQuery = jQuery.noConflict(true);</script><style type="text/css">.tt_article_useless_p_margin p {padding-top:0 !important;padding-bottom:0 !important;margin-top:0 !important;margin-bottom:0 !important;}</style><meta content="always" name="referrer"/><link href="//t1.daumcdn.net/tistory_admin/static/top/favicon_0630.ico" rel="icon"/><link href="//img1.daumcdn.net/thumb/C180x180/?fname=https%3A%2F%2Ftistory3.daumcdn.net%2Ftistory%2F1915777%2Fattach%2F65db636337e643f18cd247c641f5ebe5" rel="apple-touch-icon"/>
<link href="//img1.daumcdn.net/thumb/C76x76/?fname=https%3A%2F%2Ftistory3.daumcdn.net%2Ftistory%2F1915777%2Fattach%2F65db636337e643f18cd247c641f5ebe5" rel="apple-touch-icon" sizes="76x76"/>
<link href="//img1.daumcdn.net/thumb/C120x120/?fname=https%3A%2F%2Ftistory3.daumcdn.net%2Ftistory%2F1915777%2Fattach%2F65db636337e643f18cd247c641f5ebe5" rel="apple-touch-icon" sizes="120x120"/>
<link href="//img1.daumcdn.net/thumb/C152x152/?fname=https%3A%2F%2Ftistory3.daumcdn.net%2Ftistory%2F1915777%2Fattach%2F65db636337e643f18cd247c641f5ebe5" rel="apple-touch-icon" sizes="152x152"/><meta content="안녕하세요. 이곳은 IT위주의 잡다한 정보를 올려두는 개인 블로그입니다." name="description"/>
<!-- BEGIN OPENGRAPH -->
<link href="https://darksharavim.tistory.com" rel="canonical"><meta content="website" property="og:type"><meta content="https://darksharavim.tistory.com" property="og:url"><meta content="다크쉐라빔의 주절주절" property="og:site_name"/><meta content="다크쉐라빔의 주절주절" property="og:title"/><meta content="안녕하세요. 이곳은 IT위주의 잡다한 정보를 올려두는 개인 블로그입니다." property="og:description"/><meta content="https://tistory4.daumcdn.net/tistory/1915777/attach/65db636337e643f18cd247c641f5ebe5" property="og:image"/>
<!-- END OPENGRAPH -->
<!-- BEGIN TWITTERCARD -->
<meta content="summary_large_image" name="twitter:card"/><meta content="@TISTORY" name="twitter:site"/><meta content="다크쉐라빔의 주절주절" name="twitter:title"/><meta content="안녕하세요. 이곳은 IT위주의 잡다한 정보를 올려두는 개인 블로그입니다." name="twitter:description"/><meta content="https://tistory4.daumcdn.net/tistory/1915777/attach/65db636337e643f18cd247c641f5ebe5" property="twitter:image"/>
<!-- END TWITTERCARD -->
<!-- BEGIN STRUCTURED_DATA -->
<script type="application/ld+json">{"@context":"http:\/\/schema.org","@type":"WebSite","url":"\/","potentialAction":{"@type":"SearchAction","target":"\/search\/{search_term_string}","query-input":"required name=search_term_string"}}</script>
<!-- END STRUCTURED_DATA -->
<script async="" data-ad-client="ca-pub-7635721508987260" data-ad-host="ca-host-pub-9691043933427338" src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="user-scalable=no,width=device-width,initial-scale=1.0" name="viewport"/>
<meta content="다크쉐라빔" name="author"/>
<meta content="안녕하세요. 이곳은 IT위주의 잡다한 정보를 올려두는 개인 블로그입니다." name="description"/>
<meta content="M1react" name="generator"/>
<link href="/rss" rel="alternate" title="다크쉐라빔의 주절주절" type="application/rss+xml">
<link href="https://tistory2.daumcdn.net/tistory/1915777/skin/style.css?_T_=1547529681" media="screen" rel="stylesheet" type="text/css">
<link href="/favicon.ico" rel="shortcut icon">
<script src="https://tistory2.daumcdn.net/tistory/1915777/skin/images/m1react.js" type="text/javascript"></script>
<title>다크쉐라빔의 주절주절 :: 다크쉐라빔의 주절주절</title>
.. 생략
<script src="//t1.daumcdn.net/tiara/js/v1/tiara.min.js" type="text/javascript"></script>
<script type="text/javascript">window.tiara = {"svcDomain":"user.tistory.com","section":"\ud648","trackPage":"\ud648_\ubcf4\uae30","page":"\ud648","key":"1915777","customProps":{"userId":0,"blogId":"1915777","role":"guest","filterTarget":false,"trackPage":"\ud648_\ubcf4\uae30"},"entry":[],"sentryDsn":"https:\/\/a53520229cd744e798d42900d76b0e2a@aem-collector.daumkakao.io\/713","kakaoAppKey":"b8aef3eeb03fa312b81795386484f051","appUserId":null};</script>
<script defer="" src="https://t1.daumcdn.net/tistory_admin/assets/blog/tistory-7254247c8aa83aa1312d04ac2cb0d36d24dad79c/blogs/script/tiara/tiara.min.js?_version_=tistory-7254247c8aa83aa1312d04ac2cb0d36d24dad79c" type="text/javascript"></script>
<script type="text/javascript">
window.roosevelt_params_queue = window.roosevelt_params_queue || [{channel_id: 'dk', channel_label: 'tistory'}];
</script>
<script async="" src="//t1.daumcdn.net/midas/rt/dk_bt/roosevelt_dk_bt.js" type="text/javascript"></script><script type="text/javascript">if(window.console!=undefined){setTimeout(console.log.bind(console,"%cTISTORY","font:8em Arial;color:#EC6521;font-weight:bold"),0);setTimeout(console.log.bind(console,"%c 나를 표현하는 블로그","font:2em sans-serif;color:#333;"),0);}</script><iframe id="editEntry" src="//darksharavim.tistory.com/api" style="position:absolute;width:1px;height:1px;left:-100px;top:-100px"></iframe><div class="layer_post" id="tistoryEtcLayer"></div><div class="layer_post" id="tistorySnsLayer"></div></body>
</html>
Process finished with exit code 0
3. 특정부분 추출 예제
제 블로그 메인화면에서 아래 제목을 추출해보겠습니다.
크롬에서 개발자모드(F12)로 전환하고 copy selector로 복사합니다.
그리고 아래와 같이 소스 수정
import requests
from bs4 import BeautifulSoup
url = 'https://darksharavim.tistory.com/'
response = requests.get(url)
if response.status_code == 200:
html = response.text
soup = BeautifulSoup(html, 'html.parser')
title = soup.select_one('#ttItem1547528381 > ul > li.tt-span-12.tt-last.tt-clear > div:nth-child(1) > div > p.tt-post-title > a')
print(title)
else:
print(response.status_code)
결과
C:\Users\ryzen3\PycharmProjects\darksharavim\venv\Scripts\python.exe C:/Users/ryzen3/PycharmProjects/darksharavim/beatuifulsoup-test.py
<a href="/588">[파이썬]윈도우10 환경 설치</a>
Process finished with exit code 0
순수 텍스트만 추출할경우는 아래와 같이 하면됩니다.
print(title.get_text())
728x90
반응형
'# Script > Python' 카테고리의 다른 글
[파이썬]bluetooth_adapter_winrt.cc:1074 Getting Default Adapter failed error (0) | 2021.05.19 |
---|---|
[파이썬]셀레니움을 이용한 특정부분만 하이라이트 적용 (0) | 2021.05.07 |
[파이썬]unknown error: DevToolsActivePort file doesn't exist (0) | 2021.04.30 |
[파이썬]셀레니움을 이용해 geoip 업데이트 자동화 (1) | 2021.04.30 |
[파이썬]텔레그램 푸시 (0) | 2018.08.02 |
@다크쉐라빔 :: 다크쉐라빔의 주절주절
안녕하세요. 이곳은 IT위주의 잡다한 정보를 올려두는 개인 블로그입니다.
포스팅이 좋았다면 "좋아요❤️" 또는 "구독👍🏻" 해주세요!