Numpy & Pandas

티스토리 뷰

PYTHON

Numpy & Pandas

Y.일월 2019. 10. 26. 23:26

NumPy (Numerical Python)

: 다양한 자료구조를 다룰 수 있는 클래스들을 포함하고 있는 패키지

NumPy 배열

- NumPy에서 배열 타입을 다루기 위한 클래스는 NumPy.ndarray

- NumPy.ndarray 클래스는 n차원 배열을 쉽고 효율적으로 다루기 위한 목적으로 만듬

import numpy as np

a = np.array([1, 2.5, 4.0, 5.5, 7.0])

print(a[3: ]) #3번째 자리부터 출력(5.5 ~ )

>> [5.5 7. ]

print(a.sum()) #합

>> 20.0

print(a.std()) #표준편차

>> 2.1213203435596424

print(a.cumsum()) #누적 합

>> [ 1. 3.5 7.5 13. 20. ]

print(a * 2) #a 멤버들에 2를 곱한값 출력

>> [ 2. 5. 8. 11. 14.]

print(a ** 2) #a 멤버들에 2를 제곱한 값 출력

>> [ 1. 6.25 16. 30.25 49. ]

print(np.sqrt(a)) #a 멤버들에 루트 적용한 값 출력

>> [1. 1.58113883 2. 2.34520788 2.64575131]

b = np.array([a, a**2])

print(b)

[[ 1. 2.5 4. 5.5 7. ]

[ 1. 6.25 16. 30.25 49. ]]

b.sum(axis = 0) #axis = 0은 세로축의 합 출력

>> [ 2. 8.75 20. 35.75 56. ]

b.sum(axis = 1) #axis = 1은 가로축의 합 출력

>> [ 20. 102.5]

print(np.array([[0,0,0,], [0,0,0]]))

[[0 0 0]

[0 0 0]]

values = np.zeros((2,3), dtype = 'i') #0으로 가득 채운 2x3 행렬생성

#dtype = 'i' : 정수형 / dtype = 'f' : 복소수형

print(values)

[[0 0 0]

[0 0 0]]

Pandas

: CSV 파일 또는 DB로부터 데이터를 쉽게 읽고 쓸 수 있어 데이터를 다룰 때 많이 사용 되는 패키지

빠를 속도로 데이터 분석 가능.

Pandas의 자료 구조 : Series, DataFrame

Series 예제

import pandas as pd

pandas_series = pd.Series([3000, 3200, 2700], index = ['2016-11-10', '2016-11-11', '2016-11-12'])

print(pandas_series)

2016-11-10 3000

2016-11-11 3200

2016-11-12 2700

dtype: int64

print(pandas_series[1:])

2016-11-11 3200

2016-11-12 2700

dtype: int64

DataFrame 예제

import pandas as pd

import numpy as np

#데이터 100,150,200,250,300 를 가지는 DataFrame 생성

#열 이름 : numbers

#행 이름 : a, b, c, d, e

df = pd.DataFrame([100, 150, 200, 250, 300], columns = ['numbers'], index = ['a','b','c','d','e'])

print(df)

numbers

a 100

b 150

c 200

d 250

e 300

print(df.index) #index만 출력

>> Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

print(df.sum())

>> numbers 1000

dtype: int64

print(df.numbers**2) #DataFrame 값을 제곱

a 10000

b 22500

c 40000

d 62500

e 90000

Name: numbers, dtype: int64

#DataFrame에 'values'라는 columns 추가.

df['values'] = (10,50,40,30,60)

print(df)

numbers values

a 100 10

b 150 50

c 200 40

d 250 30

e 300 60

#DataFrame의 columns 'values' 수정.

df['values'] = pd.DataFrame(['Fourth', 'Second', 'First', 'Fifth', 'Third'], index = ['d','b','a','e','c'])

print(df)

#index에 맞는 각각의 값을 DataFrame에 추가.

#index 수를 맞추지 않으면 에러 발생

numbers values

a 100 First

b 150 Second

c 200 Third

d 250 Fourth

e 300 Fifth

del df['values']

print(df)

numbers

a 100

b 150

c 200

d 250

e 300

df_1 = pd.DataFrame(['1','2','3'], columns = ['A'])

df_2 = pd.DataFrame(['4','5','6','7'], columns = ['B'])

df = df_1.join(df_2, how = 'outer') #두개의 DataFrame을 Outer join으로 합침

print(df)

A B

0 1 4

1 2 5

2 3 6

3 NaN 7

#5x5행렬 생성

df = pd.DataFrame(np.random.randn(5,5))

df.columns = ['A','B','C','D','E']

print(df)

A B C D E

0 -1.007019 0.764202 0.487860 -0.573025 2.024442

1 -1.154769 1.752510 -0.845164 -0.802163 1.107832

2 0.947698 -0.010856 -0.111301 -0.670684 0.014977

3 -0.020861 -0.691743 -1.224263 -0.407138 1.756498

4 0.139213 -1.030757 1.297523 0.354125 0.086188

print(df.max()) #최대값 출력

A 0.947698

B 1.752510

C 1.297523

D 0.354125

E 2.024442

dtype: float64

print(df.min()) #최소값 출력

A -1.154769

B -1.030757

C -1.224263

D -0.802163

E 0.014977

dtype: float64

print(df.mean()) #평균값 출력

A -0.219147

B 0.156671

C -0.079069

D -0.419777

E 0.997987

dtype: float64

print(df.std()) #표준편차 출력

A 0.869684

B 1.126239

C 1.014374

D 0.456000

E 0.927187

dtype: float64

print(df.describe()) #통계적 분포 확인

A B C D E

count 5.000000 5.000000 5.000000 5.000000 5.000000

mean -0.219147 0.156671 -0.079069 -0.419777 0.997987

std 0.869684 1.126239 1.014374 0.456000 0.927187

min -1.154769 -1.030757 -1.224263 -0.802163 0.014977

25% -1.007019 -0.691743 -0.845164 -0.670684 0.086188

50% -0.020861 -0.010856 -0.111301 -0.573025 1.107832

75% 0.139213 0.764202 0.487860 -0.407138 1.756498

max 0.947698 1.752510 1.297523 0.354125 2.024442

df['division'] = ['X','Y','X','Y','Z']

print(df)

>> A B C D E division

0 -1.007019 0.764202 0.487860 -0.573025 2.024442 X

1 -1.154769 1.752510 -0.845164 -0.802163 1.107832 Y

2 0.947698 -0.010856 -0.111301 -0.670684 0.014977 X

3 -0.020861 -0.691743 -1.224263 -0.407138 1.756498 Y

4 0.139213 -1.030757 1.297523 0.354125 0.086188 Z

print(df.groupby(['division']).mean())

#Group by를 이용하여 Colum 'division' 값에 따라 평균값 출력

>> A B C D E

division

X -0.029660 0.376673 0.188279 -0.621855 1.019710

Y -0.587815 0.530384 -1.034713 -0.604650 1.432165

Z 0.139213 -1.030757 1.297523 0.354125 0.086188

저작자표시

'PYTHON' 카테고리의 다른 글

PYTHON 자료구조 (0)	2019.10.24

공지사항

최근에 올라온 글

일월이의' 공부장

티스토리 뷰

Numpy & Pandas

'PYTHON' 카테고리의 다른 글

티스토리툴바