<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Brainuke</title>
	<atom:link href="https://brainuke.com/feed/" rel="self" type="application/rss+xml" />
	<link>https://brainuke.com</link>
	<description>Data-Driven Consulting &#38; Advisory</description>
	<lastBuildDate>Fri, 03 Nov 2023 03:15:58 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://brainuke.com/wp-content/uploads/2021/06/cropped-iconesite-32x32.png</url>
	<title>Brainuke</title>
	<link>https://brainuke.com</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Time-Series Forecast com Prophet</title>
		<link>https://brainuke.com/time-series-forecast-com-prophet/</link>
					<comments>https://brainuke.com/time-series-forecast-com-prophet/#respond</comments>
		
		<dc:creator><![CDATA[Lucas Rezende]]></dc:creator>
		<pubDate>Fri, 03 Nov 2023 02:51:50 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[DATA SCIENCE]]></category>
		<category><![CDATA[MACHINE LEARNING]]></category>
		<guid isPermaLink="false">https://brainuke.com/?p=2724</guid>

					<description><![CDATA[Recentemente precisei fazer um estudo de caso onde era necessário prever o índice de crimes em uma determinada localidade para os próximos 6 meses. Pesquisando sobre os diferentes métodos disponíveis para fazer essa previsão eu encontrei e elegi Time-Series Forecast (Previsão de Série Temporal), para realizar o estudo. O problema é que eu nunca havia [&#8230;]]]></description>
										<content:encoded><![CDATA[
<p id="78ba">Recentemente precisei fazer um estudo de caso onde era necessário prever o índice de crimes em uma determinada localidade para os próximos 6 meses. Pesquisando sobre os diferentes métodos disponíveis para fazer essa previsão eu encontrei e elegi Time-Series Forecast (Previsão de Série Temporal), para realizar o estudo. O problema é que eu nunca havia feito qualquer atividade utilizando esse método, então o próximo passo foi pesquisar sobre o assunto e após ler vários artigos sobre eu cheguei até a biblioteca&nbsp;<a href="https://facebook.github.io/prophet/" rel="noreferrer noopener" target="_blank"><strong>Prophet</strong></a>&nbsp;criada pelo Facebook. De longe foi a biblioteca mais simples e direta de usar que encontrei e então resolvi tentar e o resultado foi bem satisfatório.</p>



<p id="cefc">No exemplo deste post vou utilizar o mesmo dataset de crimes para descrever minha linha de raciocínio, passando pela&nbsp;<a href="https://en.wikipedia.org/wiki/Exploratory_data_analysis" rel="noreferrer noopener" target="_blank"><strong><em>AED</em></strong></a>&nbsp;até chegar ao&nbsp;<strong><em>TS Forecast</em></strong>&nbsp;e ao resultado final. Irei utilizar o&nbsp;<strong>Jupyter Notebook</strong>&nbsp;para codificação com&nbsp;<strong>Python 3.6.3</strong>.</p>



<h2 class="wp-block-heading has-ast-global-color-1-color has-text-color" id="df72">Instalação da biblioteca:</h2>



<p></p>



<pre class="wp-block-code"><code><code>pip install fbprophet</code></code></pre>



<p id="8408">Bem simples, não?</p>



<h2 class="wp-block-heading has-ast-global-color-1-color has-text-color" id="5a57">Importação das bibliotecas necessárias:</h2>



<p></p>



<pre class="wp-block-code"><code><code>import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Apenas porque eu acho mais bonito :)
plt.style.use('fivethirtyeight')</code></code></pre>



<h2 class="wp-block-heading has-ast-global-color-1-color has-text-color" id="518d">Leitura do arquivo de dados:</h2>



<p></p>



<pre class="wp-block-code"><code><code>%%time
df = pd.read_csv('./Crimes_2011_to_present.csv', index_col=False, error_bad_lines=False, engine='python')

# CPU times: user 29.1 s, sys: 1.69 s, total: 30.8 s
# Wall time: 31.4 s</code></code></pre>



<p id="71d7">Explicando os parâmetros:</p>



<p id="7575"><strong>index_col=False</strong>: utilizado para o Pandas não utilizar o ID do dataset como índice.</p>



<p id="5e25"><strong>error_bad_lines=False</strong>: utilizado para caso a leitura do dataset encontre alguma linha com problema ele simplesmente ignorar.</p>



<p id="f56d"><strong>engine=&#8217;python&#8217;</strong>: utilizado apenas por é hábito. No caso de números decimais contidos no dataset ele já importa com o número de casas&nbsp;<em>&#8220;as-is&#8221;</em></p>



<h2 class="wp-block-heading has-ast-global-color-1-color has-text-color" id="3189">Remoção de features que não será utilizadas:</h2>



<p></p>



<pre class="wp-block-code"><code>df.drop(&#91;'Case.Number', 'IUCR', 'X.Coordinate', 'Y.Coordinate', 'Block', 'Updated.On', 'Year', 'FBI.Code', 'Beat', 'Ward', 'Community.Area', 'Location'], inplace=True, axis=1)</code></pre>



<p id="a754">Para facilitar a vida no futuro eu converti a coluna Date para o time correto e &#8220;setei&#8221; a mesma para ser o índice do meu dataset:</p>



<pre class="wp-block-code"><code>df&#91;'Date'] = pd.to_datetime(df&#91;'Date'])
df.index = pd.DatetimeIndex(df&#91;'Date'])</code></pre>



<p id="32ae">Algumas outras preparações para realizar as análises:</p>



<pre class="wp-block-code"><code># Criei duas novas colunas de Ano e Mês
df&#91;'Year'] = pd.DatetimeIndex(df&#91;'Date']).year
df&#91;'Month'] = pd.DatetimeIndex(df&#91;'Date']).month

# Converti o datatype de duas colunas para booleano
df&#91;'Arrest'] = df&#91;'Arrest'].astype('bool')
df&#91;'Domestic'] = df&#91;'Domestic'].astype('bool')

# NaN se não converter
df&#91;'Latitude'] = pd.to_numeric(df&#91;'Latitude'], errors='coerce')
df&#91;'Longitude'] = pd.to_numeric(df&#91;'Longitude'], errors='coerce')
df&#91;'District'] = pd.to_numeric(df&#91;'District'], errors='coerce')

# Remove os NaN
df = df&#91;df&#91;'Latitude'].notnull()]
df = df&#91;df&#91;'Longitude'].notnull()]
df = df&#91;df&#91;'District'].notnull()]

# Seleciona as top 20 entradas
loc_to_change = list(df&#91;'Location.Description'].value_counts()&#91;20:].index)
desc_to_change = list(df&#91;'Description'].value_counts()&#91;20:].index)

# Função auxiliar para converter para 'OTHER' o que não estiver nos top 20
def parse_location_names(location):
    if location in loc_to_change:
        return 'OTHER'
    else:
        return location

df&#91;'Location.Description'] = df&#91;'Location.Description'].map(parse_location_names)

# Outra forma para converter para 'OTHER'o que não estiver nos top 20
df.loc&#91;df&#91;'Description'].dropna().isin(desc_to_change), df.columns == 'Description'] = 'OTHER'

# Converte features para o tipo categórico
df&#91;'Primary.Type'] = pd.Categorical(df&#91;'Primary.Type'])
df&#91;'Description'] = pd.Categorical(df&#91;'Description'])
df&#91;'Location.Description'] = pd.Categorical(df&#91;'Location.Description'])</code></pre>



<h1 class="wp-block-heading has-ast-global-color-1-color has-text-color" id="5f39">Exploração e Visualização</h1>



<p></p>



<pre class="wp-block-code"><code>plt.figure(figsize=(11, 5))
df.resample('M').size().plot(legend=False)
plt.title('Number of crimes per month (2011 - 2015)')
plt.xlabel('Months')
plt.ylabel('Number of crimes')
plt.show()</code></pre>



<figure class="wp-block-image"><img decoding="async" src="https://miro.medium.com/v2/resize:fit:1400/1*Lib_LPNF5EYv6r54gi1DLg.png" alt=""/></figure>



<p></p>



<pre class="wp-block-code"><code>
plt.figure(figsize=(11, 5))
df.resample('D').size().rolling(365).sum().plot(legend=False)
plt.title('Rolling sum of all crimes from 2011 - 2015')
plt.ylabel('Number of crimes')
plt.xlabel('Days')
plt.show()
</code></pre>



<figure class="wp-block-image"><img decoding="async" src="https://miro.medium.com/v2/resize:fit:1400/1*EomoG4xwM9hnvdRMQYp3-w.png" alt=""/></figure>



<p></p>



<pre class="wp-block-code"><code>
crimes_count = df.pivot_table('ID', aggfunc=np.size, columns='Primary.Type', index=df.index.date, fill_value=0)
crimes_count.index = pd.DatetimeIndex(crimes_count.index)
plo = crimes_count.rolling(365).sum().plot(figsize=(12, 30), subplots=True, layout=(-1, 3), sharex=False, sharey=False)
</code></pre>



<figure class="wp-block-image"><img decoding="async" src="https://miro.medium.com/v2/resize:fit:1400/1*wEOlnoJRTWAoW6CL0zY-4Q.png" alt=""/></figure>



<p></p>



<pre class="wp-block-code"><code>
days = &#91;'Monday','Tuesday','Wednesday',  'Thursday', 'Friday', 'Saturday', 'Sunday']
df.groupby(&#91;df.index.dayofweek]).size().plot(kind = 'barh')
plt.title('Number of crimes by day of week')
plt.xlabel('Number of Crimes')
plt.ylabel('Day of Week')
plt.yticks(np.arange(7), days)
plt.show()
</code></pre>



<figure class="wp-block-image"><img decoding="async" src="https://miro.medium.com/v2/resize:fit:1202/1*Q7Em3T-6kw2CDxoRNPAfSw.png" alt=""/></figure>



<p></p>



<pre class="wp-block-code"><code>
df.groupby(&#91;df.index.month]).size().plot(kind = 'barh')
plt.title('Number of crimes by month')
plt.xlabel('Number of Crimes')
plt.ylabel('Month')
plt.show()
</code></pre>



<figure class="wp-block-image"><img decoding="async" src="https://miro.medium.com/v2/resize:fit:1104/1*nTgp93--M6cJ9tJIyrW6zw.png" alt=""/></figure>



<p></p>



<pre class="wp-block-code"><code>
plt.figure(figsize=(8, 10))
df.groupby(&#91;df&#91;'Primary.Type']]).size().sort_values(ascending=True).plot(kind = 'barh')
plt.title('Number of crimes by Type')
plt.xlabel('Number of crimes')
plt.ylabel('Primery Type')
plt.show()
</code></pre>



<figure class="wp-block-image"><img decoding="async" src="https://miro.medium.com/v2/resize:fit:1400/1*o-RccTpiXO0gEJ3kV0feTw.png" alt=""/></figure>



<p></p>



<pre class="wp-block-code"><code>
def scale_df(df,axis=0):
    return (df - df.mean(axis=axis)) / df.std(axis=axis)

def plot_hmap(df, ix=None, cmap='bwr'):
    if ix is None:
        ix = np.arange(df.shape&#91;0])
    plt.imshow(df.iloc&#91;ix,:], cmap=cmap)
    plt.colorbar(fraction=0.03)
    plt.yticks(np.arange(df.shape&#91;0]), df.index&#91;ix])
    plt.xticks(np.arange(df.shape&#91;1]))
    plt.grid(False)
    plt.show()

def scale_and_plot(df, ix = None):
    df_marginal_scaled = scale_df(df.T).T
    if ix is None:
        ix = AC(4).fit(df_marginal_scaled).labels_.argsort() # a trick to make better heatmaps
    cap = np.min(&#91;np.max(df_marginal_scaled.as_matrix()), np.abs(np.min(df_marginal_scaled.as_matrix()))])
    df_marginal_scaled = np.clip(df_marginal_scaled, -1*cap, cap)
    plot_hmap(df_marginal_scaled, ix=ix)

def normalize(df):
    result = df.copy()
    for feature_name in df.columns:
        max_value = df&#91;feature_name].max()
        min_value = df&#91;feature_name].min()
        result&#91;feature_name] = (df&#91;feature_name] - min_value) / (max_value - min_value)
    return result

dayofweek_by_location = df.pivot_table(values='ID', index='Location.Description', columns=df.index.dayofweek, aggfunc=np.size).fillna(0)
dayofweek_by_type = df.pivot_table(values='ID', index='Primary.Type', columns=df.index.dayofweek, aggfunc=np.size).fillna(0)
location_by_type  = df.pivot_table(values='ID', index='Location.Description', columns='Primary.Type', aggfunc=np.size).fillna(0)

from sklearn.cluster import AgglomerativeClustering as AC

plt.figure(figsize=(17,17))
scale_and_plot(dayofweek_by_type)
</code></pre>



<figure class="wp-block-image"><img decoding="async" src="https://miro.medium.com/v2/resize:fit:1266/1*Q6-iTChohCalwoBhvAdqqA.png" alt=""/></figure>



<p></p>



<pre class="wp-block-code"><code>df2 = normalize(location_by_type)
ix = AC(3).fit(df2.T).labels_.argsort() # a trick to make better heatmaps
plt.figure(figsize=(17,13))
plt.imshow(df2.T.iloc&#91;ix,:], cmap='Reds')
plt.colorbar(fraction=0.03)
plt.xticks(np.arange(df2.shape&#91;0]), df2.index, rotation='vertical')
plt.yticks(np.arange(df2.shape&#91;1]), df2.columns)
plt.title('Location frequency for each crime')
plt.grid(False)
plt.show()</code></pre>



<figure class="wp-block-image"><img decoding="async" src="https://miro.medium.com/v2/resize:fit:1400/1*kaclIQTdgyF8OYF24bIcXw.png" alt=""/></figure>



<p></p>



<h1 class="wp-block-heading has-ast-global-color-1-color has-text-color" id="b580">TS Forecast Using Prophet</h1>



<p></p>



<pre class="wp-block-code"><code>from fbprophet import Prophet

fc_df = df&#91;&#91;'District']]
fc_df.reset_index(inplace=True)
fc_df = df.groupby(&#91;'Date', 'District']).size()
fc_df = fc_df.reset_index()
fc_df.columns = &#91;'ds', 'District', 'y']

# Foi escolhido o distrito 1.0 para realizar o forecast
# =====================================================
aux = fc_df&#91;fc_df&#91;'District']==1.0]
aux = aux&#91;&#91;'ds', 'y']]

ax = aux.set_index('ds').plot(figsize=(16, 8))
ax.set_ylabel('Number of Crimes')
ax.set_xlabel('Date')

plt.show()</code></pre>



<p></p>



<figure class="wp-block-image"><img decoding="async" src="https://miro.medium.com/v2/resize:fit:1400/1*Z0lsTb4rpwHW-nw5MXbmVg.png" alt=""/></figure>



<p></p>



<pre class="wp-block-code"><code>my_model = Prophet()
my_model.fit(aux)

future_dates = my_model.make_future_dataframe(periods=6, freq='MS')
forecast = my_model.predict(future_dates)

forecast&#91;&#91;'ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()</code></pre>



<p></p>



<figure class="wp-block-image"><img decoding="async" src="https://miro.medium.com/v2/resize:fit:750/1*lRJEGHRDoQf7w35QllZE-Q.png" alt=""/></figure>



<p></p>



<pre class="wp-block-code"><code>my_model.plot(forecast, uncertainty=True)</code></pre>



<figure class="wp-block-image"><img decoding="async" src="https://miro.medium.com/v2/resize:fit:1400/1*O08azPXs2IAvYW0sNWqbdQ.png" alt=""/></figure>



<p></p>



<pre class="wp-block-code"><code>my_model.plot_components(forecast)</code></pre>



<p></p>



<figure class="wp-block-image"><img decoding="async" src="https://miro.medium.com/v2/resize:fit:1278/1*wLqTccJZpKQ3o5v-CNpvlw.png" alt=""/></figure>



<p></p>



<h2 class="wp-block-heading" id="5832">Cross Validation</h2>



<p></p>



<pre class="wp-block-code"><code>
from fbprophet.diagnostics import cross_validation

df_cv = cross_validation(my_model, horizon = '365 days')
df_cv.head()</code></pre>



<p></p>



<figure class="wp-block-image"><img decoding="async" src="https://miro.medium.com/v2/resize:fit:868/1*rxABUfdNpd_6Jt0qDzladg.png" alt=""/></figure>



<p></p>



<pre class="wp-block-code"><code>df_cv&#91;&#91;'y', 'yhat']].plot(figsize=(16, 6))</code></pre>



<p></p>



<figure class="wp-block-image"><img decoding="async" src="https://miro.medium.com/v2/resize:fit:1400/1*ajKWpHq07S8p0wtSHz3Fxw.png" alt=""/></figure>



<p></p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p id="c3d3"><strong>y</strong>&nbsp;= dado real /&nbsp;<strong>yhat</strong>&nbsp;= previsão</p>
</blockquote>



<p id="e53f">Então é isso. Nós podemos continuar com as otimizações do nosso modelo aplicando sazonalidade ao treino, alterando o período, precisão, etc.</p>



<p id="ea80">Nós próximos posts trarei mais exemplos com outros modelos de TS, regressão, classificação, clusterização, RNA (será?!).</p>
]]></content:encoded>
					
					<wfw:commentRss>https://brainuke.com/time-series-forecast-com-prophet/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
